WO2021190146A1 - 图片处理方法、装置、存储介质及电子设备 - Google Patents

图片处理方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2021190146A1
WO2021190146A1 PCT/CN2021/074706 CN2021074706W WO2021190146A1 WO 2021190146 A1 WO2021190146 A1 WO 2021190146A1 CN 2021074706 W CN2021074706 W CN 2021074706W WO 2021190146 A1 WO2021190146 A1 WO 2021190146A1
Authority
WO
WIPO (PCT)
Prior art keywords
picture
text
area
feature map
image
Prior art date
Application number
PCT/CN2021/074706
Other languages
English (en)
French (fr)
Inventor
刘鹏
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2021190146A1 publication Critical patent/WO2021190146A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Definitions

  • This application belongs to the field of electronic technology, and in particular relates to an image processing method, device, storage medium, and electronic equipment.
  • the embodiments of the present application provide a picture processing method, device, storage medium, and electronic equipment, which can improve the flexibility of recognizing characters in pictures.
  • an image processing method including:
  • an image processing device including:
  • the obtaining module is used to obtain the picture to be processed
  • a calling module for calling a pre-trained image semantic segmentation model to divide the picture to be processed into multiple regions, where each region corresponds to a category, and the categories include text categories, table categories, and image categories;
  • a determining module configured to determine a target area from the multiple areas
  • the recognition module is used to perform character recognition processing on the target area to recognize the characters in the target area.
  • an embodiment of the present application provides a storage medium on which a computer program is stored.
  • the computer program is executed on a computer, the computer is caused to execute the process in the image processing method provided by the embodiment of the present application.
  • an embodiment of the present application further provides an electronic device, including a memory and a processor, and the processor is configured to execute the image processing method provided in the embodiment of the present application by calling a computer program stored in the memory. Process.
  • FIG. 1 is a schematic diagram of the first flow of a picture processing method provided by an embodiment of the present application.
  • Fig. 2 is a schematic diagram of a picture to be processed provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of a scenario provided by an embodiment of the present application.
  • Fig. 4 is a second schematic diagram of a picture processing method provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of a network structure of an image semantic segmentation model provided by an embodiment of the present application.
  • Fig. 6 is a schematic structural diagram of a picture processing apparatus provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the first structure of an electronic device provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a second structure of an electronic device provided by an embodiment of the present application.
  • Fig. 9 is a schematic structural diagram of an image processing circuit provided by an embodiment of the present application.
  • An embodiment of the application provides an image processing method, including:
  • the method before acquiring the picture to be processed, the method further includes:
  • the sample picture including a plurality of sample areas, each sample area corresponding to a category;
  • the target area includes a table area
  • the method further includes:
  • the target area further includes a text area, and after filling the text into the table, it further includes:
  • the typeset table and the text recognized from the text area are output.
  • the output of the typeset table and the text recognized from the text area includes:
  • the typeset table and the text recognized from the text area are output to the editing interface.
  • the performing character recognition processing on the target area to recognize characters in the target area includes:
  • the method before said using the character recognition model to perform character recognition processing on the target area to recognize the characters in the target area, the method further includes:
  • the length of the picture to be processed is greater than the preset length, performing crop processing on the picture to be processed to crop the picture to be processed into multiple sub-pictures, wherein each sub-picture corresponds to a region;
  • the execution subject of the embodiments of the present application may be an electronic device such as a smart phone or a tablet computer.
  • FIG. 1 is a schematic diagram of the first process of the image processing method provided by an embodiment of the present application.
  • the process may include:
  • the category of the picture to be processed may include at least two categories. For example, if a picture includes text and pictures, the category of the picture may include text category and picture category, and the electronic device may determine the picture as a picture to be processed. For another example, if a picture includes text, picture, and table, the category of the picture may include text category, picture category, and table category, and the electronic device may determine the picture as a picture to be processed.
  • the picture to be processed may be as shown in FIG. 2.
  • the to-be-processed picture G1 includes text, table, and picture.
  • the category of the picture G1 to be processed may include a text category, a table category, and a picture category.
  • the pre-trained image semantic segmentation model is called to divide the image to be processed into multiple regions, where each region corresponds to a category, and the category includes a text category, a table category, and a picture category.
  • the electronic device before acquiring the image to be processed and performing text recognition on the image to be processed, the electronic device first calls the pre-trained image semantic segmentation model to divide the image to be processed into multiple regions. Among them, each area corresponds to a category, which includes text category, table category and picture category. As shown in Fig.
  • the picture G1 to be processed will be divided into three areas, namely the text area A1 where the text is located, the table area A2 where the table is located, and the picture area A3 where the picture is located.
  • the text area A1 corresponds to the text category
  • the text area A2 corresponds to the table category
  • the text area A3 corresponds to the picture category.
  • the electronic device may pre-train the u-net network, and use the trained u-net network as a pre-trained image semantic segmentation model.
  • the target area is determined from a plurality of areas.
  • the user can specify the target area from multiple areas. For example, if the user wants to recognize the text in the table area, the user can click on the table area of the picture to be processed.
  • the electronic device receives the user's click operation, the electronic device can determine the target area as the table area according to the position clicked by the user's click operation.
  • the electronic device may set one or more of the table area, the text area, or the picture area as the preset area in advance. After calling the pre-trained image semantic segmentation model to divide the to-be-processed picture into multiple regions, the electronic device may determine the region matching the preset region among the multiple regions as the target region. For example, if the preset area includes a table area and a text area, and the multiple areas include a table area, a text area, and a picture area, then the target area may be a table area and a text area. For another example, if the preset area is a text area, and the multiple areas include a table area, a text area, and a picture area, then the target area may be a text area.
  • character recognition processing is performed on the target area to recognize the characters in the target area.
  • the electronic device can perform text recognition processing on the text area A1 to recognize the text in the text area A1.
  • the electronic device can also output the characters in the target area. For example, the electronic device can save the text in the text area A1 in an editable form, such as word, TXT format, etc.
  • the pre-trained image semantic segmentation model can be called to divide the image to be processed into multiple regions, so that when only the text in a certain region of the multiple regions needs to be recognized, the region can be determined as the target region ;
  • the multiple areas can be determined as the target area, and then the determined target area is subjected to text recognition processing. It can be seen that the image processing method provided by the embodiment of the present application can improve the flexibility of recognizing the text in the image.
  • FIG. 4 is a schematic diagram of a second flow of the image processing method provided by an embodiment of the present application.
  • the flow may include:
  • the electronic device obtains a picture to be processed.
  • the category of the picture to be processed may include at least two categories. For example, if a picture includes text and pictures, the category of the picture may include a text category and a picture category, and the electronic device may determine the picture as a picture to be processed. For another example, if a picture includes text, picture, and table, the category of the picture may include text category, picture category, and table category, and the electronic device may determine the picture as a picture to be processed.
  • the embodiment of the present application uses a model to process the image to be processed, and the model usually has some requirements on the attributes of the input image, the image to be processed should meet these requirements so that the model can process normally.
  • the electronic device may preprocess the picture to make the picture meet the requirements of the model.
  • the size of the input picture be a preset size, such as 256 ⁇ 256. If the picture acquired by the electronic device is not of the preset size, the electronic device needs to adjust the size of the picture to the preset size to obtain the picture to be processed.
  • the model requires that the pixel value of the input picture should be normalized.
  • the pixel value should be a real number between [0,1]. If the picture obtained by the electronic device is not normalized, the electronic device should normalize it To get the picture to be processed.
  • the pixel value of a picture is expressed as an integer between [0,255], which can be normalized by dividing by 255. It is understandable that normalization can have different definitions. For example, in another normalization definition, the pixel value should be a real number between [-1,1]. For different normalization definitions, normalization The unified approach should be adjusted accordingly.
  • the picture to be processed may be a color picture or a grayscale picture.
  • the electronic device inputs the image to be processed into the encoding module to obtain a set of encoding feature maps.
  • the electronic device inputs the set of encoded feature maps into the decoding module to obtain a target image, where each pixel in the target image corresponds to a category, and the category includes a text category, a table category, and a picture category.
  • the electronic device can create an image semantic segmentation model in advance, and train the image semantic segmentation model, and then use the trained semantic segmentation model as a pre-trained image semantic segmentation model, and save the pre-trained image Semantic segmentation model.
  • the pre-trained image semantic segmentation model may include an encoding module and a decoding module.
  • the electronic device can input the picture to be processed into the encoding module to obtain a set of encoding feature maps.
  • the set of encoding feature maps includes multiple encoding feature maps, and the sizes of the multiple encoding feature maps may be the same or different.
  • the electronic device may input the set of coded feature maps to the decoding module to obtain a target image, where the target image may be a dual-channel image.
  • the pixel value of each pixel in the target image of one channel represents the category to which the pixel belongs
  • the pixel value of each pixel in the target image of the other channel represents the probability that the pixel belongs to a certain category.
  • the pixel value of the pixel in the image of the other channel can take a value of 0 to 3, where 0 represents the background category, 1 represents the text category, 2 represents the table category, and 3 represents the picture category.
  • the electronic device can obtain 4 channel feature maps after inputting the set of encoded feature maps into the decoding module. Then, the electronic device can obtain the target image according to the feature maps of these 4 channels.
  • the pixel value of each pixel in the feature map of each channel represents the probability that the pixel belongs to one of the four categories.
  • the sum of the pixel values of the pixels at the same position in the feature maps of the 4 channels is 1.
  • the pixel value of each pixel in the feature map of channel C1 represents the probability that the pixel belongs to the background category
  • the pixel value of each pixel in the feature map of channel C2 represents the probability that the pixel belongs to the text category.
  • Channel C3 The pixel value of each pixel in the feature map indicates the probability that the pixel belongs to the table category
  • the pixel value of each pixel in the feature map of channel C4 indicates the probability that the pixel belongs to the picture category.
  • the electronic device may use the maximum pixel value of the pixel at the same position in the feature map of the four channels as the pixel value of the pixel at the corresponding position of the target image of one of the channels. For example, suppose that the pixel values of pixels at the same position in the feature maps of 4 channels are 0.1, 0.1, 0.1, and 0.7 respectively. Then 0.7 can be used as the pixel value of the pixel at the corresponding position of the target image of one of the channels. And if 0.7 corresponds to the probability that the pixel belongs to the picture category, then the pixel value of the corresponding position of the target image of the other channel can take the value 3 (indicating that the pixel belongs to the picture category, and the probability of belonging to the picture category is 0.7) .
  • the number of categories is determined by the pre-training process. For example, if three categories are used in the pre-training process, namely a table category, a text category and a picture category, the category corresponding to the pixel in the target image is one of the three categories.
  • the electronic device uses the category corresponding to each pixel in the target image to divide the picture to be processed into multiple regions, where each region corresponds to a category.
  • the size of the target image is consistent with the size of the picture to be processed. Then, after obtaining the target image, the electronic device can use the category corresponding to each pixel in the target image to divide the picture to be processed into multiple regions. If the picture to be processed is shown in Figure 3, then each pixel in the area corresponding to the text area A1 in the target image corresponds to the text category, and each pixel in the area corresponding to the table area A2 in the target image is Corresponding to the table category, each pixel in the area corresponding to the picture area A3 in the target image corresponds to a category, so that the picture to be processed can be divided into a text area A1, a table area A2, and a picture area A3.
  • the electronic device determines the target area from the multiple areas.
  • the user can specify the target area from multiple areas. For example, if the user wants to recognize the text in the table area, the user can click on the table area of the picture to be processed.
  • the electronic device receives the user's click operation, the electronic device can determine the target area as the table area according to the position clicked by the user's click operation.
  • the electronic device may set one or more of the table area, the text area, or the picture area as the preset area in advance. After calling the pre-trained image semantic segmentation model to divide the to-be-processed picture into multiple regions, the electronic device may determine the region matching the preset region among the multiple regions as the target region. For example, if the preset area includes a table area and a text area, and the multiple areas include a table area, a text area, and a picture area, then the target area may be a table area and a text area. For another example, if the preset area is a text area, and the multiple areas include a table area, a text area, and a picture area, then the target area may be a text area.
  • the electronic device performs character recognition processing on the target area to recognize the characters in the target area.
  • the electronic device can perform text recognition processing on the text area A1 to recognize the text in the text area A1.
  • the electronic device can also output the characters in the target area. For example, the electronic device can save the text in the text area A1 in an editable form, such as word, TXT format, etc.
  • the electronic device may also receive a save instruction from the user, and the save instruction is used to indicate the save format of the text in the recognized target area. Subsequently, the electronic device can save the recognized text in the target area according to the save instruction. For example, if the save instruction instructs to save the text in the recognized target area in the word format, then the electronic device may save the text in the recognized target area in the word format.
  • the encoding module may include a first encoding sub-module, a second encoding sub-module, a third encoding sub-module, a fourth encoding sub-module, and a fifth encoding sub-module
  • the process 202 may include:
  • the electronic device invokes the first encoding sub-module to perform encoding processing on the picture to be processed to obtain the first encoding feature map;
  • the electronic device invokes the second encoding sub-module to perform encoding processing on the first encoding feature map to obtain the second encoding feature map;
  • the electronic device invokes the third encoding sub-module to perform encoding processing on the second encoding feature map to obtain the third encoding feature map;
  • the electronic device invokes the fourth encoding submodule to perform encoding processing on the third encoding feature map to obtain the fourth encoding feature map;
  • the electronic device calls the fifth encoding submodule to perform encoding processing on the fourth encoding feature map to obtain the fifth encoding feature map;
  • the first encoding feature map, the second encoding feature map, the third encoding feature map, the fourth encoding feature map, and the fifth encoding feature map constitute a set of encoding feature maps.
  • encoding an image can be understood as performing down-sampling processing, pooling processing, or convolution processing on the image.
  • the electronic device may call the first encoding sub-module to perform convolution processing on the image to be processed (grayscale image) to obtain the first encoding feature map.
  • the first encoding sub-module may include multiple convolution kernels.
  • the number of convolution kernels can be 64 or 128, etc.
  • the size of the convolution kernel can be 7 ⁇ 7 or 8 ⁇ 8, etc.
  • the step size is 2 or 3, etc., and there is no specific limitation here.
  • the size of the first coding feature map is smaller than the size of the picture to be processed.
  • the electronic device may call the first encoding sub-module to perform convolution processing on the picture to be processed (color picture) to obtain the first encoding feature map.
  • the first encoding sub-module may include multiple convolution kernels.
  • the number of convolution kernels is 64, and the size of the convolution kernel can be determined according to the actual situation.
  • the convolution kernel can be a convolution kernel of 3 dimensions.
  • the convolution processing here is to calculate the matrix inner product of the picture to be processed and the convolution kernel of each 3 dimensions to obtain the first coding feature map of 64 dimensions.
  • the electronic device can call the second encoding submodule to perform pooling processing on the first encoded feature map to obtain a pooled feature map, where the pooling scale is 2 ⁇ 2. Then, the electronic device can call the bottleneck structure of the second coding sub-module to process the pooling feature map multiple times, such as 2 or 3 times, to obtain the second coding feature map.
  • the dimension of the first encoding feature map may be 64, and the dimension of the second encoding feature map may be 256.
  • the size of the second encoding feature map is smaller than the size of the first encoding feature map.
  • the electronic device can call the bottleneck structure of the third encoding submodule to process the second encoding feature map multiple times, such as 3 or 4 times, to obtain the third encoding feature map.
  • the dimension of the third coding feature map may be 512.
  • the size of the third encoding feature map is smaller than the size of the second encoding feature map.
  • the electronic device can call the bottleneck structure of the fourth encoding submodule to process the third encoding feature map multiple times, such as 4 or 5 times, to obtain the fourth encoding feature map.
  • the dimension of the fourth encoding feature map may be 1024.
  • the size of the fourth encoding feature map is smaller than the size of the third encoding feature map.
  • the electronic device can call the bottleneck structure of the fifth encoding submodule to process the fourth encoding feature map multiple times, such as 3 or 4 times, to obtain the fifth encoding feature map.
  • the dimension of the fifth coding feature map may be 512.
  • the size of the fifth encoding feature map is smaller than the size of the fourth encoding feature map.
  • the decoding module may include a first decoding sub-module, a second decoding sub-module, a third decoding sub-module, a fourth decoding sub-module, and a fifth decoding sub-module.
  • the process 203 may include:
  • the electronic device calls the first decoding sub-module to decode the fifth encoded feature map to obtain the first decoded feature map, and performs fusion processing on the first decoded feature map and the target feature map determined according to the fourth encoded feature map to obtain the first decoded feature map.
  • a fusion feature map ;
  • the electronic device calls the second decoding submodule to decode the first fusion feature map to obtain a second decoded feature map, and perform fusion processing on the second decoded feature map and the third encoded feature map to obtain the second fusion feature map;
  • the electronic device calls the third decoding submodule to decode the second fusion feature map to obtain a third decoded feature map, and perform fusion processing on the third decoded feature map and the second encoded feature map to obtain the third fusion feature map;
  • the electronic device calls the fourth decoding sub-module to decode the third fusion feature map to obtain a fourth decoded feature map, and perform fusion processing on the fourth decoded feature map and the first encoded feature map to obtain the fourth fusion feature map;
  • the electronic device calls the fifth decoding sub-module to decode the fourth fusion feature map to obtain a fifth decoded feature map, and perform fusion processing on the fifth decoded feature map and the picture to be processed to obtain the fifth fusion feature map, and according to the first Five fusion feature maps to determine the target image.
  • decoding an image can be understood as performing upsampling, de-pooling, deconvolution, or convolution processing on the image.
  • the electronic device may call the first decoding submodule to perform deconvolution processing or up-sampling processing on the fifth encoded feature map to obtain the first decoded feature map.
  • the size of the up-sampling can be determined according to the actual situation.
  • the convolution kernel corresponding to the deconvolution can also be determined according to the actual situation.
  • the electronic device may perform fusion processing on the first decoded feature map and the target feature map determined according to the fourth encoded feature map to obtain the first fused feature map.
  • the number of channels of the first fusion feature map is the sum of the number of channels of the target feature map and the number of channels of the first decoded feature map.
  • the target feature map determined according to the fourth encoding feature map is obtained by performing convolution processing on the fourth encoding feature map.
  • the convolution kernel corresponding to the convolution processing can be determined according to the actual situation.
  • the electronic device can call the second decoding sub-module to perform convolution and up-sampling processing on the first fused feature map to obtain a second decoded feature map.
  • the convolution kernel corresponding to the convolution processing can be determined according to the actual situation.
  • the electronic device may call the second decoding sub-module to perform fusion processing on the second decoding feature map and the third encoding feature map to obtain the second fusion feature map.
  • the number of channels of the second fusion feature map is the sum of the number of channels of the second decoded feature map and the number of channels of the third encoded feature map.
  • the electronic device can call the third decoding sub-module to perform convolution processing on the second fused feature map to obtain a third decoded feature map.
  • the convolution kernel corresponding to the convolution processing can be determined according to the actual situation.
  • the electronic device may call the third decoding sub-module to perform fusion processing on the third decoding feature map and the second encoding feature map to obtain a third fusion feature map.
  • the number of channels in the third fusion feature map is the sum of the number of channels in the third decoded feature map and the number of channels in the second encoded feature map.
  • the electronic device may call the fourth decoding sub-module to perform convolution processing on the third fused feature map to obtain a fourth decoded feature map.
  • the convolution kernel corresponding to the convolution processing can be determined according to the actual situation.
  • the electronic device may call the fourth decoding sub-module to perform fusion processing on the fourth decoding feature map and the first encoding feature map to obtain a fourth fusion feature map.
  • the number of channels of the fourth fusion feature map is the sum of the number of channels of the fourth decoded feature map and the number of channels of the first encoded feature map.
  • the electronic device may call the fifth decoding sub-module to perform convolution processing on the fourth fused feature map to obtain the fifth decoded feature map.
  • the convolution kernel corresponding to the convolution processing can be determined according to the actual situation.
  • the electronic device may call the fifth decoding sub-module to perform fusion processing on the fifth decoded feature map and the picture to be processed to obtain the fifth fused feature map.
  • the number of channels of the fifth fusion feature map is the sum of the number of channels of the fifth decoded feature map and the number of channels of the picture to be processed.
  • the electronic device may also call the fifth decoding sub-module to determine the target image according to the fifth fusion feature map.
  • the electronic device calling the fifth decoding submodule to determine the target image according to the fifth fusion feature map may include: the electronic device calling the fifth decoding submodule to perform convolution processing on the fifth fusion feature map to obtain the sixth decoding Feature map.
  • the convolution kernel corresponding to the convolution processing can be determined according to the actual situation.
  • the electronic device invokes the fifth decoding submodule to perform convolution processing on the sixth decoding feature map to obtain a seventh decoding feature map.
  • the seventh decoding feature map is a feature map of 4 channels.
  • the pixel value of each pixel in the feature map of each channel is used to represent the probability that the pixel belongs to one of the four categories.
  • the feature map of each channel corresponds to a category.
  • the electronic device may call the fifth decoding sub-module to determine the target image according to the seventh decoding feature map.
  • the target image is a dual-channel image
  • the electronic device can call the fifth decoding sub-module according to the seventh decoding feature map to determine that the target image can be: the electronic device can convert the pixels at the same position in the seventh encoding feature map of the 4 channels
  • the maximum pixel value of a point is used as the pixel value of the pixel point in the corresponding position of the target image of a channel. For example, suppose that the pixel values of a pixel at the same position in the seventh encoding feature map of 4 channels are 0.1, 0.1, 0.1, and 0.7, respectively. Then 0.7 can be used as the pixel value of the pixel at the corresponding position of the target image of one channel.
  • the pixel value of the corresponding position of the target image of the other channel can take the value 3 (indicating that the pixel belongs to the picture category, and the probability of belonging to the picture category is 0.7) .
  • the process 201 may further include:
  • the electronic device obtains a sample picture, the sample picture includes a plurality of sample areas, and each sample area corresponds to a category;
  • the electronic device obtains the image semantic segmentation model to be trained
  • the electronic device uses the sample pictures to train the image semantic segmentation model to be trained.
  • the electronic device can collect multiple screenshots of a mobile phone, and then annotate each pixel of the multiple screenshots with an image annotation tool to obtain a marked screenshot, which can be used as a sample picture.
  • each pixel can be marked as one of four categories: background category, text category, picture category and table category.
  • the electronic device can mark each pixel of the multiple screenshots with a different color. For example, if a certain pixel of a screenshot is a background category, the electronic device can mark it as black (RGB[0,0,0]); if a certain pixel of a screenshot is a text category, the electronic device can Mark it as red (RGB[255,0,0]); if a certain pixel of a screenshot is a picture category, the electronic device can mark it as green (RGB[0,255,0]); if a certain screenshot is If a certain pixel is a table type, the electronic device can mark it as blue (RGB[0,0,255]).
  • the electronic device can also select part of the text area (the area where the pixel points marked as the text category are located), the table area (the area where the pixel points marked as the table category are located) and The picture area (the area where the pixel points marked as the picture category are located) is then randomly combined to form multiple combined pictures, and the multiple combined pictures can also be used as sample pictures.
  • the number of combined pictures may be the same as or different from the number of marked screenshots.
  • the electronic device can divide the multiple sample pictures into a training set, a verification set, and a test set.
  • the ratio of training set, validation set and test set can be 3:1:1.
  • the training set may include 6000 sample pictures
  • the validation set and test set may include 2000 sample pictures respectively.
  • the electronic device can obtain the image semantic segmentation model to be trained, set the training parameters, and select the learning rate.
  • the image semantic segmentation model to be trained may include an encoding module and a decoding module.
  • the encoding module may include a first encoding sub-module, a second encoding sub-module, a third encoding sub-module, a fourth encoding sub-module, and a fifth encoding sub-module.
  • the decoding module may include a first decoding sub-module, a second decoding sub-module, a third decoding sub-module, a fourth decoding sub-module, and a fifth decoding sub-module.
  • the learning rate can be 1 ⁇ 10 -5 .
  • the maximum total number of single samples is 64.
  • the number of training iterations is 300. Verify and update the model output once every iteration.
  • the electronic device can use the sample pictures to train the image semantic segmentation model to be trained to adjust the parameters of the image semantic segmentation model to be trained until the image semantic segmentation model to be trained converges to obtain the pre-trained image semantic segmentation Model.
  • the electronic device can also perform data enhancement operations such as rotation, zooming, flipping, translation, noise addition, and blurring on the sample picture to obtain a data-enhanced image, and then use the data-enhanced image to semantically segment the image to be trained
  • the model is trained to improve the recognition ability and generalization ability of the model.
  • the rotation angle can be [-0,1,0,1]rad.
  • the zoom factor can be [0.8,1.2].
  • the cross-entropy loss function can be used as the loss function of the sample image.
  • the formula of the cross entropy loss function may be:
  • L log (Y, P) represents the loss value of the sample image
  • y i, k represents the true category of the i-th pixel of the sample image
  • P i,k represents the probability that the i-th pixel of the sample picture belongs to the k-th category
  • N represents the number of pixels.
  • the loss value corresponding to the image semantic segmentation model to be trained may be the average value of the sum of the loss values of all sample pictures.
  • the electronic device can calculate the gradient of each parameter according to the loss value corresponding to the image semantic segmentation model to be trained, and then update the parameters of the entire network through the back propagation algorithm.
  • the verification set can be used to verify the image semantic segmentation model to be trained, and the evaluation function is used to calculate the evaluation value and evaluation loss value of this iteration , And output the saved image semantic segmentation model.
  • the sum of the evaluation value and the evaluation loss value can be 1.
  • the formula of the evaluation function of a single sample picture can be:
  • IoU represents the evaluation value of the sample image
  • X represents the prediction result of the sample image by the image semantic segmentation model
  • Y represents the actual annotation result of the sample image
  • the categories corresponding to pixels are not limited to the above categories, and may also be other categories.
  • the category corresponding to the pixel may also include the bubble category.
  • the screenshots collected by the electronic device include document pictures in word, ppt, or pdf formats, the pictures usually include headers, footers, and titles. Therefore, the categories corresponding to pixels can also include header categories, footer categories, and title categories.
  • the target area may include a table area, and after the process 206, it may also include:
  • the electronic device recognizes the number of rows and columns of the table in the table area
  • the electronic device generates a table based on the number of rows and columns
  • the electronic device fills the text into the table.
  • the electronic device can recognize the number of rows and columns of the table in the table area after recognizing the text in the table area, and generate the table according to the number of rows and columns of the table, and then Fill the text into the table.
  • the format of the table may be an editable format such as excel format, so that the user can edit the table.
  • editing operations can include operations such as copy, paste, and delete.
  • the target area also includes a text area, and after "filling text into the table", it also includes:
  • the electronic device typesets the form and the text recognized from the text area according to the typesetting format of the picture to be processed
  • the electronic device outputs the typeset table and the text recognized from the text area.
  • the electronic device can also compare the text, The tables are typeset and combined, so that the typesetting format of the final output result is consistent with the typesetting format of the picture to be processed, so that the user does not need to manually typeset the recognized text.
  • the electronic device outputs the typeset table and the text recognized from the text area may include:
  • the electronic device displays an editing interface, which is an interface for users to perform editing operations
  • the electronic device outputs the typeset table and the text recognized from the text area to the editing interface.
  • the editing interface may be a word document editing interface, a memo editing interface, a short message editing interface, and other interfaces that can be used by the user to perform editing operations.
  • the electronic device can automatically start the word document application and enter the word document editing interface, and the electronic device displays the editing interface. Then, the electronic device can output the typeset table and the text recognized from the text area to the editing interface, so that the user can perform the corresponding adjustment on the typeset table and the text recognized from the text area on the editing interface. Edit operation.
  • editing operations can include operations such as copy, paste, delete, modify, and add.
  • editing operations can include operations such as copy, paste, delete, modify, and add.
  • a table the user can add a new row or a column, and enter new content in the newly added row and column, to finally form a new table.
  • the user can add other text, or delete some text, and so on.
  • one of the multiple areas may be a picture area. After "filling text into the table”, it may also include:
  • the electronic device obtains the picture in the picture area
  • the electronic device typesets the form and the text recognized from the text area according to the typesetting format of the picture to be processed", which can include:
  • the electronic device typesets tables, pictures and text recognized from the text area according to the typesetting format of the pictures to be processed;
  • the electronic device outputs the typeset form and the text recognized from the text area", which can include:
  • the electronic device outputs the typeset tables, pictures, and text recognized from the text area.
  • the electronic device can also crop out the picture area
  • the text, tables, and pictures recognized from the text area are typeset and combined according to the typesetting format of the picture to be processed, so that the typesetting format of the final output result is consistent with the typesetting format of the picture to be processed, so that users do not need to manually
  • the text, tables, and pictures recognized from the text area are typeset into the same format as the pictures to be processed.
  • the electronic device outputs the typeset tables, pictures, and text recognized from the text area may include:
  • the electronic device displays an editing interface, which is an interface for users to perform editing operations
  • the electronic device outputs the typeset tables, pictures, and text recognized from the text area to the editing interface.
  • the electronic device can automatically start the word document application and enter the word document editing interface, and the electronic device displays the editing interface. Then, the electronic device can output the typeset tables, pictures, and text recognized from the text area to the editing interface, so that the user can compare the typeset tables, pictures, and the text recognized from the text area on the editing interface. Edit the text accordingly.
  • the editing operations on the pictures may include: zooming in, zooming out, flipping, and deleting the pictures.
  • the process 206 may include:
  • the electronic device obtains the pre-trained text recognition model
  • the electronic device invokes the pre-trained character recognition model to perform character recognition processing on the target area, so as to recognize the characters in the target area.
  • the electronic device can obtain a text recognition model to be trained, and train the text recognition model to be trained to obtain a trained model, and the trained model can be used as a pre-trained text recognition model.
  • the electronic device can call the pre-trained character recognition model to perform character recognition processing on the target area to recognize the characters in the target area, thereby improving the accuracy of character recognition.
  • the electronic device uses the text recognition model to perform text recognition processing on the target area to recognize the text in the target area
  • it may further include:
  • the electronic device determines whether the length of the picture to be processed is greater than the preset length
  • the electronic device performs cutting processing on the picture to be processed to cut the picture to be processed into multiple sub-pictures, wherein each sub-picture corresponds to a region;
  • Electronic equipment uses the text recognition model to perform text recognition processing on the target area to recognize the text in the target area
  • which can include:
  • the electronic device uses the character recognition model to perform character recognition processing on each sub-picture, so as to recognize the characters in each sub-picture.
  • the electronic device can determine whether the length of the image to be processed is greater than the preset length, and if the length of the image to be processed is greater than the preset length, the electronic device can perform crop processing on the image to be processed , To crop the picture to be processed into multiple sub-pictures, where each sub-picture corresponds to a region. After obtaining multiple sub-pictures, the electronic device can use the character recognition model to perform character recognition processing on each sub-picture.
  • the electronic device calls the pre-trained image semantic segmentation model to divide the picture to be processed into how many regions, then the electronic device can cut out how many sub-pictures corresponding to the picture to be processed.
  • the preset length can be determined according to the length of the picture supported by the text recognition model. For example, if the length of the picture supported by the text recognition model is 256 pixels, the preset length is 256 pixels.
  • the electronic device may also use a trained character recognition model obtained from elsewhere as a pre-trained character recognition model.
  • the process 205 may include:
  • the electronic device receives the biometric information input by the user
  • the electronic device determines the area corresponding to the biometric information among the multiple areas as the target area.
  • the biometric information may include fingerprint information, voiceprint information, facial feature information, iris information, and so on.
  • the electronic device may preset the correspondence between the biometric information and the area. For example, assume that the fingerprint J1 corresponds to the text area, the fingerprint J2 corresponds to the table area, and the fingerprint J3 corresponds to the picture area.
  • the electronic device After the electronic device divides the image to be processed into three areas: a text area, a table area, and a picture area, the electronic device can receive fingerprint information input by the user, and determine the area corresponding to the fingerprint information as the target area. For example, if the fingerprint information input by the user is fingerprint J1, the target area is a text area.
  • the electronic device can separate the picture area and the table area from the picture to be processed, and separate the picture to be processed. Fill the deducted area in the picture with white (RGB[255,255,255]) to get the text picture. Subsequently, the electronic device can perform word recognition processing on the text picture to recognize the words in the text picture. Then, the electronic device can extract the recognized text and save it in a text format.
  • the text format is an editable format, so that the user can edit the recognized text.
  • the electronic device can also perform text recognition on the table area to recognize the text in the table area.
  • the electronic device can recognize the number of rows and columns of the table in the table area, and generate a table in an editable format according to the number of rows and columns of the table. After that, the electronic device can fill the text recognized from the table area into the table, so that the user can edit the table.
  • the electronic device can also typeset and combine the text, table, and separated pictures recognized from the text image according to the typeset format of the image to be processed, so that the typeset format of the final output result is consistent with the typeset format of the image to be processed.
  • the text and tables in the final output result can be edited, and the pictures in the final output result can be stretched and flipped.
  • the electronic device may also provide a text recognition interface in the picture area, so that when the text in the picture area needs to be recognized, the recognition of the text in the picture area can be triggered.
  • FIG. 5 is a schematic diagram of the network structure of the image semantic segmentation model provided by an embodiment of the application.
  • the image semantic segmentation model includes an encoding module and a decoding module.
  • the encoding module is mainly used for feature extraction, extracting feature maps of different resolutions from the input data.
  • the decoding module is mainly used for up-sampling, and each up-sampling is fused with the corresponding feature map output by the encoding module.
  • sample pictures can be obtained to train the model, so as to finally obtain a trained model, that is, a pre-trained image semantic segmentation model.
  • the pre-trained image semantic segmentation model can be called to divide the image to be processed into multiple regions.
  • the size of the picture to be processed can be 300 ⁇ 300, and the dimension (number of channels) is 1.
  • the picture to be processed provided in the embodiment of the present application may generally include three areas: a table, a picture, and a text, as shown in FIG. 3.
  • the picture to be processed may further include a background area. As shown in FIG. 3, the background area may be an area other than the text area A1, the table area A2, and the picture area A3 in the picture G1 to be processed.
  • "calling the pre-trained image semantic segmentation model to divide the picture G1 to be processed into multiple regions” may include:
  • the electronic device can input the image to be processed into the pre-trained image semantic segmentation model, and use the convolution kernel cv1 to perform convolution processing on the image to be processed to obtain a first encoded feature map, wherein the dimensions of the first encoded feature map Is 64.
  • the number of convolution kernel cv1 is 64
  • the size of convolution kernel cv1 is 7 ⁇ 7
  • the step size is 2.
  • the size of the first coding feature map is smaller than the size of the picture to be processed.
  • the electronic device may use a filter with a size of 2 ⁇ 2 to pool the first encoded feature map to obtain the feature map F1.
  • the dimension of the feature map F1 is 64.
  • the size of the feature map F1 is smaller than the size of the first coding feature map.
  • the electronic device can use the bottleneck structure b1 to process the feature map F1 to obtain the feature map F2.
  • the dimension of the feature map F2 is 256.
  • the size of the feature map F2 is the same as the size of the feature map F1.
  • the electronic device can use the bottleneck structure b2 to process the feature map F2 to obtain the second encoded feature map.
  • the dimension of the second coding feature map is 256.
  • the size of the second encoding feature map is the same as the size of the feature map F2.
  • the electronic device can use the bottleneck structure b3 to process the second encoded feature map to obtain the feature map F3.
  • the dimension of the feature map F3 is 256.
  • the size of the feature map F3 is smaller than the size of the second encoding feature map.
  • the electronic device can use the bottleneck structure b4 to process the feature map F3 to obtain the feature map F4.
  • the dimension of the feature map F4 is 512.
  • the size of the feature map F4 is the same as the size of the feature map F3.
  • the electronic device can use the bottleneck structure b5 to process the feature map F4 to obtain the feature map F5.
  • the dimension of the feature map F5 is 512.
  • the size of the feature map F5 is the same as the size of the feature map F4.
  • the electronic device can use the bottleneck structure b6 to process the feature map F5 to obtain the third encoded feature map.
  • the dimension of the third coding feature map is 512.
  • the size of the third encoding feature map is the same as the size of the feature map F5.
  • the electronic device can use the bottleneck structure b7 to process the third encoded feature map to obtain the feature map F6.
  • the dimension of the feature map F6 is 512.
  • the size of the feature map F6 is smaller than the size of the third encoding feature map.
  • the electronic device can use the bottleneck structure b8 to process the feature map F6 to obtain the feature map F7.
  • the dimension of the feature map F7 is 1024.
  • the size of the feature map F7 is the same as the size of the feature map F6.
  • the electronic device can use the bottleneck structure b9 to process the feature map F7 to obtain the feature map F8.
  • the dimension of the feature map F8 is 1024.
  • the size of the feature map F8 is the same as the size of the feature map F7.
  • the electronic device can use the bottleneck structure b10 to process the feature map F8 to obtain the feature map F9.
  • the dimension of the feature map F9 is 1024.
  • the size of the feature map F9 is the same as the size of the feature map F8.
  • the electronic device can use the bottleneck structure b11 to process the feature map F9 to obtain the feature map F10.
  • the dimension of the feature map F10 is 1024.
  • the size of the feature map F10 is the same as the size of the feature map F9.
  • the electronic device can use the bottleneck structure b12 to process the feature map F10 to obtain the fourth encoded feature map. Wherein, the dimension of the fourth coding feature map is 1024.
  • the size of the fourth encoding feature map is the same as the size of the feature map F10.
  • the electronic device may use the convolution kernel cv2 to perform convolution processing on the fourth encoded feature map to obtain the target feature map.
  • the dimension of the target feature map is 512.
  • the size of the target feature map is the same as the size of the feature map F10.
  • the electronic device can use the bottleneck structure b13 to process the fourth encoded feature map to obtain the feature map F11.
  • the dimension of the feature map F11 is 1024.
  • the size of the feature map F11 is smaller than the size of the feature map F10.
  • the electronic device can use the bottleneck structure b14 to process the feature map F11 to obtain the feature map F12.
  • the dimension of the feature map F12 is 2048.
  • the size of the feature map F12 is the same as the size of the feature map F11.
  • the electronic device can use the bottleneck structure b15 to process the feature map F12 to obtain the feature map F13.
  • the dimension of the feature map F13 is 2048.
  • the size of the feature map F13 is the same as the size of the feature map F12.
  • the electronic device can use the bottleneck structure b16 to process the feature map F13 to obtain the feature map F14.
  • the dimension of the feature map F14 is 2048.
  • the size of the feature map F14 is the same as the size of the feature map F13.
  • the electronic device can use the convolution kernel cv3 to perform convolution processing on the feature map F14 to obtain the fifth encoded feature map.
  • the dimension of the fifth coding feature map is 512.
  • the size of the fifth encoding feature map is the same as the size of the feature map F14.
  • the number of convolution kernel cv3 is 512
  • the size of convolution kernel cv3 is 1 ⁇ 1, and the step size is 1.
  • the electronic device may perform up-sampling (uc1) processing on the fifth encoded feature map to obtain the first decoded feature map.
  • the dimension of the first decoding feature map is 512.
  • the size of the first decoding feature map is larger than the size of the fifth encoding feature map.
  • the multiple of upsampling can be 2 times, 4 times, etc. (depending on the specific situation).
  • the electronic device may perform fusion processing on the first decoded feature map and the target feature map to obtain the first fused feature map.
  • the dimension of the first fusion feature map is 1024.
  • the size of the first fusion feature map is the same as the size of the first decoded feature map.
  • the electronic device may use the convolution kernel cv4 to perform convolution processing on the first fusion feature map to obtain the feature map F15.
  • the dimension of the feature map F15 is 512.
  • the size of the feature map F15 and the size of the first fusion feature map may be the same or different.
  • the number of convolution kernel cv4 is 256
  • the size of convolution kernel cv4 is 1 ⁇ 1 or 3 ⁇ 3, and the step size is 1.
  • the electronic device can perform up-sampling (uc2) processing on the feature map F15 to obtain the second decoded feature map.
  • the dimension of the second decoding feature map is 512.
  • the size of the second decoded feature map is larger than the size of the feature map F15.
  • the electronic device may perform fusion processing on the second decoded feature map and the third encoded feature map to obtain the second fused feature map.
  • the dimension of the second fusion feature map is 1024.
  • the size of the second fusion feature map and the size of the third decoded feature map may be the same.
  • the electronic device may use the convolution kernel cv5 to perform convolution processing on the second fusion feature map to obtain the feature map F16.
  • the dimension of the feature map F16 is 256.
  • the size of the feature map F16 and the size of the first fusion feature map may be the same or different, depending on the size of the convolution kernel.
  • the electronic device can perform up-sampling (uc3) processing on the feature map F16 to obtain the third decoded feature map.
  • the dimension of the third decoding feature map is 256.
  • the size of the third decoded feature map is larger than the size of the feature map F16.
  • the electronic device may perform fusion processing on the third decoded feature map and the second encoded feature map to obtain a third fused feature map.
  • the dimension of the third fusion feature map is 512.
  • the size of the third fusion feature map is the same as the size of the third decoded feature map.
  • the electronic device may use the convolution kernel cv6 to perform convolution processing on the third fusion feature map to obtain the feature map F17.
  • the dimension of the feature map F17 is 128.
  • the size of the feature map F17 and the size of the third fusion feature map may be the same or different, depending on the actual situation.
  • the electronic device can perform up-sampling (uc4) processing on the feature map F17 to obtain a fourth decoded feature map.
  • the dimension of the fourth decoding feature map is 128.
  • the size of the fourth decoded feature map is larger than the size of the feature map F17.
  • the electronic device may perform fusion processing on the fourth decoded feature map and the first encoded feature map to obtain a fourth fused feature map.
  • the dimension of the fourth fusion feature map is 192.
  • the size of the fourth fusion feature map is the same as the size of the fourth decoded feature map.
  • the electronic device may use the convolution kernel cv7 to perform convolution processing on the fourth fusion feature map to obtain the feature map F18.
  • the dimension of the feature map F18 is 64.
  • the size of the feature map F18 and the size of the fourth fusion feature map may be the same or different, depending on the actual situation.
  • the electronic device can perform up-sampling (uc5) processing on the feature map F18 to obtain the fifth decoded feature map.
  • the dimension of the fifth decoding feature map is 64.
  • the size of the fifth decoded feature map is larger than the size of the feature map F18.
  • the electronic device may perform fusion processing on the fifth decoded feature map and the picture to be processed to obtain the fifth fused feature map.
  • the dimension of the fifth fusion feature map is 65.
  • the size of the fifth fusion feature map is the same as the size of the fifth decoded feature map.
  • the electronic device can use the convolution kernel cv8 to perform convolution processing on the fifth fusion feature map to obtain the feature map F19.
  • the dimension of the feature map F19 is 32.
  • the size of the feature map F19 and the size of the fifth fusion feature map may be the same or different.
  • the electronic device can use the convolution kernel cv9 to perform convolution processing on the feature map F19 to obtain the feature map F20.
  • the dimension of the feature map F20 is 4.
  • the size of the feature map F20 is the same as the size of the picture to be processed.
  • the pixel value of each pixel in the feature map F20 of each dimension represents the probability that the pixel belongs to a certain category.
  • the feature map F20 of each dimension corresponds to one category.
  • the feature map F20 of the first dimension may correspond to the background category, and the pixel value of each pixel in the feature map of this dimension represents the probability that the pixel belongs to the background category.
  • the feature map F20 of the second dimension can correspond to the text category
  • the feature map F20 of the third dimension can correspond to the table category.
  • the feature map F20 in the fourth dimension may correspond to the picture category.
  • the sum of the pixel values at the same position in the feature map of 4 dimensions is 1.
  • the electronic device can determine the maximum pixel value at the same position in the feature map F20 of 4 dimensions.
  • the electronic device can determine the target image according to the maximum pixel value.
  • the dimension of the target image is 2.
  • the pixel value of the pixel in the target image in one dimension is the maximum pixel value obtained by the electronic device.
  • the pixel value of the pixel in the target image in another dimension indicates the category to which the pixel belongs, and the value can be 0 to 3, corresponding to the background category, text category, table category, and picture category, respectively. For example, if the electronic device determines that the pixel value of a certain pixel in the feature map F20 corresponding to the table category is the largest, then the electronic device can use this pixel value as the pixel value of the pixel at the corresponding position of the target image, and use the table category as The category corresponding to this pixel.
  • the electronic device can cut the feature maps with larger sizes, so that the sizes of the feature maps that need to be fused are the same.
  • the electronic device can cut the periphery of the feature map with a larger size, so that the cut feature map has the same size as the feature map with a smaller size.
  • the network structure of the image semantic segmentation model is only an example provided in this application, and is not used to limit this application.
  • the network structure of the image semantic segmentation model of the present application may also be an unet network structure or a deformed structure made according to the unet network structure, etc., and there is no specific limitation here.
  • FIG. 6 is a schematic structural diagram of a picture processing apparatus according to an embodiment of the application.
  • the image processing device 300 includes: an acquiring module 301, a calling module 302, and a determining module 303.
  • the obtaining module 301 is used to obtain a picture to be processed
  • the calling module 302 is configured to call a pre-trained image semantic segmentation model to divide the image to be processed into multiple regions, where each region corresponds to a category, and the category includes text category, table category and picture category;
  • the determining module 303 is configured to determine a target area from the multiple areas
  • the recognition module 304 is configured to perform character recognition processing on the target area to recognize characters in the target area.
  • the obtaining module 301 may be used to: obtain a sample picture, the sample picture includes a plurality of sample regions, each sample region corresponds to a category; obtain the image semantic segmentation model to be trained; The sample pictures train the image semantic segmentation model to be trained.
  • the target area includes a table area
  • the identification module 304 can be used to: identify the number of rows and columns of the table in the table area; generate a table based on the number of rows and columns; The text is filled into the table.
  • the target area further includes a text area
  • the recognition module 304 can be used to typeset the table and the text recognized from the text area according to the typesetting format of the picture to be processed ; Output the typeset table and the text recognized from the text area.
  • the recognition module 304 can be used to: display an editing interface, which is an interface for users to perform editing operations; and output the typeset table and the text recognized from the text area to all The editing interface.
  • the recognition module 304 may be used to: obtain a pre-trained character recognition model; use the character recognition model to perform character recognition processing on the target area to recognize characters in the target area .
  • the recognition module 304 may be used to determine whether the length of the picture to be processed is greater than a preset length when the multiple areas are all target areas; if the length of the picture to be processed is If the length is greater than the preset length, the picture to be processed is cropped to crop the picture to be processed into a plurality of sub-pictures, wherein each sub-picture corresponds to a region; and the text recognition model is used for each sub-picture The picture undergoes text recognition processing to recognize the text in each sub-picture.
  • the embodiment of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed on a computer, the computer is caused to execute the process in the image processing method provided in this embodiment.
  • An embodiment of the present application also provides an electronic device, including a memory and a processor, and the processor is configured to execute a process in the image processing method provided in this embodiment by calling a computer program stored in the memory.
  • the above-mentioned electronic device may be a mobile terminal such as a tablet computer or a smart phone.
  • a mobile terminal such as a tablet computer or a smart phone.
  • FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • the electronic device 400 may include components such as a camera module 401, a memory 402, and a processor 403. Those skilled in the art can understand that the structure of the electronic device shown in FIG. 7 does not constitute a limitation on the electronic device, and may include more or fewer components than those shown in the figure, or a combination of certain components, or different component arrangements.
  • the camera module 401 may include a lens, an image sensor, and an image signal processor.
  • the lens is used to collect an external light source signal and provide it to the image sensor.
  • the image sensor senses the light source signal from the lens and converts it into a digitized original image, namely RAW image, and provide the RAW image to the image signal processor for processing.
  • the image signal processor can perform format conversion and noise reduction on the RAW image to obtain a YUV image.
  • RAW is an unprocessed and uncompressed format, which can be vividly called a "digital negative.”
  • YUV is a color coding method, where Y represents brightness, U represents chroma, and V represents density. Human eyes can intuitively feel the natural features contained in YUV images.
  • the memory 402 can be used to store application programs and data.
  • the application program stored in the memory 402 contains executable code.
  • Application programs can be composed of various functional modules.
  • the processor 403 executes various functional applications and data processing by running application programs stored in the memory 402.
  • the processor 403 is the control center of the electronic device. It uses various interfaces and lines to connect various parts of the entire electronic device. The various functions and processing data of the electronic equipment can be used to monitor the electronic equipment as a whole.
  • the processor 403 in the electronic device will load the executable code corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 403 will run and store the executable code in the memory.
  • the application in 402 thus executes:
  • the electronic device 400 may include a camera module 401, a memory 402, a processor 403, a touch screen 404, a speaker 405, a microphone 406 and other components.
  • the camera module 401 may include an image processing circuit, which may be implemented by hardware and/or software components, and may include various processing units that define an image signal processing (Image Signal Processing) pipeline.
  • the image processing circuit may at least include a camera, an image signal processor (Image Signal Processor, ISP processor), a control logic, an image memory, a display, and so on.
  • the camera may at least include one or more lenses and image sensors.
  • the image sensor may include a color filter array (such as a Bayer filter). The image sensor can obtain the light intensity and wavelength information captured by each imaging pixel of the image sensor, and provide a set of raw image data that can be processed by the image signal processor.
  • the image signal processor can process the original image data pixel by pixel in a variety of formats. For example, each image pixel may have a bit depth of 8, 10, 12, or 14 bits, and the image signal processor may perform one or more image processing operations on the original image data and collect statistical information about the image data. Among them, the image processing operations can be performed with the same or different bit depth accuracy.
  • the original image data can be stored in the image memory after being processed by the image signal processor.
  • the image signal processor can also receive image data from the image memory.
  • the image memory may be a part of a memory device, a storage device, or an independent dedicated memory in an electronic device, and may include DMA (Direct Memory Access) features.
  • DMA Direct Memory Access
  • the image signal processor can perform one or more image processing operations, such as temporal filtering.
  • the processed image data can be sent to the image memory for additional processing before being displayed.
  • the image signal processor may also receive processed data from the image memory, and perform image data processing in the original domain and in the RGB and YCbCr color spaces on the processed data.
  • the processed image data can be output to a display for viewing by the user and/or further processed by a graphics engine or GPU (Graphics Processing Unit, image processor).
  • the output of the image signal processor can also be sent to the image memory, and the display can read image data from the image memory.
  • the image memory may be configured to implement one or more frame buffers.
  • the statistical data determined by the image signal processor can be sent to the control logic.
  • the statistical data may include the statistical information of the image sensor such as automatic exposure, automatic white balance, automatic focus, flicker detection, black level compensation, and lens shading correction.
  • the control logic may include a processor and/or microcontroller that executes one or more routines (such as firmware).
  • routines can determine the control parameters of the camera and the ISP control parameters based on the received statistical data.
  • the control parameters of the camera may include camera flash control parameters, lens control parameters (for example, focal length for focusing or zooming), or a combination of these parameters.
  • ISP control parameters may include gain levels and color correction matrices for automatic white balance and color adjustment (for example, during RGB processing).
  • FIG. 9 is a schematic diagram of the structure of the image processing circuit in this embodiment. As shown in FIG. 9, for ease of description, only various aspects of the image processing technology related to the embodiments of the present application are shown.
  • the image processing circuit may include: a camera, an image signal processor, a control logic, an image memory, and a display.
  • the camera may include one or more lenses and image sensors.
  • the camera may be any one of a telephoto camera or a wide-angle camera.
  • the first image collected by the camera is transmitted to the image signal processor for processing.
  • the image signal processor may send statistical data of the first image (such as the brightness of the image, the contrast value of the image, the color of the image, etc.) to the control logic.
  • the control logic can determine the control parameters of the camera according to the statistical data, so that the camera can perform operations such as autofocus and automatic exposure according to the control parameters.
  • the first image can be stored in the image memory after being processed by the image signal processor.
  • the image signal processor can also read the image stored in the image memory for processing.
  • the first image can be directly sent to the display for display after being processed by the image signal processor.
  • the display can also read the image in the image memory for display.
  • the electronic device may also include a CPU and a power supply module.
  • the CPU is connected to the logic controller, image signal processor, image memory, and display, and the CPU is used to implement global control.
  • the power supply module is used to supply power to each module.
  • the application program stored in the memory 402 contains executable code.
  • Application programs can be composed of various functional modules.
  • the processor 403 executes various functional applications and data processing by running application programs stored in the memory 402.
  • the processor 403 is the control center of the electronic device. It uses various interfaces and lines to connect various parts of the entire electronic device. The various functions and processing data of the electronic equipment can be used to monitor the electronic equipment as a whole.
  • the touch display screen 404 may be used to receive a user's touch control operation on the electronic device.
  • the speaker 405 can play sound signals.
  • the sensor 406 may include a gyroscope sensor, an acceleration sensor, a direction sensor, a magnetic field sensor, etc., which may be used to obtain the current posture of the electronic device 400.
  • the processor 403 in the electronic device will load the executable code corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 403 will run and store the executable code in the memory.
  • the application in 402 thus executes:
  • the processor 403 may also perform: obtain a sample image, the sample image includes a plurality of sample regions, each sample region corresponds to a category; obtain the image semantic segmentation to be trained Model; using the sample pictures to train the image semantic segmentation model to be trained.
  • the target area includes a table area
  • the processor 403 performs word recognition processing on the target area to recognize the words in the target area, and may also execute: recognizing the table area The number of rows and columns of the table in; the table is generated according to the number of rows and columns; the text is filled into the table.
  • the target area further includes a text area
  • the processor 403 executes the filling of the text into the table, it may also execute: according to the typesetting format of the picture to be processed, And typeset the text recognized from the text area; output the typeset table and the text recognized from the text area.
  • the processor 403 when the processor 403 executes to output the typeset table and the text recognized from the text area, it may execute: display an editing interface, which is an interface for the user to perform editing operations; The typeset table and the text recognized from the text area are output to the editing interface.
  • the processor 403 when the processor 403 executes character recognition processing on the target area to recognize characters in the target area, it may execute: obtain a pre-trained character recognition model; use the character recognition model Perform character recognition processing on the target area to recognize the characters in the target area.
  • the processor 403 before the processor 403 executes text recognition processing on the target area using the text recognition model to recognize the text in the target area, it may also execute: when the multiple areas are all When it is the target area, it is determined whether the length of the picture to be processed is greater than the preset length; The picture is cropped into a plurality of sub-pictures, where each sub-picture corresponds to a region; when the processor 403 performs word recognition processing on the target area using the word recognition model to recognize the text in the target area, It can be implemented: using the text recognition model to perform text recognition processing on each sub-picture, so as to recognize the text in each sub-picture.
  • the picture processing device provided in the embodiment of the application belongs to the same concept as the picture processing method in the above embodiment, and any method provided in the picture processing method embodiment can be run on the picture processing device.
  • any method provided in the picture processing method embodiment can be run on the picture processing device.
  • For details of the implementation process refer to the embodiment of the image processing method, which will not be repeated here.
  • the computer program may be stored in a computer readable storage medium, such as stored in a memory, and executed by at least one processor.
  • the execution process may include the process of the embodiment of the image processing method.
  • the storage medium may be a magnetic disk, an optical disc, a read only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), etc.
  • the image processing device of the embodiment of the present application its functional modules may be integrated in one processing chip, or each module may exist alone physically, or two or more modules may be integrated in one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium, such as a read-only memory, a magnetic disk or an optical disk, etc. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种图片处理方法、装置、存储介质及电子设备。该方法包括:获取待处理图片;调用预训练的图像语义分割模型将待处理图片划分为多个区域,其中,每个区域对应一类别,类别包括文本类别、表格类别和图片类别;从多个区域中确定出目标区域;对目标区域进行文字识别处理,以识别得到目标区域中的文字。

Description

图片处理方法、装置、存储介质及电子设备
本申请要求于2020年3月27日提交中国专利局、申请号为202010230790.8、申请名称为“图片处理方法、装置、存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于电子技术领域,尤其涉及一种图片处理方法、装置、存储介质及电子设备。
背景技术
在现实生活中,文字无处不在。然而,很多文字信息最初是通过拍摄以图片形式存在的,如身份证、银行卡、护照、名片、票据、书籍等等。若需要获取这些文字信息,则需要对图片中的文字进行识别,并输出识别出的文字。
在实际应用中,在需要对图片中的文字进行识别时,可能存在需要识别一张图片中的所有文字的用户需求,也可能存在仅需要识别一张图片的部分区域中的文字的用户需求。
发明内容
本申请实施例提供一种图片处理方法、装置、存储介质及电子设备,可以提高对图片中的文字进行识别的灵活性。
第一方面,本申请实施例提供一种图片处理方法,包括:
获取待处理图片;
调用预训练的图像语义分割模型将所述待处理图片划分为多个区域,其中,每个区域对应一类别,所述类别包括文本类别、表格类别和图片类别;
从所述多个区域中确定出目标区域;
对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
第二方面,本申请实施例提供一种图片处理装置,包括:
获取模块,用于获取待处理图片;
调用模块,用于调用预训练的图像语义分割模型将所述待处理图片划分为多个区域,其中,每个区域对应一类别,所述类别包括文本类别、表格类别和图片类别;
确定模块,用于从所述多个区域中确定出目标区域;
识别模块,用于对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
第三方面,本申请实施例提供一种存储介质,其上存储有计算机程序,当所述计算机程序在计算机上执行时,使得所述计算机执行本申请实施例提供的图片处理方法中的流程。
第四方面,本申请实施例还提供一种电子设备,包括存储器,处理器,所述处理器通过调用所述存储器中存储的计算机程序,用于执行本申请实施例提供的图片处理方法中的流程。
附图说明
下面结合附图,通过对本申请的具体实施方式详细描述,将使本申请的技术方案及其有益效果显而易见。
图1是本申请实施例提供的图片处理方法的第一种流程示意图。
图2是本申请实施例提供的待处理图片示意图。
图3是本申请实施例提供的场景示意图。
图4是本申请实施例提供的图片处理方法的第二种示意图。
图5是本申请实施例提供的图像语义分割模型的网络结构示意图。
图6是本申请实施例提供的图片处理装置的结构示意图。
图7是本申请实施例提供的电子设备的第一种结构示意图。
图8是本申请实施例提供的电子设备的第二种结构示意图。
图9是本申请实施例提供的图像处理电路的结构示意图。
具体实施方式
请参照图示,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。
本申请实施例提供一种图片处理方法,包括:
获取待处理图片;
调用预训练的图像语义分割模型将所述待处理图片划分为多个区域,其中,每个区域对应一类别,所述类别包括文本类别、表格类别和图片类别;
从所述多个区域中确定出目标区域;
对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
在一种实施方式中,所述获取待处理图片之前,还包括:
获取样本图片,所述样本图片包括多个样本区域,每个样本区域对应一类别;
获取待训练的图像语义分割模型;
利用所述样本图片对所述待训练的图像语义分割模型进行训练。
在一种实施方式中,所述目标区域包括表格区域,所述对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字之后,还包括:
识别所述表格区域中的表格的行数与列数;
根据所述行数与列数,生成表格;
将所述文字填充至所述表格中。
在一种实施方式中,所述目标区域还包括文本区域,所述将所述文字填充至所述表格中之后,还包括:
根据所述待处理图片的排版格式,对所述表格和从所述文本区域中识别出的文字进行排版;
输出排版后的表格和从所述文本区域中识别出的文字。
在一种实施方式中,所述输出排版后的表格和从所述文本区域中识别出的文字,包括:
显示编辑界面,所述编辑界面为供用户进行编辑操作的界面;
将排版后的表格和从所述文本区域中识别出的文字输出至所述编辑界面。
在一种实施方式中,所述对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字,包括:
获取预训练的文字识别模型;
利用所述文字识别模型对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
在一种实施方式中,所述利用所述文字识别模型对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字之前,还包括:
当所述多个区域均为目标区域时,判断所述待处理图片的长度是否大于预设长度;
若所述待处理图片的长度大于预设长度,则对所述待处理图片进行裁切处理,以将所述待处理图片裁切为多个子图片,其中,每个子图片与一区域对应;
所述利用所述文字识别模型对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字,包括:
利用所述文字识别模型对每个子图片进行文字识别处理,以识别得到每个子图片中的文字。
可以理解的是,本申请实施例的执行主体可以是诸如智能手机或平板电脑等电子设备。
请参阅图1,图1是本申请实施例提供的图片处理方法的第一种流程示意图,流程可以包括:
在101中,获取待处理图片。
其中,该待处理图片的类别可至少包括两种类别。比如,若某图片中包括文本和图片,则该图片的 类别可以包括文本类别和图片类别,电子设备可将该图片确定为待处理图片。又比如,若某图片中包括文本、图片和表格,则该图片的类别可以包括文本类别、图片类别和表格类别,电子设备可将该图片确定为待处理图片。
例如,该待处理图片可如图2所示。该待处理图片G1中包括文本、表格和图片。该待处理图片G1的类别可以包括文本类别、表格类别和图片类别。
在102中,调用预训练的图像语义分割模型将待处理图片划分为多个区域,其中,每个区域对应一类别,该类别包括文本类别、表格类别和图片类别。
相关技术中,在对图片进行文字识别时,通常会将该图片中包含的全部文字识别处理。以如图2所示的待处理图片G1为例,若采用相关技术提供的方案对该待处理图片G1进行文字识别,那么,无论是图片中的文本区域的文字,还是表格区域的文字,亦或是图片区域中的文字,均会被识别并输出。
然而,用户可能仅需要得到文本区域的文字,若采用相关技术的方案,用户还得将识别出来的表格区域和图片区域的文字删除,才能得到文本区域的文字,这一过程相当麻烦。因此,在本申请实施例中,在获取待处理图片,并对待处理图片进行文字识别之前,电子设备会先调用预训练的图像语义分割模型将待处理图片划分为多个区域。其中,每个区域对应一类别,该类别包括文本类别、表格类别和图片类别。如图3所示,待处理图片G1将会被划分为3个区域,分别为文本所在的文本区域A1、表格所在的表格区域A2和图片所在的图片区域A3。其中,文本区域A1对应文本类别,文本区域A2对应表格类别,文本区域A3对应图片类别。
在一些实施例中,电子设备可预先对u-net网络进行训练,并将训练好的u-net网络作为预训练的图像语义分割模型。
在103中,从多个区域中确定出目标区域。
比如,可以由用户从多个区域中指定出目标区域。例如,若用户想识别表格区域的文字,那么,用户可点击待处理图片的表格区域。当电子设备接收到用户的点击操作时,电子设备可根据用户的点击操作所点击的位置确定目标区域为表格区域。
在一些实施例中,电子设备可预先将表格区域、文本区域或图片区域中的一种或多种设置为预设区域。在调用预训练的图像语义分割模型将待处理图片划分为多个区域之后,电子设备可将多个区域中与预设区域匹配的区域确定为目标区域。例如,若预设区域包括表格区域和文本区域,多个区域包括表格区域、文本区域和图片区域,那么,目标区域可以为表格区域和文本区域。又例如,若预设区域为文本区域,多个区域包括表格区域、文本区域和图片区域,那么,目标区域可以为文本区域。
在104中,对目标区域进行文字识别处理,以识别得到目标区域中的文字。
例如,请继续参阅图3,若目标区域为文本区域A1,那么,电子设备可对该文本区域A1进行文字识别处理,以识别得到该文本区域A1中的文字。
在识别得到目标区域中的文字之后,电子设备还可将目标区域中的文字输出。例如,电子设备可将文本区域A1中的文字保存为可编辑的形式,如word、TXT格式等。
本申请实施例中,可调用预训练的图像语义分割模型将待处理图片划分为多个区域,从而在仅需要识别多个区域中的某一个区域的文字时,可以将该区域确定为目标区域;在需要识别出整个待处理图片中的文字时,可将该多个区域均确定为目标区域,再对确定出的目标区域进行文字识别处理。可见,本申请实施例提供的图片处理方法可以提高对图片中的文字进行识别的灵活性。
请参阅图4,图4是本申请实施例提供的图片处理方法的第二种流程示意图,流程可以包括:
在201中,电子设备获取待处理图片。
其中,该待处理图片的类别可至少包括两种类别。比如,若某图片中包括文本和图片,则该图片的类别可以包括文本类别和图片类别,电子设备可将该图片确定为待处理图片。又比如,若某图片中包括文本、图片和表格,则该图片的类别可以包括文本类别、图片类别和表格类别,电子设备可将该图片确定为待处理图片。
需要说明的是,上述类别仅仅是本申请实施例提供的一种示例,并不用于限制本申请。
还需要说明的是,由于本申请实施例采用模型对待处理图片进行处理,而模型通常对输入的图片的属性有一些要求,待处理图片应当符合这些要求,以使模型能够正常处理。
可以理解的是,当电子设备获取的图片为不符合模型要求的图片时,电子设备可对该图片进行预处理,以使该图片符合模型的要求。
例如,假设模型要求输入图片的尺寸为预设尺寸,例如256×256。若电子设备获取的图片不为预设尺寸,那么,电子设备需将该图片的尺寸调整为预设尺寸,得到待处理图片。
又例如,假设模型要求输入图片的像素值应当归一化,例如,像素值应为[0,1]之间的实数,若电子设备获取的图片未归一化,电子设备应当将其归一化,得到待处理图片。例如,某图片的像素值表示为[0,255]之间的整数,可以通过除以255的方式进行归一化。可以理解的是,归一化可以有不同的定义,例如在另一种归一化的定义中,像素值应当为[-1,1]之间的实数,针对不同的归一化定义,归一化的方式应当相应地调整。
其中,该待处理图片可以为彩色图片,也可以为灰度图片。
在202中,电子设备将待处理图片输入编码模块,得到编码特征图集合。
在203中,电子设备将编码特征图集合输入解码模块,得到目标图像,其中,该目标图像中的每个像素点对应一类别,该类别包括文本类别、表格类别和图片类别。
可以理解的是,电子设备可预先创建一图像语义分割模型,并对该图像语义分割模型进行训练,然后将训练好的语义分割模型作为预训练的图像语义分割模型,并保存该预训练的图像语义分割模型。其中,该预训练的图像语义分割模型可包括编码模块和解码模块。
在得到待处理图片之后,电子设备可将该待处理图片输入该编码模块,得到编码特征图集合。其中,编码特征图集合包括多个编码特征图,多个编码特征图的尺寸可以相同,也可以不相同。
当得到编码特征图集合之后,电子设备可将该编码特征图集合输入该解码模块,得到目标图像,其中,目标图像可为一个双通道的图像。其中,一个通道的目标图像中每个像素点的像素值表示该像素点所属的类别,另一个通道的目标图像中每个像素点的像素值表示该像素点属于某个类别的概率。例如,另一个通道的图像中的像素点的像素值可以取值为0~3,其中,0表示背景类别、1表示文本类别、2表示表格类别、3表示图片类别。
在一些实施例中,假设类别有4类,电子设备将该编码特征图集合输入该解码模块之后,可得到4个通道的特征图。然后,电子设备可根据这4个通道的特征图,得到目标图像。
其中,每个通道的特征图中的每个像素点的像素值表示该像素点属于4个类别中的其中一个类别的概率。4个通道的特征图中相同位置的像素点的像素值之和为1。比如,通道C1的特征图中每个像素点的像素值表示该像素点属于背景类别的概率,通道C2的特征图中每个像素点的像素值表示该像素点属于文本类别的概率,通道C3的特征图中每个像素点的像素值表示该像素点属于表格类别的概率,通道C4的特征图中每个像素点的像素值表示该像素点属于图片类别的概率。
电子设备可将4个通道的特征图中相同位置的像素点的最大像素值作为其中一个通道的目标图像的相应位置的像素点的像素值。例如,假设4个通道的特征图中某相同位置的像素点的像素值分别为0.1、0.1、0.1、0.7。则可将0.7作为其中一个通道的目标图像的相应位置的像素点的像素值。且若0.7对应为该像素点属于图片类别的概率,那么另一个通道的目标图像的相应位置的像素值可取值为3(表示该像素点属于图片类别,且属于图片类别的概率为0.7)。
其中,类别的数量由前期训练过程决定。例如,若前期训练过程采用了3个类别,分别为表格类别、文本类别和图片类别,则该目标图像中的像素点对应的类别即为该3个类别中的其中一个类别。
在204中,电子设备利用目标图像中的每个像素点对应的类别,将待处理图片划分为多个区域,其中,每个区域对应一类别。
其中,目标图像的大小与待处理图片的大小一致。那么,当得到目标图像之后,电子设备可利用目标图像中的每个像素点对应的类别,将待处理图片划分为多个区域。若待处理图片如图3所示,那么,目标图像中与文本区域A1对应的区域中的每个像素点均对应文本类别,目标图像中与表格区域A2对 应的区域中的每个像素点均对应表格类别,目标图像中与图片区域A3对应的区域中的每个像素点均对应类别,从而可知,该待处理图片可被划分为文本区域A1、表格区域A2和图片区域A3。
在205中,电子设备从多个区域中确定出目标区域。
比如,可以由用户从多个区域中指定出目标区域。例如,若用户想识别表格区域的文字,那么,用户可点击待处理图片的表格区域。当电子设备接收到用户的点击操作时,电子设备可根据用户的点击操作所点击的位置确定目标区域为表格区域。
在一些实施例中,电子设备可预先将表格区域、文本区域或图片区域中的一种或多种设置为预设区域。在调用预训练的图像语义分割模型将待处理图片划分为多个区域之后,电子设备可将多个区域中与预设区域匹配的区域确定为目标区域。例如,若预设区域包括表格区域和文本区域,多个区域包括表格区域、文本区域和图片区域,那么,目标区域可以为表格区域和文本区域。又例如,若预设区域为文本区域,多个区域包括表格区域、文本区域和图片区域,那么,目标区域可以为文本区域。
在206中,电子设备对目标区域进行文字识别处理,以识别得到目标区域中的文字。
例如,请继续参阅图3,若目标区域为文本区域A1,那么,电子设备可对该文本区域A1进行文字识别处理,以识别得到该文本区域A1中的文字。
在识别得到目标区域中的文字之后,电子设备还可将目标区域中的文字输出。例如,电子设备可将文本区域A1中的文字保存为可编辑的形式,如word、TXT格式等。
在一些实施例中,在识别得到目标区域中的文字之后,电子设备还可接收用户的保存指令,该保存指令用于指示将识别出的目标区域中的文字的保存格式。随后,电子设备可根据该保存指令保存识别出的目标区域中的文字。例如,若该保存指令指示将识别出的目标区域中的文字保存为word格式,那么,电子设备可将识别出的目标区域中的文字保存为word格式。
在一些实施例中,编码模块可包括第一编码子模块、第二编码子模块、第三编码子模块、第四编码子模块和第五编码子模块,流程202可以包括:
电子设备调用第一编码子模块对待处理图片进行编码处理,得到第一编码特征图;
电子设备调用第二编码子模块对第一编码特征图进行编码处理,得到第二编码特征图;
电子设备调用第三编码子模块对第二编码特征图进行编码处理,得到第三编码特征图;
电子设备调用第四编码子模块对第三编码特征图进行编码处理,得到第四编码特征图;
电子设备调用第五编码子模块对第四编码特征图进行编码处理,得到第五编码特征图;
第一编码特征图、第二编码特征图、第三编码特征图、第四编码特征图和第五编码特征图构成编码特征图集合。
其中,对图像进行编码处理可以理解为对图像进行下采样处理、池化处理或卷积处理等。
比如,电子设备可调用第一编码子模块对待处理图片(灰度图片)进行卷积处理,得到第一编码特征图。其中,第一编码子模块可包括多个卷积核。卷积核的数量可以为64个或128个等,卷积核的大小可以为7×7或者8×8等,步长为2或者3等,此处不作具体限制。第一编码特征图的尺寸小于待处理图片的尺寸。
在一些实施例中,电子设备可调用第一编码子模块对待处理图片(彩色图片)进行卷积处理,得到第一编码特征图。其中,第一编码子模块可包括多个卷积核。卷积核的数量为64,卷积核的大小可根据实际情况而定。卷积核可为3个维度的卷积核。该处的卷积处理即计算待处理图片与每一3个维度的卷积核的矩阵内积,得到64个维度的第一编码特征图。
其中,假设某彩色图片的像素矩阵为
Figure PCTCN2021074706-appb-000001
某卷积核为
Figure PCTCN2021074706-appb-000002
则该彩色图片与卷积核的矩阵内积为
Figure PCTCN2021074706-appb-000003
当得到第一编码特征图之后,电子设备可调用第二编码子模块对该第一编码特征图进行池化处理,得到池化特征图,其中,池化尺度为2×2。然后,电子设备可调用第二编码子模块的瓶颈结构对池化特 征图进行多次,如2次或3次处理,得到第二编码特征图。其中,第一编码特征图的维度可以为64,第二编码特征图的维度可以为256。第二编码特征图的尺寸小于第一编码特征图的尺寸。
当得到第二编码特征图之后,电子设备可调用第三编码子模块的瓶颈结构对该第二编码特征图进行多次,如3次或4次处理,得到第三编码特征图。第三编码特征图的维度可以为512。第三编码特征图的尺寸小于第二编码特征图的尺寸。
当得到第三编码特征图之后,电子设备可调用第四编码子模块的瓶颈结构对该第三编码特征图进行多次,如4次或5次处理,得到第四编码特征图。第四编码特征图的维度可以为1024。第四编码特征图的尺寸小于第三编码特征图的尺寸。
当得到第四编码特征图之后,电子设备可调用第五编码子模块的瓶颈结构对该第四编码特征图进行多次,如3次或4次处理,得到第五编码特征图。第五编码特征图的维度可以为512。第五编码特征图的尺寸小于第四编码特征图的尺寸。
在一些实施例中,解码模块可包括第一解码子模块、第二解码子模块、第三解码子模块、第四解码子模块和第五解码子模块,流程203,可以包括:
电子设备调用第一解码子模块对第五编码特征图进行解码处理,得到第一解码特征图,并对第一解码特征图与根据第四编码特征图确定的目标特征图进行融合处理,得到第一融合特征图;
电子设备调用第二解码子模块对第一融合特征图进行解码处理,得到第二解码特征图,并对第二解码特征图与第三编码特征图进行融合处理,得到第二融合特征图;
电子设备调用第三解码子模块对第二融合特征图进行解码处理,得到第三解码特征图,并对第三解码特征图与第二编码特征图进行融合处理,得到第三融合特征图;
电子设备调用第四解码子模块对第三融合特征图进行解码处理,得到第四解码特征图,并对第四解码特征图与第一编码特征图进行融合处理,得到第四融合特征图;
电子设备调用第五解码子模块对第四融合特征图进行解码处理,得到第五解码特征图,并对第五解码特征图与待处理图片进行融合处理,得到第五融合特征图,并根据第五融合特征图确定目标图像。
其中,对图像进行解码处理可以理解为对图像进行上采样处理、反池化处理、反卷积处理或卷积处理等。
比如,电子设备可调用第一解码子模块对第五编码特征图进行反卷积处理或上采样处理,得到第一解码特征图。其中,上采样的尺寸可根据实际情况确定。反卷积对应的卷积核也可根据实际情况确定。随后,电子设备可对第一解码特征图与根据第四编码特征图确定的目标特征图进行融合处理,得到第一融合特征图。其中,第一融合特征图的通道数为目标特征图的通道数与第一解码特征图的通道数之和。根据第四编码特征图确定的目标特征图为对第四编码特征图进行卷积处理得到。其中,卷积处理对应的卷积核可根据实际情况确定。
随后,电子设备可调用第二解码子模块对第一融合特征图进行卷积及上采样处理,得到第二解码特征图。其中,该卷积处理对应的卷积核可根据实际情况确定。电子设备可调用第二解码子模块对第二解码特征图与第三编码特征图进行融合处理,得到第二融合特征图。其中,第二融合特征图的通道数为第二解码特征图的通道数与第三编码特征图的通道数之和。
接着,电子设备可调用第三解码子模块对第二融合特征图进行卷积处理,得到第三解码特征图。其中,该卷积处理对应的卷积核可根据实际情况确定。电子设备可调用第三解码子模块对第三解码特征图与第二编码特征图进行融合处理,得到第三融合特征图。其中,第三融合特征图的通道数为第三解码特征图的通道数与第二编码特征图的通道数之和。
之后,电子设备可调用第四解码子模块对第三融合特征图进行卷积处理,得到第四解码特征图。其中,该卷积处理对应的卷积核可根据实际情况确定。电子设备可调用第四解码子模块对第四解码特征图与第一编码特征图进行融合处理,得到第四融合特征图。其中,第四融合特征图的通道数为第四解码特征图的通道数与第一编码特征图的通道数之和。
随后,电子设备可调用第五解码子模块对第四融合特征图进行卷积处理,得到第五解码特征图。其 中,该卷积处理对应的卷积核可根据实际情况确定。电子设备可调用第五解码子模块对第五解码特征图与待处理图片进行融合处理,得到第五融合特征图。其中,第五融合特征图的通道数为第五解码特征图的通道数与待处理图片的通道数之和。电子设备还可调用该第五解码子模块根据该第五融合特征图确定目标图像。
其中,电子设备调用该第五解码子模块根据该第五融合特征图确定目标图像,可以包括:电子设备调用该第五解码子模块对该第五融合特征图进行卷积处理,得到第六解码特征图。其中,该卷积处理对应的卷积核可根据实际情况确定。电子设备调用该第五解码子模块对该第六解码特征图进行卷积处理,得到第七解码特征图。其中,假设共有4个类别,那么第七解码特征图为4个通道的特征图。每个通道的特征图中的每个像素点的像素值用于表示该像素点属于4个类别中的其中一个类别的概率。每个通道的特征图与一类别对应。随后,电子设备可调用第五解码子模块根据该第七解码特征图,确定目标图像。
其中,目标图像为双通道图像,电子设备可调用第五解码子模块根据该第七解码特征图,确定目标图像可以为:电子设备可将4个通道的第七编码特征图中相同位置的像素点的最大像素值作为一个通道的目标图像的相应位置的像素点的像素值。例如,假设4个通道的第七编码特征图中某相同位置的像素点的像素值分别为0.1、0.1、0.1、0.7。则可将0.7作为一个通道的目标图像的相应位置的像素点的像素值。且若0.7对应为该像素点属于图片类别的概率,那么另一个通道的目标图像的相应位置的像素值可取值为3(表示该像素点属于图片类别,且属于图片类别的概率为0.7)。
在一些实施例中,在流程201之前,还可以包括:
电子设备获取样本图片,样本图片包括多个样本区域,每个样本区域对应一类别;
电子设备获取待训练的图像语义分割模型;
电子设备利用样本图片对待训练的图像语义分割模型进行训练。
比如,电子设备可收集多张手机截图,再通过图像标注工具对该多张截图的每个像素点进行标注,得到标注好的截图,该标注好的截图可作为样本图片。其中,每个像素点可标注为背景类别、文本类别、图片类别和表格类别四种类别中的一种。
例如,电子设备可用不同的颜色标记该多张截图的每个像素点。例如,若某截图的某个像素点为背景类别,则电子设备可将其标注为黑色(RGB[0,0,0]);若某截图的某个像素点为文本类别,则电子设备可将其标注为红色(RGB[255,0,0]);若某截图的某个像素点为图片类别,则电子设备可将其标注为绿色(RGB[0,255,0]);若某截图的某个像素点为表格类别,则电子设备可将其标注为蓝色(RGB[0,0,255])。
在一些实施例中,若手机截图如图2所示,即文本类别的像素点、表格类别的像素点和图片类别的像素点均集中在某一区域。那么,电子设备也可从多张标注好的截图中,选取出部分文本区域(标注为文本类别的像素点集中所在的区域)、表格区域(标注为表格类别的像素点集中所在的区域)和图片区域(标注为图片类别的像素点集中所在的区域),然后进行随机组合,组合出多张组合图片,该多张组合图片也可以作为样本图片。其中,组合图片的数量可以与标注好的截图的数量相同,也可以不相同。
当按照上述方式得到多个样本图片之后,电子设备可将该多个样本图片划分为训练集、验证集和测试集。其中,训练集、验证集和测试集的比例可以为3:1:1。例如,训练集可包括6000个样本图片,验证集和测试集可分别包括2000个样本图片。
随后,电子设备可获取待训练的图像语义分割模型,设置训练参数,选取学习率。其中,该待训练的图像语义分割模型可包括编码模块和解码模块。编码模块可包括第一编码子模块、第二编码子模块、第三编码子模块、第四编码子模块和第五编码子模块。解码模块可包括第一解码子模块、第二解码子模块、第三解码子模块、第四解码子模块和第五解码子模块。学习率可以为1×10 -5。最大单次样本总数为64。训练迭代次数为300次。每迭代1次进行验证并更新一次模型输出。
接着,电子设备可利用样本图片对该待训练的图像语义分割模型进行训练,以对待训练的图像语义分割模型的参数进行调整,直至待训练的图像语义分割模型收敛,得到预训练的图像语义分割模型。
在一些实施例中,电子设备还可对样本图片进行旋转、缩放、翻转、平移、加噪声、模糊等数据增强操作,得到数据增强图像,再利用该数据增强图像对该待训练的图像语义分割模型进行训练,从而可 提高模型的识别能力和泛化能力。其中,旋转角度可以为[-0,1,0,1]rad。缩放倍数可以为[0.8,1.2]。
其中,在该待训练的图像语义分割模型的训练过程中,可采用交叉熵损失函数作为样本图片的损失函数。
在一些实施例中,交叉熵损失函数的公式可以为:
Figure PCTCN2021074706-appb-000004
其中,L log(Y,P)表示样本图片的损失值,y i,k表示样本图片的第i个像素点的真实类别。P i,k表示样本图片的第i个像素点属于第k个类别的概率。N表示像素点的数量。
可以理解的是,上述公式仅仅是本申请提供的一种示例,并不用于限制本申请。在实际应用中,还可以采用其他交叉熵损失函数作为样本图片的损失函数,此处不作具体限制。
在本申请实施例中,该待训练的图像语义分割模型对应的损失值可以为所有样本图片的损失值之和的平均值。
在该待训练的图像语义分割模型的训练过程中,电子设备可根据待训练的图像语义分割模型对应的损失值计算各个参数的梯度,然后通过反向传播算法,对整个网络的参数进行更新。
在该待训练的图像语义分割模型的训练过程中,每迭代一次,可采用验证集对该待训练的图像语义分割模型进行验证,并利用评价函数计算出该次迭代的评价值和评价损失值,并输出保存好的图像语义分割模型。其中,评价值与评价损失值之和可以为1。单个样本图片的评价函数的公式可以为:
Figure PCTCN2021074706-appb-000005
其中,IoU表示样本图片的评价值,X表示图像语义分割模型对样本图片的预测结果,Y表示样本图片的真实标注结果。
其中,当验证结果趋于收敛时,可停止对图像语义分割模型进行训练。
需要说明的是,在实际应用中,像素点对应的类别不仅仅限于上述类别,还可以是其他类别。例如,当电子设备收集的截图为即时通讯类应用的聊天界面时,由于该类截图中包含有聊天气泡,因此,像素点对应的类别还可以包括气泡类别。又例如,当电子设备收集的截图包括word、ppt或pdf等格式的文档类图片时,由于该类图片中通常包括页眉、页脚和标题。因此,像素点对应的类别还可以包括页眉类别、页脚类别和标题类别。
在一些实施例中,目标区域可以包括表格区域,流程206之后,还可以包括:
电子设备识别表格区域中的表格的行数与列数;
电子设备根据行数与列数,生成表格;
电子设备将文字填充至表格中。
比如,当目标区域包括表格区域时,电子设备可在识别出表格区域中的文字之后,识别表格区域中的表格的行数与列数,并根据表格的行数与列数,生成表格,再将文字填充至表格中。其中,该表格的格式可以为excel格式等可编辑的格式,从而使得用户可对该表格进行编辑操作。其中,编辑操作可包括复制、粘贴、删除等操作。
在一些实施例中,目标区域还包括文本区域,“将文字填充至表格中”之后,还包括:
电子设备根据待处理图片的排版格式,对表格和从文本区域中识别出的文字进行排版;
电子设备输出排版后的表格和从文本区域中识别出的文字。
比如,当目标区域包括文本区域和表格区域时,在将从表格区域识别出的文字填充至表格中之后,电子设备还可按照待处理图片中的排版格式对从文本区域中识别出的文字、表格进行排版组合,使得最终输出结果的排版格式与待处理图片的排版格式一致,从而使得用户无需对识别出的文字进行手动排版。
在一些实施例中,“电子设备输出排版后的表格和从文本区域中识别出的文字”,可以包括:
电子设备显示编辑界面,编辑界面为供用户进行编辑操作的界面;
电子设备将排版后的表格和从文本区域中识别出的文字输出至编辑界面。
其中,编辑界面可以为word文档编辑界面、备忘录编辑界面、短信编辑界面等可供用户进行编辑操作的界面。
比如,当对表格和从文本区域中识别出的文字进行排版之后,电子设备可自动开启word文档应用,并进入word文档编辑界面,电子设备即显示编辑界面。然后,电子设备可将排版后的表格和从文本区域中识别出的文字输出至该编辑界面,从而使得用户可在该编辑界面对排版后的表格和从文本区域中识别出的文字进行相应的编辑操作。
其中,编辑操作可包括复制、粘贴、删除、修改、新增等操作。例如,对于表格来说,用户可新增一行或一列,并在新增的行和列中输入新的内容,以最终形成新的表格。对于从文本区域中识别出的文字来说,用户可新增其他文字,或者删除一些文字,等等。
在一些实施例中,多个区域中的其中一个区域可以为图片区域,“将文字填充至表格中”之后,还可以包括:
电子设备获取图片区域中的图片;
“电子设备根据待处理图片的排版格式,对表格和从文本区域中识别出的文字进行排版”,可以包括:
电子设备根据待处理图片的排版格式,对表格、图片和从文本区域中识别出的文字进行排版;
“电子设备输出排版后的表格和从文本区域中识别出的文字”,可以包括:
电子设备输出排版后的表格、图片和从文本区域中识别出的文字。
比如,当多个区域中的其中一个区域为图片区域,目标区域包括文本区域和表格区域时,在将从表格区域识别出的文字填充至表格中之后,电子设备还可裁切出图片区域中的图片,并按照待处理图片中的排版格式对从文本区域中识别出的文字、表格和图片进行排版组合,使得最终输出结果的排版格式与待处理图片的排版格式一致,从而使得用户无需手动将从文本区域中识别出的文字、表格和图片排版成与待处理图片相同的格式。
在一些实施例中,“电子设备输出排版后的表格、图片和从文本区域中识别出的文字”,可以包括:
电子设备显示编辑界面,编辑界面为供用户进行编辑操作的界面;
电子设备将排版后的表格、图片和从文本区域中识别出的文字输出至编辑界面。
比如,当对表格、图片和从文本区域中识别出的文字进行排版之后,电子设备可自动开启word文档应用,并进入word文档编辑界面,电子设备即显示编辑界面。然后,电子设备可将排版后的表格、图片和从文本区域中识别出的文字输出至该编辑界面,从而使得用户可在该编辑界面对排版后的表格、图片和从文本区域中识别出的文字进行相应的编辑操作。
其中,对图片进行编辑操作可以包括:对图片进行放大、缩小、翻转、删除等操作。
在一些实施例中,流程206,可以包括:
电子设备获取预训练的文字识别模型;
电子设备调用预训练的文字识别模型对目标区域进行文字识别处理,以识别得到目标区域中的文字。
比如,电子设备可获取一待训练的文字识别模型,并对该待训练的文字识别模型进行训练,得到训练好的模型,该训练好的模型可作为预训练的文字识别模型。当从多个区域中确定出目标区域之后,电子设备可调用该预训练的文字识别模型对目标区域进行文字识别处理,以识别得到目标区域中的文字,从而提高了文字识别的精度。
在一些实施例中,“电子设备利用文字识别模型对目标区域进行文字识别处理,以识别得到目标区域中的文字”之前,还可以包括:
当多个区域均为目标区域时,电子设备判断待处理图片的长度是否大于预设长度;
若待处理图片的长度大于预设长度,则电子设备对待处理图片进行裁切处理,以将待处理图片裁切为多个子图片,其中,每个子图片与一区域对应;
“电子设备利用文字识别模型对目标区域进行文字识别处理,以识别得到目标区域中的文字”,可 以包括:
电子设备利用文字识别模型对每个子图片进行文字识别处理,以识别得到每个子图片中的文字。
可以理解的是,当输入文字识别模型中的图片的长度过长时,可能影响到文字识别模型的识别结果。因此,当需要对整个待处理图片进行文字识别时,电子设备可判断待处理图片的长度是否大于预设长度,若待处理图片的长度大于预设长度,电子设备可对待处理图片进行裁切处理,以将待处理图片裁切为多个子图片,其中,每个子图片与一区域对应。当得到多个子图片之后,电子设备可利用文字识别模型对每个子图片进行文字识别处理。
其中,子图片的数量与区域的数量对应,电子设备调用预训练的图像语义分割模型将待处理图片划分为多少个区域,则电子设备可根据待处理图片相应裁切出多少个子图片。预设长度可以根据文字识别模型所支持的图片长度确定。例如,若文字识别模型所支持的图片长度为256像素,则预设长度为256像素。
在一些实施例中,电子设备也可将从别处获取的训练好的文字识别模型作为预训练的文字识别模型。
在一些实施例中,流程205,可以包括:
电子设备接收用户输入的生物特征信息;
电子设备将多个区域中与生物特征信息对应的区域确定为目标区域。
其中,生物特征信息可包括指纹信息、声纹信息、人脸特征信息、虹膜信息等。
比如,电子设备可预先设置生物特征信息与区域的对应关系。例如,假设设置指纹J1对应文本区域,指纹J2对应表格区域,指纹J3对应图片区域。
当电子设备将待处理图片划分为文本区域、表格区域和图片区域这3个区域之后,电子设备可接收用户输入的指纹信息,并将该指纹信息对应的区域确定为目标区域。例如,假设用户输入的指纹信息为指纹J1,则目标区域为文本区域。
在一些实施例中,在将待处理图片划分为多个区域(如表格区域、文本区域和图片区域)之后,电子设备可将图片区域和表格区域从待处理图片中分离出来,并将待处理图片中被扣除的区域填充为白色(RGB[255,255,255]),得到文本图片。随后,电子设备可对该文本图片进行文字识别处理,以识别得到该文本图片中的文字。然后,电子设备可提取该识别出的文字,并保存为文本格式。该文本格式为可编辑的格式,从而使得用户可对该识别出的文字进行编辑操作。电子设备也可对表格区域进行文字识别,以识别得到表格区域中的文字。然后,电子设备可识别表格区域中的表格的行数和列数,并根据表格的行数和列数生成可编辑格式的表格。之后,电子设备可将从表格区域中识别出的文字填充至该表格中,以供用户对该表格进行编辑操作。
之后,电子设备还可按照待处理图片中的排版格式对从文本图片中识别出的文字、表格、分离出的图片进行排版组合,使得最终输出结果的排版格式与待处理图片的排版格式一致。其中,最终输出结果中的文字和表格部分可编辑,最终输出结果中的图片可拉伸翻转。
在一些实施例中,电子设备也可提供图片区域的文字识别接口,从而在需要对图片区域的文字进行识别时,可触发识别得到图片区域的文字。
请参阅图5,图5为本申请实施例提供的图像语义分割模型的网络结构示意图。
该图像语义分割模型包括编码模块和解码模块。其中,编码模块主要用于进行特征提取,从输入的数据中提取不同分辨率的特征图。解码模块主要用于进行上采样,且每上采样一次,就和编码模块输出的相应特征图融合。前期,可以获取样本图片对该模型进行训练,从而最终得到训练好的模型,即预训练的图像语义分割模型。
后期,可调用该预训练的图像语义分割模型将待处理图片划分为多个区域。其中,待处理图片的大小可为300×300,维度(通道数)为1。需要说明的是,本申请实施例所提供的待处理图片通常可包括表格、图片和文本三个区域,如图3所示。在一些实施例中,待处理图片还可包括背景区域,如图3所示,背景区域可为待处理图片G1中除文本区域A1、表格区域A2和图片区域A3之外的区域。
其中,“调用该预训练的图像语义分割模型将待处理图片G1划分为多个区域”,可以包括:
电子设备可将待处理图片输入该预训练的图像语义分割模型中,利用卷积核cv1对该待处理图片进行卷积处理,得到第一编码特征图,其中,该第一编码特征图的维度为64。卷积核cv1的数量为64,卷积核cv1的大小为7×7,步长为2。第一编码特征图的尺寸小于待处理图片的尺寸。
电子设备可利用尺寸为2×2的过滤器对第一编码特征图进行池化(pool)处理,得到特征图F1。其中,该特征图F1的维度为64。该特征图F1的尺寸小于第一编码特征图的尺寸。电子设备可利用瓶颈结构b1对特征图F1进行处理,得到特征图F2。其中,该特征图F2的维度为256。该特征图F2的尺寸跟特征图F1的尺寸相同。电子设备可利用瓶颈结构b2对特征图F2进行处理,得到第二编码特征图。其中,第二编码特征图的维度为256。第二编码特征图的尺寸与特征图F2的尺寸相同。
电子设备可利用瓶颈结构b3对第二编码特征图进行处理,得到特征图F3。其中,特征图F3的维度为256。特征图F3的尺寸小于第二编码特征图的尺寸。电子设备可利用瓶颈结构b4对特征图F3进行处理,得到特征图F4。其中,特征图F4的维度为512。特征图F4的尺寸与特征图F3的尺寸相同。电子设备可利用瓶颈结构b5对特征图F4进行处理,得到特征图F5。其中,特征图F5的维度为512。特征图F5的尺寸与特征图F4的尺寸相同。电子设备可利用瓶颈结构b6对特征图F5进行处理,得到第三编码特征图。其中,第三编码特征图的维度为512。第三编码特征图的尺寸与特征图F5的尺寸相同。
电子设备可利用瓶颈结构b7对第三编码特征图进行处理,得到特征图F6。其中,特征图F6的维度为512。特征图F6的尺寸小于第三编码特征图的尺寸。电子设备可利用瓶颈结构b8对特征图F6进行处理,得到特征图F7。其中,特征图F7的维度为1024。特征图F7的尺寸与特征图F6的尺寸相同。电子设备可利用瓶颈结构b9对特征图F7进行处理,得到特征图F8。其中,特征图F8的维度为1024。特征图F8的尺寸与特征图F7的尺寸相同。电子设备可利用瓶颈结构b10对特征图F8进行处理,得到特征图F9。其中,特征图F9的维度为1024。特征图F9的尺寸与特征图F8的尺寸相同。电子设备可利用瓶颈结构b11对特征图F9进行处理,得到特征图F10。其中,特征图F10的维度为1024。特征图F10的尺寸与特征图F9的尺寸相同。电子设备可利用瓶颈结构b12对特征图F10进行处理,得到第四编码特征图。其中,第四编码特征图的维度为1024。第四编码特征图的尺寸与特征图F10的尺寸相同。电子设备可利用卷积核cv2对第四编码特征图进行卷积处理,得到目标特征图。其中,目标特征图的维度为512。目标特征图的尺寸与特征图F10的尺寸相同。
电子设备可利用瓶颈结构b13对第四编码特征图进行处理,得到特征图F11。其中,特征图F11的维度为1024。特征图F11的尺寸小于特征图F10的尺寸。电子设备可利用瓶颈结构b14对特征图F11进行处理,得到特征图F12。其中,特征图F12的维度为2048。特征图F12的尺寸与特征图F11的尺寸相同。电子设备可利用瓶颈结构b15对特征图F12进行处理,得到特征图F13。其中,特征图F13的维度为2048。特征图F13的尺寸与特征图F12的尺寸相同。电子设备可利用瓶颈结构b16对特征图F13进行处理,得到特征图F14。其中,特征图F14的维度为2048。特征图F14的尺寸与特征图F13的尺寸相同。电子设备可利用卷积核cv3对特征图F14进行卷积处理,得到第五编码特征图。其中,第五编码特征图的维度为512。第五编码特征图的尺寸与特征图F14的尺寸相同。卷积核cv3的数量为512,卷积核cv3的大小为1×1,步长为1。
电子设备可对第五编码特征图进行上采样(uc1)处理,得到第一解码特征图。其中,第一解码特征图的维度为512。第一解码特征图的尺寸大于第五编码特征图的尺寸。上采样的倍数可以为2倍、4倍等(视具体情况而定)。电子设备可对第一解码特征图与目标特征图进行融合处理,得到第一融合特征图。其中,第一融合特征图的维度为1024。第一融合特征图的尺寸与第一解码特征图的尺寸相同。
电子设备可利用卷积核cv4对第一融合特征图进行卷积处理,得到特征图F15。其中,特征图F15的维度为512。特征图F15的尺寸与第一融合特征图的尺寸可以相同,也可以不相同。卷积核cv4的数量为256,卷积核cv4的大小为1×1或3×3,步长为1。电子设备可对特征图F15进行上采样(uc2)处理,得到第二解码特征图。其中,第二解码特征图的维度为512。第二解码特征图的尺寸大于特征图F15的尺寸。其中,采样倍数可根据实际情况设置。电子设备可对第二解码特征图和第三编码特征图进行融合处理,得到第二融合特征图。其中,第二融合特征图的维度为1024。第二融合特征图的尺寸与第 三解码特征图的尺寸可以相同。
电子设备可利用卷积核cv5对第二融合特征图进行卷积处理,得到特征图F16。其中,特征图F16的维度为256。特征图F16的尺寸与第一融合特征图的尺寸可以相同,也可以不相同,视卷积核的大小而定。电子设备可对特征图F16进行上采样(uc3)处理,得到第三解码特征图。其中,第三解码特征图的维度为256。第三解码特征图的尺寸大于特征图F16的尺寸。电子设备可对第三解码特征图和第二编码特征图进行融合处理,得到第三融合特征图。其中,第三融合特征图的维度为512。第三融合特征图的尺寸与第三解码特征图的尺寸相同。
电子设备可利用卷积核cv6对第三融合特征图进行卷积处理,得到特征图F17。其中,特征图F17的维度为128。特征图F17的尺寸与第三融合特征图的尺寸可以相同,也可以不相同,视实际情况而定。电子设备可对特征图F17进行上采样(uc4)处理,得到第四解码特征图。其中,第四解码特征图的维度为128。第四解码特征图的尺寸大于特征图F17的尺寸。电子设备可对第四解码特征图与第一编码特征图进行融合处理,得到第四融合特征图。其中,第四融合特征图的维度为192。第四融合特征图的尺寸与第四解码特征图的尺寸相同。
电子设备可利用卷积核cv7对第四融合特征图进行卷积处理,得到特征图F18。其中,特征图F18的维度为64。特征图F18的尺寸与第四融合特征图的尺寸可以相同,也可以不相同,视实际情况而定。电子设备可对特征图F18进行上采样(uc5)处理,得到第五解码特征图。其中,第五解码特征图的维度为64。第五解码特征图的尺寸大于特征图F18的尺寸。电子设备可对第五解码特征图与待处理图片进行融合处理,得到第五融合特征图。其中,第五融合特征图的维度为65。第五融合特征图的尺寸与第五解码特征图的尺寸相同。
电子设备可利用卷积核cv8对第五融合特征图进行卷积处理,得到特征图F19。其中,特征图F19的维度为32。特征图F19的尺寸与第五融合特征图的尺寸可以相同,也可以不相同。电子设备可利用卷积核cv9对特征图F19进行卷积处理,得到特征图F20。其中,特征图F20的维度为4。特征图F20的尺寸与待处理图片的尺寸相同。其中,每个维度的特征图F20中每个像素点的像素值表示该像素点属于某个类别的概率。每个维度的特征图F20均与1个类别对应。例如,第一个维度的特征图F20可对应背景类别,则该维度的特征图中的每个像素点的像素值表示该像素点属于背景类别的概率。以此类推,第二个维度的特征图F20可对应文本类别,第三个维度的特征图F20可对应表格类别。第四个维度的特征图F20可对应图片类别。其中,4个维度的特征图中相同位置的像素值之和为1。电子设备可确定4个维度的特征图F20中相同位置的最大像素值。然后,电子设备可根据最大像素值确定目标图像。其中,目标图像的维度为2。其中一个维度的目标图像中像素点的像素值为电子设备得到的最大像素值。另一个维度的目标图像中像素点的像素值表示该像素点所属的类别,取值可为0~3,分别对应背景类别、文本类别、表格类别和图片类别。例如,假设电子设备确定表格类别对应的特征图F20中某个像素点的像素值最大,那么,电子设备可将该像素值作为目标图像的相应位置的像素点的像素值,并将表格类别作为该像素点对应的类别。
需要说明的是,由于需要进行融合处理的特征图可能会存在尺寸不相同的问题。因此,当需要进行融合处理的特征图的尺寸不相同时,电子设备可对尺寸较大的特征图进行裁剪处理,以使需要进行融合处理的特征图的尺寸相同。其中,电子设备可对尺寸较大的特征图的四周进行裁剪,使得裁剪后的特征图与尺寸较小的特征图的尺寸相同。
还需要说明的是,该图像语义分割模型的网络结构仅仅是本申请提供的一种示例,并不用于限制本申请。本申请的图像语义分割模型的网络结构还可为unet网络结构或者根据unet网络结构做出的变形结构,等等,此处不作具体限制。
请参阅图6,图6为本申请实施例提供的图片处理装置的结构示意图。该图片处理装置300包括:获取模块301,调用模块302,确定模块303。
获取模块301,用于获取待处理图片;
调用模块302,用于调用预训练的图像语义分割模型将所述待处理图片划分为多个区域,其中,每 个区域对应一类别,所述类别包括文本类别、表格类别和图片类别;
确定模块303,用于从所述多个区域中确定出目标区域;
识别模块304,用于对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
在一些实施例中,所述获取模块301,可以用于:获取样本图片,所述样本图片包括多个样本区域,每个样本区域对应一类别;获取待训练的图像语义分割模型;利用所述样本图片对所述待训练的图像语义分割模型进行训练。
在一些实施例中,所述目标区域包括表格区域,识别模块304,可以用于:识别所述表格区域中的表格的行数与列数;根据所述行数与列数,生成表格;将所述文字填充至所述表格中。
在一些实施例中,所述目标区域还包括文本区域,识别模块304,可以用于:根据所述待处理图片的排版格式,对所述表格和从所述文本区域中识别出的文字进行排版;输出排版后的表格和从所述文本区域中识别出的文字。
在一些实施例总,识别模块304,可以用于:显示编辑界面,所述编辑界面为供用户进行编辑操作的界面;将排版后的表格和从所述文本区域中识别出的文字输出至所述编辑界面。
在一些实施例中,所述识别模块304,可以用于:获取预训练的文字识别模型;利用所述文字识别模型对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
在一些实施例中,所述识别模块304,可以用于:当所述多个区域均为目标区域时,判断所述待处理图片的长度是否大于预设长度;若所述待处理图片的长度大于预设长度,则对所述待处理图片进行裁切处理,以将所述待处理图片裁切为多个子图片,其中,每个子图片与一区域对应;利用所述文字识别模型对每个子图片进行文字识别处理,以识别得到每个子图片中的文字。
本申请实施例提供一种计算机可读的存储介质,其上存储有计算机程序,当所述计算机程序在计算机上执行时,使得所述计算机执行如本实施例提供的图片处理方法中的流程。
本申请实施例还提供一种电子设备,包括存储器,处理器,所述处理器通过调用所述存储器中存储的计算机程序,用于执行本实施例提供的图片处理方法中的流程。
例如,上述电子设备可以是诸如平板电脑或者智能手机等移动终端。请参阅图7,图7为本申请实施例提供的电子设备的结构示意图。
该电子设备400可以包括摄像模组401、存储器402、处理器403等部件。本领域技术人员可以理解,图7中示出的电子设备结构并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
摄像模组401可以包括透镜、图像传感器和图像信号处理器,其中透镜用于采集外部的光源信号提供给图像传感器,图像传感器感应来自于透镜的光源信号,将其转换为数字化的原始图像,即RAW图像,并将该RAW图像提供给图像信号处理器处理。图像信号处理器可以对该RAW图像进行格式转换,降噪等处理,得到YUV图像。其中,RAW是未经处理、也未经压缩的格式,可以将其形象地称为“数字底片”。YUV是一种颜色编码方法,其中Y表示亮度,U表示色度,V表示浓度,人眼从YUV图像中可以直观的感受到其中所包含的自然特征。
存储器402可用于存储应用程序和数据。存储器402存储的应用程序中包含有可执行代码。应用程序可以组成各种功能模块。处理器403通过运行存储在存储器402的应用程序,从而执行各种功能应用以及数据处理。
处理器403是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器402内的应用程序,以及调用存储在存储器402内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。
在本实施例中,电子设备中的处理器403会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行代码加载到存储器402中,并由处理器403来运行存储在存储器402中的应用程序,从而执行:
获取待处理图片;
调用预训练的图像语义分割模型将所述待处理图片划分为多个区域,其中,每个区域对应一类别,所述类别包括文本类别、表格类别和图片类别;
从所述多个区域中确定出目标区域;
对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
请参阅图8,电子设备400可以包括摄像模组401、存储器402、处理器403、触摸显示屏404、扬声器405、麦克风406等部件。
摄像模组401可以包括图像处理电路,图像处理电路可以利用硬件和/或软件组件实现,可包括定义图像信号处理(Image Signal Processing)管线的各种处理单元。图像处理电路至少可以包括:摄像头、图像信号处理器(Image Signal Processor,ISP处理器)、控制逻辑器、图像存储器以及显示器等。其中摄像头至少可以包括一个或多个透镜和图像传感器。图像传感器可包括色彩滤镜阵列(如Bayer滤镜)。图像传感器可获取用图像传感器的每个成像像素捕捉的光强度和波长信息,并提供可由图像信号处理器处理的一组原始图像数据。
图像信号处理器可以按多种格式逐个像素地处理原始图像数据。例如,每个图像像素可具有8、10、12或14比特的位深度,图像信号处理器可对原始图像数据进行一个或多个图像处理操作、收集关于图像数据的统计信息。其中,图像处理操作可按相同或不同的位深度精度进行。原始图像数据经过图像信号处理器处理后可存储至图像存储器中。图像信号处理器还可从图像存储器处接收图像数据。
图像存储器可为存储器装置的一部分、存储设备、或电子设备内的独立的专用存储器,并可包括DMA(Direct Memory Access,直接直接存储器存取)特征。
当接收到来自图像存储器的图像数据时,图像信号处理器可进行一个或多个图像处理操作,如时域滤波。处理后的图像数据可发送给图像存储器,以便在被显示之前进行另外的处理。图像信号处理器还可从图像存储器接收处理数据,并对所述处理数据进行原始域中以及RGB和YCbCr颜色空间中的图像数据处理。处理后的图像数据可输出给显示器,以供用户观看和/或由图形引擎或GPU(Graphics Processing Unit,图像处理器)进一步处理。此外,图像信号处理器的输出还可发送给图像存储器,且显示器可从图像存储器读取图像数据。在一种实施方式中,图像存储器可被配置为实现一个或多个帧缓冲器。
图像信号处理器确定的统计数据可发送给控制逻辑器。例如,统计数据可包括自动曝光、自动白平衡、自动聚焦、闪烁检测、黑电平补偿、透镜阴影校正等图像传感器的统计信息。
控制逻辑器可包括执行一个或多个例程(如固件)的处理器和/或微控制器。一个或多个例程可根据接收的统计数据,确定摄像头的控制参数以及ISP控制参数。例如,摄像头的控制参数可包括照相机闪光控制参数、透镜的控制参数(例如聚焦或变焦用焦距)、或这些参数的组合。ISP控制参数可包括用于自动白平衡和颜色调整(例如,在RGB处理期间)的增益水平和色彩校正矩阵等。
请参阅图9,图9为本实施例中图像处理电路的结构示意图。如图9所示,为便于说明,仅示出与本申请实施例相关的图像处理技术的各个方面。
例如图像处理电路可以包括:摄像头、图像信号处理器、控制逻辑器、图像存储器、显示器。其中,摄像头可以包括一个或多个透镜和图像传感器。在一些实施例中,摄像头可为长焦摄像头或广角摄像头中的任一者。
摄像头采集的第一图像传输给图像信号处理器进行处理。图像信号处理器处理第一图像后,可将第一图像的统计数据(如图像的亮度、图像的反差值、图像的颜色等)发送给控制逻辑器。控制逻辑器可根据统计数据确定摄像头的控制参数,从而摄像头可根据控制参数进行自动对焦、自动曝光等操作。第一图像经过图像信号处理器进行处理后可存储至图像存储器中。图像信号处理器也可以读取图像存储器中存储的图像以进行处理。另外,第一图像经过图像信号处理器进行处理后可直接发送至显示器进行显示。显示器也可以读取图像存储器中的图像以进行显示。
此外,图中没有展示的,电子设备还可以包括CPU和供电模块。CPU和逻辑控制器、图像信号处理器、图像存储器和显示器均连接,CPU用于实现全局控制。供电模块用于为各个模块供电。
存储器402存储的应用程序中包含有可执行代码。应用程序可以组成各种功能模块。处理器403通过运行存储在存储器402的应用程序,从而执行各种功能应用以及数据处理。
处理器403是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器402内的应用程序,以及调用存储在存储器402内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。
触摸显示屏404可以用于接收用户对电子设备的触摸控制操作。扬声器405可以播放声音信号。传感器406可包括陀螺仪传感器、加速度传感器、方向传感器、磁场传感器等,其可用于获取电子设备400的当前姿态。
在本实施例中,电子设备中的处理器403会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行代码加载到存储器402中,并由处理器403来运行存储在存储器402中的应用程序,从而执行:
获取待处理图片;
调用预训练的图像语义分割模型将所述待处理图片划分为多个区域,其中,每个区域对应一类别,所述类别包括文本类别、表格类别和图片类别;
从所述多个区域中确定出目标区域;
对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
在一种实施方式中,处理器403执行获取待处理图片之前,还可以执行:获取样本图片,所述样本图片包括多个样本区域,每个样本区域对应一类别;获取待训练的图像语义分割模型;利用所述样本图片对所述待训练的图像语义分割模型进行训练。
在一种实施方式中,所述目标区域包括表格区域,处理器403执行对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字之后,还可以执行:识别所述表格区域中的表格的行数与列数;根据所述行数与列数,生成表格;将所述文字填充至所述表格中。
在一种实施方式中,所述目标区域还包括文本区域,处理器403执行将所述文字填充至所述表格中之后,还可以执行:根据所述待处理图片的排版格式,对所述表格和从所述文本区域中识别出的文字进行排版;输出排版后的表格和从所述文本区域中识别出的文字。
在一种实施方式中,处理器403执行输出排版后的表格和从所述文本区域中识别出的文字时,可以执行:显示编辑界面,所述编辑界面为供用户进行编辑操作的界面;将排版后的表格和从所述文本区域中识别出的文字输出至所述编辑界面。
在一种实施方式中,处理器403执行对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字时,可以执行:获取预训练的文字识别模型;利用所述文字识别模型对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
在一种实施方式中,处理器403执行利用所述文字识别模型对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字之前,还可以执行:当所述多个区域均为目标区域时,判断所述待处理图片的长度是否大于预设长度;若所述待处理图片的长度大于预设长度,则对所述待处理图片进行裁切处理,以将所述待处理图片裁切为多个子图片,其中,每个子图片与一区域对应;处理器403执行利用所述文字识别模型对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字时,可以执行:利用所述文字识别模型对每个子图片进行文字识别处理,以识别得到每个子图片中的文字。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见上文针对图片处理方法的详细描述,此处不再赘述。
本申请实施例提供的所述图片处理装置与上文实施例中的图片处理方法属于同一构思,在所述图片处理装置上可以运行所述图片处理方法实施例中提供的任一方法,其具体实现过程详见所述图片处理方法实施例,此处不再赘述。
需要说明的是,对本申请实施例所述图片处理方法而言,本领域普通技术人员可以理解实现本申请实施例所述图片处理方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,所述计 算机程序可存储于一计算机可读取存储介质中,如存储在存储器中,并被至少一个处理器执行,在执行过程中可包括如所述图片处理方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)等。
对本申请实施例的所述图片处理装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,所述存储介质譬如为只读存储器,磁盘或光盘等。
以上对本申请实施例所提供的一种图片处理方法、装置、存储介质以及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种图片处理方法,其中,包括:
    获取待处理图片;
    调用预训练的图像语义分割模型将所述待处理图片划分为多个区域,其中,每个区域对应一类别,所述类别包括文本类别、表格类别和图片类别;
    从所述多个区域中确定出目标区域;
    对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
  2. 根据权利要求1所述的图片处理方法,其中,所述获取待处理图片之前,还包括:
    获取样本图片,所述样本图片包括多个样本区域,每个样本区域对应一类别;
    获取待训练的图像语义分割模型;
    利用所述样本图片对所述待训练的图像语义分割模型进行训练。
  3. 根据权利要求1所述的图片处理方法,其中,所述目标区域包括表格区域,所述对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字之后,还包括:
    识别所述表格区域中的表格的行数与列数;
    根据所述行数与列数,生成表格;
    将所述文字填充至所述表格中。
  4. 根据权利要求3所述的图片处理方法,其中,所述目标区域还包括文本区域,所述将所述文字填充至所述表格中之后,还包括:
    根据所述待处理图片的排版格式,对所述表格和从所述文本区域中识别出的文字进行排版;
    输出排版后的表格和从所述文本区域中识别出的文字。
  5. 根据权利要求4所述的图片处理方法,其中,所述输出排版后的表格和从所述文本区域中识别出的文字,包括:
    显示编辑界面,所述编辑界面为供用户进行编辑操作的界面;
    将排版后的表格和从所述文本区域中识别出的文字输出至所述编辑界面。
  6. 根据权利要求1所述的图片处理方法,其中,所述对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字,包括:
    获取预训练的文字识别模型;
    利用所述文字识别模型对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
  7. 根据权利要求6所述的图片处理方法,其中,所述利用所述文字识别模型对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字之前,还包括:
    当所述多个区域均为目标区域时,判断所述待处理图片的长度是否大于预设长度;
    若所述待处理图片的长度大于预设长度,则对所述待处理图片进行裁切处理,以将所述待处理图片裁切为多个子图片,其中,每个子图片与一区域对应;
    所述利用所述文字识别模型对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字,包括:
    利用所述文字识别模型对每个子图片进行文字识别处理,以识别得到每个子图片中的文字。
  8. 一种图片处理装置,其中,包括:
    获取模块,用于获取待处理图片;
    调用模块,用于调用预训练的图像语义分割模型将所述待处理图片划分为多个区域,其中,每个区域对应一类别,所述类别包括文本类别、表格类别和图片类别;
    确定模块,用于从所述多个区域中确定出目标区域;
    识别模块,用于对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
  9. 根据权利要求8所述的图片处理装置,其中,所述获取模块,用于:
    获取样本图片,所述样本图片包括多个样本区域,每个样本区域对应一类别;
    获取待训练的图像语义分割模型;
    利用所述样本图片对所述待训练的图像语义分割模型进行训练。
  10. 根据权利要求8所述的图片处理装置,其中,所述目标区域包括表格区域,所述识别模块,用于:
    识别所述表格区域中的表格的行数与列数;
    根据所述行数与列数,生成表格;
    将所述文字填充至所述表格中。
  11. 根据权利要求10所述的图片处理装置,其中,所述目标区域还包括文本区域,所述识别模块,用于:
    根据所述待处理图片的排版格式,对所述表格和从所述文本区域中识别出的文字进行排版;
    输出排版后的表格和从所述文本区域中识别出的文字。
  12. 根据权利要求11所述的图片处理装置,其中,所述识别模块,用于:
    显示编辑界面,所述编辑界面为供用户进行编辑操作的界面;
    将排版后的表格和从所述文本区域中识别出的文字输出至所述编辑界面。
  13. 一种存储介质,其中,所述存储介质中存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行权利要求1所述的图片处理方法。
  14. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行:
    获取待处理图片;
    调用预训练的图像语义分割模型将所述待处理图片划分为多个区域,其中,每个区域对应一类别,所述类别包括文本类别、表格类别和图片类别;
    从所述多个区域中确定出目标区域;
    对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
  15. 根据权利要求14所述的电子设备,其中,所述处理器用于执行:
    获取样本图片,所述样本图片包括多个样本区域,每个样本区域对应一类别;
    获取待训练的图像语义分割模型;
    利用所述样本图片对所述待训练的图像语义分割模型进行训练。
  16. 根据权利要求14所述的电子设备,其中,所述目标区域包括表格区域,所述处理器用于执行:
    识别所述表格区域中的表格的行数与列数;
    根据所述行数与列数,生成表格;
    将所述文字填充至所述表格中。
  17. 根据权利要求16所述的电子设备,其中,所述目标区域还包括文本区域,所述处理器用于执行:
    根据所述待处理图片的排版格式,对所述表格和从所述文本区域中识别出的文字进行排版;
    输出排版后的表格和从所述文本区域中识别出的文字。
  18. 根据权利要求17所述的电子设备,其中,所述处理器用于执行:
    显示编辑界面,所述编辑界面为供用户进行编辑操作的界面;
    将排版后的表格和从所述文本区域中识别出的文字输出至所述编辑界面。
  19. 根据权利要求14所述的电子设备,其中,所述处理器用于执行:
    获取预训练的文字识别模型;
    利用所述文字识别模型对所述目标区域进行文字识别处理,以识别得到所述目标区域中的文字。
  20. 根据权利要求19所述的电子设备,其中,所述处理器用于执行:
    当所述多个区域均为目标区域时,判断所述待处理图片的长度是否大于预设长度;
    若所述待处理图片的长度大于预设长度,则对所述待处理图片进行裁切处理,以将所述待处理图片 裁切为多个子图片,其中,每个子图片与一区域对应;
    利用所述文字识别模型对每个子图片进行文字识别处理,以识别得到每个子图片中的文字。
PCT/CN2021/074706 2020-03-27 2021-02-01 图片处理方法、装置、存储介质及电子设备 WO2021190146A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010230790.8A CN111444922A (zh) 2020-03-27 2020-03-27 图片处理方法、装置、存储介质及电子设备
CN202010230790.8 2020-03-27

Publications (1)

Publication Number Publication Date
WO2021190146A1 true WO2021190146A1 (zh) 2021-09-30

Family

ID=71651349

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/074706 WO2021190146A1 (zh) 2020-03-27 2021-02-01 图片处理方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN111444922A (zh)
WO (1) WO2021190146A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118072128A (zh) * 2024-04-19 2024-05-24 深圳爱莫科技有限公司 一种细粒度多模态大模型训练方法
CN118313837A (zh) * 2024-06-07 2024-07-09 青岛云创智通科技有限公司 一种基于大数据的客户关系管理系统

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444922A (zh) * 2020-03-27 2020-07-24 Oppo广东移动通信有限公司 图片处理方法、装置、存储介质及电子设备
CN111881902B (zh) * 2020-07-28 2023-06-27 平安科技(深圳)有限公司 训练样本制作方法、装置、计算机设备及可读存储介质
CN112016545A (zh) * 2020-08-11 2020-12-01 中国银联股份有限公司 一种包含文本的图像生成方法及装置
CN111931666B (zh) * 2020-08-13 2024-02-13 中国工商银行股份有限公司 凭证自动化处理系统及方法
CN112000834B (zh) * 2020-08-26 2024-08-09 北京百度网讯科技有限公司 文档处理方法、装置、系统、电子设备及存储介质
CN113762260A (zh) * 2020-09-09 2021-12-07 北京沃东天骏信息技术有限公司 一种版面图片的处理方法、装置、设备及存储介质
CN112256168A (zh) * 2020-09-30 2021-01-22 北京百度网讯科技有限公司 一种手写内容电子化的方法、装置、电子设备及存储介质
CN112465691A (zh) * 2020-11-25 2021-03-09 北京旷视科技有限公司 图像处理方法、装置、电子设备和计算机可读介质
CN112560767A (zh) * 2020-12-24 2021-03-26 南方电网深圳数字电网研究院有限公司 文档签名识别方法、装置及计算机可读存储介质
CN112668583A (zh) * 2021-01-07 2021-04-16 浙江星汉信息技术股份有限公司 图像识别方法、装置以及电子设备
CN113011274B (zh) * 2021-02-24 2024-04-09 南京三百云信息科技有限公司 图像识别方法、装置、电子设备及存储介质
CN112818961A (zh) * 2021-03-26 2021-05-18 北京东方金朔信息技术有限公司 图像特征识别方法及装置
CN113239660A (zh) * 2021-04-29 2021-08-10 维沃移动通信(杭州)有限公司 文本显示方法、装置及电子设备
CN113240687A (zh) * 2021-05-17 2021-08-10 Oppo广东移动通信有限公司 图像处理方法、装置、电子设备和可读存储介质
CN113449620A (zh) * 2021-06-17 2021-09-28 深圳思谋信息科技有限公司 基于语义分割的表格检测方法、装置、设备和介质
CN113808033A (zh) * 2021-08-06 2021-12-17 上海深杳智能科技有限公司 图像文档校正方法、系统、终端及介质
CN115035530A (zh) * 2022-04-21 2022-09-09 阿里巴巴达摩院(杭州)科技有限公司 图像处理方法、图像文本获得方法、装置及电子设备
CN114863434B (zh) * 2022-04-21 2023-05-23 北京百度网讯科技有限公司 文字分割模型的获取方法、文字分割方法及其装置
CN114911963B (zh) * 2022-05-12 2023-09-01 星环信息科技(上海)股份有限公司 一种模板图片分类方法、装置、设备、存储介质及产品
CN115880704B (zh) * 2023-02-16 2023-06-16 中国人民解放军总医院第一医学中心 一种病例的自动编目方法、系统、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09190502A (ja) * 1996-01-10 1997-07-22 Canon Inc 画像処理装置およびその方法
JP2002269573A (ja) * 2001-03-08 2002-09-20 Ricoh Co Ltd 文書認識方法及びその装置並びに記録媒体
CN107622271A (zh) * 2016-07-15 2018-01-23 科大讯飞股份有限公司 手写文本行提取方法及系统
CN109522816A (zh) * 2018-10-26 2019-03-26 北京慧流科技有限公司 表格识别方法及装置、计算机存储介质
CN109933756A (zh) * 2019-03-22 2019-06-25 腾讯科技(深圳)有限公司 基于ocr的图像转档方法、装置、设备及可读存储介质
CN110704153A (zh) * 2019-10-10 2020-01-17 深圳前海微众银行股份有限公司 界面逻辑解析方法、装置、设备及可读存储介质
CN111444922A (zh) * 2020-03-27 2020-07-24 Oppo广东移动通信有限公司 图片处理方法、装置、存储介质及电子设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002056356A (ja) * 2000-08-11 2002-02-20 Ricoh Co Ltd 文字認識装置、文字認識方法および記録媒体
CN109785843B (zh) * 2017-11-14 2024-03-26 上海寒武纪信息科技有限公司 图像处理装置和方法
CN110807404A (zh) * 2019-10-29 2020-02-18 上海眼控科技股份有限公司 基于深度学习的表格线检测方法、装置、终端、存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09190502A (ja) * 1996-01-10 1997-07-22 Canon Inc 画像処理装置およびその方法
JP2002269573A (ja) * 2001-03-08 2002-09-20 Ricoh Co Ltd 文書認識方法及びその装置並びに記録媒体
CN107622271A (zh) * 2016-07-15 2018-01-23 科大讯飞股份有限公司 手写文本行提取方法及系统
CN109522816A (zh) * 2018-10-26 2019-03-26 北京慧流科技有限公司 表格识别方法及装置、计算机存储介质
CN109933756A (zh) * 2019-03-22 2019-06-25 腾讯科技(深圳)有限公司 基于ocr的图像转档方法、装置、设备及可读存储介质
CN110704153A (zh) * 2019-10-10 2020-01-17 深圳前海微众银行股份有限公司 界面逻辑解析方法、装置、设备及可读存储介质
CN111444922A (zh) * 2020-03-27 2020-07-24 Oppo广东移动通信有限公司 图片处理方法、装置、存储介质及电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118072128A (zh) * 2024-04-19 2024-05-24 深圳爱莫科技有限公司 一种细粒度多模态大模型训练方法
CN118313837A (zh) * 2024-06-07 2024-07-09 青岛云创智通科技有限公司 一种基于大数据的客户关系管理系统

Also Published As

Publication number Publication date
CN111444922A (zh) 2020-07-24

Similar Documents

Publication Publication Date Title
WO2021190146A1 (zh) 图片处理方法、装置、存储介质及电子设备
WO2021179820A1 (zh) 图像处理方法、装置、存储介质及电子设备
US11457138B2 (en) Method and device for image processing, method for training object detection model
WO2021073493A1 (zh) 图像处理方法及装置、神经网络的训练方法、合并神经网络模型的图像处理方法、合并神经网络模型的构建方法、神经网络处理器及存储介质
US20200258197A1 (en) Method for generating high-resolution picture, computer device, and storage medium
WO2020108009A1 (en) Method, system, and computer-readable medium for improving quality of low-light images
US20190362171A1 (en) Living body detection method, electronic device and computer readable medium
CN114444558A (zh) 用于对象识别的神经网络的训练方法及训练装置
US11887280B2 (en) Method, system, and computer-readable medium for improving quality of low-light images
US11949848B2 (en) Techniques to capture and edit dynamic depth images
JP7132654B2 (ja) レイアウト解析方法、読取り支援デバイス、回路および媒体
WO2020048359A1 (en) Method, system, and computer-readable medium for improving quality of low-light images
CN116569213A (zh) 图像区域的语义细化
WO2019015477A1 (zh) 图像矫正方法、计算机可读存储介质和计算机设备
US20240046538A1 (en) Method for generating face shape adjustment image, model training method, apparatus and device
WO2021045599A1 (ko) 비디오 영상에 보케 효과를 적용하는 방법 및 기록매체
US20220329729A1 (en) Photographing method, storage medium and electronic device
KR20210029692A (ko) 비디오 영상에 보케 효과를 적용하는 방법 및 기록매체
CN113962873A (zh) 一种图像去噪方法、存储介质及终端设备
WO2021190412A1 (zh) 一种生成视频缩略图的方法、装置和电子设备
CN112348739B (zh) 图像处理方法、装置、设备及存储介质
CN117391201A (zh) 问答方法、装置及电子设备
US20230325985A1 (en) Systems and methods for inpainting images at increased resolution
WO2021179819A1 (zh) 照片处理方法、装置、存储介质及电子设备
CN112232125A (zh) 关键点检测方法和关键点检测模型的训练方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21776836

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21776836

Country of ref document: EP

Kind code of ref document: A1