CN113139534A

CN113139534A - Two-stage safe multi-party calculation image text positioning and identifying method

Info

Publication number: CN113139534A
Application number: CN202110488731.5A
Authority: CN
Inventors: 茹超飞; 黄征; 郭捷; 邱卫东
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-05-06
Filing date: 2021-05-06
Publication date: 2021-07-20
Anticipated expiration: 2041-05-06
Also published as: CN113139534B

Abstract

A two-stage safe multi-party calculation privacy protection picture text positioning and identification method does not relate to plaintext transmission for a picture character positioning and identification scheme, picture information and transmission contents between a user side and a client side are encrypted through safe multi-party calculation, the requirement of picture privacy protection is met, service is reliable, and safety is high. The invention separates the encrypted character positioning and the identification network service, provides a single character positioning and full convolution single character identification network based on segmentation which is suitable for a safe multi-party computing framework, and can respectively provide the encrypted single character positioning and the single character identification service.

Description

Two-stage safe multi-party calculation image text positioning and identifying method

Technical Field

The invention relates to a technology in the field of image processing, in particular to a privacy-preserving picture text positioning and identifying method based on two-stage safe multi-party calculation, and particularly relates to a method for realizing single character positioning and identification of a picture by utilizing a safe multi-party calculation technology under the condition of not revealing real picture information.

Background

Most of the existing cloud services providing image character positioning and recognition algorithms need real image information, and users often need to provide real images to a server side when receiving the services, so that the requirements of privacy protection cannot be met. The privacy computing technology represented by the safe multiparty computing realizes the computation task of multiparty participation on the basis of protecting data from being leaked outside, solves the contradiction between the usability and the privacy in data circulation, pays attention to the ownership and the safe processing of the data, and further realizes the usability of picture information under the condition of protecting the privacy of pictures. Secure multiparty computing protocols allow multiple participants to aggregate computing of data without actually sharing the input by using cryptographic techniques such as homomorphic encryption, secret sharing, and oblivious transmission. The research of secure multi-party computation mainly aims at the problem of how to securely compute an agreed function without a trusted third party.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a privacy-protecting picture text positioning and identifying method based on two-stage safe multi-party calculation, which realizes the positioning and identification of encrypted picture single words based on a safe multi-party calculation scheme. Different from the common character recognition work, the privacy and the safety of the user are fully protected. The invention is based on the technical means of a secure multiparty computing scheme.

The invention is realized by the following technical scheme:

the invention relates to a two-stage safe multi-party computing privacy protection picture text positioning and identifying method, which comprises the following steps:

1) based on a function secret sharing protocol in secure multi-party computing, a user encrypts information of each pixel in a picture through a secret function in the secret function, and transmits the encrypted picture information to a cloud server;

2) the cloud server performs feature extraction on the picture information based on the trained single character positioning model to obtain three levels of encrypted picture single character positioning feature maps, and transmits the encrypted picture single character positioning feature maps back to the user side;

3) the user decrypts the encrypted picture single character positioning feature map by using a secret function locally, obtains the pixel point position of a single character text from the decrypted feature map by using a progressive expansion algorithm, and calculates the minimum bounding rectangle of the single character region to obtain the single character text box coordinate;

4) the user screens out the single characters sequentially arranged in the same text line by the spatial distance of the coordinates of the single character text box, and then the sequentially arranged single characters are encrypted by the secret function in sequence by the secret function and then are respectively transmitted to the cloud server;

5) the cloud server performs single character recognition on the encrypted single character picture based on the trained single character recognition model to obtain a single character recognition result and transmits the single character recognition result to the user side;

6) the user decrypts the single character recognition result locally by using the secret function, and arranges the decryption results in sequence to obtain the character recognition result.

The secret function f is divided into secret shares f of the number of the users and the number of the servers according to the number of the providing servers₁-f_nThe secret share function is used as a part of the key, the input of the original text can be restored only when all secret shares are obtained, and the plaintext can be calculated by using all secret shares without exposing the plaintext, so that the operations of convolution, pooling and activation functions of encryption are realized.

The encryption in step 1 means: using secret shares f₁Encrypting pixel by pixel of picture P while the server has secret share f₂Can be used for E after encryption_k(P) calculation is performed. The communication calculation is carried out between the server and the user, and the feature map extraction can be regarded asIs a series of continuous functions, so that the computation of the feature extraction network can be replaced by the computation of a secret function, and then the server computes to obtain an encrypted feature map E_k(F_char) Transmitting the data back to the user end; the corresponding decryption means: the user end has all secret shares capable of decrypting and recovering F_charAnd locally restoring the single character positioning feature map.

The encryption by the secret function in the step 4 means: client side using secret shares f₁E is obtained by encrypting the sorted single character pictures Pic_k(Pic), the server carries out calculation of a character recognition network according to the encrypted picture to obtain a character recognition result E_kAnd (Char), the user decrypts according to the sequence of the transmitted single character pictures to obtain a character recognition result.

The feature extraction comprises the following specific steps:

the size of the input picture is (N, C, H, W), where N represents the training batch size, generally 1, C represents the number of picture channels, generally 3, and H, W are the height and width of the picture. After the input picture is obtained, four convolution layers of the feature network extract four features C of the picture from low to high₂，C₃，C₄，C₅The size of the picture is respectively 4, 8, 16 and 32 times of the down-sampling of the original picture, wherein the bottom layer characteristic C₂Large size and small receptive field, and more attention is paid to the bottom-level detail information of the picture, C₅The feature receptive field of the highest layer of the image is the largest and has high-layer semantic information. When the features are fused, the network adopts simple addition operation instead of concat used by general feature fusion so as to reduce the network calculation amount. Obtaining various feature outputs, then utilizing two times of upsampling to make feature graphs consistent in size, then utilizing addition operation to successively implement fusion feature operation, P₅Characteristic layer pass through C₅Convolution is obtained by changing the number of channels. C₄And P₅Fusion to give P₄I.e. P₄＝Up(P₅)+C₄. Then F₄And C₃Fusion to give P₃I.e. P₃＝Up(P₄)+C₃，P₃And C₂Fusion to give P₂I.e. F₂＝Up(P₃)+C₂. Finally, the three layers of fused features are sampled and fused again to obtain the output single character outer contour, inner contour and single character center three-level features F_char1，F_char2And F_char3. I.e. (F)_char1，F_char2，F_char3)＝(P₂+Up(P₃)+Up(Up(P₄))+Up(Up(Up(P₅)))). The output feature map size is (3, H/4, W/4). The value at each position of the feature map matrix represents the probability that the pixel is a single word.

Secondly, after the picture characteristic graph is obtained, the size of the characteristic graph is expanded by four times to the original size of the picture by adopting an up-sampling algorithm of linear double interpolation. In this case, three feature maps are provided, each having the same size as the original map. The first outer contour characteristic diagram represents the corresponding area of the minimum circumscribed rectangle of the single character, the second inner contour characteristic diagram represents the corresponding area of the minimum circumscribed rectangle of the single character reduced by 0.7 times, and the third central characteristic diagram represents the corresponding area of the minimum circumscribed rectangle of the single character reduced by 0.5 times, so that the character center is determined. And mapping each value in the feature maps to 0-1 through a sigmoid function, wherein the mapped value represents the probability that the pixel point represents the corresponding feature. Selecting a threshold value of 0.9, wherein the pixel points with the probability exceeding the threshold value are considered as corresponding characteristics, the value is 1, the pixel points lower than the threshold value are 0, and thus three 0-1 binary graphs with the same size as the original image can be obtained, and the three binary graphs respectively correspond to the pixel points of the outer contour, the inner contour and the center of the single character. And expanding the determined connected region of the single character center to the boundary of the inner contour by using width-first search BFS, and then searching and expanding the obtained boundary of the inner contour to the boundary of the outer contour. At the moment, all pixel points in the outer contour represent the region of characters, the minimum circumscribed rectangle of the connected region is the positioning frame of the single character, and the algorithm can effectively divide the adjacent single characters to find the positioning frame of each single character by determining the center of the characters through a layer of characteristics.

The single character recognition specifically comprises the following steps: and judging whether the single characters are the single characters in the same text line or not by utilizing the coordinate information of the obtained single character positioning frame. Amplifying the rectangle of each single character frame to 1.25 times of the original rectangle by using a scaling function and judging whether the rectangle is consistent with other single character framesAnd intersecting, namely, the single characters with intersection after the single character frames are expanded can be regarded as texts in the same text box and are sorted according to the abscissa. And the user side cuts corresponding area pictures on the original picture by using the coordinates of the characters, encrypts the pictures and transmits the pictures to the cloud server in sequence to identify the single characters. After the single-word picture is transformed into a Pic with a certain size by resize, the Pic passes through a plurality of alternating convolutional layers and pooling layers, a view function is used for drawing the Pic into a one-dimensional vector, a full-link layer fc is used for outputting the probability of the corresponding type characters, and finally, the argmax function is used for obtaining the single-word type with the highest probability, namely: char ═ MaxPool (conv (pic)))_n. The operation of convolution calculation, activation function and maximum pooling is repeated for a plurality of times, and the output dimension of the full connection layer fc is the predicted single character type, generally three thousand Chinese characters commonly used plus upper and lower case English letters and Arabic numerals.

The invention relates to a system for realizing the method, which comprises the following steps: the system comprises a single character feature extraction network unit, a gradual expansion post-processing unit and a single character recognition network unit, wherein: the single character feature extraction network unit is connected with the gradual expansion post-processing unit and transmits the single character feature graph information to the gradual expansion post-processing unit, and the gradual expansion post-processing unit is connected with the single character recognition network unit and transmits the single character graph to the gradual expansion post-processing unit.

Technical effects

The invention integrally solves the defect that the prior art can carry out high-efficiency text positioning identification under the condition that the privacy of the picture can not be ensured; compared with the prior art, the text positioning and identifying scheme for the picture does not relate to plaintext transmission, picture information and transmission content between the user side and the client side are encrypted through safe multi-party calculation, the requirement of picture privacy protection is met, service is reliable, and safety is high. The invention separates the encrypted character positioning and the identification network service, provides a single character positioning and full convolution single character identification network based on segmentation which is suitable for a safe multi-party computing framework, and can respectively provide the encrypted single character positioning and the single character identification service.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2a and FIG. 2b are schematic diagrams of a single character positioning model according to the present invention;

FIGS. 3a and 3b are schematic diagrams of the word recognition model of the present invention.

Detailed Description

As shown in fig. 1, the present embodiment relates to a method for positioning and recognizing a picture text based on secure multiparty computation and privacy protection, wherein a related server provides a remote invocation interface based on services such as HTTP, and text positioning and recognition models are deployed at the server by plaintext training, respectively, and the method specifically includes the following steps:

2) the cloud server performs feature extraction on the picture information based on the trained single character positioning model to obtain three-level encrypted picture single character positioning feature map F_char1、F_char2And F_char3Transmitting the encrypted picture single character positioning characteristic graph to the user end;

As shown in fig. 2, the single character positioning model includes: the single character feature extraction module and the progressive expansion post-processing module are characterized in that: the single character feature extraction module performs convolution calculation processing according to the pixel information of the text picture to obtain three layers of single character feature map results, and the gradual expansion post-processing module expands a connected domain formed by each layer of pixel points to the next layer of boundary to the single character outline according to the three layers of single character feature map information by using a width search algorithm according to a first-come-first-obtained rule to obtain the result of the single character occupying the pixel area.

The single character positioning model is trained on a plaintext picture aiming at a specific data set, and the label of the plaintext is a 0-1 binary image of the outer contour of a single character, the inner contour of the single character and the center of the single character, which is obtained by zooming an original polygon into 1.0 time, 0.7 time and 0.5 time by a single character frame by adopting a Vatti clipping algorithm image zooming method.

The single character feature extraction module adopts ResNet without a full connection layer, and the ResNet outputs 5 layers of features, wherein the first layer of features do not participate in calculation because the first layer of features too much pass through a bottom layer, and the remaining 4 layers of features output conv₂，conv₃，conv₄，conv₅Respectively corresponding to the four features C extracted in the above feature extraction₂，C₃，C₄，C₅The total text positioning Loss function is proportionally composed of Loss functions for predicting three characteristics, namely Loss is 0.5Loss1+0.25Loss2+0.25Loss3, Loss1, Loss2 and Loss3 are calculated by Dice Loss or SmoothL1 Loss and represent the Loss of the outer contour, the inner contour and the center feature map of the single character and the label.

The single character feature extraction module is further provided with an optimizer, and the optimizer selects Adam combined with two optimization algorithms of AdaGrad and RMSProp to perform training optimization.

As shown in fig. 3a and 3b, the single-word recognition model is implemented by using a convolutional neural network, which includes: several convolutional layers with activation function, max pooling layer, full connectivity layer and softmax layer, where: the convolution layer carries out matrix convolution calculation processing according to image pixel information to obtain a local image feature result, the maximum pooling layer carries out maximum pooling processing according to the image feature information to obtain an image feature result after dimension reduction, the full-connection layer carries out feature mapping processing according to the image feature information to obtain a classification probability vector result, and the softmax layer carries out normalization processing according to the classification probability vector information to obtain a final single character classification result.

The single character recognition model is obtained by pre-training through plaintext, and a Loss function adopts Cross Encopy Loss.

Through specific practical experiments, under the specific environment setting of Pysyft safe multiparty calculation, the method is operated on the text picture with the size of 416 × 416 pixels, the positioning and the recognition of the picture characters are completed within about 600 seconds, and the accuracy rate exceeds 80%.

Compared with the prior art, the invention can finish the work of positioning and identifying the single character in relatively short time under the condition of ensuring certain accuracy.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A two-stage safe multi-party computing privacy protection picture text positioning and identifying method is characterized by comprising the following steps:

6) the user decrypts the single character recognition result locally by using the secret function, and arranges the decryption results in sequence to obtain a character recognition result;

the secret function is divided into secret shares f of the number of the users and the number of the servers according to the number of the providing servers₁-f_nThe secret share function, as part of the key, can only restore the input of the plaintext when all secret shares are obtained, with which the plaintext can be computed without revealing the plaintext.

2. The method for text positioning and recognition of privacy-preserving pictures with two-stage secure multi-party computing as claimed in claim 1, wherein the encryption in step 1 is: using secret shares f₁Encrypting pixel by pixel of picture P while the server has secret share f₂Can be used for E after encryption_k(P) performing a calculation, performing a communication calculation between the server and the user, the feature map extraction being regarded as a series of successive functions, so that the calculation of the feature extraction network can be replaced by the calculation of a secret function, the server subsequently calculating an encrypted feature map E_k(F_char) Transmitting the data back to the user end; the corresponding decryption means: the user end has all secret shares capable of decrypting and recovering F_charAnd locally restoring the single character positioning feature map.

3. The method for text positioning and recognition of privacy-preserving pictures with two-stage secure multi-party computation according to claim 1, wherein the encryption by a secret function in step 4 is:client side using secret shares f₁E is obtained by encrypting the sorted single character pictures Pic_k(Pic), the server carries out calculation of a character recognition network according to the encrypted picture to obtain a character recognition result E_kAnd (Char), the user decrypts according to the sequence of the transmitted single character pictures to obtain a character recognition result.

4. The method for two-stage secure multiparty computation based text location identification of privacy preserving pictures as claimed in claim 1, wherein said feature extraction comprises the specific steps of:

firstly, the size of an input picture is (N, C, H, W), wherein N is the batch size for training, C is the number of picture channels, generally 3, H and W are the height and width of the picture; after the input picture is obtained, four convolution layers of the feature network extract four features C of the picture from low to high₂，C₃，C₄，C₅The sizes of the two images respectively correspond to 4, 8, 16 and 32 times of the down-sampling of the original image, the two times of up-sampling is utilized to ensure that the sizes of the feature images are consistent after various feature outputs are obtained, the adding operation is used for sequentially carrying out the feature fusion operation, P₅Characteristic layer pass through C₅Convolution is obtained by changing the number of channels; c₄And P₅Fusion to give P₄I.e. P₄＝Up(P₅)+C₄；F₄And C₃Fusion to give P₃I.e. P₃＝Up(P₄)+C₃，P₃And C₂Fusion to give P₂I.e. F₂＝Up(P₃)+C₂(ii) a Finally, the three layers of fused features are sampled and fused again to obtain the output single character outer contour, inner contour and single character center three-level features F_char1，F_char2And F_char3I.e. encrypted picture single-word positioning characteristic diagram (F)_char1，F_char2，F_char3)＝(P₂+Up(P₃)+Up(Up(P₄))+Up(Up(Up(P₅) ))); the output characteristic diagram size is (3, H/4, W/4); the value of each position of the characteristic diagram matrix is the probability that the pixel is a single character;

after obtaining the picture characteristic diagram, firstly adopting an up-sampling algorithm of linear double interpolation to enlarge the size of the characteristic diagram by four times to the original size of the picture, wherein the first outer contour characteristic diagram is a corresponding area of the minimum external rectangle of a single character, the second inner contour characteristic diagram is a corresponding area of the minimum external rectangle of the single character reduced by 0.7 times, and the third single character center characteristic diagram is a corresponding area of the minimum external rectangle of the single character reduced by 0.5 times, and is used for determining the character center; mapping each value in the feature maps to 0-1 through a sigmoid function, wherein the mapped value is the probability that a pixel point is a corresponding feature, the pixel point with the probability exceeding a threshold is considered as the corresponding feature, the value is 1, and the pixel point lower than the threshold is 0, so that three 0-1 binary maps with the same size as the original image can be obtained, and the three binary maps respectively correspond to the pixel points of the outer contour, the inner contour and the center of the single character; expanding the determined connected region of the single character center to the boundary of the inner contour by using width-first search BFS, and then searching and expanding the obtained boundary of the inner contour to the boundary of the outer contour; at this time, all the pixel points in the outer contour are the character areas, and the minimum external rectangle of the connected area is the positioning frame of the single character.

5. The two-stage secure multiparty computation privacy preserving picture text location identification method as claimed in claim 1, wherein said single word recognition specifically is: the method comprises the steps of judging whether the single characters in the same text line are obtained by utilizing coordinate information of a single character positioning frame, amplifying the rectangle of each single character frame to be 1.25 times of the original rectangle by utilizing a scaling function, judging whether the single characters are intersected with other single character frames, regarding the single characters with intersection after the single character frames are expanded as the texts in the same text frame, sequencing according to horizontal coordinates, cutting corresponding area pictures on an original picture through coordinates of the characters, encrypting the pictures, transmitting the pictures to a cloud server in sequence, identifying the single characters, obtaining character probability of corresponding types, and finally obtaining the single character type with the highest probability by utilizing an argmax function, namely: char ═ MaxPool (conv (pic)))_nWherein the operation of convolution calculation-activation function-max pooling is repeated several times, and the output dimension of the full-link layer fc is the predicted single-word category.

6. The method as claimed in claim 1, wherein the two-stage secure multiparty computation based privacy preserving picture text location recognition method comprises: the single character feature extraction module and the progressive expansion post-processing module are characterized in that: the single character feature extraction module performs convolution calculation processing according to the pixel information of the text picture to obtain three layers of single character feature map results, and the gradual expansion post-processing module expands a connected domain formed by each layer of pixel points to the next layer of boundary to the single character outline according to the three layers of single character feature map information by using a width search algorithm according to a first-come-first-obtained rule to obtain the result of the single character occupying the pixel area.

7. The method as claimed in claim 6, wherein the single character feature extraction module employs ResNet without full connection layer, and ResNet outputs 5 layers of features, wherein the first layer of features is too far from the bottom layer to participate in the calculation, and the remaining 4 layers of features output conv₂，conv₃，conv₄，conv₅Respectively corresponding to the four features C extracted in the above feature extraction₂，C₃，C₄，C₅The total text positioning Loss function is proportionally composed of Loss functions for predicting three characteristics, namely Loss is 0.5Loss1+0.25Loss2+0.25Loss3, Loss1, Loss2 and Loss3 are calculated by Dice Loss or SmoothL1 Loss and represent the Loss of the outer contour, the inner contour and the center feature map of the single character and the label.

8. The two-stage secure multiparty computation privacy preserving picture text location identification method as claimed in claim 6 or 7, wherein said single character feature extraction module is further provided with an optimizer, and Adam combining two optimization algorithms of AdaGrad and RMSProp is selected by said optimizer for training optimization.

9. The two-stage secure multiparty computation privacy preserving picture text location recognition method as claimed in claim 1, wherein said single word recognition model is implemented using a convolutional neural network, said convolutional neural network comprising: several convolutional layers with activation function, max pooling layer, full connectivity layer and softmax layer, where: the convolution layer carries out matrix convolution calculation processing according to image pixel information to obtain a local image feature result, the maximum pooling layer carries out maximum pooling processing according to the image feature information to obtain an image feature result after dimension reduction, the full-connection layer carries out feature mapping processing according to the image feature information to obtain a classification probability vector result, and the softmax layer carries out normalization processing according to the classification probability vector information to obtain a final single character classification result.

10. A system for implementing the method of any preceding claim, comprising: the system comprises a single character feature extraction network unit, a gradual expansion post-processing unit and a single character recognition network unit, wherein: the single character feature extraction network unit is connected with the gradual expansion post-processing unit and transmits the single character feature graph information to the gradual expansion post-processing unit, and the gradual expansion post-processing unit is connected with the single character recognition network unit and transmits the single character graph to the gradual expansion post-processing unit.