CN116543397A

CN116543397A - Text similarity calculation method and device, electronic equipment and storage medium

Info

Publication number: CN116543397A
Application number: CN202310575147.2A
Authority: CN
Inventors: 张小亮; 李东欣; 李茂林; 戚纪纲
Original assignee: Beijing Superred Technology Co Ltd
Current assignee: Beijing Superred Technology Co Ltd
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-08-04

Abstract

The application relates to a text similarity calculation method, a device, electronic equipment and a storage medium, and relates to the field of text matching. The method and the device can improve the efficiency of judging and calculating the similarity between texts.

Description

Text similarity calculation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of text matching, and in particular, to a method and apparatus for calculating text similarity, an electronic device, and a storage medium.

Background

With the continuous development of network technology, information leakage has increased. When sensitive information leaks, serious economic loss and serious safety problems are caused. At present, information leakage mainly occurs in two modes, and firstly, sensitive information is leaked in a text form; second, sensitive information is recorded in the image and revealed in the form of an image.

If the sensitive information leaks in the second form, at present, OCR is generally performed on the text in the image to obtain the text in the image, and then similarity between the text and the text corresponding to the sensitive information is calculated, so as to determine whether the sensitive information exists in the image, and further determine whether the sensitive information leaks. However, the process of performing OCR recognition on the image and then calculating the similarity occupies more serious computing resources and has lower efficiency.

Disclosure of Invention

In order to improve efficiency of judging and calculating similarity between texts, the application provides a text similarity calculation method, a device, electronic equipment and a storage medium.

In a first aspect, the present application provides a text similarity calculation method, which adopts the following technical scheme:

a text similarity calculation method, comprising:

acquiring an image to be identified;

judging whether at least one text paragraph exists in the image to be identified;

if the at least one text paragraph exists, determining a text image from the image to be identified, wherein the text image comprises the at least one text paragraph;

extracting feature vectors of the text image to obtain text feature vectors;

And calculating the similarity between preset sensitive information feature vectors and the text feature vectors.

By adopting the technical scheme, after the image to be recognized is obtained, if at least one text paragraph exists in the image to be recognized, determining a text image corresponding to at least one text paragraph from the image to be recognized, extracting a feature vector from the text image to obtain a text feature vector, and the preset sensitive information is represented by the feature vector, so that the similarity between the feature vector of the preset sensitive information and the text feature vector is directly calculated, whether the preset sensitive information exists in the image to be recognized is judged, whether the sensitive information is compromised can be judged, the text feature vector is extracted from the text image, the similarity between the text feature vector and the preset sensitive information feature vector is calculated, the OCR recognition process is omitted compared with the process of firstly carrying out OCR recognition on the text image and then calculating the similarity between the text feature vector, and the similarity between the text feature vector and the preset sensitive information feature vector is calculated more conveniently and rapidly compared with the similarity between texts, the occupied calculation resource is less, and the efficiency is higher.

In another possible implementation manner, the determining whether at least one text paragraph exists in the image to be identified includes:

Extracting features of the image to be identified to obtain global features;

text paragraph detection is carried out on the global features, and detection results are obtained;

and judging whether the at least one text paragraph exists in the image to be identified or not based on the detection result.

By adopting the technical scheme, the global features for representing the overall condition of the image to be identified are obtained after the feature extraction of the image to be identified, and the spatial structure information in the image to be identified is represented by the global features, so that whether at least one text paragraph exists in the image to be identified can be detected more accurately and rapidly by detecting the text paragraphs of the global features.

In another possible implementation manner, if the at least one text paragraph exists, determining a text image from the images to be identified, the method further includes:

obtaining a text box corresponding to each text paragraph, wherein the text box is obtained by detecting the text paragraph of the global feature;

performing rotating ROI alignment processing based on the text boxes corresponding to each text paragraph and the global features to obtain rotated text boxes corresponding to each text paragraph;

performing mask segmentation processing based on the rotated text boxes corresponding to each text paragraph to obtain a mask segmentation result;

Wherein determining a text image from the images to be identified comprises:

and cutting and splicing the images to be identified based on the mask segmentation result to obtain the text image.

Through adopting above-mentioned technical scheme, use text box mark text paragraph in text paragraph detection process, thereby make every text paragraph more clear, carry out rotatory ROI Align processing to the text box, obtain the text box after the rotation, the gesture of text paragraph can be laminated more to the text box after the rotation, reduce the noise in the text box, then carry out mask segmentation processing based on the text box after the rotation, thereby can obtain the mask segmentation result of text paragraph, mask segmentation result can more accurately sign the region of text paragraph, remain text paragraph region and remove irrelevant region such as noise, consequently, cut and splice according to mask segmentation result and waiting to discern the image, thereby can obtain the text image more accurately.

In another possible implementation manner, if the number of the text boxes is at least two, the clipping and stitching processing is performed based on the mask segmentation result and the image to be identified to obtain the text image, including:

Cutting the image to be identified based on each mask segmentation result to obtain text sub-images corresponding to each text paragraph;

and splicing the text sub-images corresponding to the text paragraphs respectively to obtain the text image.

By adopting the technical scheme, the image to be identified is cut according to the mask segmentation results, so that text sub-images corresponding to all text paragraphs in the image to be identified are obtained, all the text sub-images are spliced, so that the text image comprising all the text paragraphs is obtained, and all the text sub-images are integrated into one text image, so that the subsequent similarity calculation is facilitated.

In another possible implementation manner, each rotated text box corresponds to a rotation coefficient, and the stitching the text sub-images corresponding to the text paragraphs to obtain the text image includes:

based on the rotation coefficients corresponding to the rotated text boxes, reversely rotating the corresponding text sub-images to obtain reversely rotated text sub-images corresponding to the text sub-images;

and splicing the text sub-images after the reverse rotation to obtain the text image.

By adopting the technical scheme, the text sub-images are reversely rotated through the rotation coefficients, and the text paragraphs in the inclined state are adjusted to be in the horizontal state, so that the postures of all the text sub-images are kept consistent, the text images obtained by splicing the reversely rotated text sub-images are more neat and standard, the subsequent similarity calculation is further facilitated, and the accuracy of the calculation result is improved.

In another possible implementation manner, the dimension of the preset sensitive information feature vector is a preset dimension, and the extracting the feature vector from the text image to obtain the text feature vector includes:

and extracting semantic feature vectors of the text image according to the preset dimension to obtain text feature vectors.

By adopting the technical scheme, the semantic feature vector extraction is carried out on the text image, and the semantic feature vector extraction is converted into the feature vector with the preset dimension, so that the dimension of the obtained text feature vector is consistent with the dimension of the feature vector of the preset sensitive information, and the accuracy of the subsequent similarity calculation result is improved.

In another possible implementation manner, the preset sensitive information feature vector corresponds to preset sensitive information, and the method further includes:

If the similarity reaches a preset similarity threshold value, determining a target text paragraph and a target position, wherein the target text paragraph is a text paragraph in which the preset sensitive information is located, and the target position is a position in which the preset sensitive information is located in the target text paragraph;

labeling the target text paragraph and the target position in the text image to obtain a labeled text image;

and outputting the annotated text image and the image to be identified.

By adopting the technical scheme, if the similarity reaches the preset similarity threshold, the preset sensitive information is described in the text image, so that the target text paragraph and the target position are marked, the target text paragraph and the target position are more obvious, the marked text image and the image to be identified are output, and therefore, a worker can more intuitively check the leakage condition of the preset sensitive information and the leakage source of the sensitive information.

In a second aspect, the present application provides a text similarity calculation device, which adopts the following technical scheme:

a text similarity calculation device comprising:

the image acquisition module is used for acquiring an image to be identified;

The judging module is used for judging whether at least one text paragraph exists in the image to be identified;

an image determining module, configured to determine, when the at least one text paragraph exists, a text image from the images to be identified, where the text image includes the at least one text paragraph;

the vector extraction module is used for extracting the feature vector of the text image to obtain a text feature vector;

and the similarity calculation module is used for calculating the similarity between the preset sensitive information feature vector and the text feature vector.

By adopting the technical scheme, after the image acquisition module acquires the image to be identified, if the judgment module judges that at least one text paragraph exists in the image to be identified, the image determination module determines the text image corresponding to the at least one text paragraph from the image to be identified, the vector extraction module extracts the feature vector of the text image to obtain the text feature vector, and the preset sensitive information is also represented by the feature vector, so that the similarity calculation module directly calculates the similarity between the feature vector of the preset sensitive information and the text feature vector, thereby judging whether the preset sensitive information exists in the image to be identified, further judging whether the sensitive information is compromised, extracting the text feature vector from the text image, calculating the similarity between the text feature vector and the preset sensitive information feature vector, omitting the OCR identification process compared with the process of firstly carrying out OCR identification on the text image and then calculating the similarity, and calculating the similarity between the text feature vector and the preset sensitive information feature vector.

In another possible implementation manner, the determining module is specifically configured to, when determining whether there is at least one text paragraph in the image to be identified:

extracting features of the image to be identified to obtain global features;

In another possible implementation manner, if the at least one text paragraph exists, the apparatus further includes:

the text box acquisition module is used for acquiring a text box corresponding to each text paragraph, and the text box is obtained by detecting the text paragraph of the global feature;

the rotating ROI alignment processing module is used for carrying out rotating ROI alignment processing on the basis of the text boxes corresponding to each text paragraph and the global features to obtain rotated text boxes corresponding to each text paragraph;

the mask segmentation processing module is used for performing mask segmentation processing on the basis of the rotated text boxes corresponding to each text paragraph to obtain a mask segmentation result;

the image determining module is specifically configured to, when determining a text image from the images to be identified:

In another possible implementation manner, if the number of the text boxes is at least two, the image determining module is specifically configured to, when performing clipping and stitching processing based on the mask segmentation result and the image to be identified to obtain the text image:

In another possible implementation manner, each rotated text box corresponds to a rotation coefficient, and the image determining module is specifically configured to, when the text sub-images corresponding to the respective text paragraphs are spliced to obtain the text image:

In another possible implementation manner, the dimension of the feature vector of the preset sensitive information is a preset dimension, and the vector extraction module is specifically configured to, when extracting the feature vector of the text image to obtain the text feature vector:

In another possible implementation manner, the preset sensitive information feature vector corresponds to preset sensitive information, and the apparatus further includes:

the paragraph and position determining module is used for determining a target text paragraph and a target position when the similarity reaches a preset similarity threshold, wherein the target text paragraph is a text paragraph in which the preset sensitive information is located, and the target position is a position in which the preset sensitive information is located in the target text paragraph;

the labeling module is used for labeling the target text paragraph and the target position in the text image to obtain a labeled text image;

and the image output module is used for outputting the annotated text image and the image to be identified.

In a third aspect, the present application provides an electronic device, which adopts the following technical scheme:

An electronic device, the electronic device comprising:

at least one processor;

a memory;

at least one application, wherein the at least one application is stored in the memory and configured to be executed by the at least one processor, the at least one processor configured to: a method of text similarity calculation according to any one of the possible implementations of the first aspect is performed.

In a fourth aspect, the present application provides a computer readable storage medium, which adopts the following technical scheme:

a computer-readable storage medium, which when executed in a computer, causes the computer to perform a text similarity calculation method according to any one of the first aspects.

In summary, the present application includes at least one of the following beneficial technical effects:

1. after the image to be recognized is obtained, if at least one text paragraph exists in the image to be recognized, determining a text image corresponding to at least one text paragraph from the image to be recognized, extracting feature vectors of the text image to obtain text feature vectors, and calculating the similarity between the preset sensitive information feature vectors and the text feature vectors, wherein the preset sensitive information is represented by the feature vectors, so that the similarity between the preset sensitive information feature vectors and the text feature vectors is directly calculated, whether the preset sensitive information exists in the image to be recognized is judged, whether the sensitive information is compromised can be judged, the text feature vectors are extracted from the text image, the similarity between the text feature vectors and the preset sensitive information feature vectors is calculated, the OCR recognition process is omitted compared with the process of firstly performing OCR recognition on the text image and then calculating the similarity between the text feature vectors, and the similarity between the text feature vectors and the preset sensitive information feature vectors is calculated more conveniently and rapidly, compared with the similarity between the calculated texts, occupied calculation resources are fewer, and the efficiency is higher;

2. The text paragraphs are marked by using text boxes in the text paragraph detection process, so that each text paragraph is clearer, the text boxes are subjected to rotary ROI alignment processing to obtain the rotated text boxes, the rotated text boxes can be more attached to the gestures of the text paragraphs, noise in the text boxes is reduced, mask segmentation processing is then carried out on the basis of the rotated text boxes, mask segmentation results of the text paragraphs can be obtained, the mask segmentation results can more accurately represent the areas of the text paragraphs, the text paragraph areas are reserved, irrelevant areas such as noise are removed, and therefore cutting and splicing are carried out according to the mask segmentation results and images to be identified, and the text images can be obtained more accurately.

Drawings

Fig. 1 is a flow chart of a text similarity calculation method according to an embodiment of the present application.

Fig. 2 is an exemplary diagram of a rotational ROI alignment process in an embodiment of the present application.

FIG. 3 is an exemplary diagram of a mask segmentation process in an embodiment of the present application.

Fig. 4 is an overall flowchart of a text similarity calculation method in an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a text similarity calculating device according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Reference numerals: 1. a text paragraph; 2. a text box; 3. noise; 4. a rotated text box; 5. mask segmentation results; 61. detecting branches; 62. splitting branches; 63. a similarity calculation branch; 70. a text similarity calculating device; 701. an image acquisition module; 702. a judging module; 703. an image determining module; 704. a vector extraction module; 705. a similarity calculation module; 80. an electronic device; 801. a processor; 802. a bus; 803. a memory; 804. a transceiver.

Detailed Description

The present application is described in further detail below with reference to the accompanying drawings.

Modifications of the embodiments which do not creatively contribute to the invention may be made by those skilled in the art after reading the present specification, but are protected by patent laws only within the scope of claims of the present application.

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, unless otherwise specified, the term "/" generally indicates that the associated object is an "or" relationship.

Embodiments of the present application are described in further detail below with reference to the drawings attached hereto.

The embodiment of the application provides a text similarity calculation method, which is executed by electronic equipment, wherein the electronic equipment can be a server or terminal equipment, and the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server for providing cloud computing service. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., and the terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein, and as shown in fig. 1, the method includes step S101, step S102, step S103, step S104, and step S105, where,

Step S101, an image to be identified is acquired.

Specifically, the number of the images to be identified may be one, two, or a plurality. In this embodiment, for convenience of description, the number of images to be identified is taken as an example. The image to be identified can be input into the electronic equipment by a staff through a USB flash disk, a hard disk and other storage equipment, the electronic equipment can also acquire the image to be identified through a cloud server, and the image to be identified can also be acquired from the Internet. Judging whether the sensitive information is recorded in the image to be identified, so as to judge whether the sensitive information is leaked.

Step S102, judging whether at least one text paragraph exists in the image to be recognized.

For the embodiment of the application, the text is usually recorded in the image in the form of paragraphs, so that whether at least one text paragraph exists in the image to be recognized is judged, whether characters exist in the image to be recognized can be judged, and whether the image to be recognized needs to be subjected to the subsequent operation step of judging whether sensitive information exists in the image or not can be further determined.

Step S103, if at least one text paragraph exists, determining a text image from the images to be identified.

Wherein the text image includes at least one text paragraph.

The electronic device determines that at least one text paragraph exists in the image to be identified, that is, the at least one text paragraph exists in the form of the image, so that in order to facilitate the subsequent determination of whether the sensitive information is recorded in the image to be identified, the text image containing the at least one text paragraph is determined from the image to be identified, that is, the text image contains only text.

And step S104, extracting the feature vector of the text image to obtain the text feature vector.

Specifically, after determining the text image, the electronic device may input the text image into the related model to extract the feature vector, thereby obtaining the text feature vector. The related model may be a transform network model, and the semantic feature vector extraction is performed on the text image through the transform network model to obtain a text feature vector. In other embodiments, the correlation model may be another model capable of extracting feature vectors, which is not limited herein.

In step S105, the similarity between the preset sensitive information feature vector and the text feature vector is calculated.

For the embodiment of the application, the preset sensitive information feature vector is formed by converting preset sensitive information through vectors, the preset sensitive information can be input into the electronic equipment by staff through input equipment such as a keyboard, a touch screen and the like, and the preset sensitive information can be words or sentences. The electronic device then converts the preset sensitive information into feature vectors, i.e., preset sensitive information feature vectors. The electronic equipment calculates the similarity between the preset sensitive information feature vector and the text feature vector, so that whether the preset sensitive information is recorded in the image to be identified or not can be judged, and whether leakage occurs to the preset sensitive information or not is judged.

Specifically, the electronic device may calculate a cosine distance value between the preset sensitive information feature vector and the text feature vector, and characterize the similarity by the cosine distance value. The electronic equipment can store a preset cosine distance threshold value, after the cosine distance value is obtained, the electronic equipment judges whether the cosine distance value reaches the preset cosine distance threshold value, if so, the similarity between the preset sensitive information feature vector and the text feature vector is higher, so that the fact that the preset sensitive information is recorded in the image to be identified can be determined, and further leakage of the preset sensitive information is determined. If the preset sensitive information is not recorded in the image to be identified, the similarity between the preset sensitive information feature vector and the text feature vector is lower, and accordingly the preset sensitive information is determined to be not leaked.

In order to more accurately determine whether there is a text paragraph in the image to be recognized, step S102 of determining whether there is at least one text paragraph in the image to be recognized specifically includes: step S1021 (not shown), step S1022 (not shown), and step S1023 (not shown), wherein,

step S1021, extracting features of the image to be identified to obtain global features.

Specifically, the electronic device may perform preprocessing, such as normalization operation, that is, subtracting a mean value of pixel values from a pixel value of each pixel in the image to be identified and dividing the mean value by a variance of the pixel values, so as to obtain an image to be identified with a specified type, which is convenient for subsequent feature extraction. Specifically, the electronic device may input the preprocessed image to be identified into the network model to perform feature extraction to obtain the global feature. The network model may be a trained convolutional network model (CNN), or may be another network model capable of performing feature extraction, which is not limited herein. The global features characterize the overall situation of the image to be identified, such as the spatial structure information of the image to be identified, so that subsequent operations are facilitated according to the obtained global features. In other embodiments, the CNN network model and the feature pyramid network model (FPN) may be combined to perform global feature extraction, so as to improve the feature extraction effect.

Step S1022, text paragraph detection is carried out on the global features, and detection results are obtained.

For embodiments of the present application, the electronic device may input global features into the region candidate network model (RegionProposal Network, RPN) for text paragraph detection. If a text paragraph is detected, the electronic device classifies, by the text detector, a region in the image to be identified where the text paragraph is present and a region in the image to be identified where the text paragraph is not present, for example, a region corresponding to the text paragraph is labeled as "1", and a region other than the text paragraph is labeled as "0". And the electronic equipment ignores the area without the text paragraph according to the mark, namely the area without the text paragraph does not carry out subsequent operations such as text segmentation, similarity calculation and the like.

Step S1023, judging whether at least one text paragraph exists in the image to be recognized based on the detection result.

For the embodiment of the present application, taking step S1022 as an example, the electronic device determines whether a flag "1" exists to determine whether at least one text paragraph exists, if "1" exists, it indicates that at least one text paragraph exists in the image to be recognized, and if "1" does not exist, it indicates that at least one text paragraph does not exist, and no text exists in the image to be recognized.

If at least one text passage exists, the method includes step Sa (not shown), step Sb (not shown), and step Sc (not shown), wherein,

step Sa, obtaining a text box corresponding to each text paragraph.

The text box is obtained by detecting text paragraphs of the global features.

For the embodiment of the present application, taking step S1022 as an example, if it is detected that a text paragraph exists in the image to be identified through the RPN network, a rectangular box is used to perform box selection on the text paragraph. Specifically, the center position of the rectangular frame is the center position of the text paragraph, a coordinate system is drawn according to the center position, the rectangular frame is drawn according to the extension length w of the text paragraph in the x-axis direction of the coordinate system and the extension width h of the text paragraph in the y-axis vertical direction of the coordinate system, therefore, the whole text paragraph can be completely framed, and the coordinates of four vertexes of the rectangular frame can be obtained according to the coordinate system. The electronic device can acquire the text box of the text paragraph by acquiring the coordinates corresponding to the four vertexes of each text box, so that the follow-up operation of the text box is facilitated.

And step Sb, performing rotating ROI alignment processing based on the text boxes corresponding to each text paragraph and the global features to obtain the rotated text boxes corresponding to each text paragraph.

Specifically, referring to fig. 2, as shown in (a) of fig. 2, the text paragraph 1 may not exist in a standard horizontal writing form in the image to be recognized, but may exist at a certain angle with respect to the horizontal, i.e., in an inclined state. The selected area of the text box 2, which is detected by the text passage, comprises noise 3 and other irrelevant information in addition to the text passage 1. According to the position of the text box 2, the local feature corresponding to the selected area of the text box can be determined from the global feature, the local feature can represent relevant information of the text paragraph 1, such as the gesture of the text paragraph 1, so that the text box 2 is subjected to the rotating ROI alignment processing according to the local feature, the rotated text box 4 shown in fig. 2 (b) can be obtained, the rotated text box 4 can be more attached to the gesture of the text paragraph 1, and noise 3 and other irrelevant information of surrounding areas are reduced.

And step Sc, performing mask segmentation processing based on the rotated text boxes corresponding to each text paragraph to obtain a mask segmentation result.

For the embodiment of the present application, referring to fig. 3, after the rotated text box 4 is obtained as shown in fig. 3 (c), the area where the text paragraph 1 is located is made more accurate, and the electronic device performs mask segmentation processing on the local feature in the rotated text box 4 through the segmentation detector, so as to obtain a mask segmentation result 5 as shown in fig. 3 (d). The mask segmentation result 5 is a mask image segmented along the edge of the text paragraph 1.

Therefore, in step S103, a text image is determined from the images to be identified, which specifically includes clipping and stitching based on the mask segmentation result and the images to be identified, so as to obtain the text image.

Specifically, after the electronic device obtains the mask segmentation result, clipping and splicing are performed along the edge of the mask segmentation result from the image to be identified, so that the image corresponding to the text paragraph can be clipped under the condition of reducing noise and other irrelevant information, and further the text image is obtained.

If the number of the text boxes is at least two, cutting and splicing processing is carried out based on the mask segmentation result and the image to be identified to obtain a text image, and the method specifically comprises the following steps:

cutting the image to be identified based on the mask segmentation results to obtain text sub-images corresponding to the text paragraphs.

And splicing the text sub-images corresponding to the text paragraphs to obtain text images.

For the embodiment of the application, the number of the text boxes is at least two, which indicates that two or more text paragraphs exist in the image to be identified, and the electronic equipment cuts text sub-images corresponding to the text paragraphs from the image to be identified according to the segmentation result of each mask. After obtaining text sub-images corresponding to each text paragraph, the electronic device splices all the text sub-images into a whole text image.

In order to obtain more standard and tidy text images, each rotated text box is correspondingly provided with a rotation coefficient, and text sub-images corresponding to each text paragraph are spliced to obtain the text images, specifically comprising the following steps:

and reversely rotating the corresponding text sub-images based on the rotation coefficients corresponding to the rotated text boxes respectively to obtain reversely rotated text sub-images corresponding to the text sub-images respectively.

Specifically, the rotation coefficient is a rotation-before-rotation text box corresponding to the rotation-before-text box, and the rotation angle is clockwise or anticlockwise along an axis which vertically passes through the center point in the plane of the rotation-before-text box, for example, the rotation coefficient corresponding to a certain rotation-after-text box a is anticlockwise rotated by 30 °, that is, the rotation-after-text box a is obtained by anticlockwise rotating a corresponding original text box by 30 ° along the center point, that is, the included angle formed by a text paragraph in the text box a and a horizontal plane is 30 °. After cutting out the text sub-image corresponding to the rotated text box A, the electronic equipment reversely rotates the text sub-image by 30 degrees along the center point of the rotated text box A according to the rotation coefficient, namely rotates the text sub-image by 30 degrees clockwise, so that the text sub-image is converted into a horizontal state, and the text in the text sub-image is in the horizontal state. It should be understood that if the text paragraph in the original text box is in a horizontal state, that is, the original text box is in a horizontal state, the rotation coefficient of the corresponding rotated text box is 0.

And the electronic equipment reversely rotates each text sub-image according to the corresponding rotation coefficient, so that each text sub-image and each text paragraph are in a horizontal state. And then the electronic equipment splices the text sub-images after the reverse rotation, so that the text image recording all text paragraphs is obtained. By enabling all the text sub-images and the text paragraphs to be in a horizontal state, the spliced text images are more standard and more accurate in subsequent similarity calculation results.

In order to improve accuracy of similarity calculation, the dimension of the preset sensitive information feature vector is a preset dimension, and in step S104, feature vector extraction is performed on the text image to obtain a text feature vector, which specifically includes extracting semantic feature vectors of the text image according to the preset dimension to obtain the text feature vector.

Specifically, the electronic device may convert the preset sensitive information into a feature vector of a preset dimension in a Word Embedding (Word Embedding) manner, that is, map-embed the preset sensitive information into a numerical vector space, so as to obtain the feature vector of the preset dimension corresponding to the preset sensitive information. Assume that the size of the preset dimension is 512.

Similarly, after obtaining the text image, the electronic device may perform preprocessing on the text image according to the preprocessing mode in step S1021, to obtain a text image of a specified type. The electronic equipment can input the preprocessed text image into a transducer model for semantic feature vector extraction, and pool, linearly map and normalize the extraction result to convert the extraction result into feature vectors with 512 dimensions, so that the dimensions of the obtained text feature vector and the dimensions of the preset sensitive information feature vector are kept consistent, and further the calculation result obtained in the subsequent calculation of the similarity is more accurate.

In order to facilitate the staff to know that the preset sensitive information leaks in time, the preset sensitive information feature vector corresponds to the preset sensitive information, and the method further includes step S1 (not shown in the figure), step S2 (not shown in the figure), and step S3 (not shown in the figure), wherein step S1 may be performed after step S105, wherein,

step S1, if the similarity reaches a preset similarity threshold, determining a target text paragraph and a target position.

The target text paragraph is a text paragraph in which preset sensitive information is located, and the target position is a position in which the preset sensitive information is located in the target text paragraph.

In the embodiment of the present application, if the similarity is represented by a cosine distance value, the preset similarity threshold may be represented by a preset cosine distance threshold. The similarity reaches a preset similarity threshold value, and the preset sensitive information can be determined to be recorded in the text image, namely the preset sensitive information is indicated to be leaked. Further, the electronic device can scan the text feature vector through the sliding window to sequentially obtain a plurality of local feature vectors, and the electronic device sequentially calculates cosine distance values between the preset sensitive information feature vector and the local feature vector, so that a text paragraph in which the preset sensitive information is located and positions in the text paragraph are determined.

And S2, marking the target text paragraph and the target position in the text image to obtain a marked text image.

For the embodiments of the present application. After the electronic device determines the target text paragraph and the target position, different modes can be used for marking the target text paragraph and the target position, so that the target text paragraph and the target position are distinguished, and the target text paragraph and the target position are more obvious.

And step S3, outputting the annotated text image and the image to be identified.

Specifically, the electronic device can control the display devices such as the display screen to output the marked text image and the image to be identified, so that a worker can more intuitively check the leakage condition of preset sensitive information and the leakage source of the sensitive information.

Referring to fig. 4, an embodiment of the present application provides a text similarity calculation method, after an image to be identified is obtained, the image to be identified is preprocessed, global features of the image to be identified are extracted, and whether at least one text paragraph exists in the image to be identified is determined according to the global features through a detection branch.

Referring to fig. 4, specifically, the detecting branch 61 detects a text paragraph of the global feature, if at least one text paragraph exists, boxes the text paragraph by text boxes, classifies text paragraph regions and regions without text paragraphs in the image to be identified by the text detector in the detecting branch, determines coordinates of four vertices and a center position of the text box corresponding to the text paragraph, and characterizes the text box by the coordinates.

Referring to fig. 4, after passing the detection branch 61, the electronic device ignores the image to be recognized in which no text paragraph exists, and the region in the image to be recognized in which no text paragraph exists, and reserves a text box in which the text paragraph exists. In the segmentation branch 62, the text box is rotated ROI alignment according to the global feature and the coordinates, so as to obtain a rotated text box, and mask segmentation processing is performed on the local feature in the rotated text box through the segmentation detector according to the rotated text box, so as to obtain a mask segmentation result.

Referring to fig. 4, after obtaining the mask segmentation result, the image to be processed is cut and spliced in the similarity calculation branch 63 based on the mask segmentation result to obtain a text image, and the text image is subjected to semantic feature vector extraction according to a preset dimension to obtain a text feature vector. And the electronic equipment performs feature vector conversion on the preset sensitive information according to the preset dimension to obtain a preset sensitive information feature vector. And then calculating the similarity between the preset sensitive information feature vector and the text feature vector.

The above embodiment describes a text similarity calculation method from the viewpoint of a method flow, and the following embodiment describes a text similarity calculation device from the viewpoint of a virtual module or a virtual unit, specifically the following embodiment.

The embodiment of the present application provides a text similarity calculating device 70, as shown in fig. 5, the text similarity calculating device 70 may specifically include:

an image acquisition module 701, configured to acquire an image to be identified;

a judging module 702, configured to judge whether at least one text paragraph exists in the image to be identified;

an image determining module 703, configured to determine, when at least one text paragraph exists, a text image from the images to be identified, where the text image includes the at least one text paragraph;

The vector extraction module 704 is configured to extract a feature vector from the text image to obtain a text feature vector;

the similarity calculating module 705 is configured to calculate a similarity between a preset sensitive information feature vector and a text feature vector.

The embodiment of the present application provides a text similarity calculating device 70, wherein after an image obtaining module 701 obtains an image to be identified, if a judging module 702 judges that at least one text paragraph exists in the image to be identified, an image determining module 703 determines a text image corresponding to at least one text paragraph from the image to be identified, a vector extracting module 704 extracts feature vectors from the text image to obtain text feature vectors, and preset sensitive information is also represented by the feature vectors, so that a similarity calculating module 705 directly calculates the similarity between the feature vectors of the preset sensitive information and the text feature vectors, thereby judging whether the preset sensitive information exists in the image to be identified, further judging whether the sensitive information is compromised, extracting the text feature vectors from the text image and calculating the similarity between the text feature vectors and the preset sensitive information feature vectors, compared with the process of performing OCR recognition and then calculating the similarity to the text image, the similarity between the text feature vectors and the preset sensitive information feature vectors is more convenient and quick compared with the similarity between the text feature vectors, and the occupied calculation resources are less, and the efficiency is higher.

In one possible implementation manner of this embodiment of the present application, when determining whether there is at least one text paragraph in the image to be identified, the determining module 702 is specifically configured to:

extracting features of the image to be identified to obtain global features;

text paragraph detection is carried out on the global features, and a detection result is obtained;

and judging whether at least one text paragraph exists in the image to be identified based on the detection result.

In one possible implementation manner of the embodiment of the present application, if at least one text paragraph exists, the apparatus 70 further includes:

the text box acquisition module is used for acquiring a text box corresponding to each text paragraph, wherein the text box is obtained by detecting the text paragraph of the global feature;

wherein, the image determining module 703 is specifically configured to, when determining a text image from the images to be identified:

and cutting and splicing the images to be identified based on the mask segmentation result to obtain text images.

In one possible implementation manner of this embodiment of the present application, if the number of text boxes is at least two, the image determining module 703 is specifically configured to, when performing clipping and stitching processing based on the mask segmentation result and the image to be identified to obtain a text image:

cutting the image to be identified based on the mask segmentation results to obtain text sub-images corresponding to the text paragraphs;

In one possible implementation manner of this embodiment of the present application, each rotated text box corresponds to a rotation coefficient, and when the image determining module 703 is configured to splice text sub-images corresponding to each text paragraph to obtain a text image, the method is specifically configured to:

In one possible implementation manner of the embodiment of the present application, the dimension of the preset sensitive information feature vector is a preset dimension, and the vector extraction module 704 is specifically configured to:

And extracting semantic feature vectors of the text image according to preset dimensions to obtain text feature vectors.

In one possible implementation manner of this embodiment, the preset sensitive information feature vector corresponds to preset sensitive information, and the apparatus 70 further includes:

the paragraph and position determining module is used for determining a target text paragraph and a target position when the similarity reaches a preset similarity threshold value, wherein the target text paragraph is a text paragraph where preset sensitive information is located, and the target position is a position where the preset sensitive information is located in the target text paragraph;

It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the text similarity calculating device 70 described above may refer to the corresponding process in the foregoing method embodiment, and will not be described herein again.

In an embodiment of the present application, as shown in fig. 6, an electronic device 80 shown in fig. 6 includes: a processor 801 and a memory 803. The processor 801 is coupled to a memory 803, such as via a bus 802. Optionally, the electronic device 80 may also include a transceiver 804. It should be noted that, in practical applications, the transceiver 804 is not limited to one, and the structure of the electronic device 80 is not limited to the embodiments of the present application.

The processor 801 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 801 may also be a combination of computing functions, e.g., including one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 802 may include a path to transfer information between the aforementioned components. Bus 802 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or EISA (Extended Industry Standard Architecture ) bus, among others. Bus 802 may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or type of bus.

The Memory 803 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 803 is used for storing application program codes for executing the present application and is controlled to be executed by the processor 701. The processor 801 is configured to execute application code stored in the memory 803 to implement what is shown in the foregoing method embodiment.

Among them, electronic devices include, but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. But may also be a server or the like. The electronic device shown in fig. 6 is only an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present application.

The present application provides a computer readable storage medium having a computer program stored thereon, which when run on a computer, causes the computer to perform the corresponding method embodiments described above. Compared with the related art, after the image to be identified is obtained, if at least one text paragraph exists in the image to be identified, the text image corresponding to the at least one text paragraph is determined from the image to be identified, the text image is subjected to feature vector extraction to obtain the text feature vector, and the preset sensitive information is represented by the feature vector, so that the similarity between the preset sensitive information feature vector and the text feature vector is directly calculated, whether the preset sensitive information exists in the image to be identified is judged, whether the sensitive information is compromised can be judged, the text feature vector is extracted from the text image, the similarity between the text feature vector and the preset sensitive information feature vector is calculated, the OCR identification process is omitted compared with the process of firstly carrying out OCR identification on the text image and then calculating the similarity between the text feature vector, and the similarity between the text feature vector and the preset sensitive information feature vector is calculated more conveniently and rapidly compared with the similarity between texts, occupied calculation resources are fewer, and efficiency is higher.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for a person skilled in the art, several improvements and modifications can be made without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A text similarity calculation method, comprising:

acquiring an image to be identified;

extracting feature vectors of the text image to obtain text feature vectors;

2. The method for calculating text similarity according to claim 1, wherein the determining whether at least one text paragraph exists in the image to be recognized comprises:

extracting features of the image to be identified to obtain global features;

3. The text similarity calculation method according to claim 2, wherein if there is said at least one text passage, the method further comprises:

wherein determining a text image from the images to be identified comprises:

4. The text similarity calculating method according to claim 3, wherein if the number of text boxes is at least two, the clipping and stitching process is performed based on the mask segmentation result and the image to be identified to obtain the text image, including:

5. The method for calculating text similarity according to claim 4, wherein each rotated text box corresponds to a rotation coefficient, and the stitching the text sub-images corresponding to the text paragraphs to obtain the text image includes:

6. The method for calculating text similarity according to claim 1, wherein the dimension of the feature vector of the preset sensitive information is a preset dimension, and the feature vector extraction is performed on the text image to obtain a text feature vector, which includes:

7. The text similarity calculation method according to claim 1, wherein the preset sensitive information feature vector corresponds to preset sensitive information, and the method further comprises:

and outputting the annotated text image and the image to be identified.

8. A text similarity calculation device, comprising:

the image acquisition module is used for acquiring an image to be identified;

9. An electronic device, comprising:

at least one processor;

a memory;

at least one application program, wherein the at least one application program is stored in the memory and configured to be executed by the at least one processor, the at least one application program: for performing a text similarity calculation method according to any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed in a computer, causes the computer to perform a text similarity calculation method according to any one of claims 1 to 7.