CN118097670B - Text image processing method and system based on multi-mode and SAM technology fusion - Google Patents

Text image processing method and system based on multi-mode and SAM technology fusion Download PDF

Info

Publication number
CN118097670B
CN118097670B CN202410517576.9A CN202410517576A CN118097670B CN 118097670 B CN118097670 B CN 118097670B CN 202410517576 A CN202410517576 A CN 202410517576A CN 118097670 B CN118097670 B CN 118097670B
Authority
CN
China
Prior art keywords
image
text
frame
features
proposal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410517576.9A
Other languages
Chinese (zh)
Other versions
CN118097670A (en
Inventor
黄余
王甫
龙陈杰
王贵英
汤琳
赵爱军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mianyang Normal University
Original Assignee
Mianyang Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mianyang Normal University filed Critical Mianyang Normal University
Priority to CN202410517576.9A priority Critical patent/CN118097670B/en
Publication of CN118097670A publication Critical patent/CN118097670A/en
Application granted granted Critical
Publication of CN118097670B publication Critical patent/CN118097670B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Character Input (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to the technical field of text image processing, and discloses a text image processing method and a text image processing system based on multi-mode and SAM technology fusion; extracting image features from a pre-selected frame, extracting text features from text data, obtaining the isolated proposal scores of the pre-selected frames, carrying out feature fusion on the image features and the text features, judging whether space relation words exist in a semantic tree, selecting the pre-selected frame with the highest isolated proposal score as an answer frame, carrying out feature superposition on all the text features, and selecting the pre-selected frame containing all the text features as the answer frame; compared with the prior art, the method has the advantages that the SAM model is introduced to cut the preselection frame, so that main body features in the preselection frame are clearer, irrelevant background interference is reduced, and simultaneously, the effect of mutually supplementing text features among a plurality of identical text data is achieved by utilizing a text feature superposition mode, so that the recognition and positioning accuracy of the text features and the image features is enhanced.

Description

Text image processing method and system based on multi-mode and SAM technology fusion
Technical Field
The invention relates to the technical field of text image processing, in particular to a text image processing method and a text image processing system based on multi-mode and SAM technology fusion.
Background
With the development of artificial intelligence technology, users can modify certain attributes in an image according to their own needs, and determine specific positions of objects or objects in the image in a text given text description and the context of the associated image, such tasks require models to understand visual contents and natural language instructions at the same time, are multi-modal tasks of visual-language combination, and are widely applied in the field of text image processing.
The Chinese patent with the application publication number of CN111062865B discloses an image processing method, an image processing device, computer equipment and a storage medium, which acquire image characteristics and language characteristics, and extract modification information from the image characteristics and the language characteristics through a shared characteristic space, wherein the shared characteristic space carries out accurate matching learning on data of two different modes, namely the image characteristics and the language characteristics, the modification information extracted from the shared characteristic space is more accurate, the target state in a language text is fully fused, and the accuracy of image modification is improved;
The prior art has the following defects:
When an object in a text is identified to be in a specific position in an image, the existing text image processing method adopts a mode of integrally identifying the text features and the object features in the image, and as the number of the image features contained in a preselection frame marked on the image is large, a certain amount of irrelevant background interference data can be doped in the preselection frame, so that the speed of identifying and positioning the subsequent object is reduced, the identification mode of the text features and the image features is single, the effect of mutually supplementing the text features among a plurality of pieces of identical text data cannot be realized, and the processing efficiency and accuracy of the text image are further reduced.
In view of the above, the present invention proposes a text image processing method and system based on multi-modal and SAM technique fusion to solve the above-mentioned problems.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides the following technical scheme for achieving the purposes: the text image processing method based on the multi-mode and SAM technology fusion is applied to a text image server and comprises the following steps:
s1: screening out a pre-selected frame from the initial pre-selected frame based on the screening criteria, and extracting image features from the pre-selected frame;
S2: analyzing input text data, and extracting text features from the text data;
S3: based on the image features and the text features, obtaining an isolated proposal score of the preselected frame;
S4: feature fusion is carried out on the image features and the text features, a semantic tree is established, and whether space relation words exist in the semantic tree is judged; if the spatial relationship word does not exist, S5 is executed; if the spatial relationship word exists, executing S6:
s5: selecting a preselected frame with the highest score of the isolated proposal as an answer frame;
S6: and carrying out feature superposition on all the text features, and selecting a preselected box containing all the text features as an answer box.
Further, the screening criteria are: filtering out an initial pre-selected frame which is less than 5% of the original image area;
the screening method of the preselection frame comprises the following steps:
Acquiring a priori value of the object area in the image through a target detection algorithm;
Automatically generating a large number of initial pre-selected frames in the image through a target detection algorithm, and marking the area value of each initial pre-selected frame;
the initial preselection frames with the area value smaller than 5% of the prior value of the original image area are removed, and the rest initial preselection frames are marked as preselection frames.
Further, the image feature extraction method comprises the following steps:
Inputting the preselection frame as an input prompt into the SAM model, and cutting out an area with the same size as the area value of the preselection frame from the original image to form an independent image part;
blurring processing is carried out on each cut-out independent image part through an image processing technology;
respectively importing the image parts subjected to cutting and blurring processing into an RN50x16 model and a ViT-B/32 model to respectively obtain feature vector representations of the two models;
And adding the feature vector representations of the two models to generate an output result of a set model, wherein the output result of the set model is the image feature.
Further, the text feature extraction method comprises the following steps:
cleaning the text data through the BERT model, and removing punctuation marks, stop words, numbers and special characters in the text data;
Carrying out word normalization processing on the cleaned text data to obtain neat text data;
word segmentation processing is carried out on the regular text data, so that the text data is converted into a word set;
converting the word set into a feature vector representation by a word embedded natural language processing technology;
The association between words is established by a transducer encoder and converted to a new token, which is then the text feature.
Further, the method for obtaining the isolated proposal score comprises the following steps:
Identification and measurement by computer vision techniques The length and width of each pre-selected frame are respectively obtainedLength value ofA width value;
will be one by one Length value ofComparing the width values to generateA long width;
The expression of the long width is:
In the method, in the process of the invention, Is the firstThe length and width of the strip are equal,Is the firstThe number of length values is a function of the length values,Is the firstThe value of the width of the strip,Is a weight factor;
Scanning A preselection frame subjected to fuzzy processing to obtainA first scan frame image;
At the position of Arbitrarily selecting one pixel point in the first scanning frame image as a circle center to drawA first detection circle for makingThe distance values of the first detection circles from four boundaries of the scanning frame image are all larger than a preset safety distance;
Sequentially marking Pixel values and the number of the pixel points in the first detection circleThe pixel values of the pixel points in the first detection circles are accumulated and averaged to obtainA plurality of blurred pixel values;
Scanning A pre-selected frame without blurring processing is obtainedA second scan frame image;
At the position of Drawing corresponding first detection circle in the second scanning frame imageA second detection circle;
Sequentially marking Pixel values and the number of the pixel points in the second detection circle andThe pixel values of the pixel points in the second detection circles are accumulated and averaged to obtainOriginal pixel values;
Will be Blurred pixel valuesThe corresponding original pixel values are subjected to difference comparison to obtainA pixel difference value;
The pixel difference value is expressed as:
In the method, in the process of the invention, Is the firstThe difference value of the pixels is used for the pixel,Is the firstThe number of blurred pixel values is chosen to be,Is the firstOriginal pixel values;
Based on Length and width ofDifference of each pixel to obtainIndividual isolated proposal scores;
the expression of the orphan proposal score is:
In the method, in the process of the invention, Is the firstThe score of each individual proposal is calculated,Is a weight factor.
Further, the feature fusion method comprises the following steps:
Carrying out semantic recognition on the image features through a natural language processing technology, and marking the phrase obtained after the semantic recognition as an image element;
Sequentially counting the corresponding number of each image element, and arranging each image element in a descending order;
Marking the image elements with the number larger than a preset number threshold as active image elements, and marking the rest image elements as passive image elements;
Carrying out semantic recognition on the text features through a natural language processing technology, and marking the phrases obtained after the semantic recognition as text elements;
fusing the active image elements and the text elements one by one in an element-by-element product mode to obtain a first fusion characteristic;
The passive image elements and the text elements are fused one by one in a mode of element-by-element product, and a second fusion characteristic is obtained;
the first fusion feature and the second fusion feature are added end-to-end.
Further, the method for judging whether the space relation words exist in the semantic tree comprises the following steps:
Establishing dependency analysis of sentences corresponding to text features through a natural language processing library spaCy, and extracting semantic trees from the dependency analysis;
traversing each predicate in the semantic tree, finding the position of each predicate in the sentence, and marking;
searching words containing spatial relations in sentences as spatial relation words;
checking the position of each predicate and noun entities adjacent to each predicate;
when a spatial relationship exists between the predicate and the noun entity, judging that a spatial relationship word exists in the semantic tree;
when the spatial relationship does not exist between the predicate and the noun entity, judging that the spatial relationship word does not exist in the semantic tree.
Further, the selection method of the preselected frame with the highest score of the isolated proposal as the answer frame comprises the following steps:
Will be The isolated proposal scores of the pre-selected frames are arranged in descending order from large to small;
when the first-ranking isolated proposal score is unique, selecting a preselected frame corresponding to the first-ranking isolated proposal score as an answer frame;
When the first-ranking isolated proposal scores are not the same, comparing the long widths of the preselected frames in the first-ranking isolated proposal scores, and arranging the long widths in a descending order;
And selecting a preselect frame corresponding to the first long width as an answer frame.
Further, the selection of a pre-selected box containing all text features as answer boxes includes:
within the preselected frame, identifying the positions of predicates and noun entities by computer vision technology, and measuring the predicates and noun entities Distance between noun entities, obtainThe values to be tested;
Will be Obtaining the difference between the detected values and the preset distance error valuesA proposal value;
the expression of the proposal value is:
In the method, in the process of the invention, Is the firstThe value of the proposal is set to be,Is the firstThe number of values to be tested is the number of,Is a preset distance error value;
Will be After the proposal values are accumulated, sub scores of predicates are obtained, and traversal is performedPredicates, getSub-scoring;
The expression for the sub-score is:
In the method, in the process of the invention, Is the firstThe sub-score is a function of the sub-score,Is the firstThe first predicate ofA proposal value;
Will be After the sub scores are accumulated, the final score of the sentence is obtained;
the expression of the final score of the sentence is:
In the method, in the process of the invention, For the final score of the sentence,Is the firstSub-scoring;
Traversing A preselection frame for obtainingAnd selecting a preselected frame with the highest final score of the sentences as an answer frame.
The text image processing system based on the multi-mode and SAM technology fusion is applied to a text image server and is used for realizing the text image processing method based on the multi-mode and SAM technology fusion, and comprises an image feature extraction module, a text feature extraction module, an isolated proposal scoring calculation module, a spatial relation word judgment module, a first selection module and a second selection module, wherein the modules are connected through a wired or wireless network mode;
The image feature extraction module is used for screening out a pre-selected frame from the initial pre-selected frame based on a screening criterion and extracting image features from the pre-selected frame;
The text feature extraction module is used for analyzing the input text data and extracting text features from the text data;
the isolated proposal score calculation module is used for acquiring isolated proposal scores of the preselected frames based on the image characteristics and the text characteristics;
The spatial relation word judging module is used for carrying out feature fusion on the image features and the text features, establishing a semantic tree and judging whether spatial relation words exist in the semantic tree or not;
The first selection module is used for selecting a preselected frame with the highest score of the isolated proposal as an answer frame;
And the second selection module is used for carrying out feature superposition on all the text features and selecting a preselected frame containing all the text features as an answer frame.
The text image processing method and the system based on the multi-mode and SAM technology fusion have the technical effects and advantages that:
The method comprises the steps of screening out a pre-selected frame from an initial pre-selected frame based on a screening criterion, extracting image features from the pre-selected frame, analyzing input text data, extracting text features from the text data, obtaining isolated proposal scores of the pre-selected frame based on the image features and the text features, carrying out feature fusion on the image features and the text features, establishing a semantic tree, judging whether space relation words exist in the semantic tree, selecting a pre-selected frame with the highest isolated proposal score as an answer frame, carrying out feature superposition on all the text features, and selecting a pre-selected frame containing all the text features as the answer frame; compared with the prior art, the method has the advantages that through introducing the SAM model to cut the preselection frame, the original image with larger whole size can be divided into a plurality of independent small blocks, so that main body features in the preselection frame are clearer, irrelevant background interference is reduced, meanwhile, the text features are utilized to carry out feature superposition, the preselection frame with spatial relation words can be adaptively selected, the effect of mutually supplementing the text features among a plurality of pieces of identical text data is achieved, the recognition and positioning accuracy of the text features and the image features is further enhanced, and the processing efficiency of the text image is improved.
Drawings
Fig. 1 is a flow chart of a text image processing method based on multi-mode and SAM technique fusion provided in embodiment 1 of the present invention;
FIG. 2 is a schematic diagram of a text image processing system based on multi-mode and SAM technology fusion according to embodiment 2 of the present invention;
fig. 3 is a schematic diagram of a structural schematic diagram of an electronic device according to embodiment 3 of the present invention;
Fig. 4 is a schematic structural diagram of a computer readable storage medium according to embodiment 4 of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1: referring to fig. 1, the text image processing method based on multi-mode and SAM technique fusion in this embodiment is applied to a text image server, and includes:
s1: screening out a pre-selected frame from the initial pre-selected frame based on the screening criteria, and extracting image features from the pre-selected frame;
the screening rule is used for screening out the screening rules which accord with the size of the pre-selected frames from the initial pre-selected frames, and as the size of the initial pre-selected frames is different and the objects in the image have a certain area, the initial pre-selected frames with smaller area cannot fully and accurately express the objects in the image, so that the pre-selected frames which can express all the information of the objects in the image are required to be screened out from a large number of initial pre-selected frames;
the screening criteria are: filtering out an initial preselection frame which is smaller than 5% of the original image area, wherein the rest initial preselection frames are preselection frames;
the screening method of the preselection frame comprises the following steps:
acquiring a priori value of the object area in the image through a target detection algorithm; the prior value represents an advanced measurement value of the area of the object in the image and is used as a comparison basis for removing and screening an initial pre-selected frame, and the prior value is obtained after the area measurement is carried out on the position and the boundary of the object after the position and the boundary of the object are identified based on a target detection algorithm;
Automatically generating a large number of initial pre-selected frames in the image through a target detection algorithm, and marking the area value of each initial pre-selected frame;
Removing the initial preselection frames with the area value smaller than 5% of the prior value of the original image area, and marking the rest initial preselection frames as preselection frames; the method is related to the data characteristics of the data set, and does not involve a small target, so that the filtering of a small initial pre-selected frame is a priori aiming at the data set characteristics, and the initial pre-selected frames which are too small or not important can be eliminated by eliminating the initial pre-selected frames with small areas so as to improve the efficiency and the accuracy of a subsequent model;
The image features refer to feature parameters in a preselected frame for representing comprehensive information of objects in the image, and the feature parameters comprise color features, shape features, texture features and spatial relation features, wherein the color features are global features and describe the surface properties of scenes corresponding to the image or the image region; texture features are also global features that also describe the surface properties of the scene to which an image or image region corresponds; the shape features have two types of representation methods, one is outline features, the other is area features, the outline features of the image are mainly aimed at the outer boundary of the object, and the area features of the image relate to the whole shape area; the spatial relationship features refer to the mutual spatial position or relative direction relationship among a plurality of targets segmented in the image, and the relationships can be divided into connection/adjacent relationship, overlapping/overlapping relationship, inclusion/inclusion relationship and the like;
When the image features are acquired, the object information in the image can be comprehensively and accurately represented through the image features, so that the object in the image can be conveniently detected and identified later;
The image feature extraction method comprises the following steps:
inputting the preselection frame as an input prompt into the SAM model, and cutting out an area with the same size as the area value of the preselection frame from the original image to form an independent image part; the original image with larger whole can be divided into a plurality of independent small blocks in a dividing mode, and big data with more numerical values are independent into small data with fewer numerical values, so that the data calculation amount can be reduced, interference caused by irrelevant backgrounds is reduced, and the calculation efficiency of a model is improved;
Blurring processing is carried out on each cut-out independent image part through an image processing technology; the blurring process can make the image part look smoother and reduce details so as to reduce the definition and details of the image and facilitate the subsequent calculation process of the image part;
Respectively importing the image parts subjected to cutting and blurring processing into an RN50x16 model and a ViT-B/32 model to respectively obtain feature vector representations of the two models; RN50x16 refers to the ResNet-50 model in which each image is encoded as a 16-dimensional feature vector, viT-B/32 is the Vision Transformer model, which can divide the image portion into different blocks and capture the relationship between the global and local image by self-attention mechanisms;
Adding the feature vector representations of the two models to generate an output result of a set model, wherein the output result of the set model is the image feature;
S2: analyzing input text data, and extracting text features from the text data;
When the text data is input, the input text data is required to be analyzed, text features which can be directly utilized in the text data are extracted, and fusion matching is carried out on the text features and the image features through the extracted text features, so that a basis can be provided for the selection of a subsequent preselection frame;
the text feature extraction method comprises the following steps:
Cleaning the text data through the BERT model, and removing punctuation marks, stop words, numbers, special characters and the like in the text data; the cleaning mode can quickly delete useless information affecting the reading speed of the text data in the text data, so that only data capable of expressing text semantics are reserved in the text data, and a simplification effect is achieved on the text data;
Carrying out word normalization processing on the cleaned text data to obtain neat text data; the word normalization processing can ensure the normalization and reasonability of text data and keep the continuity and accuracy of the text data;
Word segmentation processing is carried out on the regular text data, so that the text data is converted into a word set; the text data after word segmentation is divided into individual word, sub-word or character identifiers which can be understood and processed by the model;
Converting the word set into a feature vector representation by a word embedded natural language processing technology; each dimension represented by the feature vector represents a word, and the dimension of the corresponding position of the word in the text data is 1 if the word appears, and is 0 if the word does not appear, so that the model can perform mathematical operation;
Establishing a relation between words through a transducer encoder and converting the relation into a new representation, wherein the new representation is a text feature;
S3: based on the image features and the text features, obtaining an isolated proposal score of the preselected frame;
The isolated proposal scores refer to the similarity between the image data and the text data after the image parts corresponding to the preselected frames are cut and subjected to fuzzy processing, and the size of the isolated proposal score of each preselected frame is different because the image characteristics in the image parts corresponding to each preselected frame are inconsistent, so that the isolated proposal scores of the preselected frames need to be calculated in order to obtain the similarity between the preselected frames and the text data;
the method for acquiring the isolated proposal score comprises the following steps:
Identification and measurement by computer vision techniques The length and width of each pre-selected frame are respectively obtainedLength value ofA width value;
will be one by one Length value ofComparing the width values to generateA long width; the square degree of the shape of the image part corresponding to the pre-selected frame can be obtained through the comparison of the length value and the width value, so that the square degree of the pre-selected frame after cutting is represented and used for representing the high and low numerical values of the subsequent cutting quality;
The expression of the long width is:
In the method, in the process of the invention, Is the firstThe length and width of the strip are equal,Is the firstThe number of length values is a function of the length values,Is the firstThe value of the width of the strip,Is a weight factor; when the length and width are closer to 1, the square degree of the shape of the image part corresponding to the selection frame is higher, and the image part is closer to the square;
Wherein, The dimensions of the exemplary,The number of times of the total number of times of the,0.438; In addition, it should be noted that the size of the weight factor is a specific numerical value obtained by quantizing each data, so that the subsequent comparison is convenient, and the size of the weight factor depends on the number of the image features and the text features and the corresponding weight factor is preliminarily set for each group of the image features and the text features by a person skilled in the art;
Scanning A preselection frame subjected to fuzzy processing to obtainA first scan frame image;
At the position of Arbitrarily selecting one pixel point in the first scanning frame image as a circle center to drawA first detection circle for makingThe distance values of the first detection circles from four boundaries of the scanning frame image are all larger than a preset safety distance; the preset safety distance is used for limiting the minimum distance between the detection circle and the boundary of the scanning frame image, so that the detection circle can be positioned at a non-boundary position of the scanning frame image, the boundary of the detection circle is prevented from being overlapped with the boundary of the scanning frame image, the identification and statistics of pixel values of pixel points in the detection circle are convenient, and the preset safety distance is obtained through coefficient optimization after a large number of minimum distance values, which can accurately identify and count the pixel values of the pixel points in the detection circle, of the detection circle boundary and the scanning frame image boundary are acquired;
Sequentially marking Pixel values and the number of the pixel points in the first detection circleThe pixel values of the pixel points in the first detection circles are accumulated and averaged to obtainA plurality of blurred pixel values;
Scanning A pre-selected frame without blurring processing is obtainedA second scan frame image;
At the position of Drawing corresponding first detection circle in the second scanning frame imageA second detection circle;
Sequentially marking Pixel values and the number of the pixel points in the second detection circle andThe pixel values of the pixel points in the second detection circles are accumulated and averaged to obtainOriginal pixel values;
Will be Blurred pixel valuesThe corresponding original pixel values are subjected to difference comparison to obtainA pixel difference value;
The pixel difference value is expressed as:
In the method, in the process of the invention, Is the firstThe difference value of the pixels is used for the pixel,Is the firstThe number of blurred pixel values is chosen to be,Is the firstOriginal pixel values;
Based on Length and width ofDifference of each pixel to obtainIndividual isolated proposal scores;
the expression of the orphan proposal score is:
In the method, in the process of the invention, Is the firstThe score of each individual proposal is calculated,Is a weight factor;
Wherein, The dimensions of the exemplary,At the point of time of 0.622,0.378; And the set logic of (1) Is consistent with the setting logic of the (a);
It should be noted that, by drawing the first detection circle and the second detection circle, a region with a reasonable size and convenient identification processing can be selected from the scanning frame image, so that all pixel point data in the scanning frame image are identified, counted and calculated, the data volume is effectively reduced, and the calculation load is further reduced;
s4: feature fusion is carried out on the image features and the text features, a semantic tree is established, and whether space relation words exist in the semantic tree is judged; if the spatial relationship word does not exist, S5 is executed; if the spatial relationship word exists, executing S6;
Feature fusion refers to a measure of matching and combining the acquired features with different expression types, and because the image features are used as the expression of the image data and the text features are used as the expression of the text data, after the image features and the text features are acquired, the image features and the text features are required to be combined and scored according to the result after the feature combination, so that the similarity of the feature fusion of the image features and the text features is numerically represented and provided for a follow-up preselection frame to be used as a basis for accurate selection;
The feature fusion method comprises the following steps:
Carrying out semantic recognition on the image features through a natural language processing technology, and marking the phrase obtained after the semantic recognition as an image element;
Sequentially counting the corresponding number of each image element, and arranging each image element in a descending order; the image elements can be orderly arranged in a descending order mode, so that the image elements with more numbers can be arranged in front, and processing calculation with higher priority can be performed later;
Marking the image elements with the number larger than a preset number threshold as active image elements, and marking the rest image elements as passive image elements; the preset quantity threshold value is a numerical value basis for distinguishing the image elements, so that the image elements are distinguished into active image elements and passive image elements, and the image elements can be classified in a mode of distinguishing the active image elements from the passive image elements, so that the active image elements and the passive image elements can be respectively and independently fused, the phenomenon of mutual cross influence during fusion of a large quantity of image elements is avoided, the efficiency of feature fusion is improved, and the phenomenon of feature fusion errors is avoided; the preset quantity threshold value is obtained through coefficient optimization after a large number of positive image elements and corresponding quantity values of negative image elements of the history are acquired;
Carrying out semantic recognition on the text features through a natural language processing technology, and marking the phrases obtained after the semantic recognition as text elements;
Fusing the active image elements and the text elements one by one in an element-by-element product mode to obtain a first fusion characteristic; the information represented by the two feature vectors can be cross-fused by adopting element-by-element product, so that the correlation between the two feature vector representations is enhanced, and meanwhile, the information on each dimension is reserved, thereby being beneficial to capturing richer information in a feature space;
The passive image elements and the text elements are fused one by one in a mode of element-by-element product, and a second fusion characteristic is obtained;
Adding the first fusion feature and the second fusion feature end to end;
The spatial relation words refer to the positions and relation words of predicates in sentences corresponding to text features in a semantic tree, are used for representing whether the predicates meet spatial heuristic rules or not, and the spatial relation between objects is processed according to the spatial heuristic rules, so that the position relation between the entities can be determined by using the extracted dependency paths and the semantic tree;
The spatial heuristic rules are: decomposing a complex sentence into basic primitives and common primitives, wherein the basic primitives are predicates applicable to the objects, and the common primitives are spatial relations among the objects; predicates are processed through clips, and spatial relations are processed through spatial heuristic rules;
The judging method for whether the space relation words exist in the semantic tree comprises the following steps:
Establishing dependency analysis of sentences corresponding to text features through a natural language processing library spaCy, and extracting semantic trees from the dependency analysis; each name block is a node, and the dependency path of the head of the name block is the relation among contained target entities;
Traversing each predicate in the semantic tree, finding the position of each predicate in the sentence, and marking; illustratively, the predicate is "a cat", and the positions in the sentence include, but are not limited to, a start position and an end position;
Searching words containing spatial relations in sentences as spatial relation words; illustratively, the spatial relationship terms are "next to", "in front of";
checking the position of each predicate and noun entities adjacent to each predicate;
when a spatial relationship exists between the predicate and the noun entity, judging that a spatial relationship word exists in the semantic tree;
When no spatial relation exists between the predicate and the noun entity, judging that no spatial relation word exists in the semantic tree;
s5: selecting a preselected frame with the highest score of the isolated proposal as an answer frame;
When judging that no spatial relation word exists in the semantic tree, directly calculating an isolated proposal score by using the whole sentence at the moment, and selecting one with the highest score from the calculated isolated proposal scores;
the selection method for selecting the preselected frame with the highest score of the isolated proposal as the answer frame comprises the following steps:
Will be The isolated proposal scores of the pre-selected frames are arranged in descending order from large to small;
When the first-ranking isolated proposal score is unique, the fact that the number of the preselected frames corresponding to the first-ranking isolated proposal score is only one at the moment is described, and the preselected frames corresponding to the first-ranking isolated proposal score are selected as answer frames;
when the first-ranking isolated proposal scores are not the same, the number of the preselection frames corresponding to the first-ranking isolated proposal scores is multiple, and the long widths of the preselection frames in the first-ranking isolated proposal scores are compared and are arranged in descending order; the size of the score of the isolated proposal is influenced by the difference value of the long width and the pixel, and the long width is used as the visual representation of the clipping area size of the preselected frame, so that the whole quality of the preselected frame can be directly and proportionally represented, and the preselected frame with the larger long width is used as the answer frame with higher priority;
Selecting a preselection frame corresponding to the first long width as an answer frame;
s6: feature superposition is carried out on all text features, and a preselected frame containing all text features is selected as an answer frame;
when judging that the space relation words exist in the semantic tree, at the moment, carrying out feature superposition on the text features, analyzing the final score after the text features are superposed, and further selecting an answer frame;
the selection of a pre-selected box containing all text features as answer boxes includes:
within the preselected frame, identifying the positions of predicates and noun entities by computer vision technology, and measuring the predicates and noun entities Distance between noun entities, obtainThe values to be tested;
Will be Obtaining the difference between the detected values and the preset distance error valuesA proposal value; the preset distance error value is an error compensation value for the distance between the predicate and the noun entity, and the accuracy of the proposal value is ensured by subtracting the mathematical error value after the distance between the predicate and the noun entity is measured because the noun entity has certain boundary thickness which can influence the measurement accuracy of the distance value;
the expression of the proposal value is:
In the method, in the process of the invention, Is the firstThe value of the proposal is set to be,Is the firstThe number of values to be tested is the number of,Is a preset distance error value;
Will be After the proposal values are accumulated, sub scores of predicates are obtained, and traversal is performedPredicates, getSub-scoring;
The expression for the sub-score is:
In the method, in the process of the invention, Is the firstThe sub-score is a function of the sub-score,Is the firstThe first predicate ofA proposal value;
Will be After the sub scores are accumulated, the final score of the sentence is obtained;
the expression of the final score of the sentence is:
In the method, in the process of the invention, For the final score of the sentence,Is the firstSub-scoring;
Traversing A preselection frame for obtainingSelecting a preselected frame with the highest final score of the sentences as an answer frame;
In the embodiment, through screening out a pre-selected frame from an initial pre-selected frame based on a screening criterion, extracting image features from the pre-selected frame, analyzing input text data, extracting text features from the text data, acquiring isolated proposal scores of the pre-selected frame based on the image features and the text features, performing feature fusion on the image features and the text features, establishing a semantic tree, judging whether space relation words exist in the semantic tree, selecting a pre-selected frame with the highest isolated proposal score as an answer frame, performing feature superposition on all the text features, and selecting a pre-selected frame containing all the text features as the answer frame; compared with the prior art, the method has the advantages that through introducing the SAM model to cut the preselection frame, the original image with larger whole size can be divided into a plurality of independent small blocks, so that main body features in the preselection frame are clearer, irrelevant background interference is reduced, meanwhile, the text features are utilized to carry out feature superposition, the preselection frame with spatial relation words can be adaptively selected, the effect of mutually supplementing the text features among a plurality of pieces of identical text data is achieved, the recognition and positioning accuracy of the text features and the image features is further enhanced, and the processing efficiency of the text image is improved.
Example 2: referring to fig. 2, a part of the description of embodiment 1 is not described in detail in this embodiment, and a text image processing system based on multi-mode and SAM technology fusion is provided, which is applied to a text image server and is used for implementing a text image processing method based on multi-mode and SAM technology fusion, and includes an image feature extraction module, a text feature extraction module, an isolated proposal score calculation module, a spatial relation word judgment module, a first selection module and a second selection module, wherein the modules are connected through a wired or wireless network;
The image feature extraction module is used for screening out a pre-selected frame from the initial pre-selected frame based on a screening criterion and extracting image features from the pre-selected frame;
The text feature extraction module is used for analyzing the input text data and extracting text features from the text data;
the isolated proposal score calculation module is used for acquiring isolated proposal scores of the preselected frames based on the image characteristics and the text characteristics;
The spatial relation word judging module is used for carrying out feature fusion on the image features and the text features, establishing a semantic tree and judging whether spatial relation words exist in the semantic tree or not;
The first selection module is used for selecting a preselected frame with the highest score of the isolated proposal as an answer frame;
And the second selection module is used for carrying out feature superposition on all the text features and selecting a preselected frame containing all the text features as an answer frame.
Example 3: referring to fig. 3, the disclosure provides an electronic device, including a processor and a memory;
wherein the memory stores a computer program for the processor to call;
the processor executes the text image processing method based on the multi-mode and SAM technology fusion by calling the computer program stored in the memory.
Since the electronic device described in this embodiment is an electronic device for implementing the text image processing method based on the combination of multi-mode and SAM technology in embodiment 1 of the present application, based on the text image processing method based on the combination of multi-mode and SAM technology described in this embodiment, those skilled in the art can understand the specific implementation manner of the electronic device and various modifications thereof, so how to implement the method in this embodiment of the present application in this electronic device will not be described in detail herein. As long as the person skilled in the art implements the electronic device adopted by the text image processing method based on the fusion of the multi-mode and SAM technology in the embodiment of the present application, the electronic device belongs to the scope of protection intended by the present application.
Example 4: referring to fig. 4, the present embodiment disclosure provides a computer readable storage medium having stored thereon a computer program that is erasable;
when the computer program is run, the text image processing method based on the multi-mode and SAM technology fusion is implemented.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. The text image processing method based on the multi-mode and SAM technology fusion is applied to a text image server and is characterized by comprising the following steps:
s1: screening out a pre-selected frame from the initial pre-selected frame based on the screening criteria, and extracting image features from the pre-selected frame;
The image feature extraction method comprises the following steps:
Inputting the preselection frame as an input prompt into the SAM model, and cutting out an area with the same size as the area value of the preselection frame from the original image to form an independent image part;
blurring processing is carried out on each cut-out independent image part through an image processing technology;
respectively importing the image parts subjected to cutting and blurring processing into an RN50x16 model and a ViT-B/32 model to respectively obtain feature vector representations of the two models;
Adding the feature vector representations of the two models to generate an output result of a set model, wherein the output result of the set model is the image feature;
S2: analyzing input text data, and extracting text features from the text data;
S3: based on the image features and the text features, obtaining an isolated proposal score of the preselected frame;
the isolated proposal scoring refers to the similarity of image data and text data after cutting and blurring processing of the image part corresponding to the preselected frame;
the method for acquiring the isolated proposal score comprises the following steps:
Identification and measurement by computer vision techniques The length and width of each pre-selected frame are respectively obtainedLength value ofA width value;
will be one by one Length value ofComparing the width values to generateA long width;
The expression of the long width is:
In the method, in the process of the invention, Is the firstThe length and width of the strip are equal,Is the firstThe number of length values is a function of the length values,Is the firstThe value of the width of the strip,Is a weight factor;
Scanning A preselection frame subjected to fuzzy processing to obtainA first scan frame image;
At the position of Arbitrarily selecting one pixel point in the first scanning frame image as a circle center to drawA first detection circle for makingThe distance values of the first detection circles from four boundaries of the scanning frame image are all larger than a preset safety distance;
Sequentially marking Pixel values and the number of the pixel points in the first detection circleThe pixel values of the pixel points in the first detection circles are accumulated and averaged to obtainA plurality of blurred pixel values;
Scanning A pre-selected frame without blurring processing is obtainedA second scan frame image;
At the position of Drawing corresponding first detection circle in the second scanning frame imageA second detection circle;
Sequentially marking Pixel values and the number of the pixel points in the second detection circle andThe pixel values of the pixel points in the second detection circles are accumulated and averaged to obtainOriginal pixel values;
Will be Blurred pixel valuesThe corresponding original pixel values are subjected to difference comparison to obtainA pixel difference value;
The pixel difference value is expressed as:
In the method, in the process of the invention, Is the firstThe difference value of the pixels is used for the pixel,Is the firstThe number of blurred pixel values is chosen to be,Is the firstOriginal pixel values;
Based on Length and width ofDifference of each pixel to obtainIndividual isolated proposal scores;
the expression of the orphan proposal score is:
In the method, in the process of the invention, Is the firstThe score of each individual proposal is calculated,Is a weight factor;
S4: feature fusion is carried out on the image features and the text features, a semantic tree is established, and whether space relation words exist in the semantic tree is judged; if the spatial relationship word does not exist, S5 is executed; if the spatial relationship word exists, executing S6:
s5: selecting a preselected frame with the highest score of the isolated proposal as an answer frame;
S6: and carrying out feature superposition on all the text features, and selecting a preselected box containing all the text features as an answer box.
2. The text image processing method based on multi-modal and SAM technique fusion according to claim 1, wherein the screening criteria are: filtering out an initial pre-selected frame which is less than 5% of the original image area;
the screening method of the preselection frame comprises the following steps:
Acquiring a priori value of the object area in the image through a target detection algorithm;
Automatically generating a large number of initial pre-selected frames in the image through a target detection algorithm, and marking the area value of each initial pre-selected frame;
the initial preselection frames with the area value smaller than 5% of the prior value of the original image area are removed, and the rest initial preselection frames are marked as preselection frames.
3. The text image processing method based on multi-mode and SAM technique fusion according to claim 2, wherein the text feature extraction method includes:
cleaning the text data through the BERT model, and removing punctuation marks, stop words, numbers and special characters in the text data;
Carrying out word normalization processing on the cleaned text data to obtain neat text data;
word segmentation processing is carried out on the regular text data, so that the text data is converted into a word set;
converting the word set into a feature vector representation by a word embedded natural language processing technology;
The association between words is established by a transducer encoder and converted to a new token, which is then the text feature.
4. A text image processing method based on multi-modal and SAM technique fusion according to claim 3, characterized in that the feature fusion method comprises:
Carrying out semantic recognition on the image features through a natural language processing technology, and marking the phrase obtained after the semantic recognition as an image element;
Sequentially counting the corresponding number of each image element, and arranging each image element in a descending order;
Marking the image elements with the number larger than a preset number threshold as active image elements, and marking the rest image elements as passive image elements;
Carrying out semantic recognition on the text features through a natural language processing technology, and marking the phrases obtained after the semantic recognition as text elements;
fusing the active image elements and the text elements one by one in an element-by-element product mode to obtain a first fusion characteristic;
The passive image elements and the text elements are fused one by one in a mode of element-by-element product, and a second fusion characteristic is obtained;
the first fusion feature and the second fusion feature are added end-to-end.
5. The text image processing method based on multi-modal and SAM technique fusion according to claim 4, wherein the method for determining whether a spatial relationship word exists in the semantic tree comprises:
Establishing dependency analysis of sentences corresponding to text features through a natural language processing library spaCy, and extracting semantic trees from the dependency analysis;
traversing each predicate in the semantic tree, finding the position of each predicate in the sentence, and marking;
searching words containing spatial relations in sentences as spatial relation words;
checking the position of each predicate and noun entities adjacent to each predicate;
when a spatial relationship exists between the predicate and the noun entity, judging that a spatial relationship word exists in the semantic tree;
when the spatial relationship does not exist between the predicate and the noun entity, judging that the spatial relationship word does not exist in the semantic tree.
6. The text image processing method based on multi-modal and SAM technique fusion according to claim 5, wherein the selecting method of the preselected box with the highest isolated proposal score as the answer box includes:
Will be The isolated proposal scores of the pre-selected frames are arranged in descending order from large to small;
when the first-ranking isolated proposal score is unique, selecting a preselected frame corresponding to the first-ranking isolated proposal score as an answer frame;
When the first-ranking isolated proposal scores are not the same, comparing the long widths of the preselected frames in the first-ranking isolated proposal scores, and arranging the long widths in a descending order;
And selecting a preselect frame corresponding to the first long width as an answer frame.
7. The text image processing method based on multi-modal and SAM technique fusion of claim 6, wherein the selecting party of the pre-selected box containing all text features as answer box comprises:
within the preselected frame, identifying the positions of predicates and noun entities by computer vision technology, and measuring the predicates and noun entities Distance between noun entities, obtainThe values to be tested;
Will be Obtaining the difference between the detected values and the preset distance error valuesA proposal value;
the expression of the proposal value is:
In the method, in the process of the invention, Is the firstThe value of the proposal is set to be,Is the firstThe number of values to be tested is the number of,Is a preset distance error value;
Will be After the proposal values are accumulated, sub scores of predicates are obtained, and traversal is performedPredicates, getSub-scoring;
The expression for the sub-score is:
In the method, in the process of the invention, Is the firstThe sub-score is a function of the sub-score,Is the firstThe first predicate ofA proposal value;
Will be After the sub scores are accumulated, the final score of the sentence is obtained;
the expression of the final score of the sentence is:
In the method, in the process of the invention, For the final score of the sentence,Is the firstSub-scoring;
Traversing A preselection frame for obtainingAnd selecting a preselected frame with the highest final score of the sentences as an answer frame.
8. The text image processing method system based on the multi-mode and SAM technology fusion is applied to a text image server and is used for realizing the text image processing method based on the multi-mode and SAM technology fusion, and is characterized by comprising an image feature extraction module, a text feature extraction module, an isolated proposal scoring calculation module, a spatial relation word judgment module, a first selection module and a second selection module, wherein the modules are connected through a wired or wireless network mode;
The image feature extraction module is used for screening out a pre-selected frame from the initial pre-selected frame based on a screening criterion and extracting image features from the pre-selected frame;
The text feature extraction module is used for analyzing the input text data and extracting text features from the text data;
the isolated proposal score calculation module is used for acquiring isolated proposal scores of the preselected frames based on the image characteristics and the text characteristics;
The spatial relation word judging module is used for carrying out feature fusion on the image features and the text features, establishing a semantic tree and judging whether spatial relation words exist in the semantic tree or not;
The first selection module is used for selecting a preselected frame with the highest score of the isolated proposal as an answer frame;
And the second selection module is used for carrying out feature superposition on all the text features and selecting a preselected frame containing all the text features as an answer frame.
CN202410517576.9A 2024-04-28 2024-04-28 Text image processing method and system based on multi-mode and SAM technology fusion Active CN118097670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410517576.9A CN118097670B (en) 2024-04-28 2024-04-28 Text image processing method and system based on multi-mode and SAM technology fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410517576.9A CN118097670B (en) 2024-04-28 2024-04-28 Text image processing method and system based on multi-mode and SAM technology fusion

Publications (2)

Publication Number Publication Date
CN118097670A CN118097670A (en) 2024-05-28
CN118097670B true CN118097670B (en) 2024-06-28

Family

ID=91159772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410517576.9A Active CN118097670B (en) 2024-04-28 2024-04-28 Text image processing method and system based on multi-mode and SAM technology fusion

Country Status (1)

Country Link
CN (1) CN118097670B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115359492A (en) * 2022-09-01 2022-11-18 上海鱼尔网络科技有限公司 Text image matching model training method, picture labeling method, device and equipment
CN117036706A (en) * 2023-08-11 2023-11-10 北京无代码科技有限公司 Image segmentation method and system based on multi-modal dialogue language model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4266195A1 (en) * 2022-04-19 2023-10-25 Microsoft Technology Licensing, LLC Training of text and image models

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115359492A (en) * 2022-09-01 2022-11-18 上海鱼尔网络科技有限公司 Text image matching model training method, picture labeling method, device and equipment
CN117036706A (en) * 2023-08-11 2023-11-10 北京无代码科技有限公司 Image segmentation method and system based on multi-modal dialogue language model

Also Published As

Publication number Publication date
CN118097670A (en) 2024-05-28

Similar Documents

Publication Publication Date Title
CN110807422A (en) Natural scene text detection method based on deep learning
CN109993040A (en) Text recognition method and device
CN111612012A (en) Health code identification method and device
CN105144239A (en) Image processing device, program, and image processing method
CN109740515B (en) Evaluation method and device
CN111563384A (en) Evaluation object identification method and device for E-commerce products and storage medium
CN110705558A (en) Image instance segmentation method and device
CN113537206B (en) Push data detection method, push data detection device, computer equipment and storage medium
CN112085072A (en) Cross-modal retrieval method of sketch retrieval three-dimensional model based on space-time characteristic information
CN114092700A (en) Ancient character recognition method based on target detection and knowledge graph
CN110414505A (en) Processing method, processing system and the computer readable storage medium of image
CN114187595A (en) Document layout recognition method and system based on fusion of visual features and semantic features
CN113158777A (en) Quality scoring method, quality scoring model training method and related device
CN112613321A (en) Method and system for extracting entity attribute information in text
CN114463767A (en) Credit card identification method, device, computer equipment and storage medium
CN112560584A (en) Face detection method and device, storage medium and terminal
CN115344805A (en) Material auditing method, computing equipment and storage medium
CN113762257B (en) Identification method and device for mark in make-up brand image
CN114971294A (en) Data acquisition method, device, equipment and storage medium
CN112215285B (en) Cross-media-characteristic-based automatic fundus image labeling method
CN118097670B (en) Text image processing method and system based on multi-mode and SAM technology fusion
CN112418220A (en) Single word detection method, device, equipment and medium
CN111723688A (en) Human body action recognition result evaluation method and device and electronic equipment
CN116189214A (en) Layout analysis method, layout analysis device, electronic equipment and storage medium
CN112766073B (en) Table extraction method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant