CN112418199B - Multi-modal information extraction method and device, electronic equipment and storage medium - Google Patents
Multi-modal information extraction method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN112418199B CN112418199B CN202110093438.9A CN202110093438A CN112418199B CN 112418199 B CN112418199 B CN 112418199B CN 202110093438 A CN202110093438 A CN 202110093438A CN 112418199 B CN112418199 B CN 112418199B
- Authority
- CN
- China
- Prior art keywords
- text
- area
- target image
- image
- region
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a multi-modal information extraction method, a device, an electronic device and a storage medium, wherein the multi-modal information extraction method comprises the following steps: acquiring a target image of an object to be extracted, wherein the target image comprises image content and text content; identifying and obtaining a corresponding image area and a corresponding text area in the target image according to the target image; and extracting corresponding multi-modal information according to the image area and the text area, wherein the multi-modal information comprises a target image and a target text. The multi-mode information extraction method, the multi-mode information extraction device, the electronic equipment and the storage medium can automatically extract the multi-mode information from the object to be extracted, further can greatly reduce the manual work amount during multi-mode information extraction, reduce the labor cost, and are also suitable for extracting the multi-mode information of the printed image-text data.
Description
Technical Field
The present application relates to the field of information extraction technologies, and in particular, to a multimodal information extraction method, an apparatus, an electronic device, and a storage medium.
Background
The multi-modal learning, namely the multi-modal machine learning, is one of key breakthrough directions in the field of artificial intelligence nowadays, the multi-modal machine learning aims at realizing the capability of processing and understanding multi-source modal information by a machine learning method, and currently, the popular research direction in the multi-modal learning is the multi-modal learning among images, videos, audios and semantics.
Multi-modal learning generally requires a high quality multi-modal dataset to train to ensure that the multi-modal learning achieves the desired effect in the specific application. At present, most of high-quality multi-modal data sets are data sets in the general field, and data sets related to the specific field and the specific industry are few, and actually, a large amount of professional print image-text data, such as professional books and various document materials, are accumulated in each field and each industry in a long-term development process, multi-modal data need to extract multi-modal information, the existing multi-modal information is mainly extracted manually, however, the manual extraction mode needs a large amount of manual work to participate in collection and labeling, so that the labor cost is too high, and the manual extraction mode is tedious and time-consuming in the extraction of the multi-modal information of the print image-text data.
Disclosure of Invention
An object of the embodiments of the present invention is to provide a method, an apparatus, an electronic device, and a storage medium for extracting multimodal information, which can automatically extract multimodal information from an object to be extracted, thereby greatly reducing the labor load during multimodal information extraction, reducing the labor cost, and being suitable for extracting multimodal information of printed image-text data.
In a first aspect, an embodiment of the present application provides a multi-modal information extraction method, including:
acquiring a target image of an object to be extracted, wherein the target image comprises image content and text content;
identifying and obtaining a corresponding image area and a corresponding text area in the target image according to the target image;
and extracting corresponding multi-modal information according to the image area and the text area, wherein the multi-modal information comprises a target image and a target text.
In the implementation process, the multi-modal information extraction method according to the embodiment of the application identifies and obtains the corresponding image region and text region in the target image through the obtained target image of the object to be extracted, and automatically extracts and obtains the corresponding multi-modal information according to the image region and the text region, the multi-modal information comes from the object to be extracted, the multi-modal information includes the target image and the target text, so that the labor workload during multi-modal information extraction can be greatly reduced, the labor cost is reduced, and the multi-modal information extraction method according to the embodiment of the application is also suitable for extracting the multi-modal information of the printed image-text data, and can be more convenient for extracting the multi-modal information of the printed image-text data.
Further, the acquiring a target image of an object to be extracted includes:
acquiring an initial image of an object to be extracted;
and preprocessing the initial image to obtain a target image of the object to be extracted.
In the implementation process, the method preprocesses the initial image of the object to be extracted, so that the target image of the object to be extracted can be well obtained, and the obtained target image of the object to be extracted can be convenient for extracting multi-mode information and ensuring the quality of the extracted multi-mode information.
Further, the identifying and obtaining the image region and the text region corresponding to the target image according to the target image includes:
and identifying and obtaining corresponding image regions and text regions in the target image according to the area of each connected domain in the target image and a preset connected domain segmentation threshold.
In the implementation process, the method can relatively quickly and accurately identify and obtain the corresponding image region and text region in the target image through the area of each connected domain in the target image of the object to be extracted and the preset connected domain segmentation threshold value, so that the extraction of multi-mode information can be more convenient.
Further, the extracting, according to the image region and the text region, corresponding multi-modal information includes:
searching, merging and filtering the image areas to obtain a target image area;
and extracting corresponding multi-modal information according to the target image region and the text region.
In the implementation process, the method can better normalize the image area and eliminate the interference of redundant image information possibly existing in the target image by searching, combining and filtering the image area, so that the target image area is better obtained, and the quality of the extracted multi-modal information can be further improved.
Further, the extracting, according to the target image region and the text region, corresponding multi-modal information includes:
identifying and obtaining a title text area and a description text area according to the target image area and the text area;
extracting corresponding multi-modal information according to the target image area, the title text area and the description text area;
the target text comprises a target title text and a target description text.
In the implementation process, the title text region and the description text region are identified and obtained through the target image region and the text region, the text region can be divided more accurately, the extracted multi-modal information can be more accurate, the target text does not need to be divided manually, the manual work amount in the multi-modal information extraction process can be further reduced, and the labor cost is better reduced.
Further, the identifying and obtaining a title text region and a description text region according to the target image region and the text region includes:
acquiring position information of each target image area;
performing text search according to the position information of each target image area by using a preset search distance and a preset search area to obtain a suspected title text area;
and identifying and obtaining a title text area and a description text area according to the area, the length and the width of the suspected title text area.
In the implementation process, the method performs text search by using the acquired position information of each target image area and the preset search distance and the preset search area to obtain a suspected title text area, and identifies and obtains the title text area and the description text area according to the area, the length and the width of the suspected title text area, so that the title text area and the description text area can be identified and obtained quickly and accurately.
Further, the extracting corresponding multi-modal information according to the target image region, the title text region, and the description text region includes:
associating the corresponding target image region, the title text region and the description text region;
extracting corresponding multi-modal information according to the associated target image region, title text region and description text region;
and forming corresponding image-text information pairs by the corresponding target image, the target title text and the target description text.
In the implementation process, the corresponding target image area, the corresponding title text area and the corresponding description text area are associated, corresponding multi-modal information is extracted according to the associated target image area, the associated title text area and the associated description text area, and the corresponding target image, the corresponding target title text and the corresponding target description text form a corresponding image-text information pair, so that the extracted multi-modal information is more accurate, the corresponding target image, the corresponding target title text and the corresponding target description text do not need to be manually associated, the manual work amount during multi-modal information extraction can be further reduced, and the labor cost is better reduced.
In a second aspect, an embodiment of the present application provides a multimodal information extraction apparatus, including:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a target image of an object to be extracted, and the target image comprises image content and text content;
the identification module is used for identifying and obtaining the corresponding image area and the text area in the target image according to the target image;
and the extraction module is used for extracting corresponding multi-modal information according to the image region and the text region, wherein the multi-modal information comprises a target image and a target text.
In the implementation process, the multi-modal information extraction device according to the embodiment of the application identifies and obtains the corresponding image region and text region in the target image through the obtained target image of the object to be extracted, and automatically extracts and obtains the corresponding multi-modal information according to the image region and the text region, the multi-modal information comes from the object to be extracted, the multi-modal information includes the target image and the target text, so that the labor workload during multi-modal information extraction can be greatly reduced, the labor cost is reduced, and the multi-modal information extraction device according to the embodiment of the application is also suitable for extracting the multi-modal information of the printed image-text data, and can be convenient for extracting the multi-modal information of the printed image-text data.
In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory is used to store a computer program, and the processor runs the computer program to make the electronic device execute the above-mentioned multimodal information extraction method.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the above-mentioned multimodal information extraction method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic flow chart of a multi-modal information extraction method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of step S110 according to a first embodiment of the present application;
fig. 3 is a schematic flowchart of step S130 according to a first embodiment of the present application;
fig. 4 is a block diagram of a multi-modal information extraction apparatus according to a second embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
At present, most of high-quality multi-modal data sets are data sets in the general field, and data sets related to the specific field and the specific industry are few, and actually, a large amount of professional print image-text data, such as professional books and various document materials, are accumulated in each field and each industry in a long-term development process, multi-modal data need to extract multi-modal information, the existing multi-modal information is mainly extracted manually, however, the manual extraction mode needs a large amount of manual work to participate in collection and labeling, so that the labor cost is too high, and the manual extraction mode is tedious and time-consuming in the extraction of the multi-modal information of the print image-text data.
In view of the above problems in the prior art, the present application provides a method, an apparatus, an electronic device, and a storage medium for extracting multimodal information, which can automatically extract multimodal information from an object to be extracted, thereby greatly reducing the labor load during multimodal information extraction, reducing the labor cost, and being more suitable for extracting multimodal information of printed image-text data.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart of a multimodal information extraction method provided in an embodiment of the present application. The multimodal information extraction method described below in the embodiments of the present application can be applied to a server.
The multi-modal information extraction method in the embodiment of the application comprises the following steps:
step S110, a target image of an object to be extracted is obtained, and the target image comprises image content and text content.
In this embodiment, the object to be extracted may be print image-text data, an electronic document, a web page, and so on.
It can be understood that the target image of the object to be extracted is an image corresponding to the object to be extracted.
Step S120, identifying and obtaining an image region and a text region corresponding to the target image according to the target image.
In this embodiment, the target image of the object to be extracted includes image content and text content, and an image region and a text region corresponding to the image content and the text content in the target image can be identified and obtained according to the target image of the object to be extracted.
Step S130, extracting corresponding multi-modal information according to the image region and the text region, where the multi-modal information includes the target image and the target text.
In this embodiment, the target image and the target text correspond to the image content and the text content of the target image of the object to be extracted.
According to the multi-mode information extraction method, the corresponding image area and the corresponding text area in the target image are identified and obtained through the obtained target image of the object to be extracted, the corresponding multi-mode information is automatically extracted according to the image area and the text area, the multi-mode information comes from the object to be extracted, the multi-mode information comprises the target image and the target text, further, the manual work amount during multi-mode information extraction can be greatly reduced, the labor cost is reduced, and the multi-mode information extraction method is also suitable for multi-mode information extraction of printed image-text data, and can be used for extracting the multi-mode information of the printed image-text data conveniently.
In order to better obtain a target image of an object to be extracted, an embodiment of the present application provides a possible implementation manner, referring to fig. 2, fig. 2 is a schematic flow diagram of step S110 provided in the embodiment of the present application, and the multi-modal information extraction method in the embodiment of the present application, in which step S110, a target image of an object to be extracted is obtained, and the target image includes image content and text content, and the method may include the following steps:
step S111, obtaining an initial image of an object to be extracted;
and step S112, preprocessing the initial image to obtain a target image of the object to be extracted.
Specifically, the initial image of the object to be extracted is an image corresponding to the object to be extracted.
The initial image of the object to be extracted is preprocessed, and the preprocessing can be image denoising, graying, binarization, corrosion expansion and other processing.
The target image of the object to be extracted can be regarded as being composed of a plurality of connected domains, wherein each character in the target image of the object to be extracted is a small connected domain, and the image part in the target image of the object to be extracted is a relatively large connected domain.
In the process, the method preprocesses the initial image of the object to be extracted, so that the target image of the object to be extracted can be well obtained, and the obtained target image of the object to be extracted can be convenient for extracting multi-mode information and ensuring the quality of the extracted multi-mode information.
In order to identify and obtain an image region and a text region corresponding to a target image more quickly and accurately, embodiments of the present application provide a possible implementation manner, where the multimodal information extraction method according to the embodiments of the present application, when identifying and obtaining the image region and the text region corresponding to the target image according to the target image, may:
and identifying and obtaining corresponding image regions and text regions in the target image according to the area of each connected domain in the target image and a preset connected domain segmentation threshold.
Specifically, the area of each connected domain in the target image may be greater than or equal to a preset connected domain segmentation threshold, and determined as a corresponding image region in the target image; and determining the area of each connected domain in the target image to be smaller than a preset connected domain segmentation threshold value as a corresponding initial text region in the target image, wherein the initial text region can be each character in the target image of the object to be extracted, and the text region is obtained according to the initial text region.
In the process, the method can relatively quickly and accurately identify and obtain the corresponding image region and text region in the target image through the area of each connected domain in the target image of the object to be extracted and the preset connected domain segmentation threshold value, so that the extraction of multi-modal information can be more convenient.
In order to better specify an image region and eliminate interference of redundant image information that may exist in a target image, an embodiment of the present application provides a possible implementation manner, referring to fig. 3, fig. 3 is a schematic flowchart of step S130 provided in the embodiment of the present application, a multi-modal information extraction method in the embodiment of the present application, in step S130, extracts corresponding multi-modal information according to the image region and a text region, where the multi-modal information includes the target image and a target text, and may include the following steps:
step S131, searching, merging and filtering the image areas to obtain a target image area;
in step S132, corresponding multi-modal information is extracted and obtained according to the target image region and the text region.
Specifically, the outer contour of the connected domain may be irregular due to possible irregularity of the edges of the images, and when searching and merging the image regions, the minimum circumscribed regular rectangle of the connected domain corresponding to each image may be taken as the image region, candidate regions related to intersection and inclusion relations are further merged, and the minimum circumscribed regular rectangle of the region union set is uniformly taken as the merged candidate region.
Because the target image of the object to be extracted may have redundant image information in areas such as headers, footers, margins and the like, when the image areas are filtered, the image areas with smaller area and closer to the page edge can be filtered, and finally the target image area is obtained.
In the process, the method can better normalize the image area and eliminate the interference of redundant image information possibly existing in the target image by searching, combining and filtering the image area, so that the target image area is better obtained, and the quality of the extracted multi-modal information can be further improved.
Alternatively, when the corresponding multi-modal information is extracted according to the target image region and the text region, the method may include:
identifying and obtaining a title text area and a description text area according to the target image area and the text area;
extracting corresponding multi-modal information according to the target image area, the title text area and the description text area;
the target text includes a target title text and a target description text.
Specifically, the title text corresponding to the title text region may be a title of the image.
In the process, the title text region and the description text region are identified and obtained through the target image region and the text region, the text region can be accurately divided, the extracted multi-modal information can be more accurate, the target text does not need to be divided manually, the manual work amount in the multi-modal information extraction process can be further reduced, and the labor cost is better reduced.
Alternatively, when the title text region and the description text region are identified according to the target image region and the text region, the following steps may be performed:
acquiring position information of each target image area;
performing text search according to the position information of each target image area by using a preset search distance and a preset search area to obtain a suspected title text area;
and identifying and obtaining a title text area and a description text area according to the area, the length and the width of the suspected title text area.
Specifically, the position information of each target image area may be the coordinates of the upper left corner and the lower right corner of the minimum circumscribed rectangle of each target image area.
The preset search areas may be left, right, and below a minimum circumscribed regular rectangle of the target image area, and the preset search distance may be a search distance in the preset search area, where the search distances of different preset search areas may be the same or different.
When the title text region and the description text region are identified according to the area, the length and the width of the suspected title text region, determining the suspected title text region with the area smaller than a preset area threshold value and the length-width ratio meeting a preset threshold value as the title text region; otherwise, the text area is determined to be described.
In the process, the method performs text search according to the acquired position information of each target image area and the preset search distance and the preset search area to obtain a suspected title text area, and identifies and obtains the title text area and the description text area according to the area, the length and the width of the suspected title text area, so that the title text area and the description text area can be identified and obtained quickly and accurately.
Optionally, when the corresponding multi-modal information is extracted according to the target image region, the title text region, and the description text region, the method may include:
associating the corresponding target image area, the title text area and the description text area;
extracting corresponding multi-modal information according to the associated target image region, title text region and description text region;
and forming corresponding image-text information pairs by the corresponding target image, the target title text and the target description text.
Specifically, when the corresponding multi-modal information is extracted according to the associated target image region, the associated title text region, and the associated description text region, the corresponding associated description text regions in all the description text regions may be spliced, and then the corresponding multi-modal information is extracted.
In the process, the corresponding target image area, the corresponding title text area and the corresponding description text area are associated by the method, corresponding multi-modal information is extracted according to the associated target image area, the associated title text area and the associated description text area, and the corresponding target image, the corresponding target title text and the corresponding target description text form a corresponding image-text information pair, so that the extracted multi-modal information is more accurate, the corresponding target image, the corresponding target title text and the corresponding target description text do not need to be manually associated, the manual work amount during multi-modal information extraction can be further reduced, and the labor cost is better reduced.
Example two
In order to perform a corresponding method of the above embodiments to achieve corresponding functions and technical effects, a multimodal information extraction apparatus is provided below.
Referring to fig. 4, fig. 4 is a block diagram illustrating a structure of a multi-modal information extraction apparatus according to an embodiment of the present application.
The multi-modal information extraction device of the embodiment of the application comprises:
an obtaining module 210, configured to obtain a target image of an object to be extracted, where the target image includes image content and text content;
the identification module 220 is configured to identify and obtain a corresponding image region and a corresponding text region in the target image according to the target image;
the extracting module 230 is configured to extract and obtain corresponding multi-modal information according to the image region and the text region, where the multi-modal information includes a target image and a target text.
The multi-mode information extraction device of the embodiment of the application identifies and obtains the corresponding image area and text area in the target image through the obtained target image of the object to be extracted, and automatically extracts and obtains the corresponding multi-mode information according to the image area and the text area, the multi-mode information comes from the object to be extracted, the multi-mode information comprises the target image and the target text, so that the manual work amount during multi-mode information extraction can be greatly reduced, the labor cost is reduced, and the multi-mode information extraction device of the embodiment of the application is also suitable for extracting the multi-mode information of the printed image-text data, and can be convenient for extracting the multi-mode information of the printed image-text data.
As an optional implementation manner, the obtaining module 210 may specifically be configured to:
acquiring an initial image of an object to be extracted;
and preprocessing the initial image to obtain a target image of the object to be extracted.
As an alternative implementation, the identifying module 220 may be specifically configured to:
and identifying and obtaining corresponding image regions and text regions in the target image according to the area of each connected domain in the target image and a preset connected domain segmentation threshold.
As an optional implementation manner, the extraction module 230 may specifically be configured to:
searching, merging and filtering the image areas to obtain a target image area;
and extracting corresponding multi-modal information according to the target image area and the text area.
Alternatively, when the extraction module 230 extracts corresponding multi-modal information according to the target image region and the text region, it may:
identifying and obtaining a title text area and a description text area according to the target image area and the text area;
extracting corresponding multi-modal information according to the target image area, the title text area and the description text area;
the target text includes a target title text and a target description text.
Optionally, when the extraction module 230 identifies the title text region and the description text region according to the target image region and the text region, it may:
acquiring position information of each target image area;
performing text search according to the position information of each target image area by using a preset search distance and a preset search area to obtain a suspected title text area;
and identifying and obtaining a title text area and a description text area according to the area, the length and the width of the suspected title text area.
Optionally, when the extracting module 230 extracts corresponding multi-modal information according to the target image region, the title text region, and the description text region, it may:
associating the corresponding target image area, the title text area and the description text area;
extracting corresponding multi-modal information according to the associated target image region, title text region and description text region;
and forming corresponding image-text information pairs by the corresponding target image, the target title text and the target description text.
The multi-modal information extraction device can implement the multi-modal information extraction method of the first embodiment. The alternatives in the first embodiment are also applicable to the present embodiment, and are not described in detail here.
The rest of the embodiments of the present application may refer to the contents of the first embodiment, and in this embodiment, details are not repeated.
EXAMPLE III
An embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory is used to store a computer program, and the processor runs the computer program to make the electronic device execute the above-mentioned multi-modal information extraction method.
Alternatively, the electronic device may be a server.
In addition, an embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the above-mentioned multi-modal information extraction method.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Claims (7)
1. A method for multimodal information extraction, comprising:
acquiring a target image of an object to be extracted, wherein the target image comprises image content and text content;
identifying and obtaining a corresponding image area and a corresponding text area in the target image according to the target image;
extracting corresponding multi-modal information according to the image area and the text area, wherein the multi-modal information comprises a target image and a target text;
the extracting and obtaining corresponding multi-modal information according to the image region and the text region comprises:
searching, merging and filtering the image areas to obtain a target image area;
extracting corresponding multi-modal information according to the target image area and the text area;
the extracting and obtaining corresponding multi-modal information according to the target image region and the text region comprises:
identifying and obtaining a title text area and a description text area according to the target image area and the text area;
extracting corresponding multi-modal information according to the target image area, the title text area and the description text area;
the target text comprises a target title text and a target description text;
the identifying and obtaining a title text region and a description text region according to the target image region and the text region comprises:
acquiring position information of each target image area;
performing text search according to the position information of each target image area by using a preset search distance and a preset search area to obtain a suspected title text area;
and identifying and obtaining a title text area and a description text area according to the area, the length and the width of the suspected title text area.
2. The multimodal information extraction method according to claim 1, wherein the acquiring a target image of an object to be extracted includes:
acquiring an initial image of an object to be extracted;
and preprocessing the initial image to obtain a target image of the object to be extracted.
3. The multi-modality information extraction method according to claim 1, wherein the identifying and obtaining of the corresponding image region and text region in the target image according to the target image comprises:
and identifying and obtaining corresponding image regions and text regions in the target image according to the area of each connected domain in the target image and a preset connected domain segmentation threshold.
4. The method according to claim 1, wherein the extracting corresponding multimodal information from the target image region, the title text region and the description text region comprises:
associating the corresponding target image region, the title text region and the description text region;
extracting corresponding multi-modal information according to the associated target image region, title text region and description text region;
and forming corresponding image-text information pairs by the corresponding target image, the target title text and the target description text.
5. A multimodal information extraction apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a target image of an object to be extracted, and the target image comprises image content and text content;
the identification module is used for identifying and obtaining the corresponding image area and the text area in the target image according to the target image;
the extraction module is used for extracting corresponding multi-modal information according to the image area and the text area, wherein the multi-modal information comprises a target image and a target text;
the extraction module is specifically used for searching, merging and filtering the image areas to obtain target image areas; extracting corresponding multi-modal information according to the target image area and the text area;
when the extraction module extracts corresponding multi-modal information according to the target image region and the text region, a title text region and a description text region are identified and obtained according to the target image region and the text region;
extracting corresponding multi-modal information according to the target image area, the title text area and the description text area;
the target text comprises a target title text and a target description text;
the extraction module acquires the position information of each target image area when a title text area and a description text area are identified and obtained according to the target image area and the text area;
performing text search according to the position information of each target image area by using a preset search distance and a preset search area to obtain a suspected title text area;
and identifying and obtaining a title text area and a description text area according to the area, the length and the width of the suspected title text area.
6. An electronic device comprising a memory for storing a computer program and a processor that executes the computer program to cause the electronic device to perform the multimodal information extraction method as claimed in any one of claims 1 to 4.
7. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the multimodal information extraction method as claimed in any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110093438.9A CN112418199B (en) | 2021-01-25 | 2021-01-25 | Multi-modal information extraction method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110093438.9A CN112418199B (en) | 2021-01-25 | 2021-01-25 | Multi-modal information extraction method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112418199A CN112418199A (en) | 2021-02-26 |
CN112418199B true CN112418199B (en) | 2022-03-01 |
Family
ID=74783203
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110093438.9A Active CN112418199B (en) | 2021-01-25 | 2021-01-25 | Multi-modal information extraction method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112418199B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113779934B (en) * | 2021-08-13 | 2024-04-26 | 远光软件股份有限公司 | Multi-mode information extraction method, device, equipment and computer readable storage medium |
CN115797943B (en) * | 2023-02-08 | 2023-05-05 | 广州数说故事信息科技有限公司 | Video text content extraction method, system and storage medium based on multiple modes |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112101165A (en) * | 2020-09-07 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Interest point identification method and device, computer equipment and storage medium |
CN112101386A (en) * | 2020-09-25 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Text detection method and device, computer equipment and storage medium |
CN112184738A (en) * | 2020-10-30 | 2021-01-05 | 北京有竹居网络技术有限公司 | Image segmentation method, device, equipment and storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10902588B2 (en) * | 2018-08-13 | 2021-01-26 | International Business Machines Corporation | Anatomical segmentation identifying modes and viewpoints with deep learning across modalities |
CN109272440B (en) * | 2018-08-14 | 2023-11-03 | 阿基米德(上海)传媒有限公司 | Thumbnail generation method and system combining text and image content |
-
2021
- 2021-01-25 CN CN202110093438.9A patent/CN112418199B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112101165A (en) * | 2020-09-07 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Interest point identification method and device, computer equipment and storage medium |
CN112101386A (en) * | 2020-09-25 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Text detection method and device, computer equipment and storage medium |
CN112184738A (en) * | 2020-10-30 | 2021-01-05 | 北京有竹居网络技术有限公司 | Image segmentation method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112418199A (en) | 2021-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2020200251B2 (en) | Label and field identification without optical character recognition (OCR) | |
US8965127B2 (en) | Method for segmenting text words in document images | |
US8467614B2 (en) | Method for processing optical character recognition (OCR) data, wherein the output comprises visually impaired character images | |
US10643094B2 (en) | Method for line and word segmentation for handwritten text images | |
Fabrizio et al. | Text detection in street level images | |
CA2656425A1 (en) | Recognizing text in images | |
CN110110325B (en) | Repeated case searching method and device and computer readable storage medium | |
CN112418199B (en) | Multi-modal information extraction method and device, electronic equipment and storage medium | |
CN108154132A (en) | Method, system and equipment for extracting characters of identity card and storage medium | |
CN111753120A (en) | Method and device for searching questions, electronic equipment and storage medium | |
CN106778777B (en) | Vehicle matching method and system | |
CN115828874A (en) | Industry table digital processing method based on image recognition technology | |
CN114359533B (en) | Page number identification method based on page text and computer equipment | |
Hidayatullah et al. | License plate detection and recognition for Indonesian cars | |
Chiu et al. | Picture detection in document page images | |
CN109101973B (en) | Character recognition method, electronic device and storage medium | |
Kamola et al. | Image-based logical document structure recognition | |
CN110826488A (en) | Image identification method and device for electronic document and storage equipment | |
CN110728240A (en) | Method and device for automatically identifying title of electronic file | |
Thilagavathy et al. | Fuzzy based edge enhanced text detection algorithm using MSER | |
Radzid et al. | Text line segmentation for mushaf Al-Quran using hybrid projection based neighbouring properties | |
CN106548162B (en) | A method of automatically extracting band name human face data from news pages | |
Huang et al. | Chinese historic image threshold using adaptive K-means cluster and Bradley’s | |
Cao et al. | Character segmentation and restoration of Qin-Han bamboo slips using local auto-focus thresholding method | |
CN113077410A (en) | Image detection method, device and method, chip and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |