CN113343982B

CN113343982B - Entity relation extraction method, device and equipment for multi-modal feature fusion

Info

Publication number: CN113343982B
Application number: CN202110666465.0A
Authority: CN
Inventors: 李煜林; 庾悦晨; 钦夏孟; 章成全; 姚锟; 韩钧宇; 刘经拓; 丁二锐; 吴甜; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2023-07-25
Anticipated expiration: 2041-06-16
Also published as: CN113343982A

Abstract

According to embodiments of the present disclosure, a method, apparatus, device, medium, and program product for entity relationship extraction for multimodal feature fusion are provided. The intelligent system relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be applied to smart cities and smart financial scenes. The scheme is as follows: determining, for each of a plurality of regions in an image comprising characters, a visual feature of the region and a plurality of character text features of the region, the character text features corresponding to one character in the region; determining, for each region, a region visual semantic feature of the region based on the visual features of the region and the plurality of character text features; based on the visual semantic features of the regions, determining relationship information of a plurality of regions, wherein the relationship information at least indicates the association degree between any two regions in the plurality of regions; associating areas among the plurality of areas based on the relationship information; and extracting entity relations for the acquired entities. Therefore, the accuracy of text recognition can be improved.

Description

Entity relation extraction method, device and equipment for multi-modal feature fusion

Technical Field

The present disclosure relates to the field of artificial intelligence technology, in particular to the field of computer vision and deep learning technology, applicable to smart cities and smart financial scenarios, and more particularly to a method, apparatus, device, computer-readable storage medium and computer program product for entity relationship extraction for multimodal feature fusion.

Background

With the development of information technology, neural networks are widely used for various machine learning tasks such as computer vision, voice recognition, and information retrieval. Specific information extraction of a document is to automatically extract specific information, including information entities and relationships of interest to a user, from the document (e.g., requests, notification functions, reports, meeting descriptions, and contracts, bidding, inspection reports, maintenance worksheets). Processing images of documents using neural networks to extract information in the documents is considered an effective approach. However, the accuracy of text recognition has yet to be improved.

Disclosure of Invention

According to example embodiments of the present disclosure, a method, apparatus, device, computer-readable storage medium, and computer program product are provided for multi-modal feature fused entity relationship extraction.

In a first aspect of the present disclosure, a method for entity relationship extraction processing for multimodal feature fusion is provided. The method comprises the following steps: determining, for each of a plurality of regions in an image comprising characters, a visual feature of the region and a plurality of character text features of the region, the character text features corresponding to one character in the region; determining, for each region, a region visual semantic feature of the region based on the visual features of the region and the plurality of character text features; based on the visual semantic features of the regions, determining relationship information of a plurality of regions, wherein the relationship information at least indicates the association degree between any two regions in the plurality of regions; associating areas among the plurality of areas based on the relationship information; and extracting entity relationships for the acquired entities.

In a second aspect of the present disclosure, an entity relationship extraction processing apparatus for multimodal feature fusion is provided. The device comprises: a first feature determination module configured to determine, for each of a plurality of regions in an image including characters, a visual feature of the region and a plurality of character text features of the region, the character text feature corresponding to one character in the region; a second feature determination module configured to determine, for each region, a region visual semantic feature of the region based on the visual feature of the region and the plurality of character text features; a relationship information determining module configured to determine relationship information of a plurality of regions based on the region visual semantic features, the relationship information indicating at least a degree of association between any two regions of the plurality of regions; a first region association module configured to associate regions of the plurality of regions based on the relationship information; and a first extraction module configured to extract an entity relationship for the acquired entity.

In a third aspect of the present disclosure, an electronic device is provided that includes one or more processors; and storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a method according to the first aspect of the present disclosure.

In a fourth aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which when executed by a processor implements a method according to the first aspect of the present disclosure.

In a fifth aspect of the present disclosure, there is provided a computer program product comprising computer program instructions for implementing the method of the first aspect of the present disclosure by a processor.

It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements. The accompanying drawings are included to provide a better understanding of the present disclosure, and are not to be construed as limiting the disclosure, wherein:

FIG. 1 illustrates a schematic diagram of an example of a system 100 of entity relationship extraction in which multimodal feature fusion can be implemented in some embodiments of the present disclosure;

FIG. 2 illustrates an exemplary image 200 of an image including characters in an embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of a process 300 for entity relationship extraction for multimodal feature fusion in accordance with some embodiments of the present disclosure;

FIG. 4 illustrates a flowchart of a process 400 for determining regional visual semantic features according to some embodiments of the present disclosure;

FIG. 5 shows a schematic block diagram of an entity relationship extraction apparatus 500 for multimodal feature fusion in accordance with an embodiment of the disclosure; and

fig. 6 illustrates a block diagram of a device 600 capable of implementing various embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

In embodiments of the present disclosure, the term "model" is capable of processing an input and providing a corresponding output. Taking the neural network model as an example, it generally includes an input layer, an output layer, and one or more hidden layers between the input layer and the output layer. Models used in deep learning applications (also referred to as "deep learning models") typically include many hidden layers, thereby extending the depth of the network. The layers of the neural network model are connected in sequence such that the output of the previous layer is used as the input of the subsequent layer, wherein the input layer receives the input of the neural network model and the output of the output layer is the final output of the neural network model. Each layer of the neural network model includes one or more nodes (also referred to as processing nodes or neurons), each of which processes inputs from a previous layer. The terms "neural network," "model," "network," and "neural network model" are used interchangeably herein.

As mentioned above, there is a need to improve the accuracy of text recognition. In the conventional scheme, there are generally three cases: (1) manual entry. This approach has the disadvantage of not being suitable for intelligent office systems. The automation can not be realized, and the labor cost is higher. (2) performing a region search by locating text entities. This method has a drawback in that it is limited to a document of a fixed style and has a limitation in application range. (3) extracting the relation based on the named entity. And carrying out relation judgment in the plain text through semantic features of the context. The method has the defect that the plain text is used for entity extraction, so that the visual typesetting of the content in the document is ignored, and the problem of semantic confusion is easily caused. Therefore, the conventional scheme has low accuracy in recognition of characters in an image.

Example embodiments of the present disclosure propose a scheme for entity relationship extraction for multimodal feature fusion. In the scheme, firstly, an image to be processed is acquired, and the image comprises characters to be recognized. The image may be divided into a plurality of regions according to the row or column in which the character is located, and for each region, the text characteristics of the character in that region and the visual characteristics (image characteristics, position characteristics, etc.) of that region may be determined. And then, according to the determined visual characteristics of the region and the text characteristics of the characters in the region, performing, for example, a characteristic fusion operation on the visual characteristics of the region to determine the region visual semantic characteristics of the region. And then, according to the visual semantic characteristics of the areas in each area, determining the association degree between every two areas in the plurality of areas, wherein the higher the association degree is, the greater the possibility that the relationship exists between the characters in the two areas is. And associating the areas with each other according to the determined association degree. And finally, according to the entity to be determined, extracting in the associated area based on the entity name and the entity value of the entity. According to the embodiment of the disclosure, the relationship between different areas can be accurately determined by comprehensively considering the position features, visual features and text features of characters and areas in an image. And further, the entities in the related areas can be accurately related, and the accuracy of text recognition is improved.

Fig. 1 illustrates a schematic diagram of an example of a system 100 of entity relationship extraction in which multi-modal feature fusion can be implemented in some embodiments of the present disclosure. As shown in fig. 1, system 100 includes a computing device 110. Computing device 110 may be any device having computing capabilities, such as a personal computer, tablet computer, wearable device, cloud server, mainframe, distributed computing system, and the like.

Computing device 110 obtains input 120. For example, the input 120 may be an image, video, audio, text, and/or multimedia file, among others. The computing device 110 may apply the input 120 to the network model 130 to generate a processing result 140 corresponding to the input 120 using the network model 130. In some embodiments, the network model 130 may be, but is not limited to, an OCR recognition model, an image classification model, a semantic segmentation model, an object detection model, or other neural network model associated with image processing. The network model 130 may be implemented using any suitable network structure including, but not limited to, a Support Vector Machine (SVM) model, a Bayesian model, a random forest model, various deep learning/neural network models, such as Convolutional Neural Networks (CNNs), recurrent Neural Networks (RNNs), deep Neural Networks (DNNs), deep reinforcement learning networks (DQNs), and the like. The scope of the present disclosure is not limited in this respect.

The system 100 may further comprise training data acquisition means, model training means and model application means (not shown). In some embodiments, the plurality of apparatuses described above may be implemented in different physical computing devices, respectively. Alternatively, at least some of the plurality of means described above may be implemented in the same computing device. For example, the training data acquisition means, the model training means and the model application means may be implemented in the same computing device, whereas the model application means may be implemented in another computing device.

The input 120 may be input data (e.g., image data) to be processed, the network model 130 is an image processed (e.g., a trained image classification model), and the processing results 140 may be prediction results (e.g., classification results, semantic segmentation results, or target recognition results of an image) corresponding to the input 120 (e.g., image data).

In some embodiments, the processing result 140 may be a character corresponding to a plurality of entities to be determined in text, for example, an entity "name" corresponds to "Zhang Sanj", an entity "date" corresponds to "2021, 01, an entity" amount "corresponds to" 200", and so on. In some embodiments, the processing result 140 may also be a degree of association of multiple regions in the image. Alternatively, in some embodiments, the processing result 140 may also be a classification result of each character in the image to be processed. Methods according to embodiments of the present disclosure may be applied as needed to obtain different processing results 140, and the present disclosure is not limited herein.

In some embodiments, to reduce the computational effort of the model, the computing device 110 may further process the input 120 (e.g., image). For example, computing device 110 may resize and normalize the pictures described above to form a preprocessed image. In some embodiments, for input 120 in the form of an image, the image may be cropped, rotated, and flipped over by the image therein.

It should be understood that the system 100 illustrated in fig. 1 is merely one example in which embodiments of the present disclosure may be implemented and is not intended to limit the scope of the present disclosure. Embodiments of the present disclosure are equally applicable to other systems or architectures.

Fig. 2 illustrates an exemplary image 200 of an image including characters according to an embodiment of the present disclosure.

In order to clearly illustrate the embodiments hereinafter, before describing the embodiments of the present disclosure, an image 200 including characters is first described with reference to fig. 2.

As shown in FIG. 2, image 200 includes a plurality of regions 210-270 (indicated by dashed rectangular boxes), each of which may include a plurality of characters, for example, region 210 may include a plurality of characters 211-217. The region may refer herein to the area occupied by a line of characters or a line of text in the image 200, or the area occupied by a column of characters or a column of text in the image 200. The region may be any shape and the disclosure is not limited herein. The characters may be text in various languages. Hereinafter, description will be made with reference to fig. 2 as an example image.

The detailed multimodal feature fusion entity relationship extraction process is further described below in conjunction with fig. 2-4.

Fig. 3 illustrates a flow chart of a process 300 for entity relationship extraction for multimodal feature fusion in accordance with an embodiment of the disclosure.

Process 300 may be implemented by computing device 110 in fig. 1. For ease of description, process 300 will be described with reference to FIG. 1.

At step 310 of fig. 3, computing device 110 determines, for each of a plurality of regions in image 200 that includes characters, a visual feature of the region and a plurality of character text features of the region, the character text feature corresponding to one of the characters in the region. For example, the computing device 110 determines, for each of the plurality of regions 210-270 in the image 200, visual features of the regions and character text features of the characters 211-217, 221, 223, 231, 233, 241, 243, …, 271, 273.

The visual characteristics of the region may represent the apparent characteristics of the image of the region in the image and its location characteristics, and the computing device 110 may determine the apparent characteristics of the image of the region by a suitable algorithm or model, such as a feature map obtained by processing the image 200 through the roll layer. The computing device 110 may determine the location characteristics of the region by determining the location of the region in the image 200 through a suitable algorithm or model. Computing device 110 may add the positional features and the image appearance features to determine the visual features. Text features for characters in the region. Computing device 110 may utilize optical character recognition techniques to determine character text characteristics of the characters.

At step 320 of fig. 3, computing device 110 determines, for each region, a region visual semantic feature of the region based on the visual features of the region and the plurality of character text features. For example, after determining the visual features of the regions and the character text features of the characters, computing device 110 may further process the features to determine the region visual semantic features of the regions for subsequent region association.

In particular, computing device 110 may fuse visual features of the region with a plurality of character text features and then feature enhance the fused features to determine region visual semantic features of the region. The visual semantic features of the region can accurately represent not only the text features of the characters included in the region, but also the visual features, spatial features and position features of the region in the image.

In step 330 of fig. 3, the computing device 110 determines relationship information for the plurality of regions based on the region visual semantic features, the relationship information indicating at least a degree of association between any two regions of the plurality of regions. After accurately representing the region visual semantic features of the regions, the computing device 110 may set a matrix of leachable parameters P and then determine the relationship information between the regions according to the following equation (1):

A＝σ(MPM ^t ) Formula (1)

Wherein M is regional visual semantic feature, M ^t Representing the transposed matrix of M, the dimensions and parameters of the learnable parameter matrix P can be set according to M. The relationship information may be as shown in table 1 below:

TABLE 1

Degree of association	Region 210	Region 220	Region 230	Region 240	Region 250	Region 260	Region 270
								Region 210		0	0	0	0	0	0
Region 220	0		1	0.1	0.15	0.13	0.24
								Region 230	0	1		0.2	0.2	0.3	0.3
Region 240	0	0.1	0.2		1	0.14	0.15
								Region 250	0	0.15	0.2	1		0	0
Region 260	0	0.13	0.3	0.14	0		1
								Region 270	0	0.24	0.3	0.15	0	1

Wherein the number indicates the degree of association, a higher number representing a higher degree of association. The numerals are merely exemplary and are not intended to limit the scope of the present disclosure.

In step 340 of fig. 3, computing device 110 associates regions of the plurality of regions based on the relationship information. For example, after the degree of association between the plurality of areas is determined, characters in the areas may be recognized and extracted according to information to be determined.

In some embodiments, computing device 110 may determine a degree of association between a first region of the plurality of regions and a region of the plurality of regions other than the first region, respectively. And associating the target area having the highest degree of association with the first area. For example, computing device 110 may determine the degree of association in each of the two regions from table 1 above (where there is a corresponding degree of association between each of the two regions), and computing device 110 may determine from table 1 that the degree of association between first region 220 and regions 210 and 230-270 other than the first region are 0, 1, 0.1, 0.15, 0.13, and 0.24, respectively. The region 230 having the highest degree of association of 1 may then be determined as the target region.

Alternatively, in some embodiments, if the computing device 110 determines that the first region has the same highest degree of association with two regions other than the first region, the two regions may be simultaneously treated as target regions, from which the entity name and entity value to be extracted are then determined. Multiple regions may be associated and the disclosure is not limited in this regard

Through the determined visual semantic features of the regions comprising rich images, texts and spatial features, the association between the regions can be accurately determined, and a foundation is further laid for entity extraction. After determining the associated region, computing device 110 may extract characters in the target region and the first region according to the desired information.

At step 350 of FIG. 3, computing device 110 extracts entity relationships for the acquired entities. After the association relationship between the areas is determined, content related to the entity may be determined from among the plurality of areas according to the entity to be determined.

In some embodiments, in performing entity extraction for images of known structures, computing device 110 may first obtain an entity to be determined, the entity having an associated entity name and entity value, e.g., the entity name "and the entity value" Zhang San ". Computing device 110 may then determine a first region that includes the entity name. Computing device 110 may extract a first character in the first region. For example, computing device 110 extracts a first character "name" in first region 220. Computing device 110 then determines an associated target region based on the determined first region. Having determined that the first region 220 is associated with the target region 230, as described above in step 340, finally, the computing device 110 may treat the character "Zhang Sano" included in the region 230 as an entity value.

In some embodiments, for an entity name "name," computing device 110 may determine a paraphrasing of "name" and determine that its paraphrasing character "name" is included in first region 220. The computing device 110 may extract the target character "Zhang Sanj" in the associated target area 230 as determined by the steps described above, the extracted target character "Zhang Sanj" being the entity value of the entity "name".

Alternatively, in some embodiments, the computing device 110 obtains the entity name "address" of the entity to be determined. The computing device 110 does not find an area in the image 200 that includes the character "address" or that includes a character that has a similar meaning to the character "address". Computing device 110 may, for example, issue a prompt to the user via the user interface that the entity was not found. Or computing device 110 may mark the entity value of "address" as 0 in the returned entity extraction result.

For images with known text structures, the recognition method is particularly advantageous, which saves computational effort by determining the relationship between regions to extract entities. In addition, due to the accurate determination of the region relation, the accuracy of text recognition is improved.

According to the embodiment of the disclosure, the relation between the areas can be accurately determined by recombining and fusing the visual characteristics and the text characteristics of each area in the image, so that the accuracy of text recognition can be improved. Further, the entity content of the entity to be determined can be accurately extracted.

With continued reference to fig. 2, for step 210, "computing device 110 determines, for each of a plurality of regions in image 200 that includes characters, a visual feature of the region and a plurality of character text features of the region," this embodiment provides an alternative implementation, specifically implemented as follows:

computing device 110 may first determine image characteristics of image 200. The visual characteristics of the region are then determined based on the image characteristics and the region location information in the image 200 for each of the plurality of regions in the image 200. And determining a plurality of character text features based on the region position information and the characters included in the region. For example, computing device 110 may use a Resnet50 convolutional neural Network in a Resnet (Residual Network) to extract a feature map of image 200 and take the feature map as an image feature of image 200. Note that the above neural networks are merely exemplary, and that any suitable neural network model (e.g., resnet43, resnet 101) may be applied to determine the image characteristics of image 200.

Alternatively, computing device 110 may determine color features, texture features, shape features, spatial relationship features, and the like, of image 200 (and characters included therein), respectively, using suitable algorithms. The determined features are then fused (e.g., spliced and summed in matrix form) to determine features of the image 200.

After determining the image features of image 200, computing device 110 determines visual features of the respective regions from the image features. The visual features of a region may represent the apparent features of the image and its positional features of the region in the image.

In particular, computing device 110 may determine region location information in image 200 for each of a plurality of regions in image 200. And determining the regional characteristics of the region according to the determined image characteristics and the regional position information. And then combining the features corresponding to the region position information with the region features to determine the visual features of the region.

For example, computing device 110 may first determine the location of each region in image 200 as region location information. The computing device 110 may apply an EAST algorithm to predict the locations of the plurality of regions 210-270 in the image 200 that include characters. For example, the output of the image 200 after the EAST algorithm may be a plurality of dashed boxes (a plurality of areas) shown in fig. 2, each enclosing a plurality of characters. The computing device 110 may determine region location information for each region in the image 200 from the plurality of dashed boxes. In some embodiments, the region position information may be represented by coordinates of four points of the region (coordinates of four vertices of a dotted rectangular frame). Alternatively, in one embodiment, in the case where the area sizes of the plurality of areas are the same, the area position information may be represented by the center point coordinates of the areas. The location of the region in the image may also be determined by any suitable model and algorithm. After determining the location information for the location, computing device 110 may encode the location information into a vector (e.g., 768-dimensional vector) as region location information (hereinafter may be denoted as S).

In some embodiments, computing device 110 may determine the region characteristics of the region based on the determined characteristics of image 200 and the region location information described above. For example, the computing device 110 may use ROI (regions of interest) Pooling (a Pooling operation of a region of interest to determine a feature of the region of interest in a feature map of an image) to extract an image apparent feature of a location where the region is located in the image feature map of the image 200 as a region feature of the region (hereinafter may be referred to as F).

Alternatively, the computing device 110 may segment the image 200 into a plurality of sub-images according to the above-determined position information, and then determine image features of the plurality of sub-images as region features of the respective regions using an appropriate model and algorithm. The method for determining the image features of the sub-images is described above (e.g., the method for determining the image features of the image 200 is described above), and will not be described in detail herein.

Additionally or alternatively, in case the region location information of the region is already clear (e.g. for an image of a file of a predetermined format), different regions in the image 200 may be identified separately from the predetermined location information to determine the region characteristics of the respective region.

After determining the region features and the location features of the corresponding regions in the image, computing device 110 may combine them into visual features of the regions, e.g., where F and S are feature vectors of the same dimension (e.g., vectors of 768 dimensions), computing device 110 may determine the visual features using equation (2) as follows:

visual characteristics = f+s formula (2)

The above combinations of features in the form of vector sums are merely exemplary and other suitable combinations exist and the disclosure is not limited thereto. It can be appreciated that the visual features of the region blend the apparent features and the position features of the image of the region, and the visual features are richer than the image features, which lays a foundation for the subsequent character recognition task, so that the final processing result is more accurate.

Next, computing device 110 may determine character text characteristics of the characters. For example, computing device 110 may determine each character within the dashed box of image 200 using optical character recognition technology (OCR) based on the location information described above.

In some embodiments, for different character sizes in the image, it may be considered to translate characters of different lengths to the same length. For example, computing device 140 may determine an area 210 from image 200 that includes the longest character length, such as a fixed-length character having the longest character length of 4 as a character. For characters within other regions 220-270, characters of less than 4 in length may be padded with a particular symbol. The individual regions 210-270 are then identified. Note that the above-described determination of the longest character length of 4 is merely exemplary, and that other lengths (e.g., 5, 6 or the longest character length that the model can determine) of characters may exist according to different images including different characters, and the disclosure is not limited thereto. In some embodiments, computing device 110 may utilize a particular indefinite character recognition model, such as a CRNN character recognition model, to directly recognize characters in various regions. And encodes the character as a vector as a character text feature. For ease of representation, we will locate n regions, each region comprising ki characters, we get a sequence of character text features:

T＝(t ₁ ,t ₂ ,…,t _n )＝(c _1.1 ,c _1.2 ,…,c _1.k1 ,c _2.1 ,c _2.2 ,…,c _2.k2 ,…,c _n.1 ,…,c _n.kn )

Where T represents character text features of all characters in the image, T1-tn represents character text features of all characters in each region, cij represents character text features of a single character, and if visual features of the regions have been determined, i ε n, j ε ki further determines that the character text features in the regions can more accurately represent the corresponding regions, thereby making character recognition and extraction in the regions more accurate.

Alternatively, to save computational costs, the computing device 110 may directly determine character text features of the characters through a suitable algorithm or model. Without having to recode character text features in advance for OCR recognition.

Fig. 4 illustrates a schematic diagram of a process 400 for determining regional visual semantic features according to some embodiments of the present disclosure. The present embodiment provides other alternative implementations for step 320 "determine the region visual semantic features of the region based on the visual features of the region and the plurality of character text features" for each region.

At step 410 of fig. 4, computing device 110 fuses the visual features of the plurality of regions and the plurality of character text features to obtain image visual semantic features.

In some embodiments, the computing device 110 may determine the image visual semantic features according to the following equation (3):

v=concat (T, f+s) formula (3)

That is, the determined visual feature f+s and the character text feature T of all the characters in the image are stitched to obtain the image visual semantic feature of the image 200.

In some embodiments, computing device 110 may set different weights for character text feature T, region feature F, and region location information S to determine image visual semantic features according to the following equation (4):

v=concat (αt, βf+γs) formula (4)

The alpha, beta and gamma can be set according to the test result or the requirement of the application scene.

Alternatively, in some embodiments, computing device 110 may also combine region feature F and region location feature S using an AdaIN algorithm according to equation (5) below:

where σ is the mean, μ and standard deviation, x can be set to F and y can be set to S (or vice versa). The visual semantic features of the image may then be determined according to the following equation (6):

v=concat (T, adaIN (F, S)) formula (6)

Note that the above-described fusing of character text features T, region features F, and region location information S to determine image visual semantic features V is merely exemplary, and other suitable fusing methods other than summation, stitching, adaIN, or combinations thereof may be employed, and the disclosure is not limited thereto.

At step 420 of fig. 4, computing device 110 enhances the image visual semantic features to obtain enhanced image visual semantic features. To enhance the visual semantic features of the image, computing device 110 may further fuse visual features F+S and character text features T in the fused features V described above using a suitable algorithm. For example, multi-layer bi-directional conversion may be utilized from an encoder (Bidirectional Encoder Representation from Transformers, BERT) to enhance the information representation of visual semantic features of an image in spatial, visual, semantic, etc. modalities. We define the initial input layer H of the encoder ₀ =v, and the coding mode of the encoder is defined according to the following formula (7):

wherein H is _l-1 ,H _l Representing the input features and output features of the first layer encoding, respectively. The model uses a plurality of fully connected layers (W _l* ) For characteristic H _l-1 Transforming and calculating a weight matrix, and then combining with H _l-1 Multiplying to obtain coding characteristic H of first fusion _l . Sigma is the normalization function sigmoid. Through stacking multiple codes, the visual features F+S and the character text features T interact information in the coding process, and finally are recombined into richer enhanced image visual semantic featuresH. As can be seen from the above equation (6), the dimension of H does not change, each term in H corresponds to each term in V, except that each term in H fuses the features of the associated term. Note that the above encoder and formulas are merely exemplary, and that information in the features may be fused in any suitable manner.

At step 430 of fig. 4, computing device 110 averages a plurality of character text features in an area in the enhanced image visual semantic features to obtain area text features for the respective area. The enhanced image visual semantic feature H obtained above can be expressed as:

H＝(x _1，1 ，x _1，2 ，…，x _1，k1 ，x _2，1 ，x _2，2 ，…，x _2，k2 ，…，x _n，1 ，…，x _n，kn ，y _l ，…，y _n )

wherein X is _ij Corresponding to character text feature C _ij Enhanced features, y _i Corresponding to the enhanced features of visual feature F+S, i ε n, j ε k _i . Computing device 110 may enhance a plurality of character text features X belonging to the same region in image visual semantic features _ij Averaging is performed to obtain a region text feature Q representing the region,

at step 440 of fig. 4, computing device 110 determines region visual semantic features for the respective region based on the respective ones of the region text features and the enhanced image visual semantic features.

In some embodiments, computing device 110 may characterize region text q of a region _i Hadamard product (Hadamard product) operation is performed with the enhanced visual feature yi of the region to obtain a region visual semantic feature M, M= { M of the region _i ；m _i ＝q _i ⊙y _i }. In some embodiments, q may also be _i And yi performs a kronecker product (Kronecker product) operation. Alternatively, in some embodiments, standard vectors may also be applied The way of the product determines the visual semantic features of the region. The above product operation is only to fuse the text feature of the character in the region with the visual, spatial and position features of the region, and other suitable operations may be used for fusing, which is not limited in this disclosure.

The spatial, semantic, and visual features of each region can be combined together by various means of combining (e.g., summing), fusing (e.g., stitching, adaIN), enhancing, and multiplying to form features representing the region, which can significantly increase the accuracy of subsequent entity relationship extraction.

Fig. 5 shows a schematic block diagram of an entity relationship extraction apparatus 500 for multimodal feature fusion in accordance with an embodiment of the disclosure. As shown in fig. 5, the apparatus 500 includes: a first feature determination module 510 configured to determine, for each of a plurality of regions in an image comprising characters, a visual feature of the region and a plurality of character text features of the region, the character text feature corresponding to one character in the region; a second feature determination module 520 configured to determine, for each region, a region visual semantic feature of the region based on the visual feature of the region and the plurality of character text features; a relationship information determining module 530 configured to determine relationship information of a plurality of regions based on the region visual semantic features, the relationship information indicating at least a degree of association between any two regions of the plurality of regions; a first region association module 540 configured to associate regions of the plurality of regions based on the relationship information; and a first extraction module 550 configured to extract entity relationships for the acquired entities.

In some embodiments, wherein the first feature determination module 510 may comprise: an image feature determination module configured to determine an image feature of an image including a character; a first visual feature determination module configured to determine a visual feature of a region based on the image feature and region position information of each of a plurality of regions in the image; and a character text feature determination module configured to determine a plurality of character text features based on the region position information and the characters included in the region.

In some embodiments, wherein the first visual characteristic determination module may comprise: a region position information determination module configured to determine region position information of each of a plurality of regions in an image in the image; a region feature determination module configured to determine a region feature of a region based on the image feature and the region position information; and a second visual feature determination module configured to combine the region location information and the region features to determine visual features of the region.

In some embodiments, wherein the second feature determination module 520 may comprise: the image visual semantic feature determining module is configured to fuse the visual features of the plurality of areas and the text features of the plurality of characters to obtain the image visual semantic features; the enhancement module is configured to enhance the visual semantic features of the image to obtain enhanced visual semantic features of the image; the regional text feature determining module is configured to average a plurality of character text features in one region in the enhanced image visual semantic features to obtain regional text features of the corresponding region; and a region visual semantic feature determination module configured to determine region visual semantic features of respective regions based on respective ones of the region text features and the enhanced image visual semantic features.

In some embodiments, wherein the first region association module 540 may include: a degree of association determination module configured to determine a degree of association between a first region of the plurality of regions and a region other than the first region of the plurality of regions, respectively; and a second region association module configured to associate the target region having the highest degree of association with the first region.

In some embodiments, wherein the first extraction module 550 may comprise: the entity acquisition module is configured to acquire an entity to be determined, wherein the entity has an associated entity name and entity value; a region determination module configured to determine a first region target region determination module including an entity name, configured to determine an associated target region based on the determined first region; and an entity value determination module configured to take characters included in the target area as entity values.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various devices and processes described above, such as process 300 and process 400. For example, in some embodiments, process 300 and process 400 may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of process 300 and process 400 described above may be performed. Alternatively, in other embodiments, computing unit 601 may be configured to perform process 300 and process 400 in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing an apparatus of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service augmentation in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A multi-modal feature fusion entity relationship extraction method comprises the following steps:

determining, for each of a plurality of regions in an image comprising characters, a visual feature of the region and a plurality of character text features of the region, the character text features corresponding to one character in the region;

determining, for each region, a region visual semantic feature of the region based on the visual features of the region and the plurality of character text features;

Determining relationship information of the plurality of regions based on the region visual semantic features, wherein the relationship information at least indicates the association degree between any two regions in the plurality of regions;

associating regions of the plurality of regions based on the relationship information; and

for the obtained entity, extracting the entity relationship,

wherein determining the region visual semantic features of the region comprises:

determining a location feature of the region based on region location information of the region in the image;

determining a region feature of the region based on the image feature of the image and the region position information;

adding the position feature and the region feature and stitching with the plurality of character text features to determine an image visual semantic feature of the region;

determining enhanced image visual semantic features for each of the plurality of regions based on the image visual semantic features of each of the plurality of regions itself as initial input features of the equation such that the location features, the region features, and the plurality of character text features are fused:

Wherein the initial input H ₀ For visual semantic features of images H _l-1 ,H ₁ Separate tableShows the input features and output features of the first layer, W _l1 And W is _l2 Is a full-connection layer, sigma is a normalization function sigmoid, and t represents transposition operation on a matrix;

performing average operation on the enhanced character text features corresponding to the plurality of character text features in the enhanced image visual semantic features to obtain region text features representing the region; and

and multiplying the regional text features and corresponding enhanced visual features in the enhanced image visual semantic features to determine the regional visual semantic features of the region.

2. The method of claim 1, wherein determining, for each of a plurality of regions in an image comprising characters, a visual feature of a region and a plurality of character text features of the region comprises:

determining image features of the image comprising characters;

determining a visual feature of each of a plurality of regions in the image based on the image feature and region location information of the region in the image; and

the plurality of character text features are determined based on the region position information and characters included in the region.

3. The method of claim 2, wherein determining the visual characteristics of the region based on the image characteristics and region location information in the image for each of a plurality of regions in the image comprises:

determining region position information of each of a plurality of regions in the image;

determining a region feature of the region based on the image feature and the region position information; and

the region location information and the region features are combined to determine visual features of the region.

4. The method of claim 1, wherein determining, for each region, the region visual semantic feature of the region based on the visual feature of the region and the plurality of character text features comprises:

fusing the visual features of the plurality of areas and the plurality of character text features to obtain an image visual semantic feature V;

enhancing the visual semantic features of the image to obtain enhanced visual semantic features of the image;

averaging a plurality of character text features in one region in the enhanced image visual semantic features to obtain a region text feature Q of the corresponding region; and

And determining the regional visual semantic features M of the corresponding region based on the corresponding visual features Y in the regional text features and the enhanced image visual semantic features.

5. The method of claim 1, wherein associating regions of the plurality of regions comprises:

determining a degree of association between a first region of the plurality of regions and regions of the plurality of regions other than the first region, respectively; and

and associating the target area with the highest association degree with the first area.

6. The method of claim 5, wherein extracting entity relationships for the acquired entities comprises:

acquiring an entity to be determined, wherein the entity has an associated entity name and entity value;

determining the first area including the entity name;

determining the associated target region based on the determined first region; and

and taking the characters included in the target area as the entity value.

7. An entity relationship extraction device for multi-modal feature fusion, comprising:

a first feature determination module configured to determine, for each of a plurality of regions in an image including characters, a visual feature of the region and a plurality of character text features of the region, the character text feature corresponding to one character in the region;

A second feature determination module configured to determine, for each region, a region visual semantic feature of the region based on the visual features of the region and the plurality of character text features;

a relationship information determination module configured to determine relationship information of the plurality of regions based on the region visual semantic features, the relationship information indicating at least a degree of association between any two regions of the plurality of regions;

a first region association module configured to associate regions of the plurality of regions based on the relationship information; and

a first extraction module configured to extract entity relationships for the acquired entities, wherein the second feature determination module is further configured to:

Wherein the initial input H ₀ For visual semantic features of images H _l-1 ,H ₁ Respectively representing the input features and the output features of the first layer, W _l1 And W is _l2 Is a full-connection layer, sigma is a normalization function sigmoid, and t represents transposition operation on a matrix;

8. The apparatus of claim 7, wherein the first feature determination module comprises:

an image feature determination module configured to determine an image feature of the image including the character;

a first visual feature determination module configured to determine a visual feature of each of a plurality of regions in the image based on the image feature and region location information of the region in the image; and

and a character text feature determination module configured to determine the plurality of character text features based on the region position information and the characters included in the region.

9. The apparatus of claim 8, wherein the first visual characteristic determination module comprises:

a region position information determination module configured to determine region position information of each of a plurality of regions in the image;

a region feature determination module configured to determine a region feature of the region based on the image feature and the region position information; and

a second visual feature determination module is configured to combine the region location information and the region features to determine visual features of the region.

10. The apparatus of claim 7, wherein the second feature determination module comprises:

the image visual semantic feature determining module is configured to fuse the visual features of the plurality of areas and the plurality of character text features to obtain image visual semantic features;

the enhancement module is configured to enhance the visual semantic features of the image to obtain enhanced visual semantic features of the image;

the regional text feature determining module is configured to average a plurality of character text features in one region in the enhanced image visual semantic features to obtain regional text features of the corresponding region; and

And a region visual semantic feature determination module configured to determine region visual semantic features of the respective region based on respective ones of the region text features and enhanced image visual semantic features.

11. The apparatus of claim 7, wherein the first region association module comprises:

a degree of association determination module configured to determine a degree of association between a first region of the plurality of regions and a region of the plurality of regions other than the first region, respectively; and

and a second region association module configured to associate a target region having the highest degree of association with the first region.

12. The apparatus of claim 11, wherein the first extraction module comprises:

an entity acquisition module configured to acquire an entity to be determined, the entity having an associated entity name and entity value;

a region determination module configured to determine the first region including the entity name

A target region determination module configured to determine the associated target region based on the determined first region; and

and the entity value determining module is configured to take characters included in the target area as the entity value.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.