CN110991149A

CN110991149A - Multi-mode entity linking method and entity linking system

Info

Publication number: CN110991149A
Application number: CN201911101194.3A
Authority: CN
Inventors: 徐叶强; 王峰; 窦任荣; 吴云标; 谢海博
Original assignee: Guangzhou Aixue Information Technology Co Ltd
Current assignee: Guangzhou Aixue Information Technology Co Ltd
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2020-04-10

Abstract

The invention discloses a multi-mode entity linking method and a system, wherein the linking method comprises the following steps: generating an object recognition model: collecting a labeled picture, and preprocessing the collected and labeled picture; constructing an object recognition model; training an object recognition model; generating an entity link library: acquiring entity linguistic data, and associating an entity with a picture tag to obtain an entity link library; entity linking: and preprocessing the picture obtained by shooting, inputting the picture into the object recognition model to obtain an object recognition result, and searching the object recognition result in the entity link library to obtain a text result of the entity. The invention achieves the purpose of entity disambiguation through the object recognition of the picture and realizes the multi-modal entity link from the picture to the text. The method specifically comprises the steps of shooting common objects in life through a camera, then carrying out object recognition on the objects in pictures, and finally linking the object recognition results to corresponding entities, so that entity linking from pictures to texts in a multi-mode is realized.

Description

Multi-mode entity linking method and entity linking system

Technical Field

The invention relates to the fields of deep learning, digital image processing, knowledge maps and the like, in particular to application of image recognition and entity linking technology.

Background

The entity linking refers to extracting entity names in a section of text, and mapping the entity names to a unique entity in a specified knowledge base after disambiguation. Entity links can help computers to find important semantic information in sentences and judge different meanings of words in different context, and are indispensable in helping computer to solve natural languages.

At present, the entity link technology is widely applied in the fields of information extraction, information retrieval, content analysis, automatic question answering, knowledge base expansion and the like. But its limitation is application in this field of text only.

In real life, the media of information include various modalities such as voice, video, and pictures, in addition to text. At present, the realization of entity link technology from picture to text in two different modes does not appear.

In view of the above, it is desirable to provide a multi-modal entity linking technique based on picture object recognition.

Disclosure of Invention

The invention firstly provides a multi-mode entity linking method, which is used for carrying out object recognition on common object pictures obtained in life and linking object recognition results to corresponding entities, thereby realizing multi-mode entity linking from pictures to texts.

The invention further provides a multi-mode entity linking system.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a multi-modal entity linking method, comprising the steps of:

generating an object recognition model: collecting a labeled picture, and preprocessing the collected and labeled picture; constructing an object recognition model; training an object recognition model;

(II) generating an entity link library: acquiring entity linguistic data, and associating an entity with a picture tag to obtain an entity link library;

(III) entity linking: and preprocessing the picture obtained by shooting, inputting the picture into the object recognition model to obtain an object recognition result, and searching the object recognition result in the entity link library to obtain a text result of the entity.

Preferably, the image preprocessing method includes: encoding, thresholding or filtering operations, regularization.

Preferably, the process of constructing the object recognition model includes: an inclusion V3 deep neural network model is adopted to construct an object recognition model, the input of the model is an object picture to be recognized, and the output is the name of an object and the corresponding probability thereof. The inclusion structure uses a 1 convolution kernel to reduce dimensions and the fully connected layer is replaced by a simple global average pooling.

Preferably, the specific process of training the object recognition model is as follows: and (3) training the model by using a deep learning software library Tensorflow, inputting the preprocessed pictures as training samples, setting a learning rate, performing model training after various parameters of an iteration period are set, and finally obtaining the Incep V3 model with the best training effect.

Preferably, the manner of generating the associated entity and the picture tag in the entity link library is as follows: after the entity library is obtained, the entity of the entity library is associated with the entity of the picture through manual marking; and after manual labeling, obtaining an entity-picture label library.

Preferably, the entity link further includes entity link result display after the entity text result is obtained, and the retrieval result is displayed through visual display or voice broadcast.

The invention also provides a multi-mode entity linking system, which comprises the following modules:

an object recognition model generation module: collecting a labeled picture, and preprocessing the collected and labeled picture; constructing an object recognition model; training an object recognition model;

the entity link library generation module: acquiring entity linguistic data, and associating an entity with a picture tag to obtain an entity link library;

an entity linking module: and preprocessing the picture obtained by shooting, inputting the picture into the object recognition model to obtain an object recognition result, and searching the object recognition result in the entity link library to obtain a text result of the entity.

The invention also proposes a readable storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method.

The invention also provides a multi-modal entity link generation device, which comprises a memory, a processor and a computing program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the method.

Preferably, the equipment further comprises an intelligent desk lamp, the memory and the processor are embedded into the intelligent desk lamp, and the intelligent desk lamp comprises a sound pickup device.

The innovation point of the invention is that the object recognition of the picture is used for achieving the purpose of entity disambiguation, and multi-modal entity link from the picture to the text is realized. In practice, objects which are common in life are shot through hardware equipment with a camera, then object recognition is carried out on the objects in pictures, and finally the object recognition results are linked to corresponding entities, so that entity linking from pictures to texts in a multi-mode is achieved.

Drawings

FIG. 1 is a multi-modal entity linking flow diagram;

FIG. 2 is a diagram of the Inception V3 model network architecture;

fig. 3 and 4 are schematic diagrams illustrating the relationship between the apple parent class and the apple child class.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

A multi-modal entity linking method, comprising the steps of:

object recognition model generation

1. Picture collection and labeling

One of the purposes of the invention is to identify objects which are common in daily life, so that the objects can be obtained by shooting and collecting through shooting equipment. In addition, the ImageNet project published by the li flying team comprises 2 ten thousand categories, more than 1400 ten thousand labeled pictures, and the categories can be used as one of the picture training sets. And storing the acquisition and labeling results of the picture training set into a picture-entity label library.

2. Picture preprocessing

After the training set is collected, the recognition effect is improved for the object, the apparent characteristics (such as color distribution, whole brightness and darkness, size and the like) of each picture are consistent as much as possible, and the pictures are preprocessed according to the requirements. Common preprocessing methods are coding, thresholding or filtering operations, regularization, etc.

3. Object recognition model construction

The traditional object identification method adopts a statistical-based method, but with the development of deep learning in recent years, practice shows that the method based on deep learning has far better effect than the method based on statistics. The invention adopts an Inception V3 deep neural network model to build an object recognition model. Compared with other deep neural network models, such as AlexNet and VGGNet models, the Incepton V3 model has fewer network parameters, and can accelerate the training and loading speed of the model. Meanwhile, the network model introduces an inclusion structure to replace the traditional operation technology of simple convolution plus an activation function.

Fig. 2 is a diagram of an inclusion V3 object identification model network architecture. The input of the model is an object picture to be recognized, and the output is the name of the object and the corresponding probability thereof. The Incep structure uses a 1 convolution kernel to reduce the dimension, and the problem of large calculation amount can be effectively solved. The number of parameters can be reduced by replacing the fully connected layer with a simple global average pooling.

4. Object recognition model training

After the model is built, the model is trained using the open-source deep learning software library Tensorflow. And inputting the picture obtained after the preprocessing as a training sample, setting parameters such as a learning rate and an iteration period, and then performing model training to finally obtain the Incep V3 model with the best training effect.

Second, entity link library generation

1. Entity corpus collection

The knowledge base used by the entity corpus comprises Wikipedia, Baidu encyclopedia, Freebase, YAGO and the like, and the knowledge base contains abundant entities and attribute values of the entities, and data collection can be performed on the entities through a web crawler to obtain an entity base.

2. Entity-picture tag association

After the entity library is obtained, the entity of the entity library needs to be associated with the entity of the picture through manual marking. For example, the entity "apple" may include two entities "fruit apple" and "apple computer", which are labeled to correspond to the fruit apple and the apple computer in the picture library, respectively.

And after manual labeling, obtaining an entity-picture label library.

Third, entity linking

1. Picture taking

And acquiring a target picture through shooting equipment.

2. Picture preprocessing

The image is required to be preprocessed before object recognition, and the purpose of the image is the same as that of the generation of an object recognition model.

3. Object recognition

And inputting the preprocessed picture into the object recognition model to obtain an object recognition result.

4. Entity linking

And searching the object identification result in the entity-picture mapping library to obtain a text result of the entity. Such as a "fruit apple" shot, would be linked to the "fruit apple" entity, as shown in fig. 3. The "apple computer" obtained by shooting is linked to the "apple computer" entity, as shown in fig. 4. But both are "apple" entities in the text domain.

5. Entity linking result presentation

Displaying the retrieval result through visualization such as Echarts; the voice broadcasting display can also be carried out through voice equipment, such as an intelligent desk lamp.

Aiming at the problems that the traditional image recognition technology needs manual preprocessing, has low accuracy and the like, a plurality of deep learning models are gradually applied to the image recognition field at present, such as: deep Belief Networks (DBNs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and the like. These network models can automatically learn and extract image features.

Entity linking refers to finding candidate correct entity descriptions in a knowledge base, and is a key technology for constructing a knowledge graph. The method mainly comprises candidate entity generation and candidate entity sorting, and the finally linked entities are determined by calculating the similarity between entity indexes and candidate entities. The traditional entity link generally needs to perform named entity recognition, that is, predefined entities are extracted from a specific text, and entity link is performed according to the extracted entities. According to the embodiment of the invention, an object name, namely an entity index, is obtained by constructing, training and packaging an object recognition model, input image modal information is converted into character modal information, entity linking of a knowledge base is carried out according to the entity index, and finally visual display is carried out, so that the multi-mode entity linking method and the entity linking system are realized.

The invention uses the increment V3 model to identify the object, inputs an object picture, calls the model to successfully identify the object in the picture, and returns the identification result. For example: a picture containing "orange" is input, and the recognition result of "orange, 0.9564" can be returned on the page. The above process converts the object picture information belonging to the image modality into the object name information belonging to the text modality. The object name is an entity name, and entity linkage is carried out by inquiring corresponding entities in a knowledge base to obtain the description information of the corresponding entities. Finally, the entity description information is visually displayed through Echarts.

In addition, the invention can be combined with an intelligent desk lamp, and has the following application scenes: the children place objects (such as apples) under the intelligent desk lamp, obtain object pictures through a camera of the intelligent desk lamp, call an object recognition model to return object names, and then link the object names to corresponding entities in a knowledge base to obtain description information of the corresponding entities. The desk lamp can read the object recognition result and the description information. The applet bound to the intelligent desk lamp device may also synchronize object identification and history of entity links. For preschool and low-grade children, the invention can help the children to know common objects in life and help parents to assist in solving the educational difficulty of enlightening the knowledge of the children.

Example (b):

a method of multi-modal entity linking, the method comprising the steps of:

step S101, an object recognition model is constructed and trained.

This step builds an inclusion V3 object recognition model based on the TensorFlow framework. Compared with AlexNet and VGGNet models, the Incepton V3 model has fewer network parameters and can accelerate the training and loading speed of the model. Meanwhile, the network model introduces an inclusion structure to replace the traditional operation technology of simple convolution plus an activation function. Fig. 2 is a diagram of an inclusion V3 object identification model network architecture. The input of the model is an object picture to be recognized, and the output is the name of the object and the corresponding probability thereof. The Incepton structure uses a 1 x 1 convolution kernel to reduce the dimension, and can effectively solve the problem of large calculation amount. The number of parameters can be reduced by replacing the fully connected layer with a simple global average pooling. The inclusion V3 model can identify predefined 1000 classes of common objects, each with a separate numbering correspondence. And training an Incep V3 model to obtain a pb model file, loading files of classification names corresponding to the classification character strings and files of classification numbers corresponding to the classification character strings respectively, establishing a mapping relation between the classification numbers and the corresponding classification names, transmitting the classification numbers, and returning object classification names.

Since the inclusion V3 object recognition model can only recognize the predefined 1000 types of common objects, the model recognition effect is poor for some object pictures which are not in the predefined 1000 types. Therefore, it is necessary to add a new class of object pictures and train again to identify the new class of object pictures. Specifically, the picture training data of each new category needs to be added to the original training set, so as to form a new training set for training.

Step S102, packaging the object recognition model.

The method comprises the steps of firstly, initializing and loading an input object picture, loading an inclusion V3 model file, then constructing a model object, and calculating the picture by using TensorFlow. And finally, obtaining the predefined 1000-class probabilities corresponding to the input object pictures through an inclusion V3 model, sequencing the predefined 1000-class probabilities, returning the class with the highest probability of the input pictures in the predefined 1000 classes, and outputting the names and the probabilities of the objects as the recognition results.

And step S103, calling the object recognition model and returning the recognition result.

In the step, an inclusion V3 object recognition model is packaged into a RESTful interface. The background mainly receives the object picture transmitted by the foreground and stores the object picture to the appointed path of the server. Specifically, it is first necessary to add the path information of the object identification interface and the path information uploaded by the object picture file in the configuration file. And then, judging whether the uploading path of the object picture file exists or not, if so, splicing the time stamp and the file name of the file into a new file name, and then splicing the path and storing. And then, creating an Httppost object, wherein the Http request adopts a post mode, the request is input as a packaged object recognition model interface path, receiving object picture path information and reading a picture, calling an object recognition model, and then returning a recognition result. And finally, returning the obtained result to the foreground for displaying.

And step S104, linking the recognition result to the corresponding entity in the knowledge base.

The object name, namely the entity name, can be obtained through the object recognition model, and the step converts the input image modal information into character modal information. And querying an entity and attribute information corresponding to the entity name through a knowledge graph (a knowledge base constructed based on an ontology, wherein the identified object is linked with an entity in the knowledge base) constructed based on the ontology, so as to link the entity name to the corresponding entity in the knowledge base and obtain knowledge related to the entity name. And loading the generated owl file by using an Apache Jena tool, and calling SPARQLAPI with a Java program to realize a query processing function in a Jena framework to realize query of corresponding entity concepts and attribute information of the object recognition result in a knowledge base. And finally, returning the inquired related knowledge to a system page for displaying. FIG. 3 is a diagram illustrating the relationship between the parent class and the child class of the apple. Wherein, the father is apple, and the subclass has ten categories of ID, alias, nature and taste, distribution area, nutritive value and the like. For example, if the object identification result is an apple, the apple is an entity name, the entity name of the apple can be linked to a corresponding entity apple in the knowledge base by an entity linking technology, and then the attribute information of the entity apple is returned for visual display.

And step S105, carrying out visual display by using Echarts.

The entity linking system has the overall effects that: uploading a picture of an object to be identified, clicking an 'identification' button to obtain an identification result of the picture, including the name and the corresponding probability of the object, and displaying the identification result on a page. And if the uploaded picture file cannot be empty and the size of the file cannot exceed 1M, judging the size of the uploaded file, and otherwise, re-uploading the picture file smaller than 1M.

Meanwhile, the object name, i.e. the entity name, can obtain the knowledge related to the object through the entity linking technology, such as: the identification result of the uploaded pictures is apples, and knowledge of the apples, including the IDs, functions, distribution areas, nutritive values and the like of the apples, can be obtained through entity linking. And finally, carrying out visual display on the obtained object encyclopedia knowledge in a form of a chart by utilizing Echarts.

The above-described embodiments of the present invention do not limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention shall be included in the protection scope of the claims of the present invention.

Claims

1. A multi-modal entity linking method, comprising the steps of:

2. The method of claim 1, wherein the image pre-processing method is: encoding, thresholding or filtering operations, regularization.

3. The method of claim 2, wherein the process of constructing the object recognition model is: an inclusion V3 deep neural network model is adopted to construct an object recognition model, the input of the model is an object picture to be recognized, and the output is the name of an object and the corresponding probability thereof. The inclusion structure uses a 1 convolution kernel to reduce dimensions and the fully connected layer is replaced by a simple global average pooling.

4. The method according to claim 3, wherein the specific process of training the object recognition model is: and (3) training the model by using a deep learning software library Tensorflow, inputting the preprocessed pictures as training samples, setting a learning rate, performing model training after various parameters of an iteration period are set, and finally obtaining the Incep V3 model with the best training effect.

5. The method of claim 4, wherein the manner of generating the associated entity and picture tag in the entity link library is: after the entity library is obtained, the entity of the entity library is associated with the entity of the picture through manual marking; and after manual labeling, obtaining an entity-picture label library.

6. The method of claim 5, wherein the entity link further comprises displaying the entity link result after obtaining the entity text result, and displaying the search result through visual display or voice broadcast.

7. A multi-modal entity linking system, comprising the following modules:

8. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

9. A multi-modal entity link generation apparatus comprising a memory, a processor and a computing program stored on the memory and executable on the processor, the processor executing the program to perform the steps of the method of any of claims 1-6.

10. The apparatus of claim 8, further comprising an intelligent desk lamp, wherein the memory and processor are embedded on the intelligent desk lamp, wherein the intelligent desk lamp comprises a sound pickup device.