WO2022198854A1

WO2022198854A1 - Method and apparatus for extracting multi-modal poi feature

Info

Publication number: WO2022198854A1
Application number: PCT/CN2021/107383
Authority: WO
Inventors: 范淼; 黄际洲; 王海峰
Original assignee: 北京百度网讯科技有限公司
Priority date: 2021-03-24
Filing date: 2021-07-20
Publication date: 2022-09-29
Also published as: CN113032672A; JP2023529939A; KR20230005408A

Abstract

A method and apparatus for extracting a multi-modal POI feature, which relate to big data technology in the field of artificial intelligence. The method comprises: extracting a visual feature representation of a POI from an image of the POI by using an image feature extraction model; extracting a semantic feature representation from text information of the POI by using a text feature extraction model; extracting a spatial feature representation from spatial position information of the POI by using a spatial feature extraction model; and fusing the visual feature representation, the semantic feature representation and the spatial feature representation of the POI, so as to obtain a multi-modal feature representation of the POI. By means of the method, a feature vector representation that fuses multiple modalities is extracted for each POI, thereby providing a basis for subsequent calculation of the similarity between POIs.

Description

Method and device for extracting multimodal POI features

This application claims the priority of the Chinese patent application with an application date of March 24, 2021 and an application number of 202110312700.4 with the invention title of "Method and Device for Extracting Multimodal POI Features".

technical field

The present disclosure relates to the field of computer application technology, and in particular to big data technology in the field of artificial intelligence.

Background technique

POI (Point of Interest, point of interest) can be a physical entity such as a building, a shop, a school, a bus station, etc. in a geographic information system. For a geographic information system, the number of POIs represents the value of the entire system to a certain extent. Comprehensive POI information is the necessary information to enrich the map information system. Generally speaking, each POI includes at least multiple modal information, such as name, coordinates, and image. The digital medium and presentation of this information vary. For example, names are generally text in a certain language, coordinates are generally numbers in at least two dimensions, and images are in the form of images. Therefore, a multimodal POI refers to a physical entity described by multiple digital media.

Usually the POI information is stored in a relational database. In many application scenarios, the POI information needs to be queried from the relational database. This requires the ability to quickly calculate the similarity of multi-modal POI, and the calculation of similarity is based on POI features, so how to extract POI features becomes the key.

SUMMARY OF THE INVENTION

In view of this, the present disclosure provides a method and apparatus for extracting multimodal POI features.

According to a first aspect of the present disclosure, a method for extracting multimodal POI features is provided, including:

Extract the visual feature representation of the POI from the image of the POI using an image feature extraction model;

Utilize text feature extraction model to extract semantic feature representation from the text information of described POI;

Extract spatial feature representation from the spatial location information of the POI using a spatial feature extraction model;

The visual feature representation, semantic feature representation and spatial feature representation of the POI are fused to obtain the multimodal feature representation of the POI.

According to a second aspect of the present disclosure, a device for extracting multimodal POI features is provided, comprising:

A visual feature extraction module for extracting the visual feature representation of the POI from the image of the POI by using an image feature extraction model;

a semantic feature extraction module for extracting semantic feature representation from the text information of the POI by using a text feature extraction model;

a spatial feature extraction module for extracting a spatial feature representation from the spatial location information of the POI by using a spatial feature extraction model;

The feature fusion module is used for fusing the visual feature representation, semantic feature representation and spatial feature representation of the POI to obtain the multi-modal feature representation of the POI.

According to a third aspect of the present disclosure, there is provided an electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to perform the method as described above.

According to a fifth aspect of the present disclosure, a computer program product comprising a computer program, when executed by a processor, implements the method as described above.

It can be seen from the above technical solutions that the embodiments of the present disclosure provide a method to extract feature vector representations of multiple modal fusions for each POI, thereby providing a basis for subsequent similarity calculation between POIs.

It should be understood that the matters described in this section are not intended to identify key or critical features of embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become readily understood from the following description.

Description of drawings

The accompanying drawings are used for better understanding of the present solution, and do not constitute a limitation to the present disclosure. in:

1 is a flowchart of a method for extracting multimodal POI features provided by an embodiment of the present disclosure;

2 is a schematic diagram of a training image feature extraction model provided by an embodiment of the present disclosure;

FIG. 3 is a training flow chart of a fully connected network provided by an embodiment of the present disclosure;

4 is a schematic diagram of an apparatus for extracting multi-modal POI features provided by an embodiment of the present disclosure;

5 is a block diagram of an electronic device used to implement embodiments of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

In the existing traditional similarity calculation method, the similarity calculation is usually performed on the images of two POIs, the similarity calculation is performed on the names of the two POIs, and the similarity calculation is performed on the coordinates of the two POIs. That is to say, it is necessary to separately calculate the similarity of the features of different modalities, which is computationally complex and time-consuming. In view of this problem, the core idea of the present disclosure is to extract feature representations fused by multiple modalities for each POI, so as to provide a basis for subsequent similarity calculation between POIs. The method provided by the present disclosure will be described in detail below with reference to the embodiments.

FIG. 1 is a flowchart of a method for extracting multi-modal POI features according to an embodiment of the present disclosure, and the execution body of the method is an apparatus for extracting multi-modal POI features. The device can be embodied as an application located on the server side, or can also be embodied as a plug-in or a software development kit (Software Development Kit, SDK) and other functional units in the application located on the server side, or can also be located in a computer with strong computing power. terminal, which is not particularly limited in this embodiment of the present invention. As shown in Figure 1, the method may include the following steps:

In 101, a visual feature representation of the POI is extracted from the image of the POI using an image feature extraction model.

In 102, semantic feature representations are extracted from the textual information of the POI using a textual feature extraction model.

In 103, a spatial feature representation is extracted from the spatial location information of the POI using a spatial feature extraction model.

In 104, the visual feature representation, semantic feature representation and spatial feature representation of the POI are fused to obtain a multimodal feature representation of the POI.

Steps 101 to 103 shown in the above embodiment are only one of the implementation sequences, and other sequences may be used to execute sequentially, or may be executed in parallel.

The above steps are described in detail below with reference to the embodiments respectively. First, the above-mentioned step 101, ie "using an image feature extraction model to extract a visual feature representation of POI from an image containing a POI signboard" will be described in detail.

The image in the POI information is usually the image containing the POI signboard. For example, taking a real picture of a certain store, the real picture includes the sign of the store, and the sign usually includes the name of the store, and some also includes the slogan of the store. Another example is to take a real picture of a building, and the real picture contains the signboard of the building, and the signboard is usually the name of the building. Another example is to take a real picture of a school, which contains the school's signboard, which is usually the name of the school. These images containing POI signboards have high identifiers in POI information, so as a preferred embodiment, the present disclosure can extract the visual feature representation of POIs from the images containing POI signs.

In addition to extracting visual feature representations of POIs from images containing POI signboards, it can also be extracted from other types of POI images. For example, for a building-like POI with a salient shape, a visual feature representation can be extracted from an image containing the shape of the main body of the building. Images of these POIs can be obtained from the POI database.

As one of the preferred embodiments, this step may specifically include the following steps S11 to S12:

In step S11, the signboard area is extracted from the image containing the POI signboard using the object detection technique.

In this step, you can use tools such as YOLO (You Only Look Once, just take a look), SSD (Single Shot MultiBox Detector, click multi-box detector), Faster RCNN (Faster Region Convolutional Neural Networks, fast regional convolutional neural network) Network) and other target detection technologies can identify signboard areas from images containing POI signs. On the basis of the above target detection technologies, Shanghai can further combine optimization methods such as FPN (feature pyramid networks, feature pyramid networks). These target detection methods are relatively mature technologies at present, and will not be described in detail here.

In addition to using the target detection technology, other methods can also be used to extract the signboard area. For example, a pre-trained signboard discrimination model can be used. First, the real image is divided into regions. Because the signboard in the real image is a closed area in general, the real image can be identified and divided into regions. For the determined closed area, the signboard discrimination model is input, and the signboard discrimination model is output. The judgment result of whether the closed area is a signboard area.

Among them, the signboard discrimination model is actually a classification model. Some real images can be collected in advance, and the signboard area and the non-signboard area can be marked as positive and negative samples respectively, and then the signboard discrimination model can be obtained by training the classification model.

In step S12, the visual feature representation of POI is extracted from the signboard area by using the image feature extraction model obtained by pre-training.

The image feature extraction model can be pre-trained based on a deep neural network. After the signboard area is input into the image feature extraction model, the image feature extraction model extracts the visual feature representation of POI from the signboard area.

The training process of the image feature extraction model is described below. Training samples can be obtained first. In this embodiment, the training sample used for training the image feature extraction model is referred to as the first training sample. However, it should be noted that the expressions such as "first" and "second" involved in the present disclosure do not have a limiting effect on quantity, order, size, etc., but are only used to distinguish names.

The above-mentioned first training samples include image samples and category labels of the image samples. The annotation about the category can be the object embodied by the image, for example, an image containing a cat is annotated as cat, and an image containing a dog is annotated as a dog. The category annotation can also be the category of the object represented by the image. For example, an image containing a specific hospital is marked as a hospital, and an image containing a specific school is marked as a school.

The image samples are then used as the input of the deep neural network, as shown in Figure 2, and the category annotations of the image samples are used as the target output of the classification network. In the training process of the image feature extraction model in this embodiment, two networks are involved, that is, a deep neural network and a classification network. The deep neural network extracts the visual feature representation from the image sample and then inputs it into the classification network, and the classification network outputs the classification result of the image sample according to the visual feature representation. The training objective is to minimize the difference between the classification results output by the classification network and the corresponding class labels. After training, for example, the value of the loss function is less than the preset threshold, or the number of training iterations reaches the preset threshold, etc., the image feature extraction model is obtained by using the deep neural network obtained by training. That is to say, a deep neural network and a classification network are used during training, but the final image feature extraction model only uses the deep neural network, and the classification network is used to assist the training of the deep neural network.

The deep neural network used in the above training process can be adopted but not limited to ResNet (Residual Network, Residual Network) 50, ResNet101, EfficientNet (Efficient Network) and the like. The loss function adopted by the classification network can be adopted but not limited to Large-Softmax, A-Softmax, AM-Softmax, cosface, arcface, etc.

The above-mentioned step 102, ie "using a text feature extraction model to extract semantic feature representation from text information of POI" will be described in detail.

The text information of the POI involved in this step may be the text information of the POI obtained from the POI database, such as the POI name, description information, evaluation information, and so on. It may also be text information of the POI recognized from the image including the POI signboard using the text recognition technology. That is, after identifying the signboard area from the image containing the POI signboard, OCR (Optical Character Recognition, Optical Character Recognition) is used to identify the text from the signboard area, such as the name of the POI, advertising slogans, etc., as the text information of the POI.

The text feature extraction model utilized in this step can adopt but is not limited to the following:

The first is the Wording Embedding model.

For example, Wording Embedding models such as Word2Vec (word vector), Glove, etc. can be used.

The second, pre-trained language model.

For example, pre-trained language models such as Bert (Bidirectional Encoder Representations from Transformers, bidirectional encoding representation from Transformers) and Ernie (Enhanced Representation from kNowledge IntEgration, using entity information to enhance language representation) can be used.

The third is to use the existing POI text data to fine-tune the pre-trained language model.

The above-mentioned step 103, that is, "using a spatial feature extraction model to extract a spatial feature representation from the spatial location information of the POI" will be described in detail below.

The spatial position information of the POI involved in this step mainly refers to the information for marking the spatial position of the POI in a certain form, such as coordinate information. The spatial feature representation can be extracted directly from the spatial location information of the POI by using the spatial feature extraction model.

Considering that the distance of many POIs is actually very close, and the current positioning accuracy can be controlled at the meter level, in the map information system, it is more desirable to divide each POI on a block (plot). Therefore, the present disclosure provides a preferred embodiment, which may specifically include the following steps S21 to S22:

In step S21, hash coding is performed on the spatial location information of the POI to obtain a hash code.

For the coordinate information, such as geohash (latitude and longitude address encoding) can be used to encode it. goehash uses a string to represent the two coordinates of precision and dimension. After goehash encoding, the first few digits of the hash code of the two coordinates located in the same block are the same, and only the last few digits are distinguished.

In step S22, the hash code is converted into a spatial feature representation using a spatial feature extraction model.

The spatial feature extraction model used in this step can use the Word Emedding model, that is, the hash code is converted into a quantifiable spatial feature representation by this embedding method.

In this embodiment, for the Word Emedding model, the similarity task can be used to further train it. The training target is: the closer the position of the two POIs, the higher the similarity between the spatial feature representations output by the Wording Embedding model.

The above-mentioned step 104, namely, "merging the visual feature representation, semantic feature representation and spatial feature representation of POI to obtain a multi-modal feature representation of POI" will be described in detail below.

In this step, the visual feature representation, semantic feature representation and spatial feature representation of the above POI can be directly spliced, and the spliced feature can be used as the multi-modal feature representation of the POI. However, this method is relatively rigid, lacks learning ability, and is naturally inaccurate in expression.

Therefore, the present disclosure provides a preferred fusion method, which may specifically include the following steps S31 to S32:

In step S31, the visual feature representation, semantic feature representation and spatial feature representation of the POI are spliced to obtain spliced features.

In this step, the visual feature representation, the semantic feature representation and the spatial feature representation can be spliced end to end in a preset order. The dimension of the vector represented by the feature is different, and a preset value such as 0 can be used to make up.

In step S32, the splicing feature is input into a pre-trained full connection network (Full Connection), and the multimodal feature representation of the POI output by the full connection network is obtained.

The training process of the above fully connected network is described in detail below. As shown in Figure 3, the process may include the following steps:

In 301, a second training sample is obtained, where the second training sample includes POI samples and category labels for the POI samples.

Some POIs with image, text and spatial location information can be obtained in advance as POI samples, and the categories of these POIs can be annotated. For example, the labels are hospitals, buildings, schools, bus stops, shops, etc. These POI samples and their category labels are used as second training samples to train the fully connected network used in feature fusion.

At 302, a visual feature representation of the POI sample is extracted from the image of the POI sample using an image feature extraction model.

In 303, a textual feature extraction model is used to extract semantic feature representations from the textual information of the POI samples.

In 304, a spatial feature representation is extracted from the spatial location information of the POI samples using a spatial feature extraction model.

For the feature extraction methods in the foregoing steps 302 to 304, reference may be made to the relevant records in the previous method embodiments, and details are not described herein. Steps 302 to 304 shown in the same manner are only one of the implementation sequences, and other sequences may also be used to execute sequentially, or may be executed in parallel.

In 305, the visual feature representation, the semantic feature representation and the spatial feature representation of the POI samples are spliced to obtain splicing features of the POI samples.

In 306, the splicing feature of the POI sample is input into the fully connected network, and the multimodal feature representation of the POI sample output by the fully connected layer is obtained; the multimodal feature representation is input into the classification network, and the category label of the POI sample is used as the classification network. Target output, train fully connected network and classification network.

Among them, the loss function used by the classification network can be, but not limited to, Large-Softmax, A-Softmax, AM-Softmax, cosface, arcface, etc.

In the above training process, focus on training the fully connected network and the classification network, and use the value of the loss function to update the parameters of the fully connected network and the classification network. The model parameters of the image feature extraction model, the text feature extraction model and the spatial feature extraction model can remain unchanged, or can be updated in the above training process.

The multimodal feature representation of each POI is obtained separately for each POI in the manner in the above method embodiment, and the multimodal feature representation of each POI can be stored in a database. The multimodal feature representation of POIs can be used to calculate the similarity between POIs. Specific application scenarios may include automatic production, intelligent retrieval and recommendation, etc. that are not limited to POI.

Taking the automated production of POI as an example, a collector or a collection device shoots an image containing a POI signboard, and saves the POI's image, name, coordinates and other information. The massive POI data collected historically is extracted and stored in a database using the method in the above-mentioned embodiments of the present disclosure. For example, distributed redis is used as the feature library of the multi-modal feature representation. The storage structure can take the form of key (key)-value (value) pairs.

For the newly collected POI data, the method in the above-mentioned embodiments of the present disclosure is also used to extract the multi-modal feature representation, and then the multi-modal feature representation is used for retrieval and matching in the feature database, for example, NN (Nearest Neighbor, nearest neighbor retrieval) is used. , ANN (Approximate Nearest Neighbor, approximate nearest neighbor retrieval) and other retrieval methods. The retrieval process is based on the calculation of the similarity between the multimodal feature representation of the newly collected POI and the multimodal feature representation of the existing POI in the database, so as to judge whether the newly collected POI data is the data of the existing POI. For some POI data that have not been retrieved and matched, or POI data that cannot be processed by automation due to unrecognized text, insufficient image clarity, wrong coordinates, etc., it is submitted to the artificial platform for operation.

The above is a detailed description of the method provided by the present disclosure, and the device provided by the present disclosure is described in detail below with reference to the embodiments.

FIG. 4 is a schematic diagram of an apparatus for extracting multi-modal POI features provided by an embodiment of the present disclosure. As shown in FIG. 4 , the apparatus may include: a visual feature extraction module 401 , a semantic feature extraction module 402 , a spatial feature extraction module 403 and The feature fusion module 404 may further include a first model training unit 405 , a text acquisition unit 406 , a second model training unit 407 and a similarity calculation unit 408 . The main functions of each unit are as follows:

The visual feature extraction module 401 is used for extracting the visual feature representation of the POI from the image of the POI by using the image feature extraction model.

The semantic feature extraction module 402 is used for extracting the semantic feature representation from the text information of the POI by using the text feature extraction model.

The spatial feature extraction module 403 is used for extracting the spatial feature representation from the spatial location information of the POI by using the spatial feature extraction model.

The feature fusion module 404 is used to fuse the visual feature representation, semantic feature representation and spatial feature representation of the POI to obtain the multimodal feature representation of the POI.

As a preferred embodiment, the visual feature extraction module 401 can use the target detection technology to extract the signboard area from the image containing the POI signboard; use the image feature extraction model obtained by pre-training to extract the visual feature representation of the POI from the signboard area.

The first model training unit 405 is used to pre-train to obtain an image feature extraction model in the following manner: obtain a first training sample, the first training sample includes: an image sample and a category label for the image sample; The input is to use the category label of the image sample as the target output of the classification network to train the deep neural network and the classification network; among them, the deep neural network extracts the visual feature representation from the image sample and then inputs it to the classification network, and the classification network outputs the image according to the visual feature representation. The classification result of the sample; after the training, the image feature extraction model is obtained by using the deep neural network obtained by training.

The text obtaining unit 406 is configured to obtain the text information of the POI from the POI database; and/or, use the text recognition technology to recognize and obtain the text information of the POI from the image containing the POI signboard.

The text feature extraction model may include, but is not limited to: a Word Embedding model, a pre-trained language model, or a model obtained by fine-tuning the pre-trained language model with existing POI text data.

As a preferred embodiment, the spatial feature extraction module 403 is specifically configured to perform hash coding on the spatial location information of the POI to obtain a hash code; and use a spatial feature extraction model to convert the hash code into a spatial feature representation.

Among them, the spatial feature extraction model may include the Word Embedding model.

As a preferred embodiment, the feature fusion unit 403 can be specifically used to splicing the visual feature representation, semantic feature representation and spatial feature representation of POI to obtain splicing features; input the splicing features into a pre-trained fully connected network to obtain Multimodal feature representation of POI output from a fully connected network.

The second model training unit 407 is used for pre-training to obtain a fully connected network in the following manner:

Obtain a second training sample, the second training sample includes POI samples and the category labeling of the POI samples; use the image feature extraction model to extract the visual feature representation of the POI samples from the images of the POI samples; use the text feature extraction model to extract the text from the POI samples Extracting the semantic feature representation from the information; using the spatial feature extraction model to extract the spatial feature representation from the spatial location information of the POI sample; splicing the visual feature representation, semantic feature representation and spatial feature representation of the POI sample to obtain the splicing feature of the POI sample; Input the splicing features of POI samples into the fully connected network, and obtain the multimodal feature representation of the POI samples output by the fully connected layer; input the multimodal feature representation into the classification network, and use the category label of the POI sample as the target output of the classification network. Fully connected and classified networks.

The similarity calculation unit 408 is configured to calculate the similarity between POIs based on the multimodal feature representation of POIs.

Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for related parts.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

As shown in FIG. 5 , it is a block diagram of an electronic device according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 5 , the device 500 includes a computing unit 501 that can be executed according to a computer program stored in a read only memory (ROM) 502 or loaded from a storage unit 508 into a random access memory (RAM) 503 Various appropriate actions and handling. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504 .

Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506, such as a keyboard, mouse, etc.; an output unit 507, such as various types of displays, speakers, etc.; a storage unit 508, such as a magnetic disk, an optical disk, etc. ; and a communication unit 509, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

Computing unit 501 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computing units 501 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as the extraction method of multimodal POI features. For example, in some embodiments, the extraction method of multimodal POI features may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508 .

In some embodiments, part or all of the computer program may be loaded and/or installed on device 500 via ROM 802 and/or communication unit 509 . When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the method for extracting multimodal POI features described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the extraction method of multimodal POI features by any other suitable means (eg, by means of firmware).

Various implementations of the systems and techniques described herein may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip System (SOC), Load Programmable Logic Device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the execution of the flowcharts and/or block diagrams The function/operation is implemented. The program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.

The systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

A computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, the steps described in the present application can be executed in parallel, sequentially or in different orders, and as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, no limitation is imposed herein.

The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements, and improvements made within the spirit and principles of the present disclosure should be included within the protection scope of the present disclosure.

Claims

A method for extracting POI features of multimodal interest points, comprising:

Extract the visual feature representation of the POI from the image of the POI using an image feature extraction model;

Utilize a text feature extraction model to extract semantic feature representation from the text information of the POI;

Extract spatial feature representation from the spatial location information of the POI using a spatial feature extraction model;

The visual feature representation, semantic feature representation and spatial feature representation of the POI are fused to obtain the multimodal feature representation of the POI.
The method according to claim 1, wherein the extracting the visual feature representation of the POI from the image of the POI using an image feature extraction model comprises:

Extract signboard regions from images containing POI signboards using object detection techniques;

The visual feature representation of the POI is extracted from the signboard area using a pre-trained image feature extraction model.
The method according to claim 1 or 2, wherein the image feature extraction model is pre-trained in the following manner:

Obtain a first training sample, where the first training sample includes: an image sample and a category label for the image sample;

The image sample is used as the input of the deep neural network, and the category label of the image sample is used as the target output of the classification network, and the deep neural network and the classification network are trained; wherein, the deep neural network is obtained from the image After the visual feature representation is extracted from the sample, it is input to the classification network, and the classification network outputs the classification result of the image sample according to the visual feature representation;

After the training, the image feature extraction model is obtained by using the deep neural network obtained by training.
The method according to claim 1, wherein the text information of the POI comprises:

Textual information of the POI obtained from the POI database; and/or,

The text information of the POI is recognized from the image containing the POI signboard by using the text recognition technology.
The method according to claim 1, wherein the text feature extraction model comprises:

Word embedding Word Embedding model, pre-trained language model, or a model obtained by fine-tuning the pre-trained language model with existing POI text data.
The method according to claim 1, wherein extracting a spatial feature representation from the spatial location information of the POI by using a spatial feature extraction model comprises:

Hash coding is performed on the spatial location information of the POI to obtain a hash code;

The hash code is converted into a spatial feature representation using a spatial feature extraction model.
The method of claim 1 or 6, wherein the spatial feature extraction model comprises a word embedding model.
The method according to claim 1, wherein, fusing the visual feature representation, semantic feature representation and spatial feature representation of the POI to obtain the multimodal feature representation of the POI comprises:

Splicing the visual feature representation, semantic feature representation and spatial feature representation of the POI to obtain splicing features;

The splicing feature is input into a pre-trained fully connected network, and a multimodal feature representation of the POI output by the fully connected network is obtained.
The method according to claim 8, wherein the fully connected network is pre-trained in the following manner:

obtaining a second training sample, where the second training sample includes a POI sample and a category label for the POI sample;

Extract the visual feature representation of the POI sample from the image of the POI sample using the image feature extraction model;

Using the text feature extraction model to extract semantic feature representations from the text information of the POI samples;

Extract spatial feature representation from the spatial location information of the POI sample by using a spatial feature extraction model;

Splicing the visual feature representation, semantic feature representation and spatial feature representation of the POI sample to obtain the splicing feature of the POI sample;

Input the splicing feature of the POI sample into a fully connected network, and obtain the multimodal feature representation of the POI sample output by the fully connected layer;

The multimodal feature representation is input into the classification network, and the category labeling of the POI sample is used as the target output of the classification network, and the fully connected network and the classification network are trained.
The method of claim 1, further comprising:

Based on the multimodal feature representation of POIs, the similarity between POIs is calculated.
A device for extracting multimodal POI features, comprising:

A visual feature extraction module for extracting the visual feature representation of the POI from the image of the POI by using an image feature extraction model;

a semantic feature extraction module for extracting semantic feature representation from the text information of the POI by using a text feature extraction model;

a spatial feature extraction module for extracting a spatial feature representation from the spatial location information of the POI by using a spatial feature extraction model;

The feature fusion module is used for fusing the visual feature representation, semantic feature representation and spatial feature representation of the POI to obtain the multi-modal feature representation of the POI.
The device according to claim 11, wherein the visual feature extraction module is specifically configured to extract a signboard area from an image containing POI signboards by using a target detection technology; to extract the visual feature representation of the POI.
The apparatus of claim 11 or 12, further comprising:

a first model training unit, configured to pre-train to obtain the image feature extraction model in the following manner: obtain a first training sample, where the first training sample includes: an image sample and a category label for the image sample; As the input of the deep neural network, the class label of the image sample is used as the target output of the classification network, and the deep neural network and the classification network are trained; wherein, the deep neural network extracts visual features from the image sample After the representation, the classification network is input, and the classification network outputs the classification result of the image sample according to the visual feature representation; after the training is completed, the image feature extraction model is obtained by using the deep neural network obtained by training.
The apparatus of claim 11, further comprising:

A text acquisition unit, configured to acquire the text information of the POI from the POI database; and/or, using a text recognition technology to recognize and obtain the text information of the POI from an image containing the POI signboard.
The apparatus according to claim 11, wherein the text feature extraction model comprises:

Word embedding Word Embedding model, pre-trained language model, or a model obtained by fine-tuning the pre-trained language model with existing POI text data.
The device according to claim 11, wherein the spatial feature extraction module is specifically configured to perform hash coding on the spatial location information of the POI to obtain a hash code; and use a spatial feature extraction model to extract the hash code Converted to spatial feature representation.
The apparatus of claim 11 or 16, wherein the spatial feature extraction model comprises a word embedding model.
The device according to claim 11, wherein the feature fusion unit is specifically configured to splicing the visual feature representation, semantic feature representation and spatial feature representation of the POI to obtain splicing features; The fully-connected network obtained by training is obtained, and the multi-modal feature representation of the POI output by the fully-connected network is obtained.
The apparatus of claim 18, further comprising:

The second model training unit is used to obtain the fully connected network by pre-training in the following manner:

Obtain a second training sample, where the second training sample includes POI samples and a category label for the POI samples; extract the visual feature representation of the POI samples from the images of the POI samples by using the image feature extraction model; Use the text feature extraction model to extract semantic feature representation from the text information of the POI sample; use the spatial feature extraction model to extract spatial feature representation from the spatial location information of the POI sample; represent the visual feature of the POI sample , semantic feature representation and spatial feature representation are spliced to obtain the splicing feature of the POI sample; input the splicing feature of the POI sample into the fully connected network to obtain the multimodal feature of the POI sample output by the fully connected layer Representation; input the multimodal feature representation into a classification network, use the category label of the POI sample as the target output of the classification network, and train the fully connected network and the classification network.
The apparatus of claim 11, further comprising:

The similarity calculation unit is used to calculate the similarity between POIs based on the multimodal feature representation of POIs.
An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the execution of any of claims 1-10 Methods.
A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.
A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-10.