CN113032672A

CN113032672A - Method and device for extracting multi-modal POI (Point of interest) features

Info

Publication number: CN113032672A
Application number: CN202110312700.4A
Authority: CN
Inventors: 范淼; 黄际洲; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-25
Also published as: WO2022198854A1; KR20230005408A; JP2023529939A

Abstract

The disclosure discloses a method and a device for extracting multi-modal POI characteristics, and relates to a big data technology in the field of artificial intelligence. The specific implementation scheme is as follows: extracting visual feature representation of the POI from an image of the POI by using an image feature extraction model; extracting semantic feature representation from the text information of the POI by using a text feature extraction model; extracting spatial feature representation from the spatial position information of the POI by using a spatial feature extraction model; and fusing the visual feature representation, the semantic feature representation and the spatial feature representation of the POI to obtain the multi-modal feature representation of the POI. The embodiment of the disclosure provides a method for extracting multi-modality fused feature vector representation for each POI, thereby providing a basis for similarity calculation between subsequent POIs.

Description

Method and device for extracting multi-modal POI (Point of interest) features

Technical Field

The present disclosure relates to the field of computer application technologies, and in particular, to a big data technology in the field of artificial intelligence.

Background

A POI (Point of Interest) may be a geographical entity that actually exists in a building, a shop, a school, a bus stop, etc. in the geographic information system. For a geographic information system, the number of POIs represents to some extent the value of the overall system. Comprehensive POI information is essential information of a rich map information system, and generally, each POI includes at least information of multiple modalities, such as name, coordinates, image. The digital media and the manner of presentation of such information vary. For example, a name is typically text in a language, coordinates are typically numbers in at least two dimensions, and an image is in the form of an image. Thus, a multi-modal POI refers to a physical entity described by a variety of digital media.

Generally, the information of the POI is stored in a relational database, and in many application scenarios, the information of the POI needs to be queried from the relational database. This requires the ability to quickly compute the similarity of multi-modal POIs, which is based on POI features, so how to extract POI features becomes a key issue.

Disclosure of Invention

In view of the above, the present disclosure provides a method and an apparatus for extracting multi-modal POI features.

According to a first aspect of the present disclosure, there is provided a method for extracting multi-modal POI features, including:

extracting visual feature representation of the POI from an image of the POI by using an image feature extraction model;

extracting semantic feature representation from the text information of the POI by using a text feature extraction model;

extracting spatial feature representation from the spatial position information of the POI by using a spatial feature extraction model;

and fusing the visual feature representation, the semantic feature representation and the spatial feature representation of the POI to obtain the multi-modal feature representation of the POI.

According to a second aspect of the present disclosure, there is provided an apparatus for extracting multi-modal POI features, comprising:

the visual feature extraction module is used for extracting visual feature representation of the POI from the image of the POI by utilizing an image feature extraction model;

the semantic feature extraction module is used for extracting semantic feature representation from the text information of the POI by utilizing a text feature extraction model;

the spatial feature extraction module is used for extracting spatial feature representation from the spatial position information of the POI by using a spatial feature extraction model;

and the feature fusion module is used for fusing the visual feature representation, the semantic feature representation and the spatial feature representation of the POI to obtain the multi-modal feature representation of the POI.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to a fifth aspect of the disclosure, a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

According to the technical scheme, the method extracts the feature vector representation fused by multiple modalities for each POI, so that a basis is provided for similarity calculation between subsequent POIs.

It should be understood that what is described in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart of a method for extracting multi-modal POI features according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a training image feature extraction model provided in an embodiment of the present disclosure;

fig. 3 is a training flowchart of a fully connected network provided by an embodiment of the present disclosure;

fig. 4 is a schematic diagram of an apparatus for extracting multi-modal POI features provided in an embodiment of the present disclosure;

FIG. 5 is a block diagram of an electronic device used to implement an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the conventional similarity calculation method, similarity calculation is usually performed on images of two POIs, similarity calculation is performed on names of the two POIs, and then similarity calculation is performed on coordinates of the two POIs. That is, the similarity calculation needs to be performed on the features of different modalities, which is complicated and time-consuming. In response to this problem, the core idea of the present disclosure is to extract a multi-modality fused feature representation for each POI, thereby providing a basis for similarity calculation between subsequent POIs. The method provided by the present disclosure is described in detail below with reference to examples.

Fig. 1 is a flowchart of a method for extracting multi-modal POI features according to an embodiment of the present disclosure, where an execution subject of the method is an apparatus for extracting multi-modal POI features. The apparatus may be embodied as an application located at a server side, or may also be embodied as a functional unit such as a plug-in or Software Development Kit (SDK) located in the application located at the server side, or may also be located at a computer terminal with strong computing capability, which is not particularly limited in this embodiment of the present invention.

As shown in fig. 1, the method may include the steps of:

in 101, a visual feature representation of a POI is extracted from an image of the POI using an image feature extraction model.

At 102, a semantic feature representation is extracted from textual information of the POI using a textual feature extraction model.

In 103, a spatial feature representation is extracted from the spatial location information of the POI using a spatial feature extraction model.

At 104, the visual feature representation, the semantic feature representation, and the spatial feature representation of the POI are fused to obtain a multi-modal feature representation of the POI.

The steps 101 to 103 shown in the above embodiments are only one of the implementation sequences, and may also be executed in other sequences, or executed in parallel.

The above steps are described in detail with reference to the following examples. First, the above step 101, i.e., "extracting visual feature representation of POI from an image including a POI sign using an image feature extraction model", will be described in detail.

The image in the POI information is generally an image containing a POI signboard. For example, a live-action image of a certain shop is taken, and the live-action image includes a signboard of the shop, which usually includes the name of the shop, and some advertisement words of the shop. For example, a real scene image of a building is captured, and the real scene image includes a signboard of the building, which is usually the name of the building. As another example, a live-action image of a school is taken, and the live-action image includes a signboard of the school, which is usually the name of the school. These images containing the POI signs have higher identifiers in the POI information, so as a preferred embodiment, the visual feature representation of the POI can be extracted from the images containing the POI signs in the present disclosure.

In addition to extracting visual feature representations of POIs from images containing POI signs, other types of POI images may be extracted. For example, for a building POI with a significant shape, a visual feature representation can be extracted from the image containing the shape of the building subject. Images of these POIs may be obtained from a POI database.

As one of the preferred embodiments, the present step may specifically include the following steps S11 to S12:

in step S11, a signboard region is extracted from an image containing a POI signboard using an object detection technique.

In this step, a target detection technology such as YOLO (just Look at one), SSD (Single Shot multi box Detector), fast RCNN (fast Region Convolutional Neural network) and the like may be used to identify a signboard Region from an image including a POI signboard, and an optimization manner such as FPN (feature pyramid network) may be further combined on the basis of the above target detection technology. These target detection methods are currently mature technologies and will not be described in detail herein.

In addition to using object detection techniques, other means of extracting the sign area may be used. For example, a previously trained signboard discrimination model may be used. The method comprises the steps of firstly carrying out region division on a live-action image, wherein a signboard in the live-action image is a closed region in general, so that the live-action image can be subjected to region identification and division, inputting a signboard judgment model for the determined closed region, and outputting a judgment result of whether the closed region is the signboard region or not by the signboard judgment model.

The signboard judging model is actually a classification model, some live-action images can be collected in advance, signboard areas and non-signboard areas are marked in the real-action images to serve as positive and negative samples respectively, and then the classification model is trained to obtain the signboard judging model.

In step S12, the visual feature representation of the POI is extracted from the signboard area using the image feature extraction model trained in advance.

The image feature extraction model can be obtained by pre-training based on a deep neural network, and after the image feature extraction model is input into the signboard area, the image feature extraction model extracts visual feature representation of POI from the signboard area.

The following describes a training process of the image feature extraction model. Training samples may be taken first. In this embodiment, the training sample used for training the image feature extraction model is referred to as a first training sample. However, the expressions "first", "second", and the like in the present disclosure do not have a limiting function such as the number, order, size, and the like, and are merely used for name differentiation.

The first training sample comprises an image sample and a class label of the image sample. The label about the category may be an object represented by the image, for example, the image containing the cat is labeled as cat, and the image containing the dog is labeled as dog. The type of label may be the type of object in which the image is embodied, for example, an image including a specific hospital may be labeled as a hospital, and an image including a specific school may be labeled as a school.

The image sample is then used as an input to a deep neural network, and as shown in fig. 2, the class labels of the image sample are output as targets of a classification network. In the training process of the image feature extraction model in the embodiment, two networks are involved, namely a deep neural network and a classification network. The deep neural network extracts visual features from the image samples and then inputs the visual features into the classification network, and the classification network outputs the classification result of the image samples according to the visual feature representation. The training objective is to minimize the difference between the classification result output by the classification network and the corresponding class label. After training is finished, for example, the value of the loss function is smaller than a preset threshold, or the training iteration number reaches a preset number threshold, and the like, the image feature extraction model is obtained by using the deep neural network obtained through training. In other words, the deep neural network and the classification network are adopted in the training, but the final image feature extraction model only uses the deep neural network, and the classification network is used for assisting the training of the deep neural network.

The deep neural Network used in the training process can be, but is not limited to, ResNet (Residual Network) 50, ResNet101, EfficientNet (efficient Network), and the like. The loss function used by the classification network can adopt, but is not limited to, Large-Softmax, A-Softmax, AM-Softmax, cosface, arcface and the like.

The above step 102, i.e., "extracting semantic feature representation from text information of POI by using text feature extraction model", will be described in detail.

The text information of the POI referred to in this step may be text information of the POI acquired from the POI database, such as POI name, description information, evaluation information, and the like. Or text information of the POI recognized from an image including the POI sign using a character recognition technique. That is, after a signboard region is recognized from an image including a signboard of a POI, characters, for example, a name of the POI, an advertisement, and the like are recognized from the signboard region as text information of the POI by OCR (Optical Character Recognition).

The text feature extraction model used in this step may adopt, but is not limited to, the following:

first, the word Embedding model.

For example, a Word Embedding model such as Word2Vec, Glove, etc. may be employed.

Second, the language model is pre-trained.

For example, a pre-training language model such as Bert (Bidirectional Encoder Representations from converters), Ernie (Enhanced reconstruction from non-Enhanced language Representations) and the like can be used.

Third, a fine-tune model is applied to the pre-trained language model using existing POI text data.

The following describes step 103, i.e., "extracting a spatial feature representation from spatial position information of a POI by using a spatial feature extraction model" in detail.

The spatial position information of the POI referred to in this step mainly refers to information, such as coordinate information, that is labeled with a certain form of spatial position of the POI. The spatial feature representation can be extracted from the spatial position information of the POI by directly utilizing the spatial feature extraction model.

Considering that many POIs are close in distance in practice and the current positioning accuracy can be controlled to a meter level, it is more desirable in a map information system to divide each POI on a block. Therefore, the present disclosure provides a preferred embodiment, which may specifically include the following steps S21 to S22:

in step S21, the spatial position information of the POI is hash-coded to obtain a hash code.

For the coordinate information, it may be encoded using, for example, a geohash (latitude and longitude address coding). The goehash adopts a character string to represent two coordinates of precision and dimensionality, and after goehash encoding, a plurality of bits in front of the hash codes of the two coordinates of the same block are the same, and only the latter bits are distinguished.

In step S22, the hash code is converted into a spatial feature representation using a spatial feature extraction model.

The spatial feature extraction model adopted in the step can adopt a Word embedding model, namely, the embedding mode is adopted to convert the hash code into quantifiable spatial feature representation.

In this embodiment, for the Word indexing model, a similarity task may be adopted to further train the model, and the training target is: the closer two POIs are in position, the higher the similarity between the spatial feature representations output by the working Embedding model.

The following describes in detail the step 104 of "fusing the visual feature representation, the semantic feature representation, and the spatial feature representation of the POI to obtain a multi-modal feature representation of the POI".

In this step, the visual feature representation, the semantic feature representation, and the spatial feature representation of the POI may be directly merged, and the merged feature may be used as a multi-modal feature representation of the POI. However, this method is relatively rigid, lacks learning ability, and is naturally expressed inaccurately.

Therefore, the present disclosure provides a preferred fusion method, which may specifically include the following steps S31 to S32:

in step S31, the visual feature representation, the semantic feature representation, and the spatial feature representation of the POI are spliced to obtain a spliced feature.

In the step, the visual feature representation, the semantic feature representation and the spatial feature representation can be spliced end to end according to a preset sequence. The vector dimensions of the feature representations are different and can be complemented with a preset value, for example 0.

In step S32, the stitching feature is input into a fully connected network (Full Connection) trained in advance, and a multi-modal feature representation of the POI output by the fully connected network is acquired.

The training process for the fully-connected network described above is described in detail below. As shown in fig. 3, the process may include the steps of:

in 301, a second training sample is obtained, the second training sample comprising a POI sample and a class label for the POI sample.

Some POIs with image, text and spatial position information can be obtained in advance as POI samples, and the categories of the POIs are labeled. Such as a label being a hospital, building, school, bus station, store, etc. And marking the POI samples and the categories thereof as second training samples to train a fully-connected network adopted in feature fusion.

In 302, a visual feature representation of the POI sample is extracted from an image of the POI sample using an image feature extraction model.

In 303, semantic feature representations are extracted from the textual information of the POI sample using a textual feature extraction model.

At 304, a spatial feature representation is extracted from the spatial location information of the POI sample using a spatial feature extraction model.

The feature extraction manner in the above steps 302 to 304 is referred to the related description in the previous method embodiment, and is not described herein again. The steps 302 to 304 are also shown in only one implementation order, and may be executed in other orders or in parallel.

In 305, the visual feature representation, the semantic feature representation and the spatial feature representation of the POI sample are spliced to obtain a splicing feature of the POI sample.

Inputting the splicing characteristics of the POI sample into a full-connection network in 306, and acquiring multi-modal characteristic representation of the POI sample output by a full-connection layer; and (3) inputting the multi-modal feature representation into a classification network, marking the category of the POI sample as a target of the classification network, and outputting the target, training a full-connection network and the classification network.

The loss function adopted by the classification network can adopt, but is not limited to, Large-Softmax, A-Softmax, AM-Softmax, cosface, arcface and the like.

In the training process, the fully-connected network and the classification network are trained in an important mode, and parameters of the fully-connected network and the classification network are updated by using the values of the loss functions. Model parameters of the image feature extraction model, the text feature extraction model and the spatial feature extraction model can be kept unchanged, and can also participate in updating in the training process.

In the method in the above embodiment, the multi-modal feature representation of each POI is obtained for each POI, and the multi-modal feature representation of each POI may be stored in the database. Multimodal representations of the features of the POI can be used to perform similarity calculations between the POIs. Specific application scenarios may include automated production, intelligent retrieval and recommendation, etc., as per, but not limited to, POIs.

Taking the automatic production of the POI as an example, the acquirer or the acquisition device takes an image including the POI sign, and stores information such as the image, name, coordinates, and the like of the POI. After multi-modal feature representation is extracted by adopting the method in the above embodiment of the present disclosure, the historically acquired massive POI data is stored in a database, for example, a distributed redis is adopted as a feature library of the multi-modal feature representation. The storage structure may take the form of key-value pairs.

For the newly acquired POI data, the multi-modal feature representation is extracted in the manner described in the above embodiments of the present disclosure, and then the multi-modal feature representation is used to perform search matching in the feature library, for example, a search manner such as NN (Nearest Neighbor search), ANN (Approximate Nearest Neighbor search), and the like is used. And the retrieval process is based on the calculation of the similarity between the multi-modal characteristic representation of the newly acquired POI and the multi-modal characteristic representation of the existing POI in the database, so as to judge whether the newly acquired POI data is the data of the existing POI. And submitting the POI data which are not searched and matched or cannot be processed automatically due to unrecognizable text, insufficient image definition, wrong coordinates and the like to the manual platform for operation.

The above is a detailed description of the method provided by the present disclosure, and the following is a detailed description of the apparatus provided by the present disclosure with reference to the embodiments.

Fig. 4 is a schematic diagram of an apparatus for extracting multi-modal POI features provided in an embodiment of the present disclosure, and as shown in fig. 4, the apparatus may include: the visual feature extraction module 401, the semantic feature extraction module 402, the spatial feature extraction module 403, and the feature fusion module 404 may further include a first model training unit 405, a text acquisition unit 406, a second model training unit 407, and a similarity calculation unit 408. The main functions of each component unit are as follows:

the visual feature extraction module 401 is configured to extract a visual feature representation of the POI from an image of the POI by using an image feature extraction model.

A semantic feature extraction module 402, configured to extract a semantic feature representation from the text information of the POI by using a text feature extraction model.

A spatial feature extraction module 403, configured to extract a spatial feature representation from the spatial location information of the POI by using a spatial feature extraction model.

And the feature fusion module 404 is configured to fuse the visual feature representation, the semantic feature representation, and the spatial feature representation of the POI to obtain a multi-modal feature representation of the POI.

As a preferred embodiment, the visual feature extraction module 401 may utilize object detection techniques to extract a signboard region from an image containing a POI signboard; and extracting visual feature representations of POI from the signboard area by using an image feature extraction model obtained by training in advance.

The first model training unit 405 is configured to pre-train to obtain an image feature extraction model in the following manner: obtaining a first training sample, the first training sample comprising: the method comprises the following steps of (1) carrying out image sample and class marking on the image sample; taking the image sample as the input of a deep neural network, taking the class label of the image sample as the target output of a classification network, and training the deep neural network and the classification network; the deep neural network extracts visual feature representation from the image sample and inputs the visual feature representation into a classification network, and the classification network outputs a classification result of the image sample according to the visual feature representation; and after the training is finished, obtaining an image feature extraction model by using the deep neural network obtained by the training.

A text acquiring unit 406, configured to acquire text information of a POI from a POI database; and/or recognizing text information of the POI from the image containing the POI signboard by utilizing a character recognition technology.

The text feature extraction model may include, but is not limited to: the model is obtained by carrying out fine tuning on a Word Embedding model, a pre-training language model or the pre-training language model by utilizing the existing POI text data.

As a preferred embodiment, the spatial feature extraction module 403 is specifically configured to perform hash coding on spatial location information of a POI to obtain a hash code; and converting the hash code into a spatial feature representation by using a spatial feature extraction model.

The spatial feature extraction model may include a Word Embedding model.

As a preferred embodiment, the feature fusion unit 403 may be specifically configured to splice visual feature representation, semantic feature representation, and spatial feature representation of a POI to obtain a splicing feature; and inputting the splicing characteristics into a fully-connected network obtained by pre-training, and acquiring multi-mode characteristic representation of the POI output by the fully-connected network.

A second model training unit 407, configured to pre-train to obtain a fully connected network in the following manner:

acquiring a second training sample, wherein the second training sample comprises a POI sample and a category label of the POI sample; extracting visual feature representation of the POI sample from the image of the POI sample by using an image feature extraction model; extracting semantic feature representation from the text information of the POI sample by using a text feature extraction model; extracting spatial feature representation from spatial position information of the POI sample by using a spatial feature extraction model; splicing the visual feature representation, the semantic feature representation and the spatial feature representation of the POI sample to obtain the splicing feature of the POI sample; inputting the splicing characteristics of the POI sample into a full-connection network, and acquiring multi-modal characteristic representation of the POI sample output by a full-connection layer; and (3) inputting the multi-modal feature representation into a classification network, marking the category of the POI sample as a target of the classification network, and outputting the target, training a full-connection network and the classification network.

A similarity calculation unit 408 for calculating a similarity between the POIs based on the multi-modal feature representation of the POIs.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

As shown in fig. 5, is a block diagram of an electronic device according to an embodiment of the disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as the extraction method of the multi-modal POI feature. For example, in some embodiments, the method of extracting multi-modal POI features may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508.

In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 802 and/or the communication unit 509. When loaded into RAM 503 and executed by the computing unit 501, the computer program may perform one or more of the steps of the method of extracting multi-modal POI features described above. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the method of extracting multimodal POI features in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller 30, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for extracting multi-modal POI (point of interest) features comprises the following steps:

2. The method of claim 1, wherein said extracting a visual feature representation of a POI from an image of the POI using an image feature extraction model comprises:

extracting a signboard region from an image containing a POI signboard by using a target detection technology;

and extracting visual feature representations of the POI from the signboard area by using an image feature extraction model obtained by training in advance.

3. The method according to claim 1 or 2, wherein the image feature extraction model is pre-trained in the following way:

obtaining a first training sample, the first training sample comprising: the method comprises the following steps of (1) carrying out image sample and class marking on the image sample;

taking the image sample as the input of a deep neural network, taking the class label of the image sample as the target output of a classification network, and training the deep neural network and the classification network; the deep neural network extracts visual feature representations from the image samples and inputs the visual feature representations into the classification network, and the classification network outputs classification results of the image samples according to the visual feature representations;

and after the training is finished, obtaining the image feature extraction model by using the deep neural network obtained by training.

4. The method of claim 1, wherein the textual information of the POI comprises:

acquiring text information of the POI from a POI database; and/or the presence of a gas in the gas,

and recognizing the text information of the POI from the image containing the POI signboard by using a character recognition technology.

5. The method of claim 1, wherein the text feature extraction model comprises:

the Word Embedding model, the pre-training language model or the model obtained by utilizing the existing POI text data to finely adjust the pre-training language model.

6. The method of claim 1, wherein extracting a spatial feature representation from the spatial location information of the POI using a spatial feature extraction model comprises:

carrying out hash coding on the spatial position information of the POI to obtain a hash code;

and converting the hash code into a spatial feature representation by using a spatial feature extraction model.

7. The method of claim 1 or 6, wherein the spatial feature extraction model comprises a word embedding model.

8. The method of claim 1, wherein fusing the visual feature representation, the semantic feature representation, and the spatial feature representation of the POI to obtain a multi-modal feature representation of the POI comprises:

splicing the visual feature representation, the semantic feature representation and the spatial feature representation of the POI to obtain a splicing feature;

inputting the splicing characteristics into a fully-connected network obtained through pre-training, and acquiring multi-modal characteristic representation of the POI output by the fully-connected network.

9. The method of claim 8, wherein the fully connected network is pre-trained by:

acquiring a second training sample, wherein the second training sample comprises a POI sample and a category label of the POI sample;

extracting a visual feature representation of the POI sample from an image of the POI sample by using the image feature extraction model;

extracting semantic feature representation from the text information of the POI sample by using the text feature extraction model;

extracting spatial feature representation from the spatial position information of the POI sample by using a spatial feature extraction model;

splicing the visual feature representation, the semantic feature representation and the spatial feature representation of the POI sample to obtain the splicing feature of the POI sample;

inputting the splicing characteristics of the POI sample into a full-connection network, and acquiring multi-modal characteristic representation of the POI sample output by the full-connection layer;

and inputting the multi-modal feature representation into a classification network, marking the category of the POI sample as the target output of the classification network, and training the full-connection network and the classification network.

10. The method of claim 1, further comprising:

similarity between POIs is calculated based on multi-modal representation of the POIs.

11. An apparatus for extracting multi-modal POI features, comprising:

12. The apparatus of claim 11, wherein the visual feature extraction module is specifically configured to extract a signboard region from an image containing a POI signboard using a target detection technique; and extracting visual feature representations of the POI from the signboard area by using an image feature extraction model obtained by training in advance.

13. The apparatus of claim 11 or 12, further comprising:

the first model training unit is used for obtaining the image feature extraction model through pre-training in the following mode: obtaining a first training sample, the first training sample comprising: the method comprises the following steps of (1) carrying out image sample and class marking on the image sample; taking the image sample as the input of a deep neural network, taking the class label of the image sample as the target output of a classification network, and training the deep neural network and the classification network; the deep neural network extracts visual feature representations from the image samples and inputs the visual feature representations into the classification network, and the classification network outputs classification results of the image samples according to the visual feature representations; and after the training is finished, obtaining the image feature extraction model by using the deep neural network obtained by training.

14. The apparatus of claim 11, further comprising:

the text acquisition unit is used for acquiring text information of the POI from a POI database; and/or recognizing text information of the POI from the image containing the POI signboard by utilizing a character recognition technology.

15. The apparatus of claim 11, wherein the text feature extraction model comprises:

16. The apparatus according to claim 11, wherein the spatial feature extraction module is specifically configured to hash the spatial location information of the POI to obtain a hash code; and converting the hash code into a spatial feature representation by using a spatial feature extraction model.

17. The apparatus of claim 11 or 16, wherein the spatial feature extraction model comprises a word embedding model.

18. The apparatus according to claim 11, wherein the feature fusion unit is specifically configured to splice visual feature representations, semantic feature representations, and spatial feature representations of the POI to obtain a spliced feature; inputting the splicing characteristics into a fully-connected network obtained through pre-training, and acquiring multi-modal characteristic representation of the POI output by the fully-connected network.

19. The apparatus of claim 18, further comprising:

a second model training unit, configured to pre-train to obtain the fully-connected network in the following manner:

acquiring a second training sample, wherein the second training sample comprises a POI sample and a category label of the POI sample; extracting a visual feature representation of the POI sample from an image of the POI sample by using the image feature extraction model; extracting semantic feature representation from the text information of the POI sample by using the text feature extraction model; extracting spatial feature representation from the spatial position information of the POI sample by using a spatial feature extraction model; splicing the visual feature representation, the semantic feature representation and the spatial feature representation of the POI sample to obtain the splicing feature of the POI sample; inputting the splicing characteristics of the POI sample into a full-connection network, and acquiring multi-modal characteristic representation of the POI sample output by the full-connection layer; and inputting the multi-modal feature representation into a classification network, marking the category of the POI sample as the target output of the classification network, and training the full-connection network and the classification network.

20. The apparatus of claim 11, further comprising:

and the similarity calculation unit is used for calculating the similarity between the POIs based on the multi-modal feature representation of the POIs.

21. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10.