WO2022130734A1

WO2022130734A1 - Metadata extraction program

Info

Publication number: WO2022130734A1
Application number: PCT/JP2021/036456
Authority: WO
Inventors: 基光白川
Original assignee: ソプラ株式会社
Priority date: 2020-12-15
Filing date: 2021-10-01
Publication date: 2022-06-23
Also published as: JP2022094677A; JP6902764B1

Abstract

[Problem] To extract metadata from information included in an image. [Solution] A metadata extraction program for extracting metadata from information included in an image of an information medium, the metadata extraction program being characterized by causing a computer to execute a feature map generation step for generating a feature map obtained by extracting features from the image of the information medium, and an inference step for referring to one or more individual inference models each associating the feature map with a feature label for each element, and extracting the feature label for each element as metadata from the feature map generated in the feature map generation step.

Description

Metadata extraction program

The present invention relates to a metadata extraction program suitable for extracting metadata from information contained in an image of an information medium and further identifying an appropriate image corresponding to a received conversational sentence.

In recent years, information media such as pamphlets, catalogs, company information and advertisements, as well as various documents and explanatory materials have all been provided, shared, and even stored as image data. There is. If the number of images of such information media increases rapidly with the shift to paperless offices in the future, its management will become complicated, and it will take a lot of labor to search for the images of information media that users actually want. Become.

In order to solve such a problem, metadata is associated with each image of each information medium. The metadata referred to here is a description of the attribute values of the determined attributes for the image of the information medium. For example, in the case of an image of a hot spring inn brochure, in addition to the textual information that can be read from the characters, such as the inn name and room rate displayed in the image, the address and contact information of the inn, the time from check-in to check-out, etc. The contents of the open-air bath photo (that is, the information indicating that the content of the photo is an "open-air bath"), the comfort and warmth created by the steam from the open-air bath photo, and the scenery around the open-air bath. There is additional information that can be obtained from the photos, such as the colors and beauty created from the harmony of the hot springs, and the menu of dishes that can be extracted from the photos of the meal. The textualized version of these incidental information becomes the metadata associated with the image of the pamphlet of the hot spring as an information medium.

Especially when managing data, various merits can be enjoyed by attaching metadata to the images of each information medium. For example, by associating the metadata with the image, incidental information about the information medium is supplemented, so that the overall quality of the data can be improved. Further, by associating such metadata with an image, it is possible to easily analyze, search, and extract the data. In particular, by performing a search via metadata keywords, the information media associated with these can be easily browsed, which greatly enhances convenience. It was

JP-A-2017-68859

By the way, the work of associating metadata with images of such information media is a very complicated work. For example, the text information in an image can be easily converted into metadata by using a well-known technique such as OCR, but it is difficult to immediately convert this into metadata for a photograph. In particular, in order to convert the contents and colors shown in a photograph, as well as the atmosphere and sensations created by the photograph, into metadata, it is necessary to manually input text through the steps of human visual discrimination. I don't get it. Metadata, which is particularly useful in business, often requires human definition, which is time-consuming.

In addition to this, since there are various types of imaged information media, metadata that is more convenient for post-searching images of these information media is generated and associated with each other. There is a need. In particular, when the phrase "Is there an image of lineup A?" Is acquired by voice in a spoken conversational sentence, in order to immediately extract an image of an appropriate information medium corresponding to the received conversational sentence, that image. You need to associate the corresponding metadata with.

Although a method for generating metadata from an image has been proposed in the past (see, for example, Patent Document 1), there is no mention of generating metadata that must rely on human visual discrimination. There is no particular mention of generating and associating metadata that is more convenient for performing an ex post facto search of images in information media with respect to the received conversational text.

Therefore, the present invention has been devised in view of the above-mentioned problems, and an object thereof is a metadata extraction program for extracting metadata from information contained in an image, particularly by human vision. It is possible to extract metadata from images of information media that have to rely on discrimination, and it is also possible to generate and associate metadata that is more convenient for subsequent searches. To provide a metadata extraction program and system.

The metadata extraction system according to the present invention is a metadata extraction program that extracts metadata from information contained in an image of an information medium, and a feature map generation step of generating a feature map that extracts features from the image of the information medium. , Inference that refers to one or more individual inference models in which the feature map and the feature label for each element are related to each other, and extracts the feature label for each element as metadata from the feature map generated in the feature map generation step. It is characterized by having a computer perform steps.

The image specifying program according to the present invention includes a conversation sentence reception step for accepting a conversation sentence, an entity extraction step for extracting one or more entities included in one or more conversation sentences received in the conversation sentence reception step, a feature label, and a feature label. The entity consisting of the derived associative words refers to the entity table associated with the image one-to-one or one-to-many, and identifies the image associated with one or more entities extracted in the above entity extraction step. It is characterized by having an image identification step.

The metadata extraction system according to the present invention is a metadata extraction system that extracts metadata from information contained in an image of an information medium, and is a feature map generation means for generating a feature map that extracts features from the image of the information medium. , Inference that refers to one or more individual inference models in which the feature map and the feature label for each element are associated with each other, and extracts the feature label for each element as metadata from the feature map generated by the feature map generation means. It is characterized by having means.

According to the present invention having the above-mentioned configuration, it is possible to automatically perform a very complicated work of associating metadata with an image of each information medium. Photographs can be converted into metadata without going through human visual discrimination and definition, reducing the burden of labor.

Further, according to the present invention, there are various types of imaged information media, and it is possible to generate metadata that is more convenient for performing an ex post facto search of images of these information media. Can be associated with an image. In particular, when spoken conversational sentences are acquired by voice, it is possible to extract images of appropriate information media corresponding to the received conversational sentences with high accuracy, and generate and associate metadata that can realize this. Can be done.

Moreover, since this metadata includes associative words, in addition to the keyword of the feature label itself, various words that can be associated with it are included. Therefore, at the time of specifying the image, even if such an associative word is included in the conversation sentence, the image can be specified from there. For this reason, it is also possible to generate metadata according to a sensory impression such as "hot", "heavy", "fast", etc. that a person who sees an image receives.

FIG. 1 is a diagram showing an overall configuration of a metadata extraction system. FIG. 2 is a block configuration diagram of a metadata extraction device. FIG. 3 is a diagram showing a comprehensive inference model according to the present embodiment. FIG. 4 is a diagram showing an example of an image for which metadata is to be extracted. FIG. 5 is a diagram showing an example of a table required for performing various inferences. FIG. 6 is a diagram showing an associative word inference model according to the present embodiment. FIG. 7 is a block configuration diagram of the image identification system. FIG. 8 is a flowchart showing the processing operation of the metadata extraction system. FIG. 9 is a diagram schematically showing the processing operation from the extraction of the metadata to the storage. FIG. 10 is a diagram for explaining the operation of the image specifying system. FIG. 11 is a diagram showing another example of the comprehensive inference model according to the present embodiment.

Hereinafter, the metadata extraction system to which the present invention is applied will be described in detail with reference to the drawings.

Overall Configuration FIG. 1 shows the overall configuration of the metadata extraction system 100. The metadata extraction system 100 includes a user terminal 10 that can access the public communication network 50, a metadata extraction device 1 connected to the public communication network 50, and an image identification system 2.

The public communication network 50 is an Internet communication network or the like, but when operating in a narrow area such as in the company, it may be configured by a LAN (Local Area Network). Further, the public communication network 50 may be configured by a so-called optical fiber communication network. Further, the public communication network 50 is not limited to the wired communication network, and may be realized by a wireless communication network.

The user terminal 10 can visually recognize an image of an information medium such as a personal computer (PC), a smartphone, a tablet terminal, a mobile phone, a wearable terminal, a digital camera, or the like, or can capture an image of the information medium. Includes all possible electronic devices. The user terminal 10 captures an image A1 of an information medium by a digital camera mounted on the user terminal 10 or acquires an image A1 of an information medium from a server or the like (not shown) via a public communication network 50. The information medium referred to here refers to all media showing information such as pamphlets, catalogs, company information and advertisements, and various documents and explanatory materials. The image A1 is composed of such an information medium as image data. The user terminal 10 transmits the image A1 thus acquired to the metadata extraction device 1 via the public communication network 50.

The metadata extraction device 1 is a device that extracts metadata about an image A1 received from a user terminal 10 via a public communication network 50, and is composed of, for example, a PC, a server, or the like. In this metadata extraction device 1, metadata is associated with each image A1. The metadata referred to here is an image of an information medium in which the attribute values of the determined attributes are written, and for example, if it is an image of a catalog of a car for sale, it is displayed in the image. In addition to the text information that can be read from the character string such as the seller, brand, price, fuel consumption and performance of the car, the content of the picture of the car (that is, the information indicating that the content of the picture is "car"), the car There is metadata of incidental information obtained from photographs, such as the color of the car, the shape of the car, the atmosphere of the car, and the image that seems to be fast. This metadata, that is, the textualized version of these incidental information, becomes the metadata associated with the image A1 of the car as an information medium. Further, this metadata may be composed of information indicating what kind of information medium is in the first place. For example, as the type of information medium, pamphlets, roentgen photographs, company information, catalogs, leaflets for advertising, financial statements, etc. It may be composed of information indicating the type of.

Also, for example, in the case of an image of a hot spring inn pamphlet, in addition to the textual information that can be read from the characters, such as the inn name and room rate displayed in the image, the inn's address and contact information, and the check-in to check-out time. , The content of the picture of the open-air bath (that is, the information indicating that the content of the picture is "open-air bath"), the comfort and warmth created by the steam from the picture of the open-air bath, and the surroundings of the open-air bath There is metadata of incidental information obtained from photographs, such as the colors and beauty created from harmony with the landscape, and the menu of dishes that can be extracted from photographs of meals. This metadata, that is, the textualized version of these incidental information, becomes the metadata associated with the image A1 of the pamphlet of the hot spring as an information medium.

The metadata extraction device 1 collects an image A1 related to an information medium via the user terminal 10, and sequentially executes a processing operation of associating the metadata for each of the images A1. The metadata extraction device 1 builds a database in which each metadata is associated with the image A1 by storing each of the images A1 to which the metadata is associated.

The image identification system 2 is a system for specifying an image desired by a user by referring to a database in which the relationship between the image A1 and the metadata is related, which is constructed in the metadata extraction device 1. The image specifying system 2 receives a conversational sentence directly from the user or from the user terminal 10. Then, the image specifying system 2 accesses the metadata extraction device 1, identifies an appropriate image corresponding to the received conversational sentence, displays it directly to the user, or transmits it to the user terminal 10. The user who operates the user terminal 10 can acquire a desired image sent from the image specifying system 2 by inputting an image to be viewed by voice.

Hereinafter, the configurations of the metadata extraction device 1 constituting the metadata extraction system 100 and the image identification system 2 will be described.

Configuration of Metadata Extraction Device FIG. 2 shows a block configuration of the metadata extraction device 1. The metadata extraction device 1 includes a central control unit 20, an execution unit 3 and an auxiliary storage unit 4 connected to the central control unit 20, respectively.

The central control unit 20 is, for example, a CPU (Central Processing Unit), and executes processing by calling a program stored in the execution unit 3. The central control unit 20 controls each component mounted in the metadata extraction device 1. The execution unit 3 includes an image acquisition unit 5, a feature map generation unit 6, a general inference unit 7, a linking unit 8, and an extraction unit 9, respectively. When the execution unit 3 is configured by a RAM (Random Access Memory), a program corresponding to the configurations of the image acquisition unit 5 to the extraction unit 9 is stored in the execution unit 3.

The image acquisition unit 5 acquires the image A1 received from the user terminal 10 via the public communication network 50.

The feature map generation unit 6 generates a feature map 30, which will be described later, which is data obtained by extracting features from the image A1. The feature map 30 is composed of a pixel unit or, for example, a block area unit which is an aggregate of a plurality of pixels, and features of an analysis image using a deep learning technique as necessary based on a well-known image analysis. The amount is reflected on the two-dimensional image. As a result, for example, it is possible to obtain a feature map 30 in which a featured portion of an object to be discriminated from a photograph image is highlighted.

The general inference unit 7 uses one general inference model DB1 composed of one or more individual inference models DB11 to DB16, which will be described later, for the feature map 30 generated in the feature map generation unit 6. Using the general inference model DB1, the general inference unit 7 infers at least the content of at least one object in the image A1 and generates metadata.

The associating unit 8 associates one or a plurality of associative words with the feature label as the overall inference result, which is the result of inference by the overall inference unit 7. The associative word is a word associated with the word of the feature label. For example, if the feature label is "car", "vehicle", "heavy", etc. are associative words, and the feature label is "hot spring". For example, "warm", "sulfur", "alkaline", "steam", "recreation", etc. are associative words. A plurality of associative words associated with each of such feature labels is the associative model set TB2, which is stored in the auxiliary storage unit 4.

The auxiliary storage unit 4 is, for example, an SSD (Solid State Drive) or an HDD (Hard Disk Drive), and is a table storage unit 11, a corpus storage unit 14, an entity storage unit 15, an image data storage unit 26, and a meta. It includes a data storage unit 27.

One or two or more tables are stored in the table storage unit 11. As the table stored in the table storage unit 11, the comprehensive inference model DB1 stored in the auxiliary storage unit 4, the comprehensive inference result table TB1, and the associative model set TB2 are stored.

Figure 3 shows an example of the comprehensive inference model DB. The comprehensive inference model DB1 is composed of one or more individual inference models DB11 to DB15. The inputs of the individual inference models DB11 to DB15 are common feature maps, and the output is the individual inference result. The individual inference result is composed of text data consisting of words indicating the inferred result. The input data and the output data are related to each other through the degree of association. The degree of association indicates the degree of connection between the input data and the output data. For example, it can be determined that the higher the degree of association, the stronger the connection of each data. The degree of association is indicated by, for example, three or more values such as percentages or three or more stages, and may be associated with two values or two stages.

The individual inference models DB11 to DB15 generate a plurality of feature maps for learning and a plurality of individual inference results for learning as a learning data set by machine learning. As machine learning, for example, a convolutional neural network (CNN) is used, and for example, deep learning is applied.

Each individual inference model DB11 to DB15 is composed of a learning data set that infers the content, color, characters, etc. of an object displayed on an image of an information medium from a feature map. For example, in an image of an information medium consisting of a car brochure as shown in FIG. 4, the individual inference models DB11, 12, and 14 have "car" and "tree" as the contents of the object displayed in the image of the information medium. , The individual inference model DB 13 is a model for inferring the color of the object projected on the image, and the individual inference model DB 15 is projected on the image. It is used as a model for inferring existing character strings. As described above, the individual inference model is independently provided for each element consisting of the content of the object, the color of the object, the character string projected on the image, and the like.

The individual inference result is expressed by the inference probability, which is the probability that the search solution is inferred, such as car: 96.999%, tree: 2.01%, for example, when the individual inference model DB11 is taken as an example. May be. The general inference model DB 1 may input one feature map 30 and output only a search solution of, for example, 80% or more as a threshold value of the class field probability as a general inference result. Further, when there is no search solution above the threshold value of the inference probability and only the search solution below the threshold value exists, a plurality of search solutions may be output. For example, for DB13 and DB15, a plurality of search solutions are output as a comprehensive inference result.

In the comprehensive inference model DB1, the individual inference results represented by the text data obtained as the search solutions of the individual individual inference models DB11 to DB15 are used as the comprehensive inference results. In the example of FIG. 3, "car", "tree", "red, green", "fence", "company A, lineup A" were obtained as search solutions for the individual inference models DB11 to DB15. Is output as a comprehensive inference result. The general inference result may be stored in the table storage unit 11 after being made into a table as shown in the general inference result table TB1. The comprehensive inference result table TB1 is composed of an image of an information medium forming a feature map and a table in which the comprehensive inference results are linked to each other. By preparing the general inference result table TB1 in which the image of each information medium is associated with the general inference result and storing it in the table storage unit 11, the convenience of the search performed after the fact can be improved. It will be possible.

As shown in FIG. 5, the associative model set TB2 is configured as a table in which feature labels and associative words are linked to each other. The feature label here is the text data "car", "tree", "red", "green", "fence", "company A", "lineup A" as the search solution output as the above-mentioned general inference result. "Is applicable to this.

Also, associative words are words that can be associated with these feature labels. For example, if the feature label is "car", there are associative words such as "heavy" and "vehicle". If the feature label is "tree", there are "resources", "nature", etc. as associative words. Further, if "Lineup A", which is a lineup for sale by a certain automobile manufacturer, is a feature label, "car" and "manufacturer" can be mentioned as associative words. If the feature label is "red", the associative word is "color".

The associative model set TB2 in which associative words are associated with such a feature label in a one-to-one or one-to-many relationship is set in advance on the system side (operator side), and this is set in the table storage unit. Stored in 11. By preparing such an associative model set TB in advance, it is possible to output one or more associative words associated with the feature label when the feature label is input.

By the way, this associative model set TB2 is not limited to the case where the feature label and the associative word are represented by two values of whether or not they are associated with each other, and they are related to each other as shown in FIG. It may be associated.

According to the example of FIG. 6, the input is a feature label and the output is an associative word. The feature label and the associative word are associated with each other through the degree of association. The degree of association indicates the degree of connection between the input data and the output data. For example, it can be determined that the higher the degree of association, the stronger the connection of each data. The degree of association is indicated by, for example, three or more values such as percentages or three or more stages, and may be indicated by two values or two stages. Each node of the hidden layer constituting such a degree of association may be composed of a node of a neural network.

In such a case, the feature label and the associative word are generated as a learning data set by machine learning. As machine learning, for example, a convolutional neural network is used, and for example, deep learning is applied.

By preparing such a neural network in advance as the associative model set TB2, it is possible to output an associative word with a high degree of association with the feature label when it is input.

The entity storage unit 15 stores one or more entities and entity values. An entity is one or more words associated with conversational sentence information. A word is a unit that constitutes a sentence. A word may be simply called, for example, a "word" or a "word", or may be considered as a kind of morpheme (for example, an independent word described later).

In the entity storage unit 15, as shown in the entity table TB3 shown in FIG. 5, for example, one or more entity values are stored in association with each one or more entities. The entity value is a character string that embodies the entity.

The entity usually corresponds to any one or more conversational text information in one or more conversational text information stored in the corpus storage unit 14. Therefore, in the entity storage unit 15, for example, one or two or more entities may be stored for each one or more conversational sentence information stored in the corpus storage unit 14.

In the entity table TB3, when one feature label or associative word in the associative model set TB2 is an entity, the entity value is associated with that entity in the associative model set TB2. For example, in the associative model set TB2, those including "car" in the feature label and the associative word are associated with "car", "lineup A", "heavy", and "vehicle". Therefore, when "car" is used as an entity, the entity values are "car", "lineup A", "heavy", and "vehicle". When "Ishiyakiimo" is used as an entity, the entity values are "sweet", "delicious", "crop", "breeding", "in the soil" and the like.

In the entity table TB3, such an entity and the entity value are associated with each other on a one-to-one basis or one-to-many basis. Therefore, the entity related to the entity can be derived from the entity value via the entity table TB3, and the entity value can be derived from the entity.

When setting the entity and the entity value, for example, on a server (not shown), the feature label or the associative word, which is a character string having a large number of search results searched for the feature label and the associative word, is set as the entity, and the search result is set. The other character string with a small number of cases may be set as the entity value.

Further, in the entity table TB3, an image is further associated with the entity and the entity value associated with each other and stored. The entity and the entity value are extracted from the associative model set TB2 as described above, and the feature label in the associative model set TB is associated with the image in the comprehensive inference result table TB1. Therefore, it is possible to associate an image with an entity and an entity value from the correspondence relationship associated with each other. As a result, as shown in FIG. 5, it is possible to store the relationship between the entity and the entity value associated with each image A1, A2, A3, ....

Note that the word that constitutes the entity or entity value may be, for example, a collocation. A collocation is a word that expresses a certain meaning by connecting two or more autonomous words, and may be called a compound word. The collocations are, for example, "hot spring inn" which is a combination of "hot spring" and "inn", "lineup A" which is a combination of "lineup" and "A", etc. However, any set of two or more words may be used.

In addition, one or two or more day conversion information is stored in the entity table TB3. The day conversion information is information for converting a day word into a date. A day word is a word about a day. A day word is usually a word associated with the entity name "date entity", for example, "last month", "yesterday", "last week", "this year", "this month", "last year", "previous term", For example, "this year", but any information that can be converted into a date may be used.

The day conversion information includes a day word and day information acquisition information. The day information acquisition information is information for acquiring day information. The day information is information about the day corresponding to the day word, and is information used when composing inquiry information. The day information may be, for example, information indicating a date such as "April 1" or information indicating a period from a start date to an end date such as "4/1 to 4/30", and is limited thereto. It is not something that will be done. The day information acquisition information is, for example, a function name or a method name, but may be API information, may be a program itself, and is not limited thereto.

Specifically, as the day information acquisition information for the day word "last month", for example, the current time information (for example, "May 10, 2020 11:15": the same applies hereinafter) is acquired, and the current time information has. Obtain the previous month (for example, "April") for a month (for example, "May"), refer to the calendar information of the previous month, and perform day information (for example, the first day to the last day of the previous month). For example, a program or the like for acquiring "4/1 to 4/30" etc. may be used.

Further, for the day information acquisition information for the day word "this year", for example, the current time information is acquired, and the calendar information of the year (for example, "2020") possessed by the current time information is referred to from the first day of the year. , API information or the like for acquiring day information (for example, "2020/1/1 to 2020/5/10") up to the day of the current time information may be used.

Further, the day information acquisition information for the day word "yesterday" is a method of acquiring the current time information and acquiring the day information of the day before the day of the current time information (for example, "5/9"), or The method name or the like may be used.

The corpus storage unit 14 stores one or more conversational sentence information. Conversational sentence information is information on conversational sentences. Conversational sentence information is usually an example sentence of a conversational sentence. Examples of sentences are, for example, "show me a pamphlet of a hot spring inn" and "show me a catalog of cars of XX brands", but are not limited to these.

However, the conversation text information may be a conversation text template. Templates are, for example, "Do you have an image of a {car}?", "Show me a {pamphlet} of a {vehicle}", "Tell me the {information} of a {car} of a {maker}", "{company}" The information expressed by "{", "}" such as {car} included in the template is an entity, that is, a variable.

Conversational text information usually corresponds to the intent. The intent can also be said to be information for specifying the processing operation. Examples of processing operations are, for example, "image search", "pamphlet search", "image information search", etc., and by being stored in the corpus table TB4 in association with the information specifying each of these processing operations. There may be.

That is, in the corpus storage unit 14, for example, one or two or more conversational sentence information is stored for each one or more intents stored in the intent storage unit 12.

FIG. 5 shows an example of the corpus table TB4 relating to the conversational sentence information (corpus) associated with each intent. The corpus table TB4 usually stores one or more entity information for each one or more stored conversational sentence information. The entity information is information about each one or more entities corresponding to one conversational sentence information. The entity information has the start position, end position, and entity name of each entity in addition to the above-mentioned entity and entity value.

The start position here is the position where the entity starts in the conversation text information. The start position is represented by, for example, a value (for example, "1", "4", etc.) indicating which character the first character of the entity is in the character string constituting the conversation sentence. Similarly, the end position is the position where the entity ends in the conversation text information, and is, for example, a value indicating the number of the last character of the entity (for example, "2", "5", etc.). ). However, the expression format of the start position and the end position is not limited to these. The start position and the end position may be referred to as offsets. Further, the offset may be expressed by the number of bytes, and is not limited to this.

The entity name is the name of the entity. The entity name is, for example, "object entity", "date entity", "information entity", etc., but the format is not limited to these as long as it is information that can express the attributes of the entity. The object entity is an entity related to the object, such as {vehicle}, {car}, {company}, etc. in Table 1. A date entity is an entity related to a date. An information entity is an entity related to the required information.

Alternatively, the entity information may have, for example, an entity name and an order information when the conversation text information is a template. The order information is a value indicating which variable the entity name corresponds to in one or more variables included in the template. However, the structure of the entity information is not limited to this.

The corpus in the embodiment may be considered as each of one or more conversational sentence information stored in the corpus storage unit 14, for example, and corresponds to one or more conversational sentence information and each conversational sentence information. It can also be thought of as a set of entity information.

The above-mentioned associative model set TB2, entity table TB3, and corpus table TB4 may be, for example, a tabular database. For example, one or more item names are registered in the table, and one or two or more values are registered for each one or more item names. The item name may be referred to as an attribute name, and each value of 1 or more corresponding to one item name may be referred to as an attribute value. Further, the table is, for example, a relational database table, TSV, Excel, CSV, etc., but the type thereof is not limited to these.

The image data storage unit 26 is an area for storing an image acquired via the image acquisition unit 5. Each image is associated with each parameter in the general inference result table TB1 and the entity table TB3, and this is stored in the image data storage unit 26 so that the associated image can be read out immediately. back.

In the metadata storage unit 27, each model DB1 and tables TB1 to TB4 stored in the auxiliary storage unit 4 are read out and referred to when the central control unit 20 executes various processing operations on the execution unit 3. To. Further, in each model DB1 and tables TB1 to TB4 stored in the auxiliary storage unit 4, a feature label is newly extracted as metadata in the general inference unit 7 based on the image acquired in the image acquisition unit 5, or a string is used. It may be updated every time a new associative word is derived in the attachment unit 8.

The extraction unit 9 compares the entity extracted by the image specifying system 2 described later with at least one of the feature label and the associative word, and at least a feature label and the associative word that at least partially match the extracted entity. Extract the image associated with one. In such a case, each model DB1 and tables TB1 to TB4 described above are referred to. The extraction unit 9 transmits the extracted image to the user terminal 10, and the user terminal 10 displays the image.

The configuration diagram 7 of the image identification device is a block diagram of the image identification system 2. The image identification system 2 includes a storage unit 19, a reception unit 29, a processing unit 39, and an output unit 49.

The storage unit 19 includes a table storage unit 11, an intent storage unit 12, an API information storage unit 13, a corpus storage unit 14, an entity storage unit 15, and a day conversion information storage unit 18. The reception unit 29 includes a conversation sentence reception means 21. The conversation sentence receiving means 21 includes a voice receiving means 211 and a voice recognizing means 212.

The processing unit 39 includes a parameterization unit 30, an intent determination unit 31, a conversational sentence information determination unit 32, an entity acquisition unit 33, a parameter acquisition unit 34, an API information acquisition unit 35, an inquiry information configuration unit 36, and a search result acquisition unit. Means 37 is provided. The parameter acquisition unit 34 includes a determination unit 341, a day information acquisition unit 342, an entity name acquisition unit 343, a translation item name acquisition unit 344, a table identifier acquisition unit 345, a primary key identifier acquisition unit 346, and a conversion parameter acquisition unit 347. .. The output unit 4 includes a search result output means 41.

The storage unit 19 is a database that stores various types of information. The various types of information include, for example, tables, intents, API information, corpora, entities, entity mapping information, PK items, and day conversion information. Information on tables and the like will be described later. In addition, other information will be explained in a timely manner.

The table storage unit 11 stores the same table as the general inference model DB1, the general inference result table TB1, and the associative model set TB2 stored in the table storage unit 11 in the metadata extraction device 1. In the image identification system 2, if the table storage unit 11 in the metadata extraction device 1 is used, the table storage unit 11 may be omitted on the image identification system 2 side.

One or two or more intents are stored in the intent storage unit 12. The intent is information managed for each image-specific processing, and can be said to be information for specifying an image-specific processing operation. The intent usually has a process operation name that identifies the business process. The processing operation name is the name of the processing operation. The processing operation is usually a business process executed via API. However, the processing operation may be, for example, a business process executed according to the SQL statement.

Note that the processing operation name usually corresponds to the API information described later. Therefore, it may be considered that the intent is associated with the API information, for example, via the processing operation name.

One or two or more API information is stored in the API information storage unit 13. API information is information about API. API is an interface for using the functions of a program. APIs are software such as, for example, functions, methods, or execution modules. The API is, for example, a Web API, but other APIs may be used. The Web API is an API constructed by using a Web communication protocol such as HTTP or HTTPS. Since APIs such as WebAPI are known techniques, detailed description thereof will be omitted.

API information is information that corresponds to the intent. As described above, the API information corresponds to the intent, for example, through the processing operation name.

API information is usually information for searching an imaged information medium. However, the API information may be, for example, information for registering information or performing processing based on the information.

API information has one or more parameter-specific information. The parameter specific information is information that specifies a parameter. It may be said that the parameter is a value having a specific attribute. The value is usually a variable. Variables can also be called arguments.

The parameter is usually the information obtained by converting the entity, but it may be the entity itself. The parameters are, for example, arguments given to the API or variables in the SQL statement.

The parameter may be composed of, for example, a set of an attribute name and a value. Specifically, the set of the attribute name and the value is, for example, "shin_code = 2", "sta_date = 2019041, end_date = 20190430", etc., but the format is not limited thereto.

The parameter specific information is, for example, a parameter name. The parameter name is the name of the parameter. Alternatively, the parameter-specific information is, for example, an attribute name, but any information that can specify the parameter may be used.

The corpus storage unit 14 stores a table similar to the corpus table TB4 stored in the corpus storage unit 14 in the metadata extraction device 1. In the image identification system 2, if the corpus storage unit 14 in the metadata extraction device 1 is used, the corpus storage unit 14 may be omitted on the image identification system 2 side.

The entity storage unit 15 stores the same table as the entity table TB3 stored in the entity storage unit 15 in the metadata extraction device 1. In the image identification system 2, if the entity storage unit 15 in the metadata extraction device 1 is used, the entity storage unit 15 may be omitted on the image identification system 2 side.

The reception unit 29 receives various information. The various types of information are, for example, conversational sentences. The reception unit 29 receives information such as conversational sentences from a terminal, for example, but may receive information via an input device such as a keyboard, a touch panel, or a microphone. Alternatively, the reception unit 29 may receive information read from a recording medium such as a disk or a semiconductor memory, and the mode of reception is not particularly limited.

The conversation sentence receiving means 21 accepts a conversation sentence. A conversational sentence is a sentence in which a person speaks, and can be said to be a sentence in natural language. The reception of conversational sentences is, for example, reception by voice, but reception by text may also be used. Voice is a voice made by a person. A text is a character string that is voice-recognized a voice uttered by a person. A character string consists of an array of one or more characters.

The voice receiving means 211 receives the voice of the conversational sentence. The voice receiving means 211 receives the voice of the conversation sentence from the terminal, for example, in pairs with the terminal identifier, but may receive the voice via the microphone. The terminal identifier is information that identifies the terminal. The terminal identifier is, for example, a MAC address, an IP address, an ID, or the like, but any information that can identify the terminal may be used. The terminal identifier may be a user identifier that identifies the user of the terminal. The user identifier is, for example, an e-mail address, a telephone number, or the like, but may be an ID, an address, a name, or the like, and may be any information that can identify the user.

The voice recognition means 212 performs voice recognition processing on the voice received by the voice reception means 211, and acquires a conversational sentence which is a character string. The voice recognition process is a known technique, and detailed description thereof will be omitted.

The processing unit 39 performs various processes. The various processes include, for example, parameterization means 30, intent determination means 31, conversational sentence information determination means 32, entity acquisition unit 33, parameter acquisition unit 34, API information acquisition means 35, inquiry information configuration unit 36, and search. Processing of result acquisition means 37, determination means 341, day information acquisition means 342, entity name acquisition means 343, translation item name acquisition means 344, table identifier acquisition means 345, primary key identifier acquisition means 346, conversion parameter acquisition means 347, and the like. Is. Further, the various processes include, for example, various discriminations described in the flowchart.

The processing unit 39 processes, for example, the parameterization means 30 and the intent determination means 31 in response to the conversation sentence reception means 21 receiving the conversation sentence. When a conversational sentence is transmitted from one or more terminals to a pair with the terminal identifier, the processing unit 39 performs processing by the intent determination means 31 or the like for each one or more terminal identifiers.

The parameterizing means 30 parameterizes one or more entities included in one or more conversational sentences received by the conversational sentence receiving means 21. The parameterizing means 30 may parameterize the entity corresponding to the conversational sentence information determined by the conversational sentence information determining means 32.

Specifically, the parameterizing means 30 parameterizes an entity included in a conversational sentence input as voice, as an example, an independent word. For example, comparing the conversational sentence "Is there an image of lineup A?" And the conversational sentence "Is there an image of lineup A?", There is no difference between the two conversational sentences except that the particles are interchanged. .. In spite of this, in the search results so far, there are cases where conversational sentences with different meanings are not always recognized as having the same meaning. Therefore, the parameterizing means 30 parameterizes the independent words "lineup A" and "image" included in these conversational sentences, that is, the entities.

The intent determination means 31 determines the intent corresponding to the conversation sentence received by the conversation sentence reception means 21.

Specifically, the intent determination means 31 first acquires, for example, the text corresponding to the conversation sentence received by the conversation sentence reception means 21. As described above, the text is, for example, the result of voice recognition of the conversational sentence received by the conversational sentence receiving means 21, but may be the conversational sentence itself received by the conversational sentence receiving means 21.

That is, when the conversation sentence which is a voice is received, the intent determination means 31 voice-recognizes the conversation sentence and acquires the text. When a conversational sentence, which is a text, is accepted, the intent determination means 31 may acquire the text.

Next, the intent determining means 31 acquires one or more independent words from the acquired text by, for example, performing morphological analysis and syntactic analysis. The morphological analysis is a known technique, and detailed description thereof will be omitted.

Then, the intent determining means 31 determines an intent having a processing operation name having a word that is the same as or similar to the acquired one or more independent words.

For details, for example, a synonym dictionary is stored in the storage unit 1. A synonym dictionary is a dictionary related to synonyms. In the synonym dictionary, for each processing action name constituting each one or more intents stored in the intent storage unit 12, a word possessed by the processing action name and one or more synonyms of the word are included. It is registered.

For example, when the conversation sentence receiving means 21 accepts the conversation sentence "Tell me the selling price information of the car of the manufacturer 〇〇", the intent determination means 31 uses the conversation sentence as "maker 〇〇" or "car". Or "selling price information", etc., one or more independent words are acquired, the intent storage unit 12 is searched using each independent word as a key, and whether or not there is an intent having a processing operation name that matches the independent word. Judge. The match is, for example, an exact match, but may be a partial match. Then, when there is an intent that has a word that matches the independent word and specifies the processing operation, the intent determining means 31 determines the intent. In this example, in light of the example in Table 1, the word "maker" that partially matches the independent word "maker 〇〇", the word "car" that partially matches "car", and "selling price information" partially match. Since there is an "image information search" as an intent having the word "information", the intent is determined.

When there is no intent having a processing operation name having a word matching the independent word, for example, the intent determining means 31 is one of one or more synonyms corresponding to the independent word from the synonym dictionary. The synonym is acquired, the intent storage unit 12 is searched using the one synonym as a key, and it is determined whether or not there is an intent having a processing operation name having a word matching the one synonym. Then, when there is an intent having a processing operation name having a word matching the one synonym, the intent determining means 31 determines the intent. If there is no such intent, the intent determining means 31 performs the same processing for other synonyms to determine the intent. For any synonym, if there is no such intent, the intent determining means 31 may output that the intent is not determined.

The conversation sentence information determining means 32 searches the corpus storage unit 14 using the intent determined by the intent determining means 31 as a key, and the conversation sentence receiving means 21 is selected from one or more conversation sentence information corresponding to the intent. Determines the conversational sentence information that most closely matches the conversational sentence received by.

The conversational sentence information that most closely resembles the conversational sentence is, for example, the conversational sentence information that has the highest degree of similarity to the conversational sentence. That is, the conversation sentence information determining means 32 calculates, for example, the degree of similarity between the accepted conversation sentence and each one or more conversation sentence information corresponding to the determined intent, and the conversation sentence information having the maximum similarity degree. To decide.

Alternatively, the conversation sentence information determining means 32 may search for a conversation template that matches the template in which the position of the noun of the accepted conversation sentence is used as a variable. That is, the corpus storage unit 14 stores a template in which one or more entity names are used as variables, and the conversation sentence information determining means 32 stores each entity name of one or two or more of the accepted conversation sentences. Acquires the position of, and determines the template corresponding to the position of the acquired entity name as conversational sentence information. The position of each one or more entity names in the conversation sentence is information indicating the number of the entity name in the template having one or more entity names.

The entity acquisition unit 33 corresponds to one or more entities corresponding to the conversational sentence information determined by the conversational sentence information determining means 32, and is one or more words included in the conversational sentence received by the conversational sentence receiving means 21. Get an entity.

For example, the entity acquisition unit 33 acquires the start position and end position of the entity from the corpus storage unit 14 for each one or more entities corresponding to the determined conversation text information, and from the received conversation text, the entity acquisition unit 33 obtains the start position and the end position of the entity. Acquires the word specified by the start position and the end position.

The parameter acquisition unit 34 acquires one or more parameters corresponding to each of the one or more entities acquired by the entity acquisition unit 33.

The acquired parameter is, for example, the acquired entity itself, but it may be information obtained by converting the acquired entity. That is, for example, when the acquired day word is included in one or more entities, the parameter acquisition unit 34 converts the day word into the parameter day information.

The determination means 341 constituting the parameter acquisition unit 34 determines whether or not the day word exists in one or more entities acquired by the entity acquisition unit 33. Specifically, for example, one or two or more day words are stored in the storage unit 1, and the determination means 341 matches any of the stored day words for each of the acquired one or more entities. Whether or not it is determined is performed, and when the determination results for at least one entity show a match, it is determined that the day word exists in the acquired one or more entities.

When the determination means 341 determines that the day word exists in one or more acquired entities, the day information acquisition means 342 acquires the day conversion information corresponding to the day word from the day conversion information storage unit 18. , The day information which is a parameter is acquired by using the day conversion information.

Specifically, for example, the Japanese word "last year" and the like are stored in the storage unit 1, and the conversational sentence "Show me the last year's pamphlet of brand B" is accepted, and the three entities "brand B", When "last year" and "brochure" are acquired, the determination means 341 determines that the day word exists in the acquired 3 entities because the entity "last month" matches the day word "last month". do. For example, the current time information is acquired, and the day information (for example, "4/1 to 4/30", etc.) is acquired.

The day information acquisition means 342 acquires the day information acquisition information (for example, a program) corresponding to the day word "last year" from the day conversion information storage unit 18. Then, the day information acquisition means 342 uses the day information acquisition information to obtain the current time information (for example, “May 10, 2020 11:15”) from the built-in clock of the MPU (Micro Processing Unit), the NTP server, or the like. Is acquired, and the previous month (for example, "2019") is acquired for the year (for example, "2020") that the current time information has. Then, the day information acquisition means 342 acquires the day information "January 1, 2019-December 31, 2019" from the first day to the last day of the previous year by referring to the calendar information of the previous month. do.

When the day word acquired from the conversation sentence is "this year", the day information acquisition means 342 stores the day information acquisition information (for example, API information) corresponding to the day word "this year" in the day conversion information storage unit. Obtained from 18. Then, the day information acquisition means 342 acquires the current time information from the built-in clock or the like by using the day information acquisition information, and refers to the calendar information of the year (for example, “2020”) possessed by the current time information. , Acquires day information (for example, "January 1, 2020 to December 31, 2020") from the first day of the year to the day of the current time information.

Further, when the acquired day word is "yesterday", the day information acquisition means 342 acquires the day information acquisition information (for example, a method) corresponding to the day word "yesterday" from the day conversion information storage unit 18. .. Then, the day information acquisition means 342 acquires the current time information from the built-in clock or the like by using the day information acquisition information, and the day information of the day before the day that the current time information has (for example, "5/9"). ”) Is acquired.

The entity name acquisition means 343 acquires the entity name corresponding to the entity from the entity storage unit 15 for each one or more entities acquired by the entity acquisition unit 33.

The entity name corresponding to the entity is an entity name paired with a start position and an end position that match or are similar to the position of the entity corresponding to the entity in the conversation sentence acquired by the entity. The entity name acquisition means 343 acquires the entity name corresponding to the entity from the entity storage unit 15 for each one or more entities acquired by the entity acquisition unit 33, for example, using the entity information associated with the entity. You may.

Specifically, for example, three entities "maker 〇〇", "car", and "sales price information" were acquired from the accepted conversation "Tell me the selling price information of the car of maker 〇〇". In the case, the entity name acquisition means 343 is stored in association with the conversational text information in the conversational text information "Tell me the selling price information of the car of the manufacturer XX" stored in the corpus storage unit 14. Of the two entity information, using the first entity information that has the same start and end positions as "maker 〇〇" in the accepted conversation "Tell me the selling price information of the maker 〇〇's car", Acquire the "maker entity" corresponding to "maker 〇〇".

Further, the entity name acquisition means 343 has, for example, two of the above three entity information having the same start position and end position as the "car" in the conversation sentence "Tell me the selling price information of the car of the manufacturer XX". Using the entity information of the eyes, acquire the "car entity" corresponding to the "car", and start the same as the "sales price information" in the conversation "Tell me the sales price information of the car of the manufacturer 〇〇". The "information entity" corresponding to the "sales price information" is acquired by using the third entity information having the position and the end position.

The API information acquisition means 35 acquires API information corresponding to the intent determined by the intent determination means 31 from the API information storage unit 13.

The API information acquisition means 35 acquires, for example, API information having a processing operation name corresponding to the intent determined by the intent determination means 31 from the API information storage unit 13.

Specifically, in the API information storage unit 13, for example, the processing operation name "image information search" and three or more parameter specific information "company code, Cope_code", "vehicle code, Car_code", and "information code, Info_code" are stored. When the API information 1 having the above is stored and the intent specified by the intent name "image information search" is acquired, the API information acquisition means 35 has the processing operation name "image information" possessed by the intent. Acquire API information 1 having "search".

The inquiry information configuration unit 36 configures inquiry information by using one or more parameters acquired by the parameter acquisition unit 34 and the API information acquired by the API information acquisition means 35. Inquiry information is information for information retrieval, and is usually feasible information. The query information is, for example, a function or method in which an argument is inserted, but it may be a completed SQL statement or a set of URL and parameter.

The inquiry information configuration unit 36 is, for example, a parameter corresponding to each location at each of one or more variables possessed by the API information acquired by the API information acquisition means 35, and arranges the parameters acquired by the parameter acquisition unit 34. By doing so, the inquiry information is constructed.

The search result acquisition means 37 executes the inquiry information configured by the inquiry information configuration unit 36, and acquires the search result by searching the storage unit 1 (database) using the parameters obtained by the parameterization means 30. do. Further, the search result acquisition means 37 may generate API information including the parameters obtained by the parameterization means 30 and search the storage unit 1 (database) based on the generated API information. That is, API information may be generated by writing a new parameter or by rewriting a parameter that has already been written to a new parameter, and the database may be searched based on the API information that reflects the parameter. Information for inquiries such as API information and SQL, and detailed operations of the search result acquisition means 37 will be described with specific examples and modified examples.

The output unit 49 outputs various information. The various types of information are, for example, images of the searched information medium.

The output unit 49, for example, obtains information such as a search result, which is the result of various processing performed by the processing unit 39 in response to the reception unit 29 receiving information such as a conversation sentence in pairs with the terminal identifier. Send to the terminal identified by the identifier. Alternatively, for example, in response to the reception unit 29 receiving information such as a conversation sentence via an input device such as a touch panel or a microphone, the output unit 49 outputs information such as a search result to an output device such as a display or a speaker. It may be output via.

However, the output unit 49 may print out various types of information with a printer, store the information in a recording medium, pass it on to another program, or send it to an external device, and the output unit 49 may output the information. The embodiment is not particularly limited.

The search result output means 41 outputs the search result acquired via the search result acquisition means 37. The search result output means 41 transmits, for example, an image as a search result acquired by the search result acquisition means 37 in response to the conversation sentence receiving means 21 receiving the conversation sentence from the user terminal 10 to the user terminal 10. .. Alternatively, the search result output means 41 displays, for example, an image as a search result acquired by the search result acquisition means 37 in response to the conversation sentence reception means 21 receiving the conversation sentence via an input device such as a microphone. It may be displayed via an output device such as.

The storage unit 19, the table storage unit 11, the intent storage unit 12, the API information storage unit 13, the corpus storage unit 14, the entity storage unit 15, and the day conversion information storage unit 18 are non-volatile, for example, a hard disk or a flash memory. A recording medium is suitable, but a volatile recording medium such as a RAM can also be realized.

The process of storing information in the storage unit 19 or the like is not particularly limited. For example, information may be stored in the storage unit 19 or the like via a recording medium, or information transmitted via a network, a communication line, or the like may be stored in the storage unit 1 or the like. Well, or the information input via the input device may be stored in the storage unit 1 or the like. The input device may be, for example, a keyboard, a mouse, a touch panel, a microphone, or the like.

The reception unit 29, the conversation text reception means 21, the voice reception means 211, and the voice recognition means 212 may or may not include the input device. The reception unit 29 and the like can be realized by the driver software of the input device or by the input device and the driver software thereof. Further, even if the function as the reception unit 29 is implemented in the user terminal 10 and the conversation text information acquired in the user terminal 10 is sent to the image identification system 2 via the public communication network 50. good.

Processing unit 39, intent determination means 31, conversational sentence information determination means 32, entity acquisition unit 33, parameter acquisition unit 34, API information acquisition means 35, inquiry information configuration unit 36, search result acquisition means 37, determination means 341, The day information acquisition means 342, the entity name acquisition means 343, the translation item name acquisition means 344, the table identifier acquisition means 345, the primary key identifier acquisition means 346, and the conversion parameter acquisition means 347 are usually used as a CPU (Central Processing Unit) or an MPU. It can be realized from memory or the like. The processing procedure of the processing unit 39 and the like is usually realized by software, and the software is recorded in a recording medium such as ROM. However, the processing procedure may be realized by hardware (dedicated circuit).

The output unit 49 and the search result output means 41 may or may not include output devices such as displays and speakers. The output unit 49 and the like can be realized by the driver software of the output device, or by the output device and the driver software thereof.

The receiving function of the reception unit 29 or the like is usually realized by a wireless or wired communication means (for example, a communication module such as a NIC (Network interface controller) or a modem), but a means for receiving a broadcast (for example, a broadcast). It may be realized by the receiving module). Further, the transmission function of the output unit 49 or the like is usually realized by a wireless or wired communication means, but may be realized by a broadcasting means (for example, a broadcasting module).

Next, the processing operation of the metadata extraction system 100 to which the present invention is applied will be described in detail with reference to the flowchart shown in FIG.

Steps S11 to S17 show a processing operation of extracting metadata from an image of each information medium by the metadata extraction device 1.

In step S11, the image acquisition unit 5 acquires an image of the information medium. The image of the information medium is imaged via the user terminal 10 and received by the metadata extraction device 1 via the public communication network 50, or directly imaged through a camera (not shown) attached to the metadata extraction device. Will be done.

In step S12, the feature map generation unit 6 generates a feature map for the image acquired in step S11. This feature map is generated on a pixel-by-pixel basis, or, for example, on a block area basis, which is an aggregate of a plurality of pixels. This is done by reflecting it on a two-dimensional image.

Next, the process proceeds to step S13, and the general inference unit 7 extracts the object from the feature map. In this step S13, the content of each object in the image is inferred from the feature map by using the comprehensive inference model DB1 stored in the table storage unit 11, and metadata is generated.

In such a case, as shown in FIG. 3, the feature map and the feature amount constituting the feature map are input to the individual inference models DB11 to DB15 constituting the comprehensive inference model DB1 in pixel units or floc area units. In order to make it possible to discriminate the content of the object in each image, each individual inference model DB11 to DB15 has the content (“car”, “tree”, “fence”, etc.), color, and character of the object, respectively. A learning model is built in advance so that each element such as a column can be discriminated. In step S13, these learning models are referred to and the contents of the target are extracted. From each of the individual inference models DB11 to DB15, for example, "car: 96.99%, tree: 2.01%", "red 50.01%, green 37.7%", etc. are output. As the individual inference result, the individual inference result is output for one or two or more of these output results based on the rules determined in advance on the system side or the operation side. This individual inference result becomes a feature label.

Next, the process proceeds to step S14, and the associating unit 8 refers to the associative model set TB2 and derives the associative word from the feature label as the overall inference result output in step S13. As described above, the associative model set TB2 is stored in a state where the feature label and the associative word are associated with each other. Therefore, by inputting the feature label, one or two or more associative words associated with the feature label can be easily extracted. Through this associative word, in addition to the content of the object that can be extracted from the image, the atmosphere, sensation, image, and emotion that can be recalled from it can be inflated. For example, in the associative model set TB2, the feature label "hot spring" is associated with associative words such as "warm", "sulfur", "alkaline", "steam", and "recreation". Therefore, when "hot spring" is extracted as a feature label as a result of general reasoning, these associative words "warm", "sulfur", "alkaline", "steam", "recreation", etc. related to this are used. In addition, it is possible to inflate the image from the feature label. In this step S14, information posted on the Internet may be used as necessary in deriving this associative word. In such a case, as a result of searching through a search engine using the feature label as a search term, a word having a high frequency of appearance may be taken in as an associative word.

In step S15, metadata is generated for the image acquired in step S11. This metadata includes the associative word derived in step S14 in addition to the character string of the feature label as the individual inference result described above. The metadata extraction device 1 may associate the metadata consisting of such a feature label and an associative word with an image, and then store the metadata in the metadata storage unit 27, or the associated metadata may be stored. The image may be stored in the image data storage unit 26. At this time, the image may be stored as the image metadata associative label TB5 associated with the metadata (step S16).

FIG. 9 schematically shows the processing operation from extraction to storage of such metadata. Based on the associative model set from the feature label as the summary inference result output in step S13, the newly acquired images A1, A2, and so on are associated with the associative word. Then, an image metadata associative label TB5 in which the feature label and the associative word derived from the feature label are associated with each image A1, A2, ... Is generated, and this is stored in the metadata storage unit 27. As a result, each image A1, A2, ... Is associated with the metadata consisting of the feature label and the associative word, and conversely, when such metadata input is accepted, it is associated with this. Images A1, A2, ... can be specified.

Next, the process proceeds to step S17, and the entity table TB3 and the corpus table TB4 are created from the generated metadata (feature label, associative word).

The entity table creates a table consisting of the entity and the entity value linked to each other from the feature label and the associative word as described above. At this time, the corresponding images A1, A2, ... Are associated with each entity and the combination of the entity values. As a result, when the image is read out after the fact, it can be realized via the entity, and the convenience of the search can be enhanced.

Further, when creating the corpus table TB4, in addition to the entity stored in the entity table TB3, the corpus table is linked with the intent and the conversation text information extracted by the image identification system 2 as described later. TB4 may be generated.

Next, refer to FIG. 8 for a method of specifying an image corresponding to the content of the conversation text information based on the entity table TB3, the corpus table TB4, and the image metadata associative label TB5 sequentially created through steps S11 to S17 described above. I will explain in detail while doing so.

The image corresponding to the content of the conversation text information is specified by the image specifying system 2 through steps S21 to 26.

In step S21, the reception unit 29 recognizes the conversational sentence uttered by voice. The recognition of this conversational sentence may be performed from the conversational sentence described in the manually input text data other than the acquisition from the voice.

Next, the process proceeds to step S22, and the reception unit 29 converts the conversational sentence acquired through voice recognition into text data. In the processing operation of converting this voice data into text data, a known conversion method may be used.

Next, the process proceeds to step S23, and morphological analysis and parsing are performed on the conversational sentences converted into text data.

Next, the process proceeds to step S24, and one or more independent words are acquired from the text data analyzed by morphological analysis and syntax in step S23. Next, the intent determining means 31 determines an intent having an action name having a word that is the same as or similar to one or more acquired independent words. For example, when the conversation sentence is "Is there an image of lineup A?", Two independent words "lineup A" and "image" are acquired from the conversation sentence by morphological analysis, and each independent word is used as a key to store the intent. The corpus table TB4 in the unit 12 is searched, and an intent having "image search" as an action name that partially matches "lineup A" is determined.

Further, in step S24, the entity included in the conversation sentence is acquired. In step S24, the acquired entity may be further parameterized. For example, when the conversation sentence is "Is there an image of lineup A?", Two independent words "lineup A" and "image" are acquired from the conversation sentence by morphological analysis, and the entity is "lineup A". Become.

Next, the process proceeds to step S25, and image identification processing is performed. This image identification process is performed based on the intent determined in step S24 and the extracted entity.

The entity is in a state in which an image corresponding to each of the entity and the entity value is associated with the entity and the entity value in the entity table TB3 stored in the entity storage unit 15 described above. Therefore, the entity extracted from the conversation sentence in step S24 is compared with the entity and the entity value in the entity table TB3. Then, the image corresponding to the compared entity and the entity value is specified.

For example, as shown in FIG. 10, if "lineup A" is extracted as an entity, when compared with the entity and the entity value in the entity table TB3, the image A1 associated with "lineup A" as the entity value is obtained. Can be identified. Further, if "Ishiyakiimo" is extracted as an entity, the image A3 associated with "Ishiyakiimo" can be specified as an entity by comparing with the entity and the entity value in the entity table TB3. As described above, the wording quoted in the entity table TB3 may be either an entity or an entity value. In specifying the image, the image metadata associative label TB5 may be referred to as an alternative to the entity table TB3. Similarly, the image metadata associative label TB5 also stores an image associated with the feature label and the associative word. Therefore, the feature label and the associative word corresponding to the entity extracted from the conversational sentence can be extracted in step S24, and the image associated with the extracted feature label and the associative word can be specified.

The image specified in step S25 may further utilize the intent determined in step S24. The intent is associated with conversational sentence information, an entity, and an entity value in the corpus table TB4 stored in the corpus storage unit 14. Then, this entity and the entity value are associated with the image in the entity table TB3. Therefore, by referring to the corpus table TB4 and the entity table TB3 based on the extracted intent, it is possible to realize more accurate image identification.

In addition, by organizing the images associated with the entity or the entity value for each intent, grouping them, and saving them, it is only necessary to focus on the group corresponding to the intent determined in step S24 and search. , It becomes possible to identify the image more quickly.

Note that in step S25, in the process of specifying the image, the image may be searched via the API. In such a case, the search result acquisition means 37 may generate API information including the extracted intents, entities, and the like, and perform an image search from the storage unit 1 (database) based on the generated API information. That is, an image search may be performed based on API information that reflects parameters (intents and entities).

After specifying the image, the process proceeds to step S26, and the specified image is displayed. When displaying an image to the user of the user terminal 10, the image is transmitted from the image specifying system 2 to the user terminal 10 via the public communication network 50, and the image is displayed via the user terminal 10. Further, the image may be displayed directly from the image specifying system 2, and in such a case, the image is displayed via the output unit 49.

Further, according to the present invention, there are various types of imaged information media, and it is possible to generate metadata that is more convenient for performing an ex post facto search of images of these information media. Can be associated with an image. In particular, when the phrase "Is there an image of lineup A?" Is acquired by voice in a spoken conversation, it is possible to extract an image of an appropriate information medium corresponding to the received conversation with high accuracy. It is possible to generate and correlate metadata that can achieve this.

Moreover, since this metadata includes associative words, in addition to the keyword of the feature label itself, various words that can be associated with it are included. Therefore, at the time of specifying the image, even if such an associative word is included in the conversation sentence, the image can be specified from there.

The present invention is not limited to the above-described embodiment. As shown in FIG. 11, the general inference model DB1 shown in FIG. 3 may further include an individual inference model DB16 in addition to the individual inference models DB11 to DB15.

The individual inference model DB 16 is a database for inferring the type of information medium, and for example, the individual inference model DB 16 infers the type through the shape of the information medium. The individual inference model DB 16 determines whether the information medium is, for example, a pamphlet, a catalog, financial statements, an attendance record, or an X-ray photograph. Similarly, the comprehensive inference result regarding the type of such an information medium is also feature-labeled, and the entity and the entity value are similarly converted, so that the convenience of the ex post facto search can be enhanced.

1 Metadata extraction device 2 Image identification system 3 Execution unit 4 Auxiliary storage unit 5 Image acquisition unit 6 Feature map generation unit 7 General inference unit 8 Linking unit 9 Extraction unit 10 User terminal 11 Table storage unit 12 Intent storage unit 13 Information Storage unit 14 Corpus storage unit 15 Entity storage unit 18-day conversion information storage unit 19 Storage unit 20 Central control unit 21 Conversational text reception unit 26 Image data storage unit 27 Metadata storage unit 29 Reception unit 30 Parameterization means 30 Feature map 31 In Tent determination means 32 Conversational text information determination means 33 Entity acquisition unit 34 Parameter acquisition unit 35 Information acquisition means 36 Inquiry information configuration unit 37 Search result acquisition means 39 Processing unit 41 Search result output means 49 Output unit 50 Public communication network 100 Metadata Extraction system 211 Voice reception means 212 Voice recognition means 341 Judgment means 342 Day information acquisition means 343 Entity name acquisition means 344 Translation item name acquisition means 345 Table identifier acquisition means 346 Main key identifier acquisition means 347 Conversion parameter acquisition means 347

Claims

In a metadata extraction program that extracts metadata from information contained in images on information media
A feature map generation step that generates a feature map that extracts features from the image of the above information medium, and
An inference step that refers to one or more individual inference models in which a feature map and a feature label for each element are associated with each other, and extracts the feature label for each element as metadata from the feature map generated in the feature map generation step. And have
In the inference step, the associative words that can be associated with the feature labels associated with the individual inference model are referred to the associative model set associated with each other, and the associative words are derived from the extracted feature labels.
An entity table generation step in which an entity consisting of the feature label extracted in the inference step and the derived associative word generates an entity table associated with the image on a one-to-one or one-to-many basis.
A corpus table generation step that generates a corpus table in which conversational text information and entity information related to one or more of the above entities corresponding to the conversational text information are associated with each intent for specifying a processing operation. When,
Conversational sentence reception step that accepts conversational sentences, and
The intent determination step for determining the intent for specifying the processing operation corresponding to the conversation sentence received in the above conversation sentence reception step, and the intent determination step.
By referring to the corpus table generated in the corpus table generation step based on the intent determined in the intent determination step, one or more conversation sentences included in one or more conversation sentences received in the conversation sentence reception step. The entity extraction step to extract the entity and the entity extraction step
Metadata that refers to the entity table generated in the entity table generation step and causes a computer to execute an image identification step that identifies an image associated with one or more entities extracted in the entity extraction step. Extraction program.
Conversational sentence reception step that accepts conversational sentences, and
The intent determination step for determining the intent for specifying the processing operation corresponding to the conversation sentence received in the above conversation sentence reception step, and the intent determination step.
By referring to the corpus table acquired in advance based on the intent determined in the intent determination step, one or more entities included in the one or more conversation sentences received in the conversation sentence reception step are extracted. Entity extraction step and
With reference to the entity table acquired in advance, the computer is made to execute an image identification step for specifying an image associated with one or more entities extracted in the above entity extraction step.
The image identification step refers to one or more individual inference models in which the feature map extracted from the image and the feature label for each element are associated with each other, and is extracted from the feature map extracted from the image of the information medium. The entity consisting of the associative words derived from the extracted feature labels by referring to the associative model set in which the associative words that can be associated with the feature labels for each element and the feature labels associated with the individual inference model are associated with each other is , Refer to the entity table associated with the above image on a one-to-one basis or one-to-many basis.
The entity extraction step is characterized by referring to a corpus table to which the conversational sentence information and the entity information related to one or more of the above-mentioned entities corresponding to the conversational sentence information are associated with each intent. Specific program.
In a metadata extraction system that extracts metadata from information contained in images on information media
A feature map generation means for generating a feature map by extracting features from an image of the above information medium, and
An inference means that refers to one or more individual inference models in which a feature map and a feature label for each element are associated with each other, and extracts the feature label for each element as metadata from the feature map generated by the feature map generation means. Equipped with
In the inference means, the associative words that can be associated with the feature labels associated with the individual inference model are referred to the associative model set associated with each other, and the associative words are derived from the extracted feature labels.
An entity table generation means for generating an entity table in which an entity consisting of a feature label extracted by the inference means and an associative word derived is associated with the image on a one-to-one or one-to-many basis.
A corpus table generation means for generating a corpus table in which conversational text information and entity information related to one or more of the above entities corresponding to the conversational text information are associated with each intent for specifying a processing operation. When,
Conversational sentence reception means for accepting conversational sentences,
An intent determination means for determining an intent for specifying a processing operation corresponding to a conversation sentence received by the above conversation sentence reception means, and an intent determination means.
By referring to the corpus table generated by the corpus table generation means based on the intent determined by the intent determination means, one or more conversation sentences included in one or more conversation sentences received by the conversation sentence reception means. Entity extraction means to extract entities and
A metadata extraction system characterized by further including an image specifying means for specifying an image associated with one or more entities extracted by the entity extracting means by referring to the entity table generated by the entity table generating means. ..
Conversational sentence reception means for accepting conversational sentences,
An intent determination means for determining an intent for specifying a processing operation corresponding to a conversation sentence received by the above conversation sentence reception means, and an intent determination means.
By referring to the corpus table acquired in advance based on the intent determined by the intent determination means, one or more entities included in one or more conversation sentences received by the conversation sentence reception means are extracted. Entity extraction method and
It is provided with an image specifying means for specifying an image associated with one or more entities extracted by the above-mentioned entity extracting means by referring to an entity table acquired in advance.
The image specifying means refers to one or more individual inference models in which the feature map extracted from the image and the feature label for each element are associated with each other, and is extracted from the feature map extracted from the image of the information medium. The entity consisting of the associative words derived from the extracted feature labels by referring to the associative model set in which the associative words that can be associated with the feature labels for each element and the feature labels associated with the individual inference model are associated with each other is , Refer to the entity table associated with the above image on a one-to-one basis or one-to-many basis.
The entity extraction means refers to a corpus table to which conversation text information and one or more entity information related to each of the above entities associated with the conversation text information are associated with each intent. Specific system.