WO2013035670A1

WO2013035670A1 - Object retrieval system and object retrieval method

Info

Publication number: WO2013035670A1
Application number: PCT/JP2012/072363
Authority: WO
Inventors: 貴志住吉; 義崇平松; 洋登永吉
Original assignee: 株式会社日立製作所
Priority date: 2011-09-09
Filing date: 2012-09-03
Publication date: 2013-03-14
Also published as: JP5844375B2; JPWO2013035670A1

Abstract

The purpose of the invention is to conveniently retrieve an object requested by a user by means of a spoken dialog, by obtaining the required object named in the spoken dialog, from an image acquired within a space. When a speech recognition event requesting retrieval of an object occurs, a key image related to a keyword in the request is extracted from an image database (136). If the number of extractions is 0, speech prompting for repetition is output from a speaker section (16). If the number of extractions is 1 or more, an image or images having a high degree of similarity to the key image are extracted from an image photographed by an environmentally installed camera. If the number of image extractions is 0, speech indicating that the corresponding object does not exist is output from the speaker section (16). If the number of retrieved images is 2 or more, speech indicating a search refinement query is output from the speaker section (16), and search refinement is performed. If no search refinement method results from the query, or if the number of retrieved images is 1, a location is identified from the retrieved image group, and speech that describes the location is output from the speaker section (16).

Description

Object search system and object search method

The present invention relates to an object search technique using a service robot, and more particularly to an effective technique for searching an object existing in a real space by voice conversation.

Currently, research and development of robot technology is thriving, and research and development of service robots that realize various services while communicating with human beings is advancing.

The service robot mainly looks like a human and can move in space by a moving mechanism such as legs and wheels (see Non-Patent Document 1, for example).

In addition, through interfaces such as microphones, cameras, speakers, and gesture mechanisms, communication is achieved by voice and gestures while looking at human faces. In order to realize the above-described communication, for example, various techniques such as voice recognition, image recognition, voice synthesis, and dialogue control are used.

The speech recognition installed in this type of service robot can usually only accept specific command commands and cannot recognize words that are not set in advance. This is because the speech recognition algorithm usually compares the user's speech with a preset word and selects the one with the closest acoustic matching (likelihood).

For example, Non-Patent Document 2 is known as a technique for causing a service robot to learn the name of a new object. According to Non-Patent Document 2, when a user shows an object to a robot and speaks a name, the robot stores a voice section that is considered to represent the name of the object in the spoken voice together with the image.

After that, when the user shows the same object, the robot converts the voice associated with the image to voice quality and speaks to the user as the voice of the robot. This allows the user to understand that the robot has stored the name of the object.

Furthermore, as a technique for causing the service robot to learn the name of a new object, for example, an utterance that teaches the name in a natural conversation is detected, and the name of the object in the utterance is extracted and linked to the object. A technique to learn is known (see Patent Document 1).

On the other hand, as a technique for searching for an image desired by a user through dialogue by voice recognition, there is known a technique for efficiently searching for an image intended by a user by reducing the ambiguity of a search condition (Patent Document 2). reference).

In this patent document 2, when a user inputs a desired image feature (name, position, size, etc.) through dialogue with a device by voice or text, an example of an image satisfying the feature is generated and given to the user. Present.

JP 2010-282199 A Japanese Patent Laid-Open No. 2003-196306

As described above, service robots move in space and communicate through voice and gestures through an interface to provide various value-added services in close contact with daily life. There is no technology that allows a user to easily search for an object existing in space through voice dialogue.

An object of the present invention is to provide a technique that allows a user to easily search for an object requested by a voice dialogue by obtaining an object name necessary for the voice dialogue from an image acquired in the space. .

The above and other objects and novel features of the present invention will become apparent from the description of the present specification and the accompanying drawings.

Of the inventions disclosed in this application, the outline of typical ones will be briefly described as follows.

That is, in order to achieve the above-described object, the present invention includes a first camera (an environment-installed camera) that acquires an image and an interactive interface that searches for an object through voice interaction, and an object that a user wants to search for. A system for searching for voices in a voice-pair format is realized.

In the second aspect of the present invention, the interactive interface stores a first database (environment-installed camera image database) that stores images acquired by the first camera, and an object image and a keyword list related to the image. And an object search system having a database (image database).

The dialog interface has a control unit, and the control unit extracts an image related to the object name from the second database based on the object name input by voice, and the image extracted from the second database. The above-described problem is solved by searching the first database for images having a high degree of similarity.

In addition, the present invention extracts an image related to an object name from the second database based on the object name inputted by voice, and an image having a high similarity to the image extracted from the second database. It can also be applied to a method using a system that searches from

Among the inventions disclosed in the present application, effects obtained by typical ones will be briefly described as follows.

-Objects that exist in real space can be easily searched by voice dialogue.

It is explanatory drawing which shows an example of a structure in the object search system by Embodiment 1 of this invention. It is a block diagram which shows an example of the dialogue interface apparatus provided in the object search system of FIG. It is a flowchart which shows an example of operation | movement in the dialogue control program stored in the memory | storage device provided in the dialogue interface apparatus of FIG. It is explanatory drawing which shows an example of the data content in the color feature expression database stored in the memory | storage device provided in the dialogue interface apparatus of FIG. It is explanatory drawing which shows an example of the data content in the size characteristic expression database stored in the memory | storage device provided in the dialogue interface apparatus of FIG. It is explanatory drawing which shows an example of the data content in the shape feature expression database stored in the memory | storage device provided in the dialogue interface apparatus of FIG. It is explanatory drawing which shows an example of the data structure of the image database stored in the memory | storage device provided in the dialogue interface apparatus of FIG. 2, and the data content stored. It is explanatory drawing which shows an example of the data structure in the environmental installation type camera image database stored in the memory | storage device provided in the dialogue interface apparatus of FIG. 2, and the data content stored. It is a flowchart which shows an example of the operation | movement in the speech recognition program stored in the memory | storage device provided in the dialogue interface apparatus of FIG. It is a conceptual diagram which shows an example of the speech recognition dictionary stored in the memory | storage device provided in the dialogue interface apparatus of FIG. It is explanatory drawing which shows an example of the dialogue interface apparatus by Embodiment 2 of this invention. It is a flowchart which shows an example of operation | movement in the dialogue control program stored in the memory | storage device provided in the dialogue interface apparatus of FIG. It is explanatory drawing which shows an example of the data structure of the user identification database stored in the memory | storage device provided in the dialog interface apparatus of FIG. 11, and the data content stored. It is explanatory drawing which showed an example of the data structure of the data structure of the user database stored in the memory | storage device provided in the dialogue interface apparatus of FIG. 11, and the data content stored.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that components having the same function are denoted by the same reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted.

(Embodiment 1)
FIG. 1 is an explanatory diagram showing an example of the configuration of the object search system according to the first embodiment of the present invention, FIG. 2 is a block diagram showing an example of a dialog interface device provided in the object search system of FIG. FIG. 4 is a flowchart showing an example of the operation in the dialog control program stored in the storage device provided in the dialog interface device of FIG. 2, and FIG. 4 is stored in the storage device provided in the dialog interface device of FIG. FIG. 5 is an explanatory diagram showing an example of data contents in the color feature expression database. FIG. 5 is an explanatory diagram showing an example of data contents in the size feature expression database stored in the storage device provided in the dialog interface apparatus of FIG. 6 is data in a shape feature expression database stored in a storage device provided in the dialog interface device of FIG. FIG. 7 is an explanatory diagram showing an example of the contents, FIG. 7 is an explanatory diagram showing an example of the data structure of the image database stored in the storage device provided in the dialog interface device of FIG. FIG. 9 is an explanatory diagram showing an example of the data structure in the environment-installed camera image database stored in the storage device provided in the dialog interface apparatus of FIG. 2 and the contents of the stored data, and FIG. 9 is the dialog interface of FIG. FIG. 10 shows an example of a speech recognition dictionary stored in the storage device provided in the dialog interface device of FIG. 2, and FIG. 10 shows an example of the operation in the speech recognition program stored in the storage device provided in the device. It is a conceptual diagram.

<Summary of invention>
A first outline of the present invention is an object search system (first embodiment) that includes a first camera (environment-installed cameras 20a to 20c) that acquires an image and a dialog interface (dialog interface 10) that searches for an object through voice interaction. In the object search system 1), the dialogue interface relates to a first database (environmentally installed camera image database 137) that stores images acquired by the first camera, an object image, and the image. Based on the second database (image database 136) in which the keyword list is stored and the object name inputted by voice, an image related to the object name is extracted from the second database, and the second database is extracted. A control unit (interactive control) that searches and extracts an image having a high similarity to the image extracted from the first database Program 131) and those having a.

The second outline of the present invention is an object search system (object search system) including a first camera (environment-installed cameras 20a to 20c) for acquiring an image and an interactive interface for searching for an object by voice interaction. 1) A method for searching for an object using 1), the step of storing an image acquired by the first camera in a first database (environmentally installed camera image database 137); Based on the associated keyword list and the step of storing in the second database (image database 136), and the object name obtained by recognizing the input speech, the dialog interface can retrieve the object name from the second database. An image having a high similarity to the image extracted from the second database is extracted from the first data. In which a step of extracting search from over scan.

Hereinafter, the embodiment will be described in detail based on the above-described outline.

<Configuration of object search system>
In the first embodiment, the object search system 1 is a system for searching for an object existing in a real space such as an office by voice dialogue. As shown in FIG. 1, the object search system 1 includes a dialog interface device 10, environment-installed cameras 20 a to 20 c, and a network 30.

The dialogue interface device 10 extracts an image having a high degree of similarity from the object name input by the user through voice dialogue or the like, and presents the extracted image or the shooting position of the extracted image to the user. The environment-installed cameras 20a to 20c are installed at arbitrary positions in the real space and take still images of the real space.

The interactive interface device 10 and the environment-installed cameras 20 a to 20 c are connected to each other via a network 30, and data can be transmitted / received via the network 30.

The network 30 includes, for example, a wireless TCP / IP (TransmissionTransProtocol / Internet Protocol) network. TCP / IP is a protocol handled as standard in the Internet and the like.

Although FIG. 1 shows an example in which three environment-installed cameras 20a to 20c are provided, the number of environment-installable cameras may be one or more. Furthermore, although the network 30 is a wireless TCP / IP network, it may be wired and the communication method is not limited to this.

<Configuration of dialog interface device>
FIG. 2 is a block diagram illustrating an example of the dialog interface device 10.

The interactive interface device 10 includes a CPU (Central Processing Unit) 12, a storage device 13, a network interface 14, a microphone unit 15, a speaker unit 16, a camera unit 17, and a moving device 18, as shown.

The components (the CPU 12, the storage device 13, the network interface 14, the microphone unit 15, the speaker unit 16, the camera unit 17, and the moving device 18) in the interactive interface device 10 are connected to each other by the bus 11, and the bus 11, the communication function is established by, for example, a bus architecture. Note that the communication method between the components is not limited to the bus architecture, and the communication function may be established by a communication method other than the bus architecture.

The CPU 12 reads various programs stored in the storage device 13, and in accordance with the description of the read program, data read / write to the storage device 13, operations such as four arithmetic operations, network interface 14, microphone unit 15, speaker unit 16, camera unit 17 and the mobile device 18 are controlled and data is transmitted and received.

In the present embodiment, the CPU 12 is described as a general-purpose CPU. However, the CPU 12 may be configured by a hardware chip that realizes a function equivalent to each program, for example.

The storage device 13 includes a dialogue control program 131, a speech recognition program 132, a dictionary creation program 133, an environment image acquisition program 134, an image database 136, an environment-installed camera image database 137, a speech recognition dictionary 138, a speech recognition acoustic model 139, a color A feature representation database 140, a size feature database 141, a shape feature representation database, and the like are stored.

Further, the dialogue control program 131 causes the CPU 12 to perform a process of performing a voice dialogue and presenting the object to be searched to the user. The dialogue control program 131 includes a key image extraction subroutine, a search image subroutine, a narrowing method determination subroutine, and the like.

The key image extraction subroutine performs a process of extracting a key image group related to the keyword from the image database 136. The search image subroutine performs a process of extracting, as a search image group, images having a high similarity to the key image group among images taken by the environment-installed

cameras

20a, 20b, and 20c. The narrowing-down method determination subroutine performs processing for making a narrow-down inquiry corresponding to the obtained narrowing-down method to the user.

Furthermore, the voice recognition program 132 performs voice recognition and causes the CPU 12 to execute a process for issuing a voice recognition event as a recognition result. The dictionary creation program 133 causes the CPU 12 to execute processing for constructing a portion corresponding to the object name of the search request portion in the speech recognition dictionary (FIG. 10).

The environment image acquisition program 134 repeats the process of repeatedly acquiring images and metadata as image information from the environment-installed cameras 20a to 20c through the network 30 and adding them as new records to the environment-installable camera image database 137. To run.

The data structures in the image database 136, the environment-installed camera image database 137, and the speech recognition dictionary 138 will be described later.

The network interface 14 is an interface for connecting the dialog interface device 10 to the network 30 (FIG. 1). The microphone unit 15 is installed to record the sound in the environment (real space), particularly the user's voice. The microphone unit 15 in the microphone unit 15 observes a voice waveform, performs digital sampling, Make the data visible.

The speaker unit 16 is installed in the environment in order to make the user listen to the voice, and converts the data transmitted from the CPU 12 into an analog waveform and outputs it as a sound wave.

The camera unit 17 is installed in the environment to photograph a human face or an object in the environment, observes an image with a camera device, performs digital quantization, and allows the CPU 12 to refer to the data.

In FIG. 1, the environment-

installable cameras

20a, 20b, and 20c are connected to the dialog interface 10 via the network 30, but the camera unit 17 is used instead of the environment-

installable cameras

20a, 20b, and 20c. It may be a configuration.

The moving device 18 includes, for example, a motor and a control unit that controls the motor. The moving device 18 operates according to a command from the CPU 12, and the control unit drives the motor to move the dialog interface device 10 itself.

Also, the moving device 18 may be equipped with a GPS (Global Positioning System) or an odometer that measures the current position of the dialog interface device 10 in order to accurately move to the position specified by the CPU 12. Furthermore, a laser range finder, a stereo camera, or the like may be mounted on the moving device 18 in order to avoid a collision with an obstacle. These may be provided separately from the moving device 18 and connected to the bus 11, for example.

Each component (CPU 12, storage device 13, network interface 14, microphone unit 15, speaker unit 16, camera unit 17, and moving device 18) of the dialog interface device 10 described above will be described assuming that the number thereof is one. It is not limited to that.

<Operation example of dialogue control program>
FIG. 3 is a flowchart showing an example of the operation in the dialogue control program 131 stored in the storage device 13.

Here, the dialogue control program 131 is a program that is always executed while the dialogue interface device 10 is being used.

First, it is checked whether or not there is a voice recognition event (step S101). If there is no voice recognition event, the process waits until a voice recognition event occurs, and if there is, the process branches depending on the type of the voice recognition event. To do.

The speech recognition event in the process of step S101 is issued when speech recognition is performed by the speech recognition program 132 described later, and the recognition result information of speech recognition is described.

If the voice recognition event is a “search request” requesting a search for an object, a key image extraction subroutine (to be described later) in the dialogue control program 131 is executed based on the keyword of the voice recognition event, and the key related to the keyword is executed. An image group is extracted from the image database 136 (FIG. 7) (step S102).

In the process of step S102, if the number of extractions is 0 (step S103), a subsequent search process is impossible because there is no key image corresponding to the spoken keyword, and a voice prompting the user to speak again is issued. Output from the speaker (step S104).

If the number of extractions is one or more, a search image extraction subroutine (to be described later) in the dialogue control program 131 is executed based on the extracted key image group (step S105), and photographed by the environment-installed

cameras

20a, 20b, and 20c. Of these images, those having high similarity to the key image group are extracted as the search image group.

If the number of retrieval images extracted by the processing in step S105 is zero (step S106), a sound that indicates that there is no object corresponding to the keyword in the environment (real space for object retrieval) is output from the speaker unit 16. (Step S107).

If the number of search images by the process of step S105 is one or more, it is determined whether or not the number of search images is a threshold value (for example, the number of search images is two) or more (step S108).

In the process of step S108, if the number of search images is equal to or greater than the threshold value (two or more), a narrowing method determination subroutine described later in the dialogue control program 131 is executed based on the search image group (step S109).

The narrowing-down method determination subroutine in the process of step S109 outputs a narrow-down inquiry voice from the speaker unit 16 in order to make a narrow-down inquiry corresponding to the obtained narrowing-down method to the user.

If it is determined as a result of the refinement inquiry that there is no refinement method, or if the number of search images is less than the threshold value (one), the location is identified from the search image group, and the voice explaining the location Is output from the speaker unit 16 (step S110). Thereafter, the process returns to step S102.

Here, the sound output from the speaker unit 16 may be a pre-recorded sound, or a generally known Text To Speech technology or the like from text combining the object name and the fixed phrase of the speech recognition result. The synthesized voice waveform may be used for reproduction.

Further, the dialog interface apparatus 10 may be provided with a display or the like, and an explanation of the location of the object searched for instead of the voice or together with the voice may be displayed on the display. When the display is used, for example, a map of the space is displayed, and the search image group is displayed in association with the position on the map corresponding to the location.

Further, the description may be displayed from the stage in the middle of the narrowing process (for example, the processes in steps S108 to S109). Alternatively, an instruction is sent to a PDA (Personal Digital Assistant) such as a mobile terminal or a head-mounted display held by the user, and information is displayed at a position corresponding to the location of the search image group on the map or photographed image displayed by the PDA. You may make it do. The present embodiment does not limit the information presentation technology from the dialog interface device 10 to the user.

Subsequently, when the speech recognition event in the process of step S101 is the “narrowing” process, a narrowing process for narrowing down the search image group based on the narrowing down contents of the speech recognition event is performed (step S111).

Here, an example of the narrowing process will be described.

For example, if the narrowed-down content is “color-red”, only the hue histogram of the search image group that has accumulated the red neighborhood component is selected to be more than a certain percentage, or only the top rank is selected. A new search image group is assumed.

In order to perform this process, the color feature expression information stored in the color feature expression database 140 shown in FIG. 4 is referred to. As shown in FIG. 4, the color feature expression database 140 displays color components corresponding to color names (shown on the left side of FIG. 4), for example, the strengths of the three primary colors RGB (Red, Green, Blue) (on the right side of FIG. 4). Information represented as “RGB”).

In addition, when the narrowed down content is “size−20 cm”, the size of the object of the search image group is estimated, and the size (for example, the long side or diagonal line of the object) is within an arbitrary set value from 20 cm. Only those having a deviation or those having the smallest deviation are selected and set as a new search image group.

When this process is performed, the size feature expression information stored in the size feature expression database 141 shown in FIG. 5 is referred to. The size feature expression database 141 stores a range of values (“value (cm)” shown on the right side of FIG. 5) corresponding to the size expression (“size name” shown on the left side of FIG. 5) in units of millimeters, for example. This is information shown as a size feature expression.

When the narrowed down content is “shape-circle”, the shape of the object in the search image group is estimated, and the shape, or the numerical value of the shape, is higher than the set value or higher. Only a thing is selected and it is set as a new search image group.

In this process, the shape feature expression information stored in the shape feature expression database 142 shown in FIG. 6 is referred to. The shape feature expression database 142 is information indicating shape identifiers (“identifiers” shown on the right side of FIG. 6) corresponding to the shape expressions (“model name” shown on the left side of FIG. 6) as information on the shape feature expressions. .

Then, after narrowing down step S111, the processing after step S106 already described is performed.

If the voice recognition event in step S101 is “guidance request”, the moving device 18 is instructed to move to the location presented last in the processing of step S110, and the dialog interface device 10 is moved ( Step S112) and the process returns to Step S102.

If an exceptional situation occurs in the above processing, the exception is notified to the user and the processing of the speech recognition event is skipped. Specifically, for example, when the voice recognition event is “narrow down”, the voice recognition event of “search request” has not been processed before, and there is no search image group to be narrowed down. This is the case of “guidance request” but the place to guide is not fixed.

<Data structure and contents of image database>
FIG. 7 is an explanatory diagram showing an example of the data structure of the image database 136 stored in the storage device 13 and the contents of the stored data.

The image database 136 is a relational database including an image shown on the left side of FIG. 7 and a keyword list shown on the right side of FIG. For the image, data indicating the image itself may be directly stored on the database, or only reference information such as a file name may be stored.

Alternatively, instead of the image itself, the image converted into feature amount data used for similar image search described later may be stored together with the image or as an alternative to the image.

<Key image extraction subroutine processing>
In the key image extraction subroutine, an entry whose keyword is included in the keyword list of the image database 136 is searched from the image database 136, and the key image group is obtained by extracting the image of the entry.

The image database 136 is expected to increase the accuracy of object search as it becomes larger. However, the cost of constructing the image database 136 (such as creating a keyword list) also increases.

Therefore, as a technique for solving the increase in the construction cost, a method for automatically constructing an image database using a document with an image represented by an html (Hyper Text Markup Language) page on the Internet will be described below.

A large number of html pages can be acquired by crawling the Internet. Furthermore, an image can be acquired by referring to a URL (Uniform Resource Locator) to an image file included in an <img> tag that is a tag for displaying an image on an html page. The URL is a description method indicating an information location such as a document or an image existing on the Internet.

The keyword list of each acquired image can be obtained from the attribute value of the <img> tag and the surrounding text of the <img> tag. For example, after attribute values and text are divided into morpheme strings by morphological analysis, the score of the substrings is a scale such as TF / IDF (Text Frequency / Inverse Document Frequency) (where TF is related to the <img> tag) The number of subsequences that appear in the text to be determined, IDF is the reciprocal of the number of occurrences of the morpheme among all <img> tags in all html pages), and the score is greater than the set value Alternatively, a substring group that is higher is used as a keyword list.

<Data structure and contents of environment-installed camera image database>
FIG. 8 is an explanatory diagram showing an example of the data structure in the environment-installed camera image database 137 and the contents of stored data.

The environment-installed camera image database 137 is a relational database composed of images and metadata (shooting position, shooting angle, shooting date and time). The metadata is composed of data such as a shooting position, a shooting angle, and a shooting date.

The environment-installed camera image database 137 is constructed by the processing of the environment image acquisition program 134. As described above, the environment image acquisition program 134 acquires images and metadata of various objects from the environment-installed cameras 20a to 20c through the network 30, and adds them as new records to the environment-installable camera image database 137. repeat.

The timing for acquiring an image is an arbitrary time interval determined in advance, or a point in time when an image change is detected by analyzing a captured image, but is not limited to these methods.

When the environment-installed cameras 20a to 20c have a moving function, the position of the environment-installed cameras 20a to 20c can be obtained by providing position measuring means such as GPS (Global Positioning System) or an odometer.

If the environment-installed cameras 20a to 20c can control the shooting direction, the camera shooting direction can be acquired using a technique such as acquiring the current camera position using API (Application Programming Interface).

Further, the environment image acquisition program 134 may acquire an image by the camera 17 provided in the dialog interface device 10 in the same manner as acquiring images from the environment-installed cameras 20a to 20c.

In this way, if the dialog interface device 10 includes the camera 17 and the moving device 18 as in the case of a robot, they can be used for environment image acquisition as they are, and are generally different from environment-installed cameras. Has the advantage of being able to acquire an object search image from the same viewpoint as the users.

<Search image extraction subroutine processing>
A processing example of the search image extraction subroutine in the dialog control program 131 will be described.

The search image extraction subroutine measures the degree of coincidence with an object included in an image in the environment-installed camera image database 137 using each image in the key image group as a key, and the degree of coincidence is an arbitrary set value or higher. A thing is extracted as a search image group.

As a technique for searching for an image similar to one key image from the environment-installed camera image database 137, for example, a generally known similar image search algorithm (for example, described in document (1)) (for example, The following document (1) is used.

Further, when the image included in the environment-installed camera image database 137 partially includes a plurality of objects, a partially matching similar image search algorithm described in the document (1) (for example, a document ( 1)).

Reference (1): Tatsuya Harada, Hideki Nakayama, Yasuo Kuniyoshi, “AI Goggles: Wearable Image Annotation and Retrieval System with Additional Learning Functions” IEICE Transactions, Vol.J93-D, No.6, pp. 857-869, Jun. 2010.
<Narrowing method determination subroutine processing>
Next, a processing example of a narrowing method determination subroutine in the dialogue control program 131 will be described.

In the narrowing-down method determination subroutine, three types of color, size, and shape are assumed as narrowing-down methods. As a simple method, there is a method in which these are narrowed down in an arbitrary order.

A more effective narrowing-down method determination method is shown below.

First, for each method, the measured values of all images in the search image group are obtained. For example, in the case of color, the color that is the main component of the search image is measured using a hue histogram. The distribution of the measurement results is obtained, and the expected average narrowing degree is estimated from the response information obtained from the user by each narrowing method.

That is, possible response patterns XM = {XM1,. . . , XMn} (in the case of color, XM1 to XMn correspond to color names), the number of search images after narrowing down is set to N (XM1),. . . , N (XMn), M ′ = argmin # M (avg # XM (N (XM))) is determined as the narrowing-down method.

<Operation example of voice recognition program>
FIG. 9 is a flowchart showing an example of the operation in the voice recognition program 132.

As a general rule, the speech recognition program 132 is started when the dialog interface device 10 is activated, and is always operated during a time period during which speech recognition is desired. The voice waveform data recorded by the microphone unit 15 is always referred to.

In FIG. 9, first, the speech waveform observed by the microphone unit 15 is analyzed, it is determined whether or not speech is present, and a section in which speech is present is determined (step S201). The processing in step S201 can be realized by, for example, a known method called voice segment detection (Document (2)).

Further, instead of analyzing the voice waveform data or as an auxiliary means, the dialogue interface device 10 is provided with a switch, and the user uses the switch to determine the voice section or from the image captured by the camera unit 17. A result of detecting a face image or a lip image may be used for voice section detection.

When the speech section is determined, an entry in the speech recognition dictionary that closely matches the speech pattern of the section is obtained based on the speech recognition acoustic model (sound acoustic feature amount) (step S202). This can be realized, for example, by a known method called automatic speech recognition (see Document (2)). Then, the entry obtained in the process of step S202 is issued as a voice recognition event (step S203).

Document (2): “Basics of speech recognition” written by LawrencewrRabiner and Biing-HwangHJuang, supervised by Sadaaki Furui, published by NTT Advanced Technology Co., Ltd.

<Configuration of voice recognition dictionary>
FIG. 10 is a conceptual diagram showing an example of the speech recognition dictionary 138 stored in the storage device 13.

The voice recognition dictionary 138 is described by, for example, FSA (Finite State Automaton). The label assigned to the FSA transition is one element of a basic unit of a language such as a syllable used by the speech recognition acoustic model 139, and is used for time-series matching with a probability model in the corresponding speech recognition acoustic model 139. All paths from the start state to the end state of FSA are entries. The speech recognition dictionaries used in this embodiment are classified into three types: search requests (upper part in FIG. 10), narrowing down (middle part in FIG. 10), and guidance requests (lower part in FIG. 10).

The voice recognition event is composed of the label series of the selected entry and the classification (search request, narrowing down, guidance request) including the entry.

<Operation of dictionary creation program>
The dictionary creation program constructs a portion corresponding to the object name of the search request portion in the speech recognition dictionary 138 shown in FIG. Two methods are shown below.

The first method is constructed using all keywords included in the keyword list of each entry in the image database 136.

In the second method, a similar image search of images in the environment-installed camera image database 137 is performed using each image in the image database 136 as a key, and a keyword list of entries to which a key image having at least one search result belongs. It is constructed using the keywords included in.

The first method is simple, but depending on the size of the image database 136, the number of entries in the dictionary becomes enormous, which may cause a decrease in the accuracy of speech recognition. Therefore, by using the second method, it is expected that the ratio of words that are considered object names of objects existing in the environment increases in the dictionary.

However, since the second method requires a large amount of calculation of similar image retrieval, for example, it is conceivable to execute it at a frequency of once a day.

Thereby, according to the first embodiment, by using the object search system 1, a user can easily search for a desired object by voice dialogue.

(Embodiment 2)
FIG. 11 is an explanatory diagram showing an example of a dialog interface device according to Embodiment 2 of the present invention, and FIG. 12 shows an example of an operation in a dialog control program stored in a storage device provided in the dialog interface device of FIG. FIG. 13 is an explanatory diagram showing an example of the data structure of the user identification database stored in the storage device provided in the dialog interface device of FIG. 11 and the contents of the stored data, and FIG. It is explanatory drawing which showed an example of the data structure of the data structure of the user database stored in the memory | storage device provided in this dialog interface apparatus, and the data content stored.

<Configuration of dialog interface device>
In the second embodiment, an example will be described in which the object search system 1 (FIG. 1) performs an object search using a user database 156 described later. The object search system 1 includes a dialog interface device 10, environment-installed cameras 20a to 20c, and a network 30 as in FIG. 1 of the first embodiment.

As shown in FIG. 11, the dialog interface device 10 is similar to the dialog interface device 10 of FIG. 2, which includes a CPU 12, a storage device 13, a network interface 14, a microphone unit 15, a speaker unit 16, a camera unit 17, and a moving device 18. It consists of.

The storage device 13 also includes a dialogue control program 131, a speech recognition program 132, a dictionary creation program 133, an environment image acquisition program 134, an image database 136, an environment-installed camera image database 137, a speech recognition dictionary 138, and speech recognition sound. In addition to the information similar to FIG. 2 of the first embodiment such as the model 139, information of the user identification database 155 and the user database 156 is newly stored.

Furthermore, the dialog control program 131 newly has a user identification subroutine in addition to the key image extraction subroutine, search image subroutine, and narrowing-down method determination subroutine shown in FIG. 2 of the first embodiment. This user identification subroutine performs a process of selecting a user ID (a code for identifying a user) which is a user identifier of a record having a face image having a high degree of similarity.

<Operation example of dialogue control program>
FIG. 12 is a flowchart showing an example of the operation in the dialogue control program 131.

In FIG. 12, the processing of steps S101 to S114 is the same as the processing of FIG. 3 of the first embodiment, so the description thereof will be omitted, and the newly added processing of steps S115 and S116 will be described.

The process of step S115 is a step that is first executed when the speech recognition event (the process of step S101) is a “search request” requesting an object search, and the process of step S116 is a search image extraction subroutine of step S107. This is a process to be performed after executing.

As described above, in the process of step S101, when the speech recognition event is “search request” for requesting an object search, the user identification subroutine is executed prior to the key image extraction subroutine of step S102 (step S115). ).

Here, the operation of the user identification subroutine, which is the process of step S115, will be described.

First, an image is acquired from the camera 17. Subsequently, a face area is detected from the acquired image, and a face image is extracted. Then, the similarity between the face image of each record in the user identification database 155 and the extracted face image is calculated, and the user ID of the record having the face image with the highest similarity is selected.

Here, the user identification database 155 will be described.

FIG. 13 is an explanatory diagram showing an example of the data structure of the user identification database 155 and stored data contents. Each record of the user identification database 155 includes a user ID (shown on the right side of FIG. 13) and a user face image (shown on the left side of FIG. 13), which are codes for identifying the user, as shown in the figure. They are stored in association with each other.

Subsequently, in the user identification subroutine that is the process of step S115, if there is no user whose similarity exceeds a certain threshold value, a new user ID is assigned and associated with the extracted face image in the user identification database. Add to 155 as a new record. Thus, the selected or added user ID is returned to the main program.

Here, the face image in the user identification database 155 is not the face image itself, but may be, for example, a face image converted into a feature quantity such as vector data necessary for calculating similarity. The detection of the face area and the calculation of the similarity of the face image are performed using, for example, the algorithm described in the above-mentioned document (1).

In FIG. 12, after executing the search image extraction subroutine of step S105, the user information of the record corresponding to the selected user ID is acquired from the user database 156 (step S116).

FIG. 14 is an explanatory diagram showing an example of the data structure of the user database 156 and the contents of stored data.

Each record in the user database 156 is associated with a user ID (shown on the left side of FIG. 14) and user information. The list of feature pairs (shown on the right side of FIG. 14), which is user information, includes a list of pairs of vocabulary and default features, and narrows down by default features that match the vocabulary. Then, the process after step S106 is performed.

At this time, at the time of a refinement request from the user, the pair of the vocabulary and the specified feature is added to or updated in the user information of the record corresponding to the user ID selected from the user database 156.

Thereby, in the second embodiment, by storing the name of the object that the user wants to find and its characteristics for each user, the object that the user wants to find can be presented without asking about the characteristics. It is possible to respond to user requests in a short time.

As mentioned above, the invention made by the present inventor has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and various modifications can be made without departing from the scope of the invention. Needless to say.

The present invention is suitable for a technology that allows a user to easily search for an object to be searched by voice dialogue.

DESCRIPTION OF SYMBOLS 1 Object search system 10 Dialog interface apparatus 30 Network 11 Bus 13 Storage apparatus 14 Network interface 15 Microphone part 16 Speaker part 17 Camera part 18 Mobile device 20a Environmental installation type camera 20b Environmental installation type camera 20c Environmental installation type camera 131 Dialogue control program 132 Speech recognition program 133 Dictionary creation program 134 Environmental image acquisition program 136 Image database 137 Environment-installed camera image database 138 Speech recognition dictionary 139 Speech recognition acoustic model 155 User identification database 156 User database

Claims

An object search system comprising a first camera for acquiring an image and an interactive interface for searching for an object by voice interaction,
The interactive interface is
A first database for storing images acquired by the first camera;
A second database in which images of objects and keyword lists related to the images are stored;
Based on the object name inputted by voice, an image related to the object name is extracted from the second database, and an image having a high similarity with the image extracted from the second database is extracted from the first database. An object search system comprising a control unit for searching and extracting.
The object search system according to claim 1,
The interactive interface is
A microphone to capture audio,
A voice recognition unit that recognizes the voice acquired by the microphone;
The controller is
The object search system, wherein the voice recognition unit recognizes the voice acquired by the microphone to obtain the object name.
The object search system according to claim 2, wherein
The language model for speech recognition in the speech recognition unit is:
An object search system using an object name stored in the second database.
The object search system according to claim 1,
The first database is
An image acquired by the first camera and image information related to the image are stored;
The controller is
An object search system that outputs information on a photographing position included in image information of the first database when an image having a high similarity is searched and extracted from the first database.
The object search system according to claim 1,
The interactive interface is
A third database for storing a user identifier for identifying a user, a vocabulary associated with the user identifier, and user information including a list of characteristics of the vocabulary;
A user recognition unit that extracts user information associated with the corresponding user identifier from the third database;
The controller is
An object characterized by determining a similarity according to a feature included in the user information extracted by the user recognition unit, and extracting an image extracted from the first database based on the similarity Search system.
The object search system according to claim 5, wherein
The interactive interface is
A second camera for acquiring an image;
A fourth database storing a face image and a user identifier associated with the face image;
The user recognition unit
An object search system, wherein a face region is detected from an image acquired by the second camera, the fourth database is searched, and the user identifier associated with the face image is extracted.
The object search system according to claim 1,
The controller is
An object search system characterized in that an html page is obtained by crawling the Internet, an object image and a keyword of the image are obtained from the html page, and stored in the second database.
The object search system according to claim 1,
The controller is
An object search system for extracting images with high similarity from the plurality of images based on feature representations of objects to be searched when there are a plurality of images with high similarity extracted from the first database. .
The object search system according to claim 8.
The feature expression used by the control unit is at least one of a color, a size, and a shape of an object.
The object search system according to claim 8.
The controller is
When there are a plurality of images with high similarity extracted from the first database, a plurality of the images are arranged in the feature space based on the feature expression, and the feature of the object is determined according to the shape of the distribution in the feature space. An object search system characterized by determining a type.
The object search system according to claim 1,
The interactive interface is
An object search system comprising moving means for moving the dialog interface, wherein the dialog interface can be moved to an arbitrary position.
The object search system according to claim 11, wherein
The moving means is
When the image having a high similarity is searched and extracted from the first database, the dialogue interface is moved based on information on a photographing position included in the image information of the first database. Object search system.
The object search system according to claim 11, wherein
The interactive interface is
An object search system comprising the first camera.
An object search method for searching for an object by an object search system comprising a first camera for acquiring an image and an interactive interface for searching for an object by voice interaction,
Storing an image acquired by the first camera in a first database;
Storing an image of the object and a keyword list associated with the image in a second database;
Based on the object name obtained by recognizing the input voice, the dialog interface extracts an image related to the object name from the second database, and is similar to the image extracted from the second database And a step of searching for and extracting a high-quality image from the first database.
The object search method according to claim 13.
When storing the image acquired by the first camera in the first database, the image information related to the image is associated and stored,
When searching for and extracting an image having a high degree of similarity to the image from the first database, information on the photographing position included in the image information of the first database is output,
The language model of speech recognition when recognizing the speech is:
An object search method using an object name stored in the second database.