WO2024161991A1

WO2024161991A1 - Information processing device, information processing method, and program

Info

Publication number: WO2024161991A1
Application number: PCT/JP2024/001049
Authority: WO
Inventors: 隆木下; 真山田; 隆俊中村
Original assignee: ソニーグループ株式会社
Priority date: 2023-01-31
Filing date: 2024-01-17
Publication date: 2024-08-08

Abstract

The present technology relates to an information processing device, an information processing method and a program that make it possible for content suited to a surrounding environment to be provided to a user. An information processing device according to the present technology comprises: a similarity evaluating unit for evaluating a similarity between first text linked to an item of content, and surrounding data relating to the surrounding environment of the user, input by the user; and a selecting unit for selecting content corresponding to the surrounding environment from among a plurality of items of content on the basis of the similarity. The present technology is applicable to equipment employed in services for providing spatial content, for example.

Description

Information processing device, information processing method, and program

This technology relates to an information processing device, an information processing method, and a program, and in particular to an information processing device, an information processing method, and a program that can provide a user with content suited to the surrounding environment.

In recent years, services that provide content such as sounds and images that fit the worldview of works to users visiting locations or tourist spots that are the setting for works such as animations, manga, movies, and dramas have become popular. These services are configured as systems that place content in specific locations such as actual locations depicted in works or tourist spots.

International Publication No. 2020/071216

Users who use the services that provide the above-mentioned content have a need to have a similar experience in their residential areas, facilities, etc. As a result, systems have been proposed that allow users to place content created by creators in various areas on a map.

It is difficult for users to recall from memory what each area on the map looks like and decide in which area content should be placed. In order to select content appropriate for a certain area, a function is required that allows users to actually go to the area and select content while observing the surrounding environment.

One technology that supports user content selection is one that provides sounds that match the surrounding environment by inputting images of the surrounding environment into a recognizer that has learned to combine images and sounds. With this technology, image and sound combinations are learned based on video images that are publicly available on the Internet, for example.

Also, as a technology to support user content selection, there is a technology that compares the features of a query image generated based on text describing the contents of the desired image with the features of searched images that are the subject of the search, and outputs searched images that are similar to the query image as the desired image (see, for example, Patent Document 1).

However, it is difficult for a general-purpose recognizer obtained by learning from the large amount of data available on the Internet to select content that matches the surrounding environment from the wide variety of content that corresponds to the worldview of a particular work.

This technology was developed in light of these circumstances, making it possible to provide users with content that suits their surroundings.

An information processing device according to one aspect of the present technology includes a similarity evaluation unit that evaluates the similarity between a first text linked to a content and peripheral data related to the user's surrounding environment input by the user, and a selection unit that selects the content corresponding to the surrounding environment from among a plurality of the contents based on the similarity.

An information processing method according to one aspect of the present technology evaluates the similarity between text associated with content and surrounding data about the user's surrounding environment input by the user, and selects the content corresponding to the surrounding environment from among a plurality of pieces of content based on the similarity.

A program according to one aspect of the present technology causes a computer to execute a process of evaluating the degree of similarity between text associated with content and peripheral data about the user's surrounding environment input by the user, and selecting the content corresponding to the surrounding environment from among a plurality of pieces of content based on the degree of similarity.

In one aspect of this technology, the similarity between text associated with a content and surrounding data about the user's surrounding environment input by the user is evaluated, and the content corresponding to the surrounding environment is selected from among the multiple pieces of content based on the similarity.

1 is a diagram showing a configuration example of an embodiment of a content providing system to which the present technology is applied; FIG. 13 is a diagram showing an example of an application screen when spatial content is provided. 1A and 1B are diagrams illustrating an example of matching between a peripheral image and element content in the present technology. FIG. 13 is a diagram showing a flow of element content selection in the present technology. FIG. 13 is a diagram showing an example of a creator tool screen when registering element content. FIG. 11 is a diagram illustrating a first flow in which a user prepares to play back spatial content. FIG. 11 is a diagram illustrating a first flow in which a user prepares to play back spatial content. FIG. 11 is a diagram illustrating a second flow in which a user prepares to play back spatial content. FIG. 11 is a diagram illustrating a second flow in which a user prepares to play back spatial content. FIG. 11 is a diagram illustrating a third flow in which a user prepares to play back spatial content. FIG. 11 is a diagram illustrating a third flow in which a user prepares to play back spatial content. FIG. 11 is a diagram showing an example of a first flow for transmitting subjective text without selecting element content. FIG. 13 is a diagram showing an example of a second flow in which subjective text is sent without selecting element content. FIG. 2 is a block diagram showing an example of the configuration of a user terminal. FIG. 2 is a diagram illustrating an example of the configuration of a creator terminal. 11 is a flowchart illustrating a process performed by a user terminal. 13 is a flowchart illustrating a process performed by a creator terminal. FIG. 13 is a diagram illustrating a fourth flow in which a user prepares to play back spatial content. FIG. 13 is a diagram illustrating a fourth flow in which a user prepares to play back spatial content. FIG. 13 is a diagram illustrating a fifth flow in which a user prepares to play back spatial content. 13 is a flowchart illustrating a process performed by a user terminal in the second embodiment. FIG. 11 is a diagram showing an example of data acquired by the content providing system. FIG. 13 is a diagram illustrating an example of data used for learning. FIG. 2 is a block diagram showing an example of the hardware configuration of a computer.

Hereinafter, an embodiment of the present technology will be described in the following order.
1. First embodiment 2. Second embodiment 3. Example of using data acquired by a content providing system for learning

1. First embodiment
Configuration of Content Providing System FIG. 1 is a diagram showing a configuration example of an embodiment of a content providing system to which the present technology is applied.

The content providing system in Figure 1 is a system that provides spatial content. Spatial content is content that provides a space consisting of the world view (theme) of a work such as animation, manga, movie, or drama, for example, through stereophonic sound. Spatial content includes one or more elemental contents (sound, image, text, etc.) that correspond to the world view of the work. In the following, spatial content and elemental contents are collectively referred to as content.

The content provision system shown in FIG. 1 is composed of a user terminal 1, a creator terminal 2, and a server 3, and the user terminal 1, the creator terminal 2, and the server 3 can be connected to each other via a wired or wireless network.

The user terminal 1 is an information processing device owned by a user who is a subscriber of the content provision service provided by the server 3. The user terminal 1 may be a smartphone, a tablet terminal, a wearable device, a wearable camera, a portable music player, a game console, a PC, or the like.

By using the user terminal 1, the user can use the content provision service provided by the server 3. Specifically, the user terminal 1 downloads an application and cooperates with the server 3 to exchange data, thereby preparing to play back spatial content.

When preparing to play spatial content, elemental content that corresponds to the worldview of a work is placed in an area on a map. By defining the area on the map and the elemental content to be placed in that area, a scape is formed in which it is possible to experience, for example, one of the scenes that make up a work. Below, an area of real space in which it is possible to experience all or part of a work by experiencing each scene in multiple scapes is referred to as a world.

When playback conditions are met after playback preparation is complete (for example, when the user enters an area in which the element content is located), the user terminal 1 plays the spatial content (element content).

The creator terminal 2 is an information processing device such as a PC operated by a creator who produces the entire work expressed by the spatial content and each element content. The creator terminal 2 executes creator tools, generates content to be provided by the content providing service in response to the creator's operations, and registers the content to the server 3.

The server 3 is an information processing device managed by the operator of the content provision service. The server 3 distributes applications for using the content provision service. The server 3 also records content created by creators, and transmits the content to the user terminal 1.

Note that the content does not necessarily have to be transmitted by the server 3, but may be transmitted to the user terminal 1 by the creator terminal 2 or another server different from the server 3.

Overview of the Present Technology FIG. 2 is a diagram showing an example of an application screen when spatial content is provided.

When spatial content is provided, a map M1 showing the extent of the world is displayed on the user terminal 1, as shown in Figure 2. In the example of Figure 2, four areas A1 to A4 in which element content is placed are set within the world.

In FIG. 2, the black pin Pi1 indicates the user's current location. As shown in FIG. 2, when the user is within area A1, element content arranged in area A1 is provided to the user by user terminal 1.

Specifically, multiple types of element content are provided that allow the user to experience one of the scenes that make up the work. For example, when the user is in area A1, sound is provided as element content arranged in area A1, and image CP1 and text T1 are displayed at the bottom of the screen as element content arranged in area A1. Below, the sound, image (moving image or still image), and text provided as element content are referred to as content sound, content image, and content text, respectively.

Areas A1 to A4 arranged within the world are set up by the user going to a desired location within the world when preparing to play back spatial content, and repeatedly selecting and placing element content that suits the location where the user is located from among multiple element content that correspond to the worldview of a certain work.

One possible method for supporting the user in selecting element content is for the user to take a picture of the surrounding environment using the user terminal 1, and for the content providing system to select element content that matches the surrounding environment from among multiple element contents based on the surrounding image (still image or video image) of the surrounding environment. This method requires scene-level matching, and requires comprehensive recognition of things contained in the surrounding environment, rather than a combination of objects.

One example of a technology that performs this type of matching is one that provides sounds that match the surrounding environment by inputting images of the surrounding environment into a recognizer that has been trained to combine images and sounds. With this technology, image and sound combinations are learned based on video images that are publicly available on the Internet, for example.

However, it is difficult for a general-purpose recognizer obtained by learning from the large amount of data available on the Internet to select content that matches the surrounding environment from the diverse types of content that correspond to the worldview of a particular work. New methods such as transfer learning are required to learn the correlation between images and subsets of content that correspond to the worldview of a work, and it is difficult to collect training data for each of the various works.

Diverse interpretations should be allowed when it comes to determining what kind of real-world environment a scene from a work corresponds to, and from what perspective it should be judged.

Figure 3 shows an example of matching a surrounding image with element content using this technology.

In this technology, as shown in Figure 3, element content corresponding to the worldview of a certain work A is linked to environmental description text, which is a sentence that shows the state of the environment that matches that element content. As shown by the arrow in Figure 3, the content provision system of this technology uses a general-purpose recognizer to evaluate the similarity between the surrounding image and the environmental description text, and can select element content that matches the surrounding environment based on that similarity.

Figure 4 shows the flow of element content selection in this technology.

As shown in FIG. 4, a user takes a picture of his/her surrounding environment using the user terminal 1, and inputs the surrounding image to the recognizer 11. The recognizer 11 is, for example, a recognizer that has undergone machine learning to detect correlations between a large amount of images and text published on the Internet. When the surrounding image is input, the recognizer 11 evaluates the similarity between the surrounding image and environmental description text that is linked to multiple element contents that correspond to the worldview of a certain work.

The content providing system generates a ranking in which element contents are arranged in order of the similarity between the associated environmental description text and the surrounding image, and presents this to the user. In the example of Figure 4, a group of element contents that have been put together to allow the user to experience scenes A through F that make up a certain work are presented in order.

Users can look at the presented rankings and select the group of elemental content they wish to view in their own location. Compared to a case where a list is presented in which a simple arrangement of elemental content groups that can be used to experience each scene is presented, a list (ranking) with evidence is presented, making it easier for users to select elemental content.

When selecting a desired element content, the user can, for example, input subjective text. The subjective text is a sentence that expresses the user's subjective opinion about the surrounding environment itself or the combination of the surrounding environment and the element content, such as why the user thought the element content matched the surrounding environment, or how the user felt when viewing the element content in the user's location.

The subjective text is used for additional training of the recognizer 11 and to support the creator when writing the environmental description text.

Figure 5 shows an example of the creator tool screen when registering element content.

In the example of FIG. 5, a group of element contents is registered for each scene that constitutes the work. Element contents may also be registered one by one.

When registering element content, as shown in FIG. 5, a thumbnail image Th11 of the content image, text T11 indicating the title of the scape (scene), the file name of the element content, etc., and a text box TB1 for inputting environmental description text are displayed on the creator terminal 2.

In the example in Figure 5, "The Town of Beginnings" is displayed as the title of the scape. "town.mp3" is displayed as the file name of the content sound, and "town.jpg" is displayed as the file name of the content image. Additionally, "A station concourse with many people coming and going" has been entered in text box TB1 as the environmental description text.

In this way, the environment description text is entered, for example, by the creator when registering element content, and is registered in association with that element content.

First flow of preparation for playing back spatial content (example of changing already placed element content)
A first flow in which the user prepares to play back spatial content will be described with reference to FIG. 6 and FIG.

As shown in FIG. 6A, for example, an edit button B1 for starting preparation for playback is displayed in the upper right part of the screen displayed on the user terminal 1. Note that, for simplification, the screen in FIG. 6A only illustrates areas and pins, and the map showing the range of the world is omitted. In FIG. 6A, the white pin indicates the user's destination. In reality, the edit button B1, area, pin, etc. are displayed superimposed on the map. The same applies to the other figures.

As shown in A of FIG. 6, the user can start preparing to play back spatial content by pressing edit button B1 while in area A1 in real space.

When the edit button B1 is pressed, for example, a save button B2 for completing the changes to the element content is displayed in place of the edit button B1, as shown in FIG. 6B. Also, for example, a change button B3 for changing the element content already placed in area A1 is displayed to the right of the content text provided in area A1.

When the user presses the change button B3, a list is displayed in which multiple element contents corresponding to the worldview of a certain work are arranged, as shown in FIG. 6C. In the list in FIG. 6C, combinations of thumbnail images of content images and content text are arranged according to the scenes that make up the work.

In the example of FIG. 6C, selection buttons B5a to B5d are displayed to the right of each content text to select each element content group as the element content group to be placed in area A1. Above the list, button B4 is displayed to present a ranking based on the similarity with surrounding images.

When the list is displayed, thumbnail images and content text of the content images already placed in area A1 are displayed at the bottom of the screen, and a cancel button B6 for canceling changes to the element content is displayed to the right of the content text.

When the user presses button B4, the user terminal 1 starts capturing an image of the surrounding environment (the user's first-person perspective image), and as shown in FIG. 7D, a surrounding image P1 is displayed on the screen of the user terminal 1. A capture button B7 for acquiring the surrounding image is displayed below the surrounding image P1.

When the user presses the capture button B7, a surrounding image P1 of the current surrounding environment is acquired, and the similarity between the surrounding image P1 and the environmental description text linked to each element content group is evaluated. After that, a ranking of each element content group based on the similarity is generated, and the ranking is displayed as shown in FIG. 7E.

In this ranking, similar to the list, combinations of thumbnail images of content images and content text are arranged by scene. In the example of FIG. 7E, selection buttons B11a to B11d are displayed to the right of each content text to select each element content group as the element content group to be arranged in area A1.

The user can select a desired element content group from among the multiple element content groups displayed in the ranking. As shown in FIG. 7E, the user can place in area A1 the element content group linked to the environmental description text that is most similar to the surrounding image, for example, by pressing selection button B11a. In other words, the user can change the element content group already placed in area A1 to the element content group selected from the ranking.

Users find it easier to select element content that suits an area by looking at the area (surrounding environment) on-site and selecting element content, rather than by looking at a map and imagining what the area looks like.

If the element content group linked to the text with the highest similarity (the highest-ranking element content group) is not selected as the element content group to be placed in area A1, the user can say that the similarity between the environmental description text linked to the selected element content group and the surrounding image P1 is the highest. Therefore, the accuracy of the recognizer 11 can be improved by re-training the recognizer 11 using the environmental description text and the surrounding image P1 as training data.

If the user cannot find the desired group of elemental content from the rankings, he or she can press button B4 again to take another image of the surrounding area.

In addition, a ranking of element contents taking into consideration the shooting conditions of the surrounding image may be presented, and element contents suitable for the surrounding environment may be selected from the element contents. For example, if the surrounding image was taken at night, a ranking of element contents provided only at night may be presented, or element contents provided only at night may be presented at the top of the ranking. The shooting conditions of the surrounding image may be identified by the user terminal 1 based on, for example, the shooting time of the surrounding image or the surrounding image itself. Some of the playback conditions of the element content, such as being provided only at night, may be determined by the creator when the element content is registered. A sentence regarding the playback conditions of the element content may be included in the environment description text.

Second flow of preparation for playing spatial content (example of newly arranging element content)
A second flow in which the user prepares to play back spatial content will be described with reference to FIG. 8 and FIG.

As shown in A of FIG. 8, when the user is outside an area in real space where element content has already been placed, the user can start preparations for playing back spatial content by pressing the edit button B1.

When the edit button B1 is pressed, a list is displayed in which multiple groups of element content according to the worldview of a certain work are arranged, as shown in Figure 8B.

When the user presses button B4 displayed at the top of the list, the user terminal 1 starts capturing images of the surrounding environment, and a surrounding image P1 is displayed on the screen of the user terminal 1, as shown in FIG. 8C.

When the user presses the capture button B7 displayed below the surrounding image P1, the surrounding image P1 is acquired and the similarity between the environmental description text linked to each element content group and the surrounding image P1 is evaluated. After that, a ranking of each element content group based on the similarity is generated and the ranking is displayed as shown in Figure 9D.

As shown in FIG. 9D, by pressing, for example, the selection button B11a, the user can place a group of element contents linked to the environmental description text most similar to the surrounding image in a new area including the user's current location. For example, the element contents are placed in a circular area of a predetermined size centered on the user's current location.

It is easier for users to place element content that suits a given area by viewing the area (surrounding environment) on-site and placing the element content, rather than by looking at a map and imagining what the area looks like.

When the user presses, for example, the Select button B11a, the area A11 in which the element content group has been newly placed is displayed on the map as shown in FIG. 9E, and thumbnail images of the content images placed in area A11 and the content text are displayed below the map. Note that, as described with reference to FIG. 6 and FIG. 7, the user can change the element content group placed in area A11 by pressing the Change button B3 located to the right of the content text.

Third flow of preparation for playing spatial content (example of using surrounding images captured in the past)
A third flow in which the user prepares to play back spatial content will be described with reference to FIG. 10 and FIG.

As described above, when changing an already placed group of elemental content or placing a new group of elemental content, a list is displayed showing multiple groups of elemental content that correspond to the worldview of the work that can be experienced in a certain world, as shown in A of Figure 10.

When the user presses button B4 displayed at the top of the list, the user terminal 1 starts capturing images of the surrounding environment, and a surrounding image P1 is displayed on the screen of the user terminal 1, as shown in FIG. 10B.

In the example of FIG. 10B, a button B21 is displayed to the right of the capture button B7 displayed below the surrounding image P1. When the user presses the button B21, a list of surrounding images previously captured within the area where the user is currently located is displayed, as shown in FIG. 10C.

In the example of FIG. 10C, area #11 enclosed by a dashed line displays surrounding images that the user has taken in the past using the user terminal 1 within the area in which the user is currently located. Area #12 enclosed by a dashed line displays surrounding images that other users have taken in the past within the area in which the user is currently located.

In addition, surrounding images previously taken by other users are accumulated on server 3 and may include images posted on SNS (Social Network Service) and images obtained from other systems.

The user can select the desired surrounding image from multiple surrounding images displayed on the screen. For example, even if it is raining in the real world, the user can select an image taken on a sunny day as the surrounding image.

When a surrounding image is selected by the user, the surrounding image P11 selected by the user is displayed at the top of the screen, as shown in FIG. 11D, and subjective text T21 entered by another user for the surrounding image P11 is displayed below the surrounding image P11.

Below the subjective text T21, a text box TB11 is displayed for inputting the user's subjective text about the surrounding image P11. The user can input their impressions of the surrounding image P11 in the text box TB11.

Below the text box TB11, a send button B31 is displayed for sending the subjective text entered by the user to the server 3.

When the send button B31 is pressed, the text entered in the text box TB11 is sent to the server 3 as subjective text for the surrounding image P11. Environmental description text may be generated based on the subjective text sent to the server 3. For example, the subjective text sent to the server 3 may be linked directly as environmental description text to the element content ultimately selected by the user.

The subjective text is transmitted, and the similarity between the environmental description text linked to each element content group and the surrounding image P11 is evaluated. After that, a ranking of each element content group based on the similarity is generated, and the ranking is displayed as shown in FIG. 11E.

As shown in FIG. 11E, by pressing, for example, a selection button B11a, the user can place a group of element contents linked to the environmental description text that is most similar to the surrounding image in the area where the user is currently located, or in a new area that includes the user's current location.

Example of Transmitting Subjective Text FIG. 12 is a diagram showing an example of a first flow in which subjective text is transmitted without selecting element content.

As shown in A of FIG. 12, when the user is in area A1, a group of element contents arranged in area A1 is provided to the user. In the example of A of FIG. 12, a share button B41 is displayed to the right of the content text provided to the user on the screen of the user terminal 1.

When the user presses the share button B41 while element content is being provided, the user terminal 1 starts capturing images of the surrounding environment, and a surrounding image P1 is displayed on the screen of the user terminal 1, as shown in FIG. 12B.

When the user presses the capture button B7 displayed below the peripheral image P1, the peripheral image P1 is acquired, and at least a part of the acquired peripheral image P1 is displayed at the top of the screen, as shown in FIG. 12C. A text box TB21 is displayed below the peripheral image P1 for inputting the user's subjective text regarding the combination of area A1 and element content. For example, the user can input their impressions of viewing the element content in area A1 in text box TB21.

Below the text box TB21, a send button B42 is displayed for sending the subjective text entered by the user to the server 3.

When the send button B42 is pressed, the subjective text entered in the text box TB21 is sent to the server 3 together with the surrounding image P1, location information of area A1 (information on the user's current location), information indicating the element content provided in area A1, and the like.

If the element content group includes a content sound, a video image of the user's surrounding environment that includes the content sound as audio may be transmitted to the server 3 instead of the surrounding image P1.

In addition, when the send button B42 is pressed, the subjective text entered in the text box TB21 may be posted to an SNS. In this case, a hashtag or URL indicating the work itself (world) or a scene that constitutes the work may be automatically entered in the text box TB21. Even if the subjective text is posted to an SNS, it is possible to track the subjective text posted by users of the content providing service based on the hashtag or URL.

FIG. 13 shows an example of a second flow for sending subjective text without selecting element content.

As shown in A of FIG. 13, when the user is in area A1, the group of element contents arranged in area A1 is provided to the user as described with reference to A of FIG. 12.

When a group of element contents is provided and the user presses the share button B41, a text box TB22 is displayed in which the user can enter subjective text about the combination of area A1 and element content, as shown in FIG. 13B. For example, the user can enter their impressions of viewing the element content in area A1 in text box TB22.

Below the text box TB22, a send button B43 is displayed for sending the subjective text entered by the user to the server 3.

When the send button B43 is pressed, the subjective text entered in the text box TB22 is sent to the server 3 along with location information in area A1 (information about the user's current location), information such as the current time, and information indicating the element content provided in area A1.

As described above, in the content provision system of this technology, the similarity between the environmental description text linked to the element content and the surrounding data about the user's surrounding environment input by the user is evaluated, and an element content corresponding to the surrounding environment is selected from among multiple element contents based on the similarity. The surrounding data about the user's surrounding environment includes surrounding images captured of the surrounding environment and surrounding audio data captured of sounds in the surrounding environment.

The selection of element content based on similarity is performed, for example, in response to a user's operation after viewing a ranking of element content based on similarity.

In this technology, the general-purpose recognizer only needs to evaluate the similarity between the surrounding image (surrounding sound data) input by the user and the environmental description text linked to the element content. Even if multiple element contents corresponding to the worldview of a certain work contain various types of content, the types of data input to the general-purpose recognizer are images and text. Therefore, the content provision system of this technology uses the general-purpose recognizer to accurately select element content that matches the user's surrounding environment from multiple element contents containing various types of content, and can present it to the user.

When a creator registers environmental description text, feedback is obtained about the correlation between the element content and the environmental description text, and when a user prepares for playback, feedback is obtained about the correlation between the surrounding images and the environmental description text. By performing transfer learning using this feedback as training data, it is also possible to obtain a recognizer that can select element content from multiple element contents that best suits the user's surrounding environment.

Configuration of Each Device FIG. 14 is a block diagram showing an example of the configuration of the user terminal 1.

As shown in FIG. 14, the user terminal 1 is composed of an input unit 51, a camera 52, a position detection unit 53, a control unit 54, a communication unit 55, a display unit 56, and a speaker 57.

The input unit 51 is composed of a touch panel, switches, buttons, sensors, etc. superimposed on the display unit 56. The input unit 51 accepts input of user operations and supplies signals corresponding to the user operations to the control unit 54. When the user operations are input using voice commands or when peripheral voice data is used as the peripheral data, the input unit 51 may be composed of a microphone that collects the user's voice and sounds from the surrounding environment.

The camera 52 captures the user's surrounding environment and acquires surrounding images. The camera 52 supplies the acquired surrounding images to the control unit 54.

The position detection unit 53 is composed of a positioning device using any positioning method, such as GNSS (Global Navigation Satellite System). The position detection unit 53 detects (measures) the current position of the user (user terminal 1) and supplies the detection result to the control unit 54.

The current location of the user terminal 1 may be detected by the location detection unit 53, or may be detected by another device other than the user terminal 1 that is carried by the user. In such a case, the communication unit 55 of the user terminal 1 receives (acquires) the detection result of the current location of the user terminal 1 detected by the other device from the other device.

The control unit 54 is composed of an image acquisition unit 71, a similarity evaluation unit 72, a display control unit 73, a setting unit 74, a playback control unit 75, and a subjective text acquisition unit 76.

The image acquisition unit 71 acquires surrounding images captured by the camera 52. Furthermore, based on the user's current location detected by the position detection unit 53, the image acquisition unit 71 acquires surrounding images previously captured by the user at the current location from the storage unit (not shown) of the user terminal 1. Based on the user's current location, the image acquisition unit 71 acquires surrounding images previously captured by other users at the current location from the server 3 via the communication unit 55.

The image acquisition unit 71 supplies the acquired surrounding images to the similarity evaluation unit 72.

The similarity evaluation unit 72 has, for example, the recognizer 11 (Figure 4). The surrounding image supplied from the image acquisition unit 71 and the environmental description text linked to each element content are input to the recognizer 11, thereby evaluating the similarity between the surrounding image and the environmental description text. The similarity evaluation unit 72 generates a ranking of the element content based on the similarity, and supplies it to the display control unit 73 and the setting unit 74.

The display control unit 73 controls the display unit 56 to display the rankings supplied from the similarity evaluation unit 72. In addition, when the user enters an area in which element content is arranged, the display control unit 73 causes the display unit 56 to display a content image.

The setting unit 74, in response to user operations, places element content in a new area including the user's current location or changes already placed element content, and sets the element content so that it is provided when the user enters the area. The setting unit 74 can also automatically select element content linked to the environmental description text with the highest similarity based on the ranking supplied from the similarity evaluation unit 72, and set the element content.

Information about the scape formed by arranging element content (information about the location of the area, the element content to be provided, etc.) is stored, for example, in the memory unit of the user terminal 1 or in the server 3.

When the user enters an area in which element content is placed, the playback control unit 75 outputs the content sound from the speaker 57. The display control unit 73 and the playback control unit 75 function as a content providing unit that provides the element content to the user when the user enters an area in which element content is placed.

The subjective text acquisition unit 76 acquires the subjective text input by the user in response to the user's operation, and transmits the acquired subjective text to the server 3 via the communication unit 55.

The communication unit 55 communicates with external devices such as the server 3 via the network. That is, the communication unit 45 transmits information provided by the control unit 54 to the external device, and receives information provided from the external device and provides it to the control unit 54.

The display unit 56 is composed of, for example, an organic EL (Electro Luminescence) panel or a liquid crystal panel, and displays various screens and content images according to the control of the display control unit 73.

The speaker 57 outputs the content sound according to the control of the playback control unit 75. The content sound may be output from an external device such as inner-ear headphones that can be worn on the user's ears, headphones, or a speaker unit provided on a wearable device that is connected to the user terminal 1 by wire or wirelessly. In such a case, the playback control unit 75 supplies sound data representing the content sound to the external device to output it.

Note that a part of the configuration of the user terminal 1 may be provided in an external device such as the server 3 or a cloud. For example, the similarity evaluation unit 72 and the setting unit 74 may be provided in the server 3. In this case, the user terminal 1, for example, transmits a surrounding image to the server 3, and obtains and displays a ranking of element contents based on the similarity between the surrounding image and the environmental description text from the server 3. In addition, the user terminal 1, for example, transmits information indicating a user operation to the server 3, and the server 3, for example, performs settings in response to the user operation so that element contents are provided when the user enters an area.

FIG. 15 shows an example of the configuration of a creator terminal 2.

As shown in FIG. 15, the creator terminal 2 is composed of a production unit 81, a text acquisition unit 82, and a registration unit 83.

The production unit 81 produces spatial content and element content in response to the creator's operations and supplies them to the registration unit 83.

The text acquisition unit 82 acquires the environment description text entered by the creator and supplies it to the registration unit 83.

The registration unit 83 links the element content provided by the production unit 81 with the environmental description text provided by the text acquisition unit 82 and registers them on the server 3.

Operation of Each Device Next, a process performed by the user terminal 1 having the above-described configuration will be described with reference to the flowchart in Fig. 16. The process in Fig. 16 is executed, for example, when the user prepares for playback.

In step S1, the control unit 54 accepts the selection of an edit mode by the user. For example, the control unit 54 accepts pressing of the edit button B1 (A in FIG. 6) or the change button B3 (B in FIG. 6) as the selection of an edit mode.

In step S2, the control unit 54 determines whether or not the user has selected the peripheral image capture mode. For example, if the capture button B7 (D in FIG. 7) is pressed after the button B4 (C in FIG. 6) is pressed, it is determined that the peripheral image capture mode has been selected.

If it is determined in step S2 that the peripheral image capture mode has been selected, in step S3, the camera 52 captures a peripheral image.

On the other hand, if it is determined in step S2 that the peripheral image capture mode has not been selected, then in step S4, the control unit 54 determines whether or not the peripheral image selection mode has been selected by the user. For example, if button B4 (A in FIG. 10) is pressed and then button B21 (B in FIG. 10) is pressed, it is determined that the peripheral image selection mode has been selected.

If it is determined in step S4 that the surrounding image selection mode has been selected, in step S5, the image acquisition unit 71 acquires surrounding images that have already been captured, and the display control unit 73 causes the display unit 56 to display a list of surrounding images that have been captured in the past (already).

In step S6, the control unit 54 accepts the user's selection of a desired surrounding image from a list of surrounding images captured in the past.

On the other hand, if it is determined in step S4 that the peripheral image selection mode has not been selected, in step S7, the display control unit 73 causes the display unit 56 to display a list of multiple element contents (multiple groups of element contents) that correspond to the worldview of a certain work.

In step S8, the setting unit 74 accepts the user's selection of a desired element content from the list of element contents. The setting unit 74 functions as a selection unit that selects the element content selected by the user from among multiple element contents, and places the selected element content in an area that includes the user's current location. After the element content has been placed, the process ends.

After the surrounding image is captured in step S3 and after the surrounding image is selected in step S6, the process proceeds to step S9. In step S9, the control unit 54 determines whether the user is within the area based on the user's current position detected by the position detection unit 53.

If it is determined in step S9 that the user is not within the area, in step S10, the setting unit 74 generates a new area that includes the user's current location based on the user's current location.

On the other hand, if it is determined in step S9 that the user is within the area, step S10 is skipped and processing proceeds to step S11.

In step S11, the similarity evaluation unit 72 evaluates the similarity between the surrounding image and the environmental description text associated with each element content.

In step S12, the similarity evaluation unit 72 generates a ranking of the element contents based on the similarity.

In step S13, the setting unit 74 determines whether or not to automatically set the element content. Whether or not to automatically set the element content is determined in advance by the user, for example, before starting preparations for playback.

If it is determined in step S13 that the element content is to be set automatically, in step S14, the setting unit 74 places the element content associated with the environmental description text that has the highest similarity to the surrounding image in the area including the user's current location (setting is performed so that the element content is provided when the user enters that area). After the element content is set in step S14, the process ends.

On the other hand, if it is determined in step S13 that the element content is not to be set automatically, in step S15, the display control unit 73 causes the display unit 56 to display the rankings of the element content.

In step S16, the setting unit 74 accepts the user's selection of a desired element content from the element content rankings. The setting unit 74 places the element content selected by the user in an area including the user's current location, and the process ends.

Next, the processing performed by the creator terminal 2 will be explained with reference to the flowchart in FIG. 17.

In step S21, the production unit 81 produces element content in response to the creator's operations.

In step S22, the registration unit 83 registers the created element content on the server 3.

In step S23, the text acquisition unit 82 accepts the input of the environment description text by the creator, and the registration unit 83 links the environment description text input by the creator with the element content and registers it.

By using the above process, the content provision system of this technology can present to the user, in the form of a ranking, element content that matches the contents of the environmental description text and the user's surrounding environment from among multiple element content containing a wide variety of content. By checking the element content that is ranked at the top of the ranking, the user can easily find and place the desired element content without having to check all the element content.

Therefore, the content provision system of this technology can provide users with element content that matches the surrounding environment.

2. Second embodiment
In the second embodiment, the user terminal 1 automatically places the element contents without the user selecting the element contents. In the second embodiment, the user can place element contents in multiple areas while freely wandering around the world.

The fourth flow in which a user prepares to play spatial content will be described with reference to Figs. 18 and 19. Figs. 18 and 19 describe an example in which a user prepares to play spatial content using, for example, a smartphone as a user terminal 1.

The preparation for playing spatial content begins, for example, when no element content has yet been placed in the world. As shown in A of FIG. 18, for example, an automatic generation button B101 for placing element content is displayed in the upper right portion of the screen displayed on the user terminal 1.

As shown in A of FIG. 18, when the user is exploring the world and finds a place where they want to place element content, they can press the auto-generate button B101.

When the automatic generation button B101 is pressed, the user terminal 1 starts capturing images of the surrounding environment, and a surrounding image P1 is displayed on the screen of the user terminal 1, as shown in FIG. 18B.

When the user presses the capture button B7 displayed below the surrounding image P1, the surrounding image P1 is acquired and the similarity between the environmental description text linked to each element content and the surrounding image P1 is evaluated. After that, a ranking of each element content based on the similarity is generated, and an area A101 in which the element content with the highest similarity to the surrounding image P1 has been newly placed is displayed on the map, as shown in FIG. 18C.

After the element content is placed in area A101, the element content is provided to the user by the user terminal 1. The content image and content text provided to the user are displayed below the map.

After the element content has been placed in area A101, as shown in FIG. 19D, if the user moves outside area A101 and finds another location where they want to place the element content, they can press the auto-generate button B101 and take a picture of the surrounding area.

As the user repeatedly presses the automatic generation button B101 and takes surrounding images, element contents are arranged one after another, and as shown in FIG. 19E, element contents are arranged in, for example, four areas A101 to A104.

FIG. 20 describes a fifth flow in which a user prepares to play spatial content. In FIG. 20, an example is described in which a user prepares to play spatial content using, for example, a wearable camera 101 as a user terminal 1.

When a user wearing the wearable camera 101 is walking around the world and finds a place where they want to place element content, they can perform operations such as hand gestures, as shown on the left side of A in Figure 20.

When a hand gesture is made, the wearable camera 101 starts capturing images of the surrounding environment in response to the user's hand gesture, and acquires a surrounding image. The wearable camera 101 then evaluates the similarity between the surrounding image and the environment description text associated with each element content, and generates a ranking of each element content based on the similarity.

The wearable camera 101 automatically places the element content that has the highest similarity to the surrounding image P1 in an area A101 that includes the user's current location, as shown by the white arrow in A of Figure 20.

After the element content is placed in area A101, the element content is provided to the user by user terminal 1.

After element content has been placed in area A101, if the user moves outside area A101 and finds another place where he or she wants to place element content, the user can perform further operations such as hand gestures, as shown on the left side of B in Figure 20.

As the user repeats movements and hand gestures, element content is placed one after another, and as shown by the white arrows in Figure 20, element content is placed in, for example, four areas A101 to A104.

The process performed by the user terminal 1 in the second embodiment will be described with reference to the flowchart in FIG. 21. The process in FIG. 21 is executed, for example, when the user prepares for playback.

In step S101, the control unit 54 accepts an instruction to generate a scape from the user. If the user terminal 1 is, for example, a smartphone, the instruction to generate a scape is input using a button such as the automatic generation button B101 (A in FIG. 18) or the capture button B7 (B in FIG. 18), or a voice command. If the user terminal 1 is, for example, a wearable device, the instruction to generate a scape is input using a tap on the wearable device or a voice command. If the user terminal 1 is, for example, a wearable camera, the instruction to generate a scape is input using a hand gesture or a voice command.

In step S102, the camera 52 captures an image of the surroundings.

In step S103, the control unit 54 determines whether the user is within the area based on the user's current location detected by the location detection unit 53.

If it is determined in step S103 that the user is not within the area, in step S104, the setting unit 74 generates a new area that includes the user's current location based on the user's current location.

On the other hand, if it is determined in step S103 that the user is within the area, the process of step S104 is skipped and the process proceeds to step S105.

In step S105, the similarity evaluation unit 72 evaluates the similarity between the surrounding image and the environmental description text associated with each element content.

In step S106, the similarity evaluation unit 72 generates a ranking of the element contents based on the similarity.

In step S107, the setting unit 74 places the element content associated with the environmental description text that has the highest similarity to the surrounding image in the area including the user's current location (setting is performed so that the element content is provided when the user enters that area).

By using the above process, the content providing system of this technology can select and place element content that matches the contents of the environmental description text and the user's surrounding environment from among multiple element content containing a wide variety of content. The user can place element content simply by inputting a command to generate the scape into the user terminal 1, and can easily place element content while walking around freely.

The content provision system of this technology uses a general-purpose recognizer to accurately evaluate the similarity between the environmental description text and the surrounding image, so the user can experience each scene that makes up the work without feeling out of place, even if they do not make many changes to the element content placed by the user terminal 1.

3. Example of using data acquired from the content provision system for learning
FIG. 22 is a diagram showing an example of data acquired by the content providing system.

In a series of processes for providing a content provision service, the content provision system can acquire, for example, user information, information about element content selected by the user, user location information, map information, information about surrounding images, user biometric information, user behavior information, user entry history into an area, and a 3rd Party DB, as shown in FIG. 22.

User information includes demographic information such as age, gender, place of residence, and occupation, as well as the account ID for the content providing service. Information about the elemental content selected by the user includes the world ID, the scape (scene and elemental content) ID, and the environmental description text (matching text) associated with the elemental content (scene).

The user's location information includes the coordinates of the area, the coordinates of the object on which element content is placed in place of the area, and the viewpoint direction. The map information includes information indicating map POIs (Points of Interest) such as buildings around the user, as well as floor information and store names associated with the map POIs.

Information about the surrounding images includes the ID of the image input by the user as the surrounding image, the time the image was taken, the time the image was acquired, the source from which the image was acquired, the subjective text entered by the user, and the time the subjective text was entered. The 3rd Party DB includes, for example, information about the user's membership in a fan club, purchasing information indicating the purchase history of items such as content and merchandise, and information about targeted advertising.

Figure 23 shows an example of data used for learning.

As shown in FIG. 23, for example, user information, consumer activity information including fan club membership information and purchasing information, user location information, and information regarding element content selected by the user are used to learn the correlation between the user's content selection based on a specific location and the user's characteristics.

For example, user information, consumption activity information, map information, and information about elemental content selected by the user are used to learn correlations between the user's content selections based on specific map POIs and the user's characteristics.

For example, user information, consumer activity information, information about surrounding images, and information about elemental content selected by the user are used to learn the correlation between the user's content selection based on specific surrounding images and the user's characteristics.

For example, information about the surrounding images, the matching results between the surrounding images and the environmental description text, the subjective text entered by the user, and information about the element content selected by the user are used to learn the correlation between the surrounding images and the subjective text.

For example, the results of learning the correlation between surrounding images and subjective text, and information about element content selected by the user, are used to learn the correlation between subjective text and content.

For example, the learning results of the correlation between a user's content selection based on a specific peripheral image and user characteristics, and the learning results of the correlation between subjective text and content, are used to learn text (environment description text) that links content and peripheral images. The learning results of the environment description text using these data may be used to support creators when inputting environment description text. For example, even if the creator does not input the environment description text, it is possible to obtain an environment description text that describes the appearance of the surrounding environment in which the element content should be provided (matching the element content) by inputting element content into a learning model obtained by learning the environment description text.

<About computers>
The above-mentioned series of processes can be executed by hardware or software. When the series of processes is executed by software, the program constituting the software is installed from a program recording medium into a computer incorporated in dedicated hardware or a general-purpose personal computer.

FIG. 24 is a block diagram showing an example of the hardware configuration of a computer that executes the above-mentioned series of processes using a program.

CPU (Central Processing Unit) 501, ROM (Read Only Memory) 502, and RAM (Random Access Memory) 503 are interconnected by a bus 504.

Further connected to the bus 504 is an input/output interface 505. Connected to the input/output interface 505 are an input unit 506 consisting of a keyboard, mouse, etc., and an output unit 507 consisting of a display, speakers, etc. Also connected to the input/output interface 505 are a storage unit 508 consisting of a hard disk or non-volatile memory, a communication unit 509 consisting of a network interface, etc., and a drive 510 that drives removable media 511.

In a computer configured as described above, the CPU 501, for example, loads a program stored in the storage unit 508 into the RAM 503 via the input/output interface 505 and the bus 504 and executes the program, thereby performing the above-mentioned series of processes.

The programs executed by the CPU 501 are provided, for example, by being recorded on removable media 511, or via a wired or wireless transmission medium such as a local area network, the Internet, or digital broadcasting, and are installed in the storage unit 508.

The program executed by the computer may be a program in which processing is performed chronologically in the order described in this specification, or it may be a program in which processing is performed in parallel or at the required timing, such as when called.

In this specification, a system refers to a collection of multiple components (devices, modules (parts), etc.), regardless of whether all the components are in the same housing. Therefore, multiple devices housed in separate housings and connected via a network, and a single device in which multiple modules are housed in a single housing, are both systems.

The effects described in this specification are merely examples and are not limiting, and other effects may also exist.

The embodiment of this technology is not limited to the above-mentioned embodiment, and various modifications are possible without departing from the gist of this technology.

For example, this technology can be configured as cloud computing, in which a single function is shared and processed collaboratively by multiple devices over a network.

In addition, each step described in the above flowchart can be executed by a single device, or can be shared and executed by multiple devices.

Furthermore, when one step includes multiple processes, the multiple processes included in that one step can be executed by one device, or can be shared and executed by multiple devices.

<Examples of configuration combinations>
The present technology can also be configured as follows.

(1)
a similarity evaluation unit that evaluates a similarity between a first text associated with the content and surrounding data related to the user's surrounding environment input by the user;
and a selection unit that selects the content corresponding to the surrounding environment from among the plurality of contents based on the similarity.
(2)
The information processing device according to any one of the preceding claims, wherein the similarity evaluation unit evaluates the similarity using a recognizer that receives the first text and the peripheral data as input and outputs the similarity.
(3)
a display control unit that displays a ranking in which the contents are arranged in descending order of the degree of similarity between the associated first text and the associated peripheral data;
The information processing device according to (1) or (2), wherein the selection unit selects the content corresponding to the surrounding environment from among a plurality of the contents in response to an operation by the user.
(4)
The information processing device according to (1) or (2), wherein the selection unit selects, from among the plurality of pieces of content, the piece of content associated with the first text having the highest similarity as the piece of content corresponding to the surrounding environment.
(5)
The information processing device according to any one of (1) to (4), wherein the peripheral data is data acquired in the current peripheral environment by the user using a predetermined device.
(6)
The information processing device according to any one of (1) to (4), wherein the surrounding data is data relating to the surrounding environment in the past.
(7)
The information processing device according to (6), wherein the surrounding data relating to the surrounding environment in the past is data acquired in the surrounding environment in the past by the user or another user using a predetermined device.
(8)
The information processing device according to any one of (1) to (7), further comprising a content providing unit that provides the content to the user when the user enters an area in which the content is located.
(9)
The information processing device according to (8), wherein, when the user is within the area, the selection unit changes the content already arranged in the area to the content corresponding to the surrounding environment.
(10)
The information processing device according to (8) or (9), wherein the selection unit, when the user is not within the area, generates a new area including the user's current location and places the content corresponding to the surrounding environment in the new area.
(11)
The information processing device according to any one of (1) to (10), wherein the surrounding data includes images of the surrounding environment and audio data of sounds collected from the surrounding environment.
(12)
The information processing device according to any one of (1) to (11), wherein the content is composed of at least one of a moving image and a sound.
(13)
The information processing device according to any one of (1) to (12), wherein the selection unit selects the content based on at least one of a time and a situation in which the peripheral data was acquired.
(14)
The information processing device according to any one of (1) to (13), wherein the first text is input by a creator who produced the content.
(15)
The information processing device according to any one of (1) to (13), further comprising an acquisition unit for acquiring a subjective text that acquires a second text indicating a subjective sentence of the user with respect to at least the surrounding environment.
(16)
The information processing device according to (15), wherein the first text is generated based on the second text.
(17)
Evaluating a similarity between a text associated with the content and surrounding data related to the user's surrounding environment input by the user;
selecting the content corresponding to the surrounding environment from among the plurality of contents based on the degree of similarity.
(18)
the text is entered by a creator who created the content;
The information processing method according to (17), further comprising obtaining information indicating a correlation between the content and the text when the creator registers the text in association with the content.
(19)
On the computer,
Evaluating a similarity between a text associated with the content and surrounding data related to the user's surrounding environment input by the user;
a program for executing a process of selecting the content corresponding to the surrounding environment from among a plurality of the contents based on the similarity.

1 User terminal, 2 Creator terminal, 3 Server, 11 Recognizer, 51 Input unit, 52 Camera, 53 Position detection unit, 54 Control unit, 55 Communication unit, 56 Display unit, 57 Speaker, 71 Image acquisition unit, 72 Similarity evaluation unit, 73 Display control unit, 73 Setting unit, 75 Playback control unit, 76 Subjective text acquisition unit, 81 Production unit, 82 Text acquisition unit, 83 Registration unit

Claims

a similarity evaluation unit that evaluates a similarity between a first text associated with the content and surrounding data related to the user's surrounding environment input by the user;
and a selection unit that selects the content corresponding to the surrounding environment from among the plurality of contents based on the similarity.
The information processing apparatus according to claim 1 , wherein the similarity evaluation unit evaluates the similarity using a recognizer that receives the first text and the peripheral data as input and outputs the similarity.
a display control unit that displays a ranking in which the contents are arranged in descending order of the degree of similarity between the associated first text and the associated peripheral data;
The information processing device according to claim 1 , wherein the selection unit selects the content corresponding to the surrounding environment from among a plurality of the contents in response to an operation by the user.
The information processing device according to claim 1 , wherein the selection unit selects, from among the plurality of pieces of content, the piece of content associated with the first text having the highest similarity as the piece of content corresponding to the surrounding environment.
The information processing apparatus according to claim 1 , wherein the peripheral data is data acquired in the current peripheral environment by the user using a predetermined device.
The information processing apparatus according to claim 1 , wherein the surrounding data is data relating to the surrounding environment in the past.
The information processing apparatus according to claim 6 , wherein the surrounding data relating to the surrounding environment in the past is data acquired in the surrounding environment in the past by the user or another user using a predetermined device.
The information processing device according to claim 1 , further comprising a content providing unit that provides the content to the user when the user enters an area in which the content is located.
The information processing device according to claim 8 , wherein, when the user is in the area, the selection unit changes the content already arranged in the area to the content corresponding to the surrounding environment.
The information processing device according to claim 8 , wherein the selection unit, when the user is not within the area, generates a new area including a current position of the user, and places the content corresponding to the surrounding environment in the new area.
The information processing device according to claim 1 , wherein the surrounding data includes images of the surrounding environment and audio data of sounds of the surrounding environment.
The information processing device according to claim 1 , wherein the content is composed of at least one of a moving image and a sound.
The information processing device according to claim 1 , wherein the selection unit selects the content based on at least one of a time and a situation in which the peripheral data was acquired.
The information processing device according to claim 1 , wherein the first text is input by a creator who created the content.
The information processing apparatus according to claim 1 , further comprising: an acquisition unit for acquiring a subjective text that acquires a second text indicating a subjective sentence of the user with respect to at least the surrounding environment.
The information processing apparatus according to claim 15 , wherein the first text is generated based on the second text.
Evaluating a similarity between a text associated with the content and surrounding data related to the user's surrounding environment input by the user;
selecting the content corresponding to the surrounding environment from among the plurality of contents based on the degree of similarity.
the text is entered by a creator who created the content;
The information processing method according to claim 17 , further comprising obtaining information indicating a correlation between the content and the text when the creator registers the text in association with the content.
On the computer,
Evaluating a similarity between a text associated with the content and surrounding data related to the user's surrounding environment input by the user;
a program for executing a process of selecting the content corresponding to the surrounding environment from among a plurality of the contents based on the similarity.