CN113498538A

CN113498538A - Receiving device, server, and voice information processing system

Info

Publication number: CN113498538A
Application number: CN202180001659.7A
Authority: CN
Inventors: 山本澄彦; 村上雅俊; 岡野和幸; 加藤雅也; 堤竹秀行; 辻雅史; 西口友美; 内野聡
Original assignee: Hisense Visual Technology Co Ltd; Toshiba Visual Solutions Corp
Current assignee: Hisense Visual Technology Co Ltd; Toshiba Visual Solutions Corp
Priority date: 2020-02-07
Filing date: 2021-02-03
Publication date: 2021-10-12
Also published as: WO2021155812A1

Abstract

Provided are a receiving device, a server and a voice information processing system for executing processing of a voice command for an image scene designated by a user. The receiving device is provided with: a control signal receiving unit that receives a scene specifying signal that is a control signal for specifying a scene of one image as the image content while the image content is being output from the display unit; and a control means for receiving a voice, performing voice recognition on the voice, and generating a start command for starting acquisition of an instruction for the voice instruction acquisition means that acquires the instruction.

Description

Receiving device, server, and voice information processing system

Cross Reference to Related Applications

The present application claims priority of japanese patent application filed on 16/9/2020, having application number 2020 and 155675, entitled "receiving apparatus, server, and voice information processing system"; and the priority of japanese patent application entitled "receiving apparatus, scene information providing system, program, providing method of scene information, scene display control apparatus, and scene information providing apparatus" filed by japanese patent office on 7/2/2020; and the priority of japanese patent application filed on 7/2/2020, having application No. 2020-.

Technical Field

The present embodiment relates to a receiving apparatus, a server, and a speech information processing system.

Background

A television receiver can perform operation control based on a voice command (voice command) from a smart speaker or the like using a voice recognition technology. Typically, the smart speaker requires the use of a trigger word to activate the smart speaker prior to using the voice instruction.

In addition, a CM for a product or service is broadcast in a program such as a television broadcast. In addition, a program may be provided with a column for introducing a product or the like to introduce the product or the like. Although a user who is a viewer can record a program on a television receiving apparatus, a video recorder, or the like, it is very troublesome to find a scene (scene) of a CM for a specific product or the like from the recorded program or a scene of a column in which a specific product or the like is introduced, and to play the recorded program. In recent years, a system has been proposed in which information such as a product introduced by a program is provided to a portable electronic device terminal, but it is complicated for a user to operate the portable electronic device terminal and display the information such as the product on a screen of a display of the portable electronic device terminal.

In recent years, a recommendation information notification service has been used which notifies recommendation information such as a product or a service to a terminal device such as a smartphone or a tablet computer which is in contact with a receiver, in accordance with the state of the receiver such as a television receiver or a video recorder. In the recommendation information notification service, the receiver transmits the state of the receiver to the management server via the internet, and the recommendation information corresponding to the state of the receiver is transmitted from the management server to the terminal device via the internet. Therefore, there is a time lag (time lag) until the terminal device receives the recommendation information corresponding to the state of the receiver.

However, the user frequently changes the state of the receiver due to an operation such as changing the channel of the television being watched or stopping the playback of the recorded program being played and watching the television program in real time. Therefore, when the terminal device receives the recommendation information corresponding to the state of the receiver, the state of the receiver may have been changed by the user. That is, in the conventional recommendation information notification service, when the state of the receiver is frequently changed, there is a possibility that recommendation information of a product or the like which is not related to the current state of the receiver at all is transmitted and displayed on the terminal device.

Prior art documents

Patent document

Patent document 1: JP patent publication (Kokai) No. 2019-207286

Patent document 2: JP patent publication No. 2020-122819

Patent document 3: japanese patent laid-open No. 2012-92443

Patent document 4: japanese patent No. 5668013

Disclosure of Invention

However, for example, in a case where a user (viewer) viewing an image of a broadcast program displayed on a television receiving apparatus operates an image scene (image) of interest by a voice instruction, the image scene may pass while a trigger word issued before the voice instruction is processed by an intelligent speaker. Therefore, the voice instruction issued by the user may not be a voice instruction for an image scene at a time when the viewer has an interest.

Accordingly, an object of the present application is to provide a receiving apparatus, a server, and a voice information processing system for executing a processing voice command for an image scene designated by a user.

Another object of the present invention is to provide a recorded scene display device that allows a user who is a viewer of a program to easily view a scene that introduces a desired product or the like from among a plurality of recorded programs.

It is another object of the present invention to provide an information communication system, a receiving apparatus, a terminal apparatus, a display control method, and a display control program capable of displaying appropriate recommendation information in accordance with a state of the receiving apparatus, wherein the display control program may be in the form of a set of computer instructions capable of running on a computer device to cause the computer device to execute a predetermined method or function.

The receiving apparatus of the present application includes: a control signal receiving unit that receives a scene specifying signal that is a control signal for specifying a scene of one image as the image content while the image content is being output from the display unit; and a control means for receiving a voice, performing voice recognition on the voice, and generating a start command for starting instruction acquisition for the voice instruction acquisition means that acquires an instruction.

The video scene display device of the present application has: a display information generating unit that generates display information of scene information related to a recorded program, the scene information being related to a scene including at least one of a product and a service; and a scene playback processing unit that generates playback information of a scene selected from the scene information, based on a playback instruction.

An information communication system includes a receiving device, a management server, and a terminal device. The receiving device generates first state information indicating its current state. And the management server generates recommendation information which is associated with the first state information based on the first state information. A terminal device acquires the first state information and the recommendation information associated with the first state information, acquires second state information indicating a state of the receiver at a time corresponding to acquisition of at least one of the first state information and the recommendation information, compares the first state information with the second state information, and executes display processing of the recommendation information when at least a part of the first state information and the second state information matches each other.

Drawings

Fig. 1 is a diagram showing a configuration example of a system according to an embodiment;

fig. 2 is a diagram schematically showing the structure of a receiving apparatus;

FIG. 3 is a block diagram showing an example of the structure of an intelligent device;

FIG. 4 is a block diagram showing a configuration example of a server;

fig. 5 is a sequence diagram showing an example of the operation of the system according to embodiment 1;

fig. 6 is a flowchart showing an example of the operation of the system according to the embodiment;

fig. 7 is a diagram showing an example of the operation of the system according to the embodiment;

fig. 8 is a sequence diagram of the system according to embodiment 2;

fig. 9 is a diagram showing an example of data flow in the system according to the embodiment;

fig. 10 is a diagram showing a1 st data flow example according to the system of the modification;

fig. 11 is a diagram showing a2 nd data flow example according to the system of the modification;

fig. 12 is a diagram showing a3 rd data flow example of the system according to the modification;

FIG. 13 is a block diagram showing the structure of a server according to another embodiment;

fig. 14 is a block diagram of program information according to another embodiment;

fig. 15 is a diagram showing a plurality of time periods in which a plurality of scenes introducing a plurality of goods (or services) are included in one program content according to another embodiment;

FIG. 16 is a block diagram of a video content display system according to another embodiment;

fig. 17 is a flowchart for explaining the operation of the terminal device according to another embodiment.

Description of the reference numerals

100. television receiver, 200. remote controller, 300. intelligent device, 400. server, 500. network.

Detailed Description

Hereinafter, embodiments will be described with reference to the drawings.

Fig. 1 is a diagram showing a configuration example of a system according to an embodiment.

The receiving apparatus 100 is, for example, a receiving apparatus for digital television broadcasting (also referred to as a television receiving apparatus or a television receiving apparatus), and receives, from an antenna, cable broadcasting, or the like, a broadcast signal of 4K/8K broadcasting such as highly broadband satellite digital broadcasting, or a broadcast signal of 2K broadcasting such as conventional terrestrial digital broadcasting, BS digital broadcasting, CS digital broadcasting, or the like. The broadcast signal may be referred to as various broadcast signals, such as 4K/8K broadcast and 2K broadcast. The receiving apparatus 100 acquires data related to content (referred to as content data) such as an image signal, a voice signal, and a text signal from a broadcast signal, and provides the content to a user. The receiving apparatus 100 may acquire image data for digital television broadcasting from a storage medium such as a DVD or a hard disk, a content server not shown on the internet, or the like, for example, without acquiring a broadcast signal.

The remote controller 200 is attached to the reception apparatus 100, and controls the reception apparatus 100 by a remote control such as turning on/off of a power supply and switching of a channel. When the user 5 operates the remote controller 200, a control signal based on infrared rays or the like (referred to as a remote controller control signal) is output from the remote controller 200 to the receiving apparatus 100. In the remote controller 200 of the present embodiment, a scene specifying button 201 is provided.

When the user 5 presses the scene specifying button 201, a remote controller control signal (referred to as a scene specifying signal) corresponding to the scene specifying button 201 is output. When the receiving apparatus 100 receives the scene specifying signal, it specifies a scene (image frame) of the content (image, voice, text, etc.) being output from the display 170, the speaker 171, etc. at the timing of reception, and acquires viewing content information and scene specifying time data relating to the scene.

A scene is a substantially instantaneous image and represents 1 frame of an image. However, since it is considered that the user does not generally have a resolution for distinguishing 1 frame of the image, the scene for the user may not be 1 frame of the image but represent an image having a time width of several seconds.

The viewing content information is information for specifying what the content is, such as a channel on which the content being output is broadcast. The scene specifying time data is time information such as a broadcast time of a specified scene. The viewing content information and scene specifying time data are included and referred to as scene specifying information. The receiving apparatus 100 may store the acquired scene specifying information in a memory or the like. In the remote controller 200, the scene specifying button 201 may be replaced with an existing button such as a "good" button. Further, the firmware of the remote controller 200 may be updated and assigned to the existing button. In particular, the remote controller 200 attached to the receiving apparatus 100 is not required, and a button device dedicated to the scene specifying button 201 may be used, and a dedicated button device may be connected to the remote controller 200. The reception apparatus 100 may store data of a momentary image (image frame) of a specific scene in a memory or the like. In addition, the smart device 300 may be capable of receiving a scene specifying signal output from the remote controller 200.

The smart device 300 is a smart speaker, which incorporates a speaker, a microphone, a camera, a voice recognition mechanism, and the like, receives voice from the microphone, and can extract a command or the like superimposed on the voice from the received voice by the voice recognition mechanism. The smart device 300 includes an interface with an external device, and can interact with the external device. For example, the smart device 300 includes interfaces for connecting to the receiving device 100, the remote controller 200, and the network 500. In addition, when receiving the "question" by voice, the smart device 300 can acquire the "answer" to the "question" from an artificial intelligence engine (AI engine) or the like on the network 500. The smart device 300 may also have an AI engine.

The server 400 is a server that provides information related to viewing content (also referred to as viewing content-related information), and may be a cloud server, for example. The server 400 interacts data with the receiving device 100 and the smart device 300 via the network 500. When receiving the scene specifying information and the command from the receiving device 100 or the smart device 300, the server 400 in the present embodiment performs processing based on the command on the scene specified by the scene specifying information. The server 400 outputs the processing result to the receiving device 100 and the smart device 300. For example, the server 400 generates an "answer" to the "question" received from the smart device 300 and outputs the generated "answer" to the smart device 300.

The network 500 is an electrical communication line, such as the internet.

Fig. 2 is a block diagram schematically showing the configuration of the receiving apparatus 100.

The reception device 100 includes a basic function 160 as a function of receiving broadcast waves, a system control unit 161, a communication control unit 162, and an application control unit 163. The receiving apparatus 100 is connected to a display 170 and a speaker 171.

The basic functions 160 include: a broadcast tuner 101, a demultiplexer 102, a descrambler 103, an image decoder 104, a voice decoder 105, a subtitle decoder 106, a buffer data section 107, and a transmission control signal analysis section 111.

The broadcast tuner 101 demodulates a stream (broadcast signal) transmitted via a broadcast wave. The demodulated stream (broadcast signal) is input to the demultiplexer 102. The demultiplexer 102 separates the input multiplexed stream into an image stream, a voice stream, a subtitle stream, application data, and a transport control signal, and the image stream, the voice stream, the subtitle stream, and the application data are input to the descrambler 103, and the transport control signal is input to the transport control signal analysis unit 111.

The descrambler 103 descrambles each stream as necessary, inputs the video stream into the video decoder 104, the audio stream into the audio decoder 105, the subtitle stream into the subtitle decoder 106, and the application data into the buffer data section 107.

The image stream is decoded by an image decoder 104, the audio stream is decoded by an audio decoder 105, and the subtitle stream is decoded by a subtitle decoder 106.

The transmission control signal analysis section 111 analyzes various control Information included in the transmission control signal, SI Information (Signaling Information), and the like. The transmission control signal analysis unit 111 also sends MH-AIT, a data transmission message, and the like, which are control information related to application data, among the analyzed transmission control signals, to the application control unit 163, and further analyzes the control information. The transmission control signal analysis unit 111 extracts viewing content information and the like related to the content in the broadcast from various control information such as a transmission control signal and an SI signal, and stores the information in a memory and the like, which are not shown.

The application control unit 163 manages and controls MH-AIT, which is control information related to application data, and control information such as a data transfer message, which are transmitted from the transfer control signal analysis unit 111.

The application control unit 163 controls the browser 164 using the cached data stored in the cache data unit 107, thereby performing screen display control of the data broadcast. In addition, the browser 164 generates the picture overlap data of the subtitle from the output data of the subtitle decoder 106.

The decoded image signal and display contents (contents) such as subtitles and data broadcasting are synthesized by the synthesizer 165 and output to the display 170.

The voice data decoded by the voice decoder 105 is output to the speaker 171.

The codec type of the image decoder 104 is h.265, but the present invention is not limited to this, and may be any of MPEG-2 and h.264. In addition, the kind of codec is not limited thereto.

The system control unit 161 controls various functions of the reception apparatus 100 based on a control signal from an external apparatus or the like received by the communication control unit 162. For example, when receiving the scene specifying signal from the remote controller I/F162-2 of the communication control unit 162, the system control unit 161 generates a control signal for activating (turning on) the voice detection function or the instruction acquisition function by voice recognition (which may be referred to as a voice instruction acquisition function) of the smart device 300, and transmits the control signal to the smart device 300. When receiving the scene specifying signal, the system control unit 161 specifies the scene of the content being output from the display 170, the speaker 171, or the like at the timing of reception, and acquires the viewing content information and the scene specifying time data related to the scene. The system control unit 161 may determine the scene specifying time data by, for example, a clock, not shown, in the receiving apparatus 100, or may determine the scene specifying time data based on time information included in the broadcast signal.

The communication control unit 162 includes various interfaces.

Network I/F162-1 is an interface to network 500. The communication control unit 162 can be connected to the server 400 via the network I/F162-1 and the network 500. The communication control unit 162 can acquire an application and content managed by a service provider apparatus (not shown) via a network. The acquired application and content are sent from the communication control unit 162 to the browser 164, and are used for display and the like.

The remote controller I/F162-2 is an interface with the remote controller 200, and may have a function of infrared communication, for example. The remote controller I/F162-2 receives a remote controller control signal output from the remote controller 200.

The smart device I/F162-3 is an interface with the smart device 300, and may be, for example, a wired cable, or an interface for wireless communication such as Wifi (registered trademark) or bloooth (registered trademark). The receiving device 100 is capable of direct data communication with the smart device 300 through the smart device I/F162-3. In addition, the receiving device 100 is also capable of data communication with the smart device 300 via the network I/F162-1.

Fig. 3 is a block diagram showing a configuration example of the smart device 300.

The smart device 300 is equipped with a voice recognition unit 310, a system controller 301, a ROM302 storing programs and the like, a RAM303 used as a temporary memory, a motor control unit 304, a motor 321 controlled by the motor control unit 304, and a drive mechanism 322 that is driven by the motor 321 and changes the orientation and the like of the smart device 300. Further, the smart device 300 includes a clock 305, a camera 311, a microphone 312, a speaker 313, an interface unit 314, and a battery 333.

The smart device 300 can input the voice received from the microphone 312 to the voice recognition unit 310 and extract a command or the like superimposed on the voice. The fetched command can be output to an external device from the interface unit 314, for example. In addition, the smart device 300 according to the present embodiment activates its own voice command acquiring function when receiving a control signal for activating a voice command receiving function or a command acquiring function by voice recognition. While the normal smart device 300 needs to receive a voice command called a trigger word before the voice command acquiring function is activated, the smart device 300 according to the present embodiment starts receiving a voice command after a scene is specified by a scene specifying signal output from the remote controller 200.

For example, when the interface unit 314 receives a scene specifying signal from the remote controller 200 (S2b), the system controller 301 turns on the microphone 312 constituting the voice detection function. The system controller 301 temporarily stores "voice signal" picked up with the voice detection function turned on, "voice detection time data" at the time of pickup, and "smart speaker identification information" in the RAM303 as "voice command information (may be simply referred to as a command)". Further, the system controller 301 controls to transmit "voice instruction information" to the server 400 via the interface section 314.

Fig. 4 is a block diagram showing a configuration example of the server. The server 400 includes an interface section 411, a system controller 422, a storage section 423, and an analysis section 424. For example, scene specifying data and commands transmitted from a television receiver or a smart speaker are temporarily taken into (buffered) the storage unit 423 under the control of the system controller 422. The analysis unit 424 analyzes the received data taken into the storage unit 423. The analysis unit 424 specifies a scene of the broadcast program based on the received scene specifying data, and executes the command for the specified scene. For example, according to the scene specifying data, a scene in which the car travels on the grassland in a certain travel program is specified, and the instruction is the content of "asking where the scene is". When the server 400 receives the scene specifying data and the command, the analysis unit 424 acquires, from the database or the like, the content-related information, for example, "yao of changyou prefecture" in this case, from the place where the scene (image) specified by the scene specifying data is displayed, and outputs the content to the receiving device 100 or the smart device 300. The content-related information is provided from the receiving device 100, the smart device 300, to the user.

In addition, when the user operates the remote controller to apply an instruction for displaying the recommended article to the reception apparatus, the processor acquires the program information from the server through the network I/F. The program information includes goods/services information. The processor compares the program recorded in the storage device with the product/service information included in the program information PI to generate recommended product information, and displays the recommended product information on the display device via the display I/F.

Fig. 13 is a block diagram showing the configuration of the server 203. The server 203 includes a processor 2014, a storage 2015, and a network I/F2031. The processor 2014 includes a CPU, ROM, RAM, etc. The ROM stores software programs for various functions, and the CPU reads necessary programs from the ROM, expands the programs into the RAM, and executes the programs, thereby realizing various functions of the server 203.

The processor 2014 may communicate with receiving devices via the network I/F2031.

The storage 2015 has a program information storage area 2015a in which program information PI is stored. The program information PI includes goods or services of CMs played in a broadcasted program and goods/service information about goods or services introduced in the program. Therefore, the server 203 has program information PI containing scene information. The processor 2014 constitutes a program information management part that associates scene information on scenes including at least one of goods and services with a program and manages as program information.

Fig. 14 is a structural diagram of the program information PI. The program information PI includes information of the title, date, Channel (CH), product (or service), and time slot of the program. The goods/services information is information of goods (or services), and time periods. That is, the program information PI includes product/service information, and the product/service information includes information on a product (or a service) and information on a time zone in which a scene of the product (or the service) is introduced. The service information in the product/service information includes a store related to food (gourmet) information, a name of a service provided, and the like.

The title is a program name of broadcasting for information distinguished from other programs. The date is information indicating the date of broadcast of the program, and the year, month, and day of the broadcast of the program. Note that the date may also include the broadcast start time and the broadcast end time of the program. The channel is information indicating a channel on which the program is broadcast.

The product (or service) and the time zone are a pair of information, and here, for one program, the program information PI includes information of a plurality of products (or services) and information of a time zone introducing a scene of each corresponding product (or service). The product (or service) 1 is information indicating a product (or service) related to a scene (for example, CM1) broadcast in the program. The time period 1 is information indicating a broadcast time period of a scene (e.g., CM1) of the product (or service) 1. The commodity (or service) 2 is information indicating a commodity (or service) related to another scene (for example, CM2) broadcast in the program. The time period 2 is information indicating a broadcast time period of a scene (for example, CM2) of the product (or service) 2.

Fig. 15 is a diagram showing a plurality of time slots introducing a plurality of scenes of a plurality of goods (or services) included in one program content. The program content C1 contains a plurality of scenes introducing a plurality of goods (or services). The following is shown: the program content C1 is broadcast from the start time T0, and the broadcast of CM1 of a product (or service) (hereinafter sometimes simply referred to as product) X1 starts at a time T1 when only the time T1 elapses from the start time T0, and ends at a time T2. Also, the following is indicated: the broadcast of the CM2 of the product X2 starts at time T3 when only the time T2 has elapsed from the start time T0, and ends at time T4.

In addition, not only the CM but also the program includes a program in which a commercial product (or service) X3 is introduced, and the broadcasting of the scene of the program starts at time T5 when only time T3 has elapsed from start time T0 and ends at time T6. That is, each product (or service) and each time slot in each program information PI are information indicating a product (or service) related to a CM or the like in each program content and a broadcast time slot of the scene shown in fig. 15. Therefore, the server 203 has information on the product (or service) introduced to the CM or the like broadcasted in the program and the broadcast time zone for each program broadcasted in the past. Thus, the program information PI contains the start time and the end time of a scene with respect to at least one of the recorded programs.

Here, although information such as the start time and the end time of each scene is included in the program information PI, the information may not be included in the program information PI. For example, the program information PI includes link address information of each scene, and the information of each scene is stored in a storage area indicated by the link address information. The program information PI may be configured to include address information and the like in which such information on each scene is stored.

As shown in fig. 13, the storage device 2015 includes a product display information storage area 2015b in which display information of images and texts for each product (or service) displayed in the recommended product display image is stored. The image data stored in the product display information storage area 2015b is data representing a representative image or video of each product (or service). The text data stored in the product display information storage area 2015b is the name (i.e., product name) of each product (or service). It should be noted that the text data may also include text describing the characteristics, contents, and the like of each product (or service).

The processor plays a scene of the product (or service) (a scene of the CM or a scene in the program) from the program content including the scene introducing the product (or service) related to the selected window. Since the program content including the scene (the scene of the CM or the scene in the program) introducing the product (or the service) related to the selected window is stored in the program content storage area, the processor reads the program content from the program content storage area and plays the program content based on the title selectable method (that is, the method of finding the start point when a certain part of the recorded video is to be played) of the scene of the product (or the service) related to the selected window. From the time slot information of the product (or service) included in the program information PI, the playback is performed from the first scene of the content (for example, CM) of the product (or service) by the scene head selectable method.

A legend for a recommended merchandise display image display may be referred to in the priority application of JP 2020-019686. Based on the display data generated in the generation processing, the display data is displayed on the screen of the display device. The recommended article display image includes a title unit and an article display unit. The title section is a display area for displaying the name of the screen, here, characters of "recommended product list".

The product display unit is a display area for a plurality of products (or services) arranged in priority according to a plurality of predetermined rules. In the product display unit, a plurality of windows for each product (or service) are arranged in a tile (tile) shape according to the priority. According to the determined priority, windows of a plurality of products are displayed in the recommended product display image in the order of the priority from top to bottom.

Each window has an image display unit and a text display unit. The image display unit is a display area for displaying an image in the display information acquired during the display. In addition, each window becomes a button of the GUI (graphical User interface). Therefore, the user can select a window by, for example, moving a cursor or the like on a desired window.

It should be noted that the image data of the product (or service) displayed on the image display unit is acquired from the server, but may be extracted from the recorded program content.

The text display unit is a display area of the text of the acquired display information.

Therefore, the recommended goods display image displays an image and text for each good (or service). In addition, the recommended product display image displays a plurality of products (or services) in the order of the products (or services) attracting the interest of the user, according to the taste of the user or the like. It should be noted that the image relating to each article (or service) may also be an image relating to a program.

The image for each commodity (or service) is a candidate image representing a candidate of a selectable scene for the user. Accordingly, the scene information display unit displays scene information regarding a scene including at least one of a product and a service related to the recorded program, together with the candidate image, in a selectable array.

When the user selects a window in which to recommend a commodity (or a service) in the commodity display image, content playback processing is executed, and a playback image presented by the playback processing is displayed on the display device.

When the user selects a window recommending 1 commodity (or service) in the commodity display image, content playback is executed.

The processor plays a scene (a scene of the CM or a scene in the program) of the commodity (or the service) from the program content including the scene introducing the commodity (or the service) related to the selected window. Since the program content including the scene (the scene of the CM or the scene in the program) introducing the commodity (or the service) related to the selected window is stored in the program content storage area, in this processing, the processor reads the program content from the program content storage area and plays the scene of the commodity (or the service) related to the selected window based on the title selectable method (i.e., the method of finding the start point when a certain part of the recorded video is to be played). Therefore, the playback information of the scene selected from the scene information is generated in accordance with the playback instruction, and the scene playback processing unit for performing playback is configured.

Therefore, the product (or service) is played back from the top scene of the content (for example, CM) of the product (or service) by the scene head selectable method from the time slot information of the product (or service) included in the program information PI.

As a result, the content of the product (or service) (for example, CM1) is played back on the screen on the display device. That is, the scene of the selected CM or the like of the product (or service) is played from the start time of the time slot. Accordingly, playback information of a scene selected from the scene information is generated in accordance with a playback instruction for the selected scene information.

It should be noted that scene playback may also be performed from the beginning of the program content (i.e., the broadcast start time of the program) including the scenes (e.g., CMs) of the products (or services) associated with the selected window.

In the above example, the recommended product display image is an image in which a plurality of rectangular windows for each product (or service) are arranged in a tile shape on a two-dimensional plane, but may be an image in which an image and a text for each product (or service) are arranged and displayed in a list (list) format in a high-priority order.

In addition, although the matching process is performed in the receiver in the above-described embodiment, the program content recorded in the receiver may be transmitted to the server and executed in the server.

Further, the recommended article display image is displayed on the display device of the receiving device, but the recommended article display image may be displayed on a display of a terminal device different from the receiving device.

Fig. 16 is a block diagram of the recorded content display system 201A. The recorded content display system 201A includes a receiving device 202, a server 203, and a smartphone 206 as a portable terminal device. As with the receiving device 202 and the server 203, the smartphone 206 may communicate with the receiving device 202 and the server 203 via a network 204 such as the internet.

In the present embodiment, the recommended product display image is displayed on the display 206a of the smartphone 206. Therefore, the recommended article display program is stored in the ROM of the processor 206b (indicated by a dotted line) of the smartphone 206. A touch panel device not shown is mounted on the display 206 a.

The processor 206b of the smartphone also includes a CPU, a ROM, and a RAM, and executes various programs stored in the ROM, thereby implementing various functions of the smartphone.

In some embodiments, the smartphone 206 performs the flow of processing of the recommended merchandise display program. The recommended product display process is substantially the same as the recommended product display process in the television, but may be executed by the processor 206b of the smartphone 206.

Therefore, the processor 206b executes the respective processes to display the recommended article display image on the screen of the display 206a, as described in the priority application having the application number of JP 2020-019686.

It should be noted that the smartphone communicates with the receiver via the network, and acquires the management information MI in the video recording management information storage area from the receiver.

The processor 206b makes the following decisions: whether or not the user selects one of the windows in the recommended product display image displayed on the display is determined by touching the screen or the like. If no window is selected, no processing is done.

When the user selects 1 window (1 item in the recommended item display image), the processor 206b transmits a play instruction signal to the receiver via the network, wherein the play instruction signal is used for instructing the scene playing of the item (or service) related to the selected window. The instruction transmitting unit transmits an instruction to instruct playback of a scene selected from the scene information to a receiver as a device for storing the recorded program.

The playback instruction signal also contains content specifying information (contained in the management information MI) for specifying the content of the program, such as the program name and the broadcast date, and time slot information of the scene (CM or the like) of the good (or service) to be played back, so that the content (CM or the like) to be played back can be specified.

It should be noted that the play instruction signal may also be transmitted from the smartphone 206 to the receiver via a network other than the network (e.g., an in-home LAN), or directly by a short-range wireless signal or the like.

When the processor of the receiver receives the playback instruction signal, the processor plays back a scene (CM, etc.) to be played back from the program content recorded in the storage device in accordance with the playback instruction signal. As a result, the user can view the CM or the like of the selected product (or service).

As described above, the smartphone 206 is a scene display control apparatus. Therefore, using a smartphone as the scene display control device can obtain effects similar to using a smart tv as the scene display control device.

As described above, according to the embodiments described above, the user can easily view a scene of a plurality of desired products (or services) (that is, a scene of a CM or a scene of a column in a program) from a plurality of recorded program contents simply by selecting a product (or a service) related to the CM broadcasted in the program, a product (for example, a book) or a service (for example, an introduction of a restaurant) introduced in the column in the program from the recommended product display screen.

In particular, since the products and the like are displayed in order of the preference of the user based on the demographic attributes and the like of the user, the probability that the user views the contents of the products (or services) becomes high.

Each of the processors described above includes a CPU and a ROM, and the CPU reads and executes a software program or a computer instruction stored in the ROM to realize various functions of each device, but each processor may be configured by an electronic circuit or may be configured as a circuit block of an integrated circuit such as an FPGA (Field Programmable Gate Array).

In the above-described embodiment, the program content is recorded in the storage device of the receiver, but may be recorded in the server. For example, the storage device of the server has a program content storage area, and the server plays the program content of the designated program and transmits the video signal to the receiver in response to a play instruction of the program from the receiver. The display device of the receiver displays a video image of the received video signal. When the smart phone is used as a scene display control device, a command for a playback instruction is transmitted to the server when the program content is recorded in the server. In this case, in the server, since the management information MI is also stored in the storage device, part of the processing is executed by the processor of the server. The display data of the recommended article display screen is transmitted to the receiver, and the recommended article display screen is displayed on the display device so that the user can select an article or the like.

In fig. 17, the control unit receives recommendation information, other recommendation information, and status information of the receiving device from the product sales management server (S3036). When receiving the recommendation information, the state information of the receiving device, and other recommendation information from the sales management server, the control unit inquires the receiving device of the current state in the process of S3032, and receives the current state information from the receiving device in the process of S3033.

The management server generates predicted state information that predicts the state of the receiving apparatus, and table information that associates the predicted state of the predicted state information with recommendation information corresponding to the predicted state. That is, in the table information, the predicted state of the predicted state information is associated with the recommendation information corresponding to the predicted state.

The control unit determines whether or not the status information from the product sales management server matches the current status information from the receiving apparatus in the processing of S3034. If the control unit determines in the process of S3034 that the status information from the product sales management server matches the current status information from the receiving apparatus, the control unit displays the recommendation information on the display unit in the process of S3035, and ends the process.

On the other hand, if it is determined in the process of S3034 that the status information from the product sales management server does not match the current status information from the receiving apparatus, the control unit determines whether or not there is any other recommended information matching the current status information among the other recommended information (S3037). When it is determined that there is another piece of recommendation information matching the current state information among the other pieces of recommendation information (yes in S3037), the control unit displays the other piece of recommendation information matching the current state information among the other pieces of recommendation information on the display unit (S3038), and ends the processing. On the other hand, if it is determined that there is no other recommended information matching the current state information among the other recommended information (no in S3037), the control unit ends the process.

Through the above processing, according to the information communication system of the present application, even when the state of the receiving apparatus is changed by the user without displaying the recommendation information on the display unit of the terminal apparatus after the state information is transmitted from the receiving apparatus to the viewing history management server, it is possible to display other recommendation information in accordance with the current state of the receiving apparatus.

In some embodiments, the receiver is configured not to transmit the status information to the management server. In the present embodiment, when the display control program is executed in the terminal device, the display control program is executed, and this is notified to the product sales management server of the management server via the internet. When notified that the display control program is executed, the product sales management server generates the predicted state information and transmits the table information to the terminal device. The prediction state information and the table information can be referred to in priority of JP2020-019997 application.

In the prediction state information, the channel of the receiver is associated with time. In addition, the predicted state information is associated with the channel of the receiver and the time, but is not limited to this, and the specific recorded program may be associated with the time from the beginning of the specific recorded program. In the table information, the predicted state of the predicted state information is associated with the recommendation information corresponding to the predicted state.

When receiving the predicted state information and the table information from the product sales management server, the terminal device inquires the receiver of the current state. The terminal device determines whether or not the current state information of the receiver matches the predicted state of the predicted state information. When it is determined that at least a part of the current state information and the predicted state information of the receiver match each other, the terminal device acquires recommendation information corresponding to the predicted state from the table information and displays the recommendation information on the display unit.

In this way, the terminal device acquires the predicted state information and the table information, and acquires the state information indicating the state of the receiver at the time corresponding to the acquisition of the predicted state information and the table information. The terminal device compares the predicted state information with the state information, and when at least a part of the predicted state information and the state information match each other, acquires recommendation information corresponding to the predicted state from the table information and executes a display process of the recommendation information.

For example, when the user watches the program of the channel CH1 during the period from 7: 30 to 8: using the receiver, the terminal device determines that the current state information of the receiver matches the predicted state. When the terminal device determines that the current state information of the receiver matches the predicted state, the terminal device acquires recommendation information corresponding to the predicted state from the table information and displays the recommendation information on the display unit.

The control unit receives the prediction state information and the table information from the commodity sales management server. When receiving the predicted state information and the table information from the commodity sales management server, the control unit inquires the receiver of the current state and receives the current state information from the receiver.

The control unit determines whether or not the current state information of the receiver matches the predicted state of the predicted state information. When determining that the current state information of the receiver does not match the predicted state of the predicted state information, the control unit ends the processing. That is, when the current state information of the receiver does not match the predicted state of the predicted state information, the control unit ends the process without displaying any information on the display unit.

On the other hand, when it is determined that the current state information of the receiver matches the predicted state of the predicted state information, the control unit acquires recommendation information corresponding to the predicted state from the table information, displays the recommendation information on the display unit, and ends the processing.

Through the above processing, the information communication system according to the present embodiment can display recommendation information that matches the current state of the receiver with high accuracy without transmitting the state information from the receiver to the management server.

The terminal device may periodically inquire of the receiver about the current state, and may display the recommendation information on the display unit when the current state of the receiver matches the predicted state.

For example, when the terminal device periodically inquires the receiver about the current state and determines that the user views the program of the channel CH4 during the period from 9: 30 to 10: 30 using the receiver, the terminal device acquires the recommendation information corresponding to the predicted state from the table information and displays the recommendation information on the display unit.

The present invention is not limited to the case where the terminal device periodically inquires the receiver about the current state, and the receiver may periodically transmit the current state of the receiver to the terminal device.

(embodiment 1)

In the present embodiment, an operation example in the case where the smart device 300 receives a scene specifying signal from the remote controller 200 is shown.

In this case, if the scene specifying button 201 is operated and the smart device 300 receives the scene specifying signal, the smart device 300 may have the following mechanism: immediately, the voice detection function is turned on, and the voice signal thus picked up and the voice detection time data at the time of the pickup are stored in a memory as a "voice instruction". Further, the server 400 has a mechanism for transmitting the "voice command". For example, when the scene specifying button 201 of the remote controller 200 is operated, the reception device 100 may include at least the following: a means for recording scene specifying time data indicating a time position of a scene of an image when the scene specifying signal is received, and content information (program information or the like) including a content of the scene as "scene specifying data" in an information recording unit; a mechanism for transmitting the "scene specifying data" to the server 400.

In other words, as shown in fig. 1, through the system path S0, in the case where the user 5 viewing the program desires what to inquire about the scene of the image, the user 5 touches the scene specifying button 201 of the remote controller 200 (S1). In this way, the remote controller 200 transmits scene specifying signals to the receiving apparatus 100 and the smart device 300 (S2a) and (S2b), respectively.

In this way, the receiving apparatus 100 receives the scene specifying signal from the remote controller (S2a), and records, as "scene specifying data", at least "scene specifying time data" indicating a time position of a scene of the image at the time of receiving the scene specifying signal (S2a), "content information (for example, program information)" including a content of the scene, and "TV identification information". And, the "scene specifying data" is transmitted to the server 400.

The smart device 300 receives the scene specifying signal from the remote controller 200 (S2b), and stores "voice signal" picked up by turning on the voice detection function, "voice detection time data" at the time of the pickup, and "smart speaker recognition information" in the memory as "voice command information". And, the "voice instruction information" is transmitted to the server 400.

In addition, if the storage time of the data is a very short time, the "scene specifying time data" and the "voice detection time data" may be "real time data" when the "scene specifying data" and the "voice command information" are transmitted to the server 400. Here, these pieces of time data are referred to as "combination collation data" for pairing "scene specifying data" and "voice instruction information".

In the server 400, the scene specifying data and the voice instruction information are linked and held in the memory as a database. The database is parsed for various purposes. As the reference information for linking, approximate time data of "scene specifying time data" and "voice detection time data" is used.

Fig. 5 is a sequence diagram showing an example of the operation of the system according to embodiment 1, and shows an example of the operation of the speech information processing system over time. Fig. 5 (2a) shows a real-time passage. (2b) Indicating the passage of a program scene in the screen of the receiving apparatus 100. (2c) This is the passage of time in the remote controller 200, and indicates that the scene specifying button 201 is operated at time t 1. (2d) The time elapsed in the smart device 300 indicates that the voice detection function is turned on and voice is picked up from the microphone at time t 1. When the voice detection function is turned off after it is turned on, the user may arbitrarily set the timing when the peripheral voice is interrupted at a certain level or more, or when a predetermined time elapses (30 seconds, 2 minutes, or the like).

(2e) The time period is a time period during which the receiving apparatus 100 generates scene specifying data and transmits the scene specifying data to the server.

(2f) Is the time lapse in the server 400. The server 400 receives the "scene specifying data" and the "voice command information" from the receiving device 100 and the smart device 300, links the scene specifying data and the voice command information, and holds them as a database in a memory. The server 400 analyzes the database for various purposes, or returns the analysis result to each receiving device 100 and/or the smart device 300.

The above example shows an example of collecting information on a program broadcasted in real time, but the above-described concept can be applied even when a broadcast program is temporarily recorded in a recording/reproducing device and the program is reproduced.

In this case, the elapsed time from the start time of the program of the time information is used as the previous time t 1. In addition, when the "scene specifying data" is transmitted to the cloud server, identification information (or may be referred to as attribute information) of a broadcast program is added to program information (such as a program name) included in the scene specifying data. Further, also in the case where the "scene specifying data" and the "voice instruction information" are transmitted to the cloud server, as reference time information for linking the two, "real-time information" is attached and transmitted.

Fig. 6 is a flowchart showing an operation example of the system according to the embodiment, and shows an operation example in the case where the remote controller 200 shown in fig. 1 and 2 has a start operation in the scene specifying mode. In the remote controller 200, the scene specifying button 201 may be used as a button for activating the scene specifying mode, or a scene specifying mode activation button for determining the predetermined operation mode in advance may be separately provided.

Now, the remote controller 200 is set to the scene specifying mode to be started (SA 1). Further, it is assumed that the user 5 views the program while viewing the screen. Here, for example, a scene under observation is assumed to exist. The user operates the scene specifying button 201 at this time (SA 2). In this way, in the reception device 100, the communication control unit 162 and the system control unit 161 operate together. A scene information storage unit (SA3) for temporarily storing at least time data of a current scene and program information (channel, program name, etc.) as "scene specifying data" in a memory in the system control unit 161. Next, the "scene specifying data" stored in the scene information storage unit and the "TV identification information" for identifying the receiving apparatus 100 are integrated and transmitted to the cloud server via the network I/F162-1 (SA 5). Further, integration is not necessary when the TV identification information is included in the scene specifying data.

On the other hand, in the smart device 300, when the user operates the scene specifying button 201(SA2), the interface unit 314 receives a scene specifying signal. In this way, the system controller 301 turns on the speaker 313 to enable voice input (SA 7).

Under the control of the system controller 301, speech is collected, and the speech data and time data at the time of the collection are stored as "speech command information" in the memory (RAM303) (SA 8).

Next, "TV identification information" and/or "remote controller identification information" of the reception apparatus 100 corresponding to the "voice instruction information" are transmitted to the server 400. In addition, when the "voice command information" is transmitted, the "speaker identification information" and/or the "remote controller identification information" of the smart device 300 may be transmitted to the server 400. In addition, when the "speaker identification information" is already included in the "voice command information", it is not necessary to additionally add the "speaker identification information" at the time of the transmission.

Note that, as a term corresponding to a voice command, for example, the following terms are available.

"is the current scene a shooting location? "," where is the venue? "," who is the present person? "," is the manufacturer of the vehicle at present? "," is the present vehicle type? "," where the hotel is "," where the restaurant is? "," manufacturer is? "," abort "," record "," return "," stop ", etc. In addition, when the recording/playback apparatus instructs to play back an image, there are, for example, "pause", "rewind", "fast-forward", "skip", and "flag", "power-off", etc. which set the screen to be completely black.

Steps SA3, SA3-SA5, and SA6 can be described as functional blocks in the system controller 161 in the receiver 100. Steps SA3, SA7-SA9, and SA6 can be described as functional blocks in the system controller 301 in the smart device 300. Scene specifying data and voice commands transmitted from the plurality of television receivers and the smart speakers are temporarily taken into (buffered in) the storage unit 423 under the control of the system controller 422. The analysis unit 424 analyzes the received data taken into the storage unit 423, and first performs data arrangement for each program. The analysis unit 424 in the present embodiment searches the database based on the scene specifying information and the command received from the reception device 100 and the smart device 300, acquires the related provision information (the related provision information is associated with the scene specifying information), and outputs the related provision information to the reception device 100 and the smart device 300.

Fig. 7 is a diagram showing an operation example of the system according to the embodiment, and shows an operation example of the server 400 in the case where the "scene specifying data" is transmitted from the receiving device 100 to the server 400 and the "voice command information" is transmitted from the smart device 300.

Server 400 temporarily stores the previous "scene specifying data" in buffer 423a, and temporarily stores "voice command information" in buffer 423 b. The "scene specifying data" and the "voice instruction information" are also sent one by one from different television apparatuses and smart speakers.

The combination engine 424a combines the mutually corresponding "scene specifying data" and "voice command information" based on the combination collation data, and stores the set of "scene specifying data" and "voice command information" in the pair storage unit 423 c.

The "voice command information" stored in the pairing storage unit 423c is analyzed by the command analysis unit 424b, and the contents of the voice command are grasped.

As a result of the instruction analysis, it is determined whether the voice instruction is a TV control instruction (for example, "pause," rewind, "" fast forward, "" skip, "a" flag "to turn the screen completely black," power off, "or the like), or an information acquisition instruction related to an image scene (for example," where is the current scene the shooting location?, "" where is the current person ?, "" where is the manufacturer of the current car?, "" where is the hotel?, "" where is the manufacturer.

When the voice command is a TV control command 423d, the control command is prepared by the buffer 423e and transmitted to the corresponding receiving apparatus 100 as a TV control command.

When the audio command is the scene-related information acquisition command 423f, information corresponding to the command is read from the program meta information storage unit 423h using the command, and is prepared in the buffer 423 g. Examples of the information corresponding to the instruction include a director name, a manufacturer name, an actor's sequence screen, and a tourist spot. This information is transmitted to the smart device 300 as voice information, for example, and as voice response information. In addition, as the response information, image data for Picture In Picture (PIP) may be transmitted. The program meta information storage unit 423h also stores information accumulated in the server 400 itself to collect and accumulate related information from the program information and various media information. In addition, the accumulated information also collects and accumulates viewing history and the like from each television receiving apparatus.

As described above, the present embodiment can provide a receiving apparatus, a server, a voice information processing system, and a method, which can realize immediacy of the input timing of a voice command when the voice command for acquiring information on an image scene is given to a smart speaker.

The above system can be described as follows.

(1) The voice information processing system includes: a means for receiving a scene specifying signal from a remote controller, and recording, as "scene specifying data", at least scene specifying time data indicating a time position of a scene of the image at the time of receiving the scene specifying signal and content information including a content of the scene in an information recording unit; and

a mechanism that transmits the "scene specifying data" to a cloud server,

an intelligent device having at least a function of picking up a voice includes: a means for receiving the scene specifying signal from the remote controller, and storing a voice signal picked up by turning on a voice detection function and voice detection time data at the time of the pickup as "voice command information" in a memory; and a mechanism that transmits the "voice instruction information" to the cloud server.

(2) The television device described in (1) above includes a means for storing an image of a scene of the image. Thus, the user can confirm the stored scene later and execute the voice command on the stored scene.

(3) The television device described in (1) or (2) above includes a mechanism for displaying an image of a scene of the image on a small screen for a predetermined time. Thus, the user can visually observe a scene having interest and issue a voice instruction.

(4) The television device includes a control means (system control unit 161) for receiving a command included in the "voice command information" transmitted from the cloud server and controlling an operation according to the command, in any one of the above (1) to (3). This enables the user to save a scene of interest, repeat playback (still image playback), and the like. Further, editing processing such as chapter setting for the scene can be easily performed.

(5) The smart device is configured to receive "voice data" acquired by the cloud server based on an instruction included in the "voice instruction information" transmitted to the cloud server in any one of the above (1) to (4), and output a voice corresponding to the "voice data" from a speaker.

The usual smart device 300 needs to receive the trigger word before receiving the voice instruction. In the case of such a general smart device 300, the user may not be able to quickly specify a scene having an instant of interest. That is, even if a voice command is issued at an interesting moment, the normal execution of the smart device 300 is to receive a trigger word and the voice command, and after the command is fetched by voice recognition, the scene in which the command is executed becomes a scene delayed from the scene having the interest. The smart device 300 according to the present embodiment can generate a voice command in order for a user to specify a scene, and can execute the voice command in the specified scene.

For example, a user (viewer) watching a program image of a television broadcast may wish to know further related information with respect to an image scene displayed at any moment. The related information is, for example, information on names of actors appearing in the image scene and places of scenery (for example, area names, residences, and the like). In such a case, according to the present embodiment, it is possible to acquire related information by a voice command for an image scene of interest of a user.

(embodiment 2)

In the present embodiment, an operation example is shown in the case where the receiving apparatus 100 receives the scene specifying signal output from the remote controller 200 and transmits a start command to start the smart device 300 from the receiving apparatus 100 to the smart device 300 via the server 400. In this embodiment, the server 400 can recognize the state of the smart device 300 and can appropriately process the command from the smart device 300.

Fig. 8 is a sequence diagram of the system according to embodiment 2, and shows the flow of interaction of data and the like between the user 5 and the receiving apparatus 100, the server 400, and the smart device 300, and the processing of each function.

The user 5 sees a scene in which a general car travels on a very beautiful grassland while watching a travel program through the receiving apparatus 100, and thinks about "where the place is desired to be known" and "the manufacturer of the car is desired to be known". The user 5 presses the scene specifying button 201 of the remote controller 200 at the instant when seeing the scene. (step S51).

In the receiving apparatus 100, when the system control unit 161 receives the scene specifying signal output from the remote controller 200 via the remote controller I/F162-2 (step S101), scene specifying time data for the content scene output to the display 170 and the speaker 171 is acquired at the timing when the scene specifying signal is received. The scene specifying time data may be, for example, an absolute time at which the scene is displayed, or a count time (relative time) from the start of the content until the scene is displayed. The scene specifying time data may be acquired by a clock or a counter provided in the receiving apparatus 100, or may be acquired from program information of a broadcast signal or the like.

At the same time, the system control unit 161 acquires viewing content information related to the output content. The system control unit 161 includes viewing content information and scene specifying time data, and generates scene specifying information (step S102). The system control unit 161 transmits the generated scene specifying information from the network I/F162-2 to the server 400 via the network 500 (step S103). The server 400 receives the scene specifying information transmitted from the system control unit 161 and stores the scene specifying information in the storage unit 423 (step S131).

Further, in the reception apparatus 100, the system control section 161 outputs an activation signal for activating the voice instruction acquisition function for the smart device 300 from the network I/F162-2 to the network 500 (step S104). The activation signal is received by the server 400 once, and then transmitted to the smart device 300 via the network 500 (steps S132 and S141). Through this step, the server 400 can manage the state of the smart device 300. In the present embodiment, an example is shown in which the receiving apparatus 100 explicitly transmits the activation signal to the smart device 300, but the scene specifying information output in step S103 may be used as the activation signal.

In the server 400, the system controller 422 changes the mode of data processing when receiving the activation signal from the receiving apparatus 100 (steps S132 and S133). By this mode change, the command received in the subsequent stage becomes a mode executed with respect to the scene specifying information received in step S131 (step S133). Note that, for the sake of explanation, as step S133, the mode change is explicitly indicated, and, for example, if it is determined that the command received in the subsequent stage after receiving the scene specifying information and the activation signal in steps S131 and S132, is a command to be executed on the scene specifying information, the system controller 422 may particularly have no step S133.

In the smart device 300, when the system controller 422 receives the activation signal, the voice command acquiring function of the voice recognition unit 310 is activated (step S142). Further, although the mode is changed at the same time in step S142, the operation of the smart device 300 is changed from the normal processing operation. In the normal operation (normal mode) of the smart device 300, the voice instruction acquisition function is activated after receiving the trigger word, but in the present embodiment, the smart device 300 activates the voice instruction acquisition function by setting the activation signal as a trigger. In addition, since the system controller 422 may activate the voice command acquiring function after receiving the activation signal in step S141, the mode change operation in step S142 may not be particularly required.

The smart device 300 may notify the user of the activation of the voice instruction acquisition function by voice from the speaker 313 (step S143). The user can recognize that the voice command can be issued by hearing the fact that the voice command acquiring function is valid from the speaker 313 (step S52).

Through the above sequence, the user 5 can issue a voice instruction to a scene designated from the remote controller 200.

Fig. 9 is a diagram showing an example of data flow in the system according to the embodiment, and shows a flow of data in the system after the user 5 designates a scene of a content under viewing and before a voice command can be issued to the designated scene.

The user 5 presses the scene specifying button 201 of the remote controller 200 (data line L201, corresponding to step S51 of fig. 8). The remote controller 200 outputs a scene specifying signal and the receiving apparatus 100 receives it (data line L202, corresponding to steps S103 and S131 in fig. 8). The reception device 100 outputs scene specifying information and a start signal, and the server 400 receives the scene specifying information and the start signal via the network 500 (data lines L203 and L204, corresponding to steps S103, S131, S104, and S132 in fig. 8). The server 400 outputs an activation signal, and the smart device 300 receives the activation signal via the network 500 (data lines L205 and L206, corresponding to steps S132 and S141 in fig. 8). The smart device 300 outputs a voice notification to the effect that the voice command acquiring function is activated (the data line L207 corresponds to steps S142 and S52 in fig. 8). When the user 5 hears the voice notification to activate the voice command acquiring function, it issues a voice command to the smart device 300 (data line L208, corresponding to step S53 in fig. 8).

Returning to fig. 8, the user 5 issues a voice instruction (step S53). For example, a phrase (voice instruction) of "where the place is desired to be known" is issued. In the smart device 300, when the microphone 312 receives a voice, the voice recognition unit 310 performs voice recognition on the received voice (steps S144 and S145). In the present embodiment, the case where the speech recognition unit 310 is provided in the smart device 300 is shown, and an external speech recognition device or the like on the network 500 may be used. The voice recognition unit 310 acquires a command (command) superimposed on the voice command based on the text data obtained by the voice recognition (step S146). Here, the instruction may be acquired by, for example, the smart device 200 transmitting the text data to an external text conversion device not shown, converting the text data into an instruction, and returning the instruction to the smart device 200. The smart device 200 transmits the acquired instruction to the server 400, and the server 400 receives the instruction (step S134). Note that the text conversion device in step S146 may be the reception device 100, and in this case, the transmission of the command in step S147 to the server 400 is performed by the reception device 100. The text conversion device in step S146 may be the server 400, and in this case, the server 400 itself may manage the command.

The server 400 generates content-related information based on the scene specifying information stored in the storage unit 423 in step S131 and the instruction received in step S134 (step S135). Specifically, the server 400 specifies a scene from the scene specifying information, and performs processing related to the received instruction on the specified scene to obtain the content-related information. The content-related information is a result of an instruction for a specific scene, and is a response to a voice instruction issued by a user. For example, in response to a voice command "where the place is desired to be known", content-related information "is" yao of changyou "is generated. The content-related information is transmitted to the receiving device 100 and the smart device 300 as necessary. When the receiving apparatus 100 receives the content-related information, the content-related information may be displayed on the screen as text information, for example. In case the smart device 300 receives the content related information, the content related information may also be spoken by voice, for example.

When the smart device 300 continues to receive the next voice command, the process returns to step S145 again to perform voice recognition, generate content-related information, and transmit the content-related information to the receiving device 100 or the smart device 300 (yes in step S149). For example, when a keyword such as "further" or "continue" is acquired by voice recognition after the first voice command, it may be determined that the next voice command is coming, and the processing from step S145 may be repeated again.

On the other hand, when the next voice command does not arrive for a certain time or longer, for example, the smart device 300 ends the acquisition of the voice command for the scene specifying information stored in the storage unit 423 in step S131, and returns to the normal mode (yes in step S149, S150). When the smart device 300 returns to the normal mode, it notifies the server 400 of the fact. Upon recognizing that the smart device 300 returns to the normal mode, the server 400 returns its own mode to the mode before the scene specifying information and the activation signal are received in steps S131 and S132 (step S138).

Through the above sequence, the user 5 can execute a voice instruction on a scene designated from the remote controller 200. For example, when the user 5 views a program through the receiving apparatus 100, and thinks about "where the place is desired to be known" and "the manufacturer of the vehicle" with respect to an object or a scene that appears, "the user presses the scene specifying button 201 from the remote controller 200, and then generates a voice command such as" where the place is desired to be known "and" the manufacturer of the vehicle "to the smart device 300. Through this procedure, information desired to be known about a scene in which the user is interested, for example, the answer "aotoma" here, www (world Wide web) of the car manufacturer, and the like, is displayed on the receiving apparatus 100, or is output from the smart device 300 by voice.

(modification example)

In the present modification, an example of a method of transmitting an activation command for activating (changing the mode) the receiving device smart device 300 after outputting a scene specifying signal from the remote controller 200 is shown. The operation after the activation of the smart device 300 is the same as the flow shown in embodiment 1 and embodiment 2.

Fig. 10 is a diagram showing a1 st data flow example of the system according to the modification, and shows the flow of data in the system until the user 5 can give a voice command to a specified scene after specifying the scene of the content under viewing.

The user 5 presses a scene specifying button 201 (data line L301) of the remote controller 200. The remote controller 200 outputs a scene specifying signal, and the receiving apparatus 100 receives the scene specifying signal (data line L302). The receiving apparatus 100 outputs a start signal from the smart device I/F162-3, and the smart device 300 receives the signal via the interface unit 314 (data line L303). The receiving apparatus 100 acquires scene specifying information triggered by a scene specifying signal from the remote controller 200, and outputs the scene specifying information to the server 400 via the network 500 (data lines L304, L305). The smart device 300 starts the voice instruction acquisition function and outputs a voice such as "being able to receive a voice instruction" triggered by the reception of the start signal. (data line L306). When the user 5 hears the voice notification to activate the voice command acquiring function, it issues a voice command to the smart device 300 (data line L307).

Fig. 11 is a diagram showing a2 nd data flow example of the system according to the modification, and shows a flow of data in the system until a user 5 can give a voice command to a specified scene after specifying the scene of the content being viewed.

The user 5 presses the scene specifying button 201 (data line L401) of the remote controller 200. The remote controller 200 outputs a scene specifying signal, and the remote controller I/F162-2 of the reception apparatus 100 receives the scene specifying signal (data line L402). The remote controller I/F162-2 outputs a scene specifying signal (data line L403) to the system control section 161. The system control unit 161 outputs a start signal (data line L404) to the smart device 300 based on the scene specifying signal. The receiving apparatus 100 acquires scene specifying information triggered by a scene specifying signal from the remote controller 200, and outputs the scene specifying information to the server 400 via the network 500 (data lines L405 and L406). The smart device 300 starts the voice instruction acquisition function and outputs a voice such as "being able to receive a voice instruction" triggered by the reception of the start signal. (data line L407). When the user 5 hears the voice notification to activate the voice command acquiring function, it issues a voice command to the smart device 300 (data line L408).

Fig. 12 is a diagram showing a3 rd data flow example of the system according to the modification, and shows a flow of data in the system until a user 5 can speak a voice command to a specified scene after specifying the scene of the content being viewed. The data flow of the present modification corresponds to the data flow of embodiment 1.

The user 5 presses a scene specifying button 201 (data line L101) of the remote controller 200. The remote controller 200 outputs a scene specifying signal, and the receiving apparatus 100 receives the scene specifying signal (data line L102). Simultaneously, the smart device 300 also receives a scene specifying signal (data line L103) output from the remote controller 200. The receiving apparatus 100 acquires scene specifying information triggered by a scene specifying signal from the remote controller 200, and outputs the scene specifying information to the server 400 via the network 500 (data lines L104 and L105). The smart device 300 starts the voice instruction acquisition function and outputs a voice such as "can receive a voice instruction" by using the reception of the scene specifying signal in the data line L103 as a trigger. (data line L106). When the user 5 hears the voice notification to the effect that the voice command acquiring function is activated, it utters a voice command to the smart device 300 (data line L107).

Through the above procedure of the modification, the user 5 can issue a voice command to a scene designated from the remote controller 200.

In at least one embodiment described above, a receiving apparatus, a server, and a speech information processing system that execute processing of a speech command for an image scene specified by a user can be provided.

The present embodiments provide a non-volatile storage medium having stored thereon a program or computer instructions that, when executed by a processor, cause a computer device (e.g., the above-described server, receiver, terminal device, etc.) to perform the above-described method. The non-volatile storage medium described in this application may be a specific storage module or device on the server, the receiver (e.g., a television or a video recorder, etc.) and the terminal device (e.g., a smart phone or a tablet computer, etc.), and the modules or devices implement the methods described in the foregoing embodiments when running computer instructions or programs.

While the embodiments of the present invention have been described above, these embodiments are presented as examples and are not intended to limit the scope of the invention. These new embodiments can be implemented in other various forms, and various omissions, substitutions, and changes can be made without departing from the spirit of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalent scope thereof. Further, among the respective components of the claims, the present invention is also within the scope of the present invention even when the components are expressed in a divided manner, or when a plurality of components are expressed in a combined manner, or when these components are expressed in a combined manner. Further, a plurality of embodiments may be combined, and an example configured by the combination is also the scope of the invention.

In addition, in order to make the description more clear, the drawings may schematically show the width, thickness, shape, and the like of each portion than they are actually. In the block diagram, data and signals may be exchanged between unconnected blocks or in a direction not shown by an arrow even if the blocks are connected. The functions shown in the block diagrams, the processing shown in the flowcharts and the sequence diagrams may be realized by hardware (an IC chip or the like), software (a program or the like), a Digital Signal processing arithmetic chip (DSP), or a combination of hardware and software. The device of the present invention is also applicable to a case where the claims are expressed as a control logic, a case where the claims are expressed as a program including instructions to be executed by a computer, and a case where the claims are expressed as a computer-readable recording medium on which the instructions are described. The present invention is not limited to the names and terms used, and other expressions are included in the present invention if they have substantially the same contents and the same subjects.

Claims

A receiving apparatus, comprising:

a control signal receiving means for receiving a scene specifying signal in outputting the image content from the display means, the scene specifying signal being a control signal for specifying a scene which is one image of the image content; and

and a control means that receives the voice, performs voice recognition on the voice, and generates an activation command for activating acquisition of the instruction with respect to the voice instruction acquisition means that acquires the instruction.
The receiving device of claim 1,

the control means specifies a scene being output from the display means at a timing when the scene specifying signal is received, and acquires scene specifying time data indicating a time position of the specified scene and viewing content information relating to image content including the scene.
The receiving apparatus according to claim 1 or 2,

the control means determines a scene being output from the display means at a timing when the scene specifying signal is received, and stores image data of the determined scene in storage means.
The receiving device of claim 2,

the receiving apparatus includes:

a transmission unit that outputs the scene specifying time data, the viewing content information, and the instruction to an external server; and

a receiving mechanism that receives an execution result of the instruction based on the external server.
The receiving device according to any one of claims 1 to 4,

the receiving device is provided with an intelligent speaker including the voice instruction acquisition mechanism.
The receiving device according to any one of claims 1 to 4,

the voice instruction acquisition mechanism is contained in an external intelligent loudspeaker,

the control means sends the start command to the voice instruction acquisition means via an electrical communication line, and starts acquisition of the instruction.
The receiving device according to any one of claims 1 to 4,

the voice instruction acquisition mechanism is contained in an external intelligent loudspeaker,

the control mechanism sends the starting command to the voice instruction acquisition mechanism through a cable or short-distance wireless to start the acquisition of the instruction.
A server, wherein the server is capable of data transceiving with a smart speaker,

the server is provided with:

a receiving unit that receives a voice, performs voice recognition on the voice, and receives an activation command for activating a function of the smart speaker for acquiring a command, scene specifying time data indicating a time position of a scene of one image as image content, viewing content information related to the scene, and a command related to the scene;

a parsing mechanism that determines the scene based on the scene specifying time data and the viewing content information;

the instruction executing mechanism executes the instruction on the scene to obtain an execution result; and

an output mechanism that outputs the execution result.
A speech information processing system, comprising:

a receiving device, comprising: a control signal receiving means for receiving a scene specifying signal in outputting the image content from the display means, the scene specifying signal being a control signal for specifying a scene which is one image of the image content; a control means for generating a start command for starting instruction acquisition, specifying a scene being output from the display means at a timing when the scene specifying signal is received, and acquiring scene specifying time data indicating a time position of the specified scene and viewing content information; and an output mechanism for outputting the start command, the scene designation time data, and the viewing content information;

a mechanism to receive the start command;

a voice instruction acquisition unit that receives a voice, performs voice recognition on the voice, acquires an instruction from a result of the voice recognition, and outputs the instruction; and

a server, comprising: a receiving mechanism for receiving the starting command, the scene designated time data, the watching content information and the instruction; a parsing mechanism that determines the scene based on the scene specifying time data and the viewing content information; the instruction execution mechanism executes the instruction on the scene to obtain an execution result; and an output mechanism that outputs the execution result.