CN109740510B

CN109740510B - Method and apparatus for outputting information

Info

Publication number: CN109740510B
Application number: CN201811635301.6A
Authority: CN
Inventors: 赖长铃; 谢攀; 何健; 柳瑞超
Original assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Current assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2023-03-24
Anticipated expiration: 2038-12-29
Also published as: CN109740510A

Abstract

The embodiment of the application discloses a method and a device for outputting information. One embodiment of the above method comprises: acquiring a target image; carrying out target identification on the target image and determining an object included in the target image; based on the identified object, a sentence is generated that describes the target image. The embodiment can perform face recognition and object recognition on the image and generate sentences for describing the image.

Description

Method and apparatus for outputting information

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for outputting information.

Background

In daily life, the cat eye is widely applied, and the intelligent cat eye is more and more popular. The image obtained by the cat eye is used for identifying people and objects, and the image is understood and analyzed, so that the method is a research hotspot at present.

Disclosure of Invention

The embodiment of the application provides a method and a device for outputting information.

In a first aspect, an embodiment of the present application provides a method for outputting information, including: acquiring a target image; carrying out target recognition on the target image and determining an object included in the target image; and generating a sentence for describing the target image based on the identified object.

In some embodiments, the object comprises a person and/or an object; and generating a sentence describing the image set based on the identified object, including: determining whether the target image meets at least one of the following conditions: comprising at least two persons, comprising at least one person and at least one object; in response to determining that the target image satisfies the at least one condition, determining a distance between the objects and a position occupied by the objects in the target image; determining an index of closeness between said objects based on said distance and said location; and generating a sentence for describing the target image according to the object and the intimacy index.

In some embodiments, determining an index of closeness between the objects based on the distance and the location comprises: determining a first weight coefficient according to the distance; determining intersection area and union area between the objects according to the positions; determining a second weight coefficient according to the intersection area and the union area; the intimacy index is determined based on the first weight coefficient and the second weight coefficient.

In some embodiments, the generating a sentence describing the target image according to the object and the intimacy index includes: generating at least two sentences according to the object and the intimacy index; and scoring at least two generated sentences, and taking the sentence with the highest score as the sentence for describing the target image.

In some embodiments, the acquiring the target image includes: in response to detecting that an object exists within a preset distance of a preset object, determining the stay time of the detected object; in response to determining that the dwell time is greater than or equal to a preset threshold, acquiring an image set including the detected object with an image acquisition device mounted on the preset object; and determining a target image from the image set.

In some embodiments, the above method further comprises: converting the generated sentence into voice and outputting the voice.

In some embodiments, the above method further comprises: obtaining a question sentence of a user; determining an answer sentence template for answering the question sentence according to the question sentence and a preset dialogue library; obtaining an answer sentence according to the target image, the identified object and the answer sentence template; and outputting the answer sentence.

In a second aspect, an embodiment of the present application provides an apparatus for outputting information, including: an image acquisition unit configured to acquire a target image; an object recognition unit configured to perform object recognition on the target image and determine an object included in the target image; and a sentence generating unit configured to generate a sentence describing the target image based on the recognized object.

In some embodiments, the object comprises a person and/or an object; and the sentence generation unit includes: a judging module configured to determine whether the target image satisfies at least one of the following conditions: comprising at least two persons, comprising at least one person and at least one object; a determining module configured to determine a distance between the objects and a position occupied by the objects in the target image in response to determining that the target image satisfies the at least one condition; a calculation module configured to determine an index of closeness between the objects according to the distance and the location; and the generating module is configured to generate a sentence for describing the target image according to the object and the intimacy index.

In some embodiments, the computing module is further configured to: determining a first weight coefficient according to the distance; determining intersection area and union area between the objects according to the positions; determining a second weight coefficient according to the intersection area and the union area; the intimacy index is determined based on the first weight coefficient and the second weight coefficient.

In some embodiments, the generating module is further configured to: generating at least two sentences according to the object and the intimacy index; and scoring at least two generated sentences, and taking the sentence with the highest score as the sentence for describing the target image.

In some embodiments, the image acquisition unit is further configured to: in response to detecting that an object exists within a preset distance of a preset object, determining the stay time of the detected object; in response to determining that the dwell time is greater than or equal to a preset threshold, acquiring an image set including the detected object with an image acquisition device mounted on the preset object; and determining a target image from the image set.

In some embodiments, the above apparatus further comprises: a conversion unit configured to convert the generated sentence into a voice and output the voice.

In some embodiments, the apparatus further comprises an answering unit configured to: obtaining a question sentence of a user; determining an answer sentence template for answering the question sentence according to the question sentence and a preset dialogue library; obtaining an answer sentence according to the target image, the identified object and the answer sentence template; and outputting the answer sentence.

In a third aspect, an embodiment of the present application provides a server, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the embodiments of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable medium, on which a computer program is stored, where the program, when executed by a processor, implements a method as described in any of the embodiments of the first aspect.

The method and the device for outputting information provided by the above embodiments of the present application may first acquire a target image. Then, face recognition and object recognition are performed on the target image, and object recognition included in the target image is determined. Finally, a sentence describing the target image is generated based on the identified object. The method of the embodiment can perform face recognition and object recognition on the image and generate a sentence for describing the image.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for outputting information, in accordance with the present application;

FIG. 3 is a schematic diagram of an application scenario of a method for outputting information according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method for outputting information according to the present application;

FIG. 5 is a schematic illustration of another application scenario of a method for outputting information according to the present application;

FIG. 6 is a schematic block diagram of one embodiment of an apparatus for outputting information in accordance with the present application;

FIG. 7 is a block diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the present method for outputting information or apparatus for outputting information may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various electronic devices, such as a speaker, an image capturing device, and the like, may be connected to the

terminal apparatuses

101, 102, and 103. Various communication applications, such as an image display application and a voice playing application, may be installed on the

terminal devices

101, 102, and 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices with a photographing function, including but not limited to smart cats eye, smart phones, tablets, smart cameras, laptop and desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as a plurality of software or software modules (for example to provide distributed services) or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background server that processes images taken by the

terminal devices

101, 102, 103. The backend server may perform processing such as analysis on data such as the received image, and feed back a processing result (e.g., a sentence describing the image) to the

terminal apparatuses

101, 102, and 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for outputting information provided in the embodiment of the present application is generally performed by the server 105, and accordingly, the apparatus for outputting information is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for outputting information in accordance with the present application is shown. The method for outputting information of the embodiment comprises the following steps:

step 201, acquiring a target image.

In the present embodiment, the execution subject of the method for outputting information (e.g., the server 105 shown in fig. 1) may receive the target image from the terminal device (e.g., the

terminal devices

101, 102, 103 shown in fig. 1) by a wired connection manner or a wireless connection manner. The target image may include a person and/or an object.

In some optional implementation manners of this embodiment, the step 201 may be specifically implemented by the following steps not shown in fig. 2: and determining the stay time of the detected object in response to the detection of the object existing within the preset distance of the preset object. In response to determining that the dwell time is greater than or equal to a preset threshold, acquiring, with an image acquisition device mounted on a preset object, a set of images including the detected object. A target image is determined from the set of images.

In this implementation, the preset object may be a door panel. The door panel may have a sensor mounted thereon. When the sensor detects the presence of an object within a preset distance (e.g., 0.5 meters) of the door panel, the dwell time of the detected object may be determined. When the dwell time of the object is determined to be greater than or equal to a preset threshold (e.g., 30 seconds), the image capturing device mounted on the door panel may be controlled to capture the set of images. The set of images includes objects detected by the sensor. The image acquisition device can be installed in the cat eye of the door panel. The image acquisition device may send the image set to the execution subject after acquiring the image set. The executing subject may determine the target image from the set of images. The target image may be any one of a set of images.

Step 202, performing target identification on the target image, and determining an object included in the target image.

After the execution subject obtains the target image, the execution subject may perform target recognition on the target image. The executing subject may perform target recognition through a deep learning-based target detection and recognition algorithm, such as a region suggestion-based target detection and recognition algorithm, a regression-based target detection and recognition algorithm, a search-based target detection and recognition algorithm, and the like. Target recognition may include, but is not limited to, face recognition, object recognition. Through target recognition, the performing subject may determine an object included in the target image. Such objects may include, but are not limited to, people and objects. The above-described area-suggestion-based target detection and recognition algorithm may include a convolutional neural network, and the trained convolutional neural network may recognize familiar persons and strangers of a given user.

Step 203, generating a sentence for describing the target image based on the identified object.

The execution subject, upon determining the object included in the target image, may generate a sentence describing the target image. For example, the execution subject may list objects included in the target image. Alternatively, the execution subject may add some words to the object included in the target image, constituting a sentence. For example, the execution subject recognizes that the target image includes a small a and a bunch of flowers, the execution subject may generate a sentence "recognize the small a and the flowers", or the execution subject may generate a sentence "the small a holds a bunch of flowers outside".

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for outputting information according to the present embodiment. In the application scenario of fig. 3, the user has installed a smart cat eye on the door panel. The friend small A brings a bunch of flowers to the home of the user for guests. After the residence time of the small A outside the door exceeds 30 seconds, the camera in the intelligent cat eye shoots an image of the small A and the flower. And then, uploading the image to a cloud. After the cloud server performs face recognition and object recognition on the images, a sentence 'little A takes a bunch of flowers to do cheering' is generated. The high in the clouds can also change above-mentioned sentence into the audio frequency to broadcast through the speaker in the intelligent cat eye.

The method for outputting information provided by the above embodiment of the present application may first acquire a target image. Then, face recognition and object recognition are performed on the target image, and object recognition included in the target image is determined. Finally, a sentence describing the target image is generated based on the identified object. The method of the embodiment can perform face recognition and object recognition on the image and generate a sentence for describing the image.

With continued reference to FIG. 4, a flow 400 of another embodiment of a method for outputting information in accordance with the present application is shown. As shown in fig. 4, the method for outputting information of the present application includes the following steps:

step 401, a target image is acquired.

Step 402, performing target recognition on the target image, and determining an object included in the target image.

Step 403, determining whether the target image meets at least one of the following conditions: comprising at least two persons, comprising at least one person and at least one object.

In this embodiment, after the execution subject determines the object included in the target image, it may be detected whether the object in the target image satisfies the following condition: comprising at least two persons, comprising at least one person and at least one object. If at least one of the above conditions is satisfied, step 404 may be performed.

In response to determining that the target image satisfies at least one of the above conditions, a distance between the objects and a position occupied by the objects in the target image are determined, step 404.

If the subject determines that the target image satisfies at least one condition in step 403, the distance between the objects in the target image may be determined, and the position of the objects in the target image may also be determined. The execution subject may determine the distance between the objects in various ways. For example, when the target image is a depth image, the execution subject may determine the depth value of each object directly from the depth value corresponding to each pixel. And then determining the distance between the objects according to the depth values of the objects. Alternatively, the executing body may also determine the distance between objects in the image from a plurality of images consecutively taken. The executing agent may recognize a person or object in the image using the trained machine learning model and label the recognized person or object. The execution subject may take the position covered by the label box as the position occupied by the object in the target image.

Based on the distance and location, an index of closeness between the objects is determined, step 405.

After determining the distance between objects and determining the location of objects in the target image, the executive may determine an index of closeness between objects. It will be appreciated that the closer the distance between two objects, the more intimate the two objects are. The more the positions occupied by two objects in the target image coincide, the more closely the two objects are.

In some optional implementations of this embodiment, the step 405 may be specifically implemented by the following steps not shown in fig. 4: from the distance, a first weight coefficient is determined. And determining intersection area and union area between the objects according to the positions. And determining a second weight coefficient according to the intersection area and the union area. And determining the intimacy index based on the first weight coefficient and the second weight coefficient.

In this implementation, the execution subject may determine each first weight coefficient according to each distance. For example, the reciprocal of the distance is used as the first weight coefficient. The execution subject may determine the intersection area and the union area between the objects according to the positions occupied by the objects in the target image. Then, a second weight coefficient is calculated from the intersection area and the union area. For example, the intersection area is divided by the union area, and the result of the division is used as the second weight coefficient. After determining the first and second weight coefficients, the execution principal may determine an index of closeness between objects. For example, the execution agent may perform weighted average of the first weight coefficient and the second weight coefficient, and use the obtained value as the intimacy index.

Some optional implementations in this embodimentIn this manner, the execution subject may determine the intersection area and union area between the objects by: firstly, a trained convolutional neural network is used for identifying a target image to obtain an identification result (Y) ₁ ，Y ₂ ，…，Y _k ). Wherein the size of the target image is r _m *r _n *3,k represents the number of objects identified. Then, the execution subject may perform interpolation operation on each feature vector (with a size of m × n × c) output from the last convolutional layer of the convolutional neural network, and amplify the feature vector to be as large as the target image, thereby obtaining each feature vector (F) ₁ ，F ₂ ，…，F _k ). Wherein m represents the length, n represents the width, c represents the number of channels, and the dimension of F is r _m *r _n * c. The executing agent may then obtain recognition results (Y) in the fully-connected layer of the convolutional neural network ₁ ，Y ₂ ，…，Y _k ) Weight (w) of ₁ ，w ₂ ，…，w _k ) (length c). Feature vector (F) ₁ ，F ₂ ，…，F _k ) Corresponding weight (w) ₁ ，w ₂ ，…，w _k ) Multiplying to obtain a heat map (H) corresponding to each recognition result ₁ ，H ₂ ，…，H _k ). Wherein, the heat map (H) ₁ ，H ₂ ，…，H _k ) All have a size of r _m *r _n *1. And (4) carrying out binarization processing on each heat map, namely marking the result of the pixel value larger than the threshold value as 1 and marking the result of the pixel value smaller than the threshold value as 0. The cavities of each thermal map are filled by an erosion-expansion method. A heat map corresponding to each recognition result can be obtained. Then, according to the pixel values and the coordinates in each heat map, the intersection area and the union area between the objects can be calculated.

In some optional implementations of the present embodiment, the execution subject may generate the depth map by: first, a video is captured with a camera. And taking the first frame image in the video as a reference frame image. And sequentially detecting the corner points of each frame image and the reference frame image in the video, and performing corner point matching. And according to preset camera internal parameters and distortion parameters, minimizing a reprojection error by using a beam adjustment method, and iterating to obtain the three-dimensional space points corresponding to the internal parameters, the external parameters and the feature points of the camera. And performing dense stereo matching by using a plane scanning method according to the obtained internal reference and external reference to obtain a depth map. Furthermore, the reference frame image can be used as a guide map to refine the obtained depth map.

And 406, generating at least two sentences according to the object and the intimacy index.

The execution subject may generate a sentence for describing the target image according to the obtained object and the calculated intimacy index between the objects. In this embodiment, the execution body may generate at least two statements. It is understood that the description words differ between different objects due to their different intimacy indices. For example, when two persons are included in the target image and the closeness index of the two persons is small, the execution subject may generate the sentence "two persons are out of door". When the closeness index of two people is large, then the executing agent may generate the statement "there are two people with greater closeness outdoors". In some alternative implementations, the executing subject may also generate at least two sentences using the trained machine learning model. The machine learning model may be a convolutional neural network.

And step 407, scoring the at least two generated sentences, and taking the sentence with the highest score as the sentence for describing the target image.

The execution subject may also score the generated at least two sentences, and take the sentence with the highest score as the sentence for describing the target image. In particular, the execution subject may score each statement using various scoring approaches. For example, the execution body may score each sentence according to the degree of co-occurrence of words in each sentence. Or each sentence is scored according to the number of words included in each sentence. Or each sentence is scored according to BLEU (Bilingual Evaluation understandy) Evaluation criteria. BLEU is a machine translation evaluation index used for analyzing the co-occurrence degree of n-tuple in the candidate translation and the reference translation.

Step 408, convert the generated sentence into voice and output the voice.

The execution body may convert the generated sentence into voice and then output the voice. Specifically, the execution subject may output the voice output value to the speaker. Therefore, the loudspeaker can play the voice, and a user can be reminded conveniently.

In some optional implementations of this embodiment, the method may further include the following steps not shown in fig. 4: obtaining a question sentence of a user; determining an answer sentence template for answering the question sentence according to the question sentence and a preset dialogue library; obtaining an answer sentence according to the target image, the identified object and the answer sentence template; and outputting the answer sentence.

In this implementation, the execution main body may be further connected with at least one microphone. The microphone may be used to collect a question sentence of the user and send the collected question sentence to the executing body. After the execution main body obtains the question sentences, the execution main body can determine answer sentence templates for answering the question sentences according to the question sentences and a preset dialogue library. Then, the execution subject may further extract features in the target image, and obtain an answer sentence according to the identified object and the answer sentence template. And finally, outputting the answer sentence.

Taking the scene shown in fig. 5 as an example, when the courier knocks, the intelligent cat eye installed on the door panel collects the target image. And sending the target image to a cloud server. The cloud server generates a sentence 'a man holds a box' after analyzing the target image. And send the statement to the smart cat eye. And the intelligent cat eye plays the sentences. And collected the user's question sentence "who? ". The intelligent cat eye sends the sentences to the cloud server, and after the cloud server obtains the question sentences, an answer sentence template is extracted as ' XX people wearing XX on the head of the person wearing XX ' according to a preset dialogue library '. The cloud server extracts the features of the target image again, modifies the answer sentence template by using the extracted features, and obtains an answer sentence which is a stranger wearing a hat and wearing gray clothes. And sending the answer sentences to the smart cat eye for playing.

In some optional implementation manners of this embodiment, when the smart cat eye plays the above statement, a target image may also be displayed, so that the user can view the object outside the door more conveniently.

According to the method for outputting information provided by the above embodiment of the application, the descriptive sentence can be generated according to the distance between the objects in the target image and the positions of the objects in the target image, so that the descriptive sentence is more accurate.

With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for outputting information, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 6, the apparatus 600 for outputting information of the present embodiment includes: an image acquisition unit 601, an object recognition unit 602, and a sentence generation unit 603.

An image acquisition unit 601 configured to acquire a target image.

An object recognition unit 602 configured to perform object recognition on the target image and determine an object included in the target image.

A sentence generating unit 603 configured to generate a sentence describing the target image based on the recognized object.

In some alternative implementations of the present embodiment, the object includes a person and/or an object. The sentence generation unit 603 may further include a judgment module, a determination module, a calculation module, and a generation module, which are not shown in fig. 6.

A determination module configured to determine whether the target image satisfies at least one of the following conditions: comprising at least two persons, comprising at least one person and at least one object.

A determination module configured to determine a distance between the objects and a position occupied by the objects in the target image in response to determining that the target image satisfies at least one of the above conditions.

A calculation module configured to determine an index of closeness between objects according to the distance and the location.

And the generating module is configured to generate a sentence for describing the target image according to the object and the intimacy index.

In some optional implementations of this embodiment, the computing module may be further configured to: from the distance, a first weight coefficient is determined. And determining intersection area and union area between the objects according to the positions. And determining a second weight coefficient according to the intersection area and the union area. And determining the intimacy index based on the first weight coefficient and the second weight coefficient.

In some optional implementations of this embodiment, the generating module may be further configured to: and generating at least two sentences according to the object and the intimacy index. And scoring at least two generated sentences, and taking the sentence with the highest score as the sentence for describing the target image.

In some optional implementations of this embodiment, the image acquisition unit may be further configured to: and in response to detecting that the object exists within the preset distance of the preset object, determining the stay time of the detected object. In response to determining that the dwell time is greater than or equal to a preset threshold, acquiring, with an image acquisition device mounted on a preset object, a set of images including the detected object. A target image is determined from the set of images.

In some optional implementations of this embodiment, the apparatus 600 may further include a conversion unit, not shown in fig. 6, configured to convert the generated sentence into a voice and output the voice.

In some optional implementations of this embodiment, the apparatus 600 may further include an answering unit, not shown in fig. 6, configured to: obtaining a question sentence of a user; determining an answer sentence template for answering the question sentence according to the question sentence and a preset dialogue library; obtaining an answer sentence according to the target image, the identified object and the answer sentence template; and outputting the answer sentence.

The apparatus for outputting information provided by the above-described embodiment of the present application may acquire a target image first. Then, face recognition and object recognition are performed on the target image, and object recognition included in the target image is determined. Finally, a sentence describing the target image is generated based on the identified object. The method of the embodiment can perform face recognition and object recognition on the image and generate sentences for describing the image.

It should be understood that units 601 to 603, which are described in the apparatus 600 for outputting information, correspond to respective steps in the method described with reference to fig. 2, respectively. Thus, the operations and features described above for the method for outputting information are equally applicable to the apparatus 600 and the units included therein and will not be described in detail here.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing a server according to embodiments of the present application. The server shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, ROM 702, and RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that the computer program read out therefrom is mounted in the storage section 708 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by a Central Processing Unit (CPU) 701, performs the above-described functions defined in the method of the present application.

It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an image acquisition unit, an object recognition unit, and a sentence generation unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the image acquisition unit may also be described as a "unit that acquires a target image".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a target image; carrying out target identification on the target image and determining an object included in the target image; based on the identified object, a sentence is generated that describes the target image.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for outputting information, comprising:

determining the stay time of a detected object in response to the fact that the object exists within a preset distance of the preset object, wherein the preset object comprises a door plate, and the object comprises a person and/or an object;

in response to determining that the dwell time is greater than or equal to a preset threshold, acquiring, with an image acquisition device mounted on the preset object, an image set including the detected object;

determining a target image from the image set;

carrying out target recognition on the target image, and determining an object included in the target image;

generating a sentence for describing the target image based on the identified object;

obtaining a question sentence of a user; determining an answer sentence template for answering the question sentence according to the question sentence and a preset dialogue library; obtaining an answer sentence according to the target image, the identified object and the answer sentence template, wherein the obtaining of the answer sentence comprises: modifying the answer sentence template according to the characteristics of the identified object in the target image to obtain an answer sentence; and outputting the answer sentence.

2. The method of claim 1, wherein the generating, based on the identified object, a statement describing the set of images comprises:

determining whether the target image satisfies at least one of the following conditions: comprising at least two persons, comprising at least one person and at least one object;

in response to determining that the target image satisfies at least one of the above conditions, determining a distance between the objects and a position occupied by the objects in the target image;

determining an index of closeness between the objects according to the distance and the location;

and generating a sentence for describing the target image according to the object and the intimacy index.

3. The method of claim 2, wherein determining an index of closeness between the objects from the distance and location comprises:

determining a first weight coefficient according to the distance;

determining an intersection area and a union area between the objects according to the positions;

determining a second weight coefficient according to the intersection area and the union area;

determining the affinity index based on the first weight coefficient and the second weight coefficient.

4. The method of claim 2, wherein the generating a statement describing the target image according to the object and the affinity index comprises:

generating at least two sentences according to the object and the intimacy index;

and scoring at least two generated sentences, and taking the sentence with the highest score as the sentence for describing the target image.

5. The method of any of claims 1-4, wherein the method further comprises:

converting the generated sentence into voice and outputting the voice.

6. An apparatus for outputting information, comprising:

the image acquisition unit is configured to respond to the fact that an object exists within a preset distance of a preset object, and determine the residence time of the detected object, wherein the preset object comprises a door panel, and the object comprises a person and/or an object; in response to determining that the dwell time is greater than or equal to a preset threshold, acquiring, with an image acquisition device mounted on the preset object, an image set including the detected object; determining a target image from the image set;

an object recognition unit configured to perform object recognition on the target image, and determine an object included in the target image;

a sentence generation unit configured to generate a sentence describing the target image based on the recognized object;

an answering unit configured to acquire a question sentence of a user; determining an answer sentence template for answering the question sentence according to the question sentence and a preset dialogue library; obtaining an answer sentence according to the target image, the identified object and the answer sentence template, wherein the steps of: modifying the answer sentence template according to the characteristics of the identified object in the target image to obtain an answer sentence; and outputting the answer sentence.

7. The apparatus of claim 6, wherein the sentence generation unit comprises:

a determination module configured to determine whether the target image satisfies at least one of the following conditions: comprising at least two persons, comprising at least one person and at least one object;

a determination module configured to determine a distance between the objects and a position occupied by the objects in the target image in response to determining that the target image satisfies at least one of the above conditions;

a calculation module configured to determine an affinity index between the objects based on the distance and the location;

a generating module configured to generate a sentence for describing the target image according to the object and the intimacy index.

8. The apparatus of claim 7, wherein the computing module is further configured to:

determining a first weight coefficient according to the distance;

9. The apparatus of claim 7, wherein the generation module is further configured to:

10. The apparatus of any of claims 6-9, wherein the apparatus further comprises:

a conversion unit configured to convert the generated sentence into a voice and output the voice.

11. A server, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.