WO2022206605A1 - Method for determining target object, and photographing method and device - Google Patents

Method for determining target object, and photographing method and device Download PDF

Info

Publication number
WO2022206605A1
WO2022206605A1 PCT/CN2022/083080 CN2022083080W WO2022206605A1 WO 2022206605 A1 WO2022206605 A1 WO 2022206605A1 CN 2022083080 W CN2022083080 W CN 2022083080W WO 2022206605 A1 WO2022206605 A1 WO 2022206605A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target
information
index data
voice
Prior art date
Application number
PCT/CN2022/083080
Other languages
French (fr)
Chinese (zh)
Inventor
徐其超
陈帅
刘蒙
吴虹
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022206605A1 publication Critical patent/WO2022206605A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72439User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for image or video messaging
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules

Definitions

  • the present application relates to the technical field of image processing, and in particular, to a method, a photographing method, and an apparatus for determining a target object.
  • a target object is manually selected, and then the target object is shot to obtain a video or picture of the tracking object.
  • the actual selected target object may be inconsistent with the target object that the user wants to specify due to reasons such as clicking errors, that is, the target object cannot be accurately selected.
  • a user holds an item in his hand, etc., he needs to put down the item in his hand first, and then manually select the target object, which is less convenient.
  • the embodiments of the present application provide a method for determining a target object, a shooting method and an apparatus, which can allow a user to conveniently and accurately specify a target object.
  • an embodiment of the present application provides a method for determining a target object.
  • the method is applied to a terminal device, by first acquiring an image to be processed, and the image to be processed includes at least one object; and then determining object information of each object in the image to be processed , the object information includes area information and attribute information, the area information is used to describe the position of the object in the image to be processed; then, the second index data of each attribute information is extracted from a pre-established data set, the data set includes attribute keywords Word and attribute index; for each object, associate the second index data with the region information to obtain the association relationship between the index data and the region information; when acquiring the first voice information, based on the second index data in the dataset corresponding to attribute key words, extract the first index data of the first voice information from the data set; finally, based on the association relationship and the second index data, determine the target area information corresponding to the first index data, wherein, the object corresponding to the target area information for the target object.
  • the object information of each object in the image is identified in advance, and based on the pre-established data set, the association relationship between the index data of each object and the area information is established; then, when the voice command is collected, The object in the image corresponding to the voice command is determined based on the association relationship, so that the user can conveniently and quickly specify the target object.
  • the above-mentioned process of determining target area information corresponding to the first index data based on the association relationship and the second index data may include: for each second index data, The indexes in the first index data are respectively matched with the indexes in the first index data, and the number of items of the target index is determined, and the target index is the index in the second index data that matches the index in the first index data; based on the association relationship , and the area information corresponding to the second index data with the largest number of items is determined as the target area information.
  • the above process of extracting the second index data of each attribute information from the pre-established data set may include: for each attribute information, searching for the first index data matching the attribute information from the data set Key word; the index data of the first key word in the data set is used as the second index data.
  • the process of extracting the first index data of the first voice information from the data set based on the attribute keywords corresponding to the second index data in the data set may include: Carry out voice recognition to obtain a first voice recognition result; according to the first voice recognition result, determine that the first voice information is a voice command for specifying a target object; search for the first voice from the attribute keyword corresponding to the second index data Identify the second key word matching the result; and use the index data of the second key word in the data set as the first index data.
  • the above-mentioned process of performing speech recognition on the first voice information to obtain the first voice recognition result may include: extracting voiceprint features of the first voice information; determining the voiceprint features and pre-storing Similarity between voiceprint features; when the similarity is greater than or equal to a preset threshold, input the voiceprint feature and the first speech information into the semantic understanding model to obtain the first speech recognition result output by the semantic understanding model.
  • the above process of determining object information of each object in the image to be processed may include: inputting the image to be processed into the target recognition model; and obtaining object information output by the target recognition model.
  • the first voice information is a voice command for specifying the target object for the first time, or a voice command for increasing or decreasing the target object, or a voice command for replacing the target object.
  • the user can not only specify the target object for the first time by voice, but also can increase or decrease the target object and replace the target object by voice on the basis of specifying the target object, which further increases the convenience of shooting and improves the user experience.
  • the image to be processed includes N original images with different viewing angles, where N is a positive integer greater than or equal to 1.
  • the method further includes: obtaining a first output image of the target object according to the target area information based on the original image, and displaying the first output image in the viewfinder frame.
  • the above process of obtaining the first output image of the target object based on the original image and the target area information may include: determining the first target image from N original images, and the first target image It is the image with the smallest field of view in the image set, and the image set includes at least one original image that can completely contain the bounding box selection area of the target object; according to the target area information, the first sub-image of the target object is cropped from the first target image. ; generating a first output image based on the first sub-image.
  • the above-mentioned process of obtaining the first output image based on the first sub-image may include: : Splicing the first sub-images corresponding to each target object to obtain the first output image.
  • the method further includes: acquiring second voice information; performing voice recognition on the second voice information to obtain a second voice recognition result ; According to the second voice recognition result, determine that the second voice information is a voice command for adjusting the display mode of the target object; Determine the second target image from the original image; Cut out the second child of the target object from the second target image image; generating a second output image based on the second sub-image, and displaying the second output image in the viewfinder.
  • the user can adjust the display mode of the target object through voice, which further improves the convenience of shooting.
  • the method may further include: in the third target image, determining the distance between each edge of the bounding box selection area of the target object and the corresponding edge of the image, and the third The target image is the image with the largest field of view in the image set; if the minimum value among the multiple distances is less than or equal to the preset distance threshold, prompt information is displayed in the viewfinder, and the prompt information is used to prompt adjustment of the camera pose.
  • the user experience can be improved by prompting the user to adjust the camera pose.
  • the above attribute information includes at least one of the following: type, gender, age group, activity information, hair length, hair color, clothing type, clothing color, and posture.
  • an embodiment of the present application provides a method for determining a target object, which is applied to a terminal device.
  • a picture captured by a camera is displayed in a viewfinder, and the picture includes at least one object; after obtaining the first voice information
  • the object corresponding to the first voice information in the screen is determined as the target object.
  • the above process of determining the object corresponding to the first voice information in the picture as the target object may include: extracting the first index data of the first voice information from a pre-established data set, The data set includes attribute key words and attribute indexes; based on the association relationship of each object, the target area information corresponding to the first index data is determined, and the object corresponding to the target area information is the target object;
  • the association relationship is a mapping relationship between the second index data of the attribute information of the object and the area information, and the area information is used to describe the position of the object in the original image collected by the camera.
  • the above process of extracting the second index data of the attribute information of each object from the data set may include: for each attribute information, searching the data set for a first key matching the attribute information word; the index data of the first key word in the data set is used as the second index data.
  • the above process of determining object information of each object in the original image may include: inputting the original image into the target recognition model; and obtaining object information output by the target recognition model.
  • the above-mentioned process of determining the target area information corresponding to the first index data based on the association relationship of each object may include: for each second index data, Each index is respectively matched with each index in the second index data, and the number of items of the target index is determined, and the target index is an index in the second index data that matches the index in the first index data; The area information corresponding to the second index data with the largest number of items is determined as the target area information.
  • the above process of extracting the first index data of the first voice information from the pre-established data set may include: performing voice recognition on the first voice information to obtain a first voice recognition result; According to the first speech recognition result, it is determined that the first speech information is a speech command for specifying the target object; the second key word that matches the first speech recognition result is searched from the attribute key words corresponding to the second index data; The index data of the second key word in the data set is used as the first index data.
  • the above-mentioned process of performing speech recognition on the first voice information to obtain the first voice recognition result may include: extracting voiceprint features of the first voice information; determining the voiceprint features and pre-storing Similarity between voiceprint features; when the similarity is greater than or equal to a preset threshold, input the voiceprint feature and the first speech information into the semantic understanding model to obtain the first speech recognition result output by the semantic understanding model.
  • the cameras include N cameras with different viewing angles, where N is a positive integer greater than or equal to 1.
  • the method further includes: obtaining a first output image of the target object according to the target area information based on N original images with different viewing angles, and Display the first output image in the viewfinder.
  • the above-mentioned process of obtaining the first output image of the target object according to the target area information based on the original images of N different viewing angles may include: determining the first output image from the N original images. target image, the first target image is the image with the smallest field of view in the image set, and the image set includes at least one original image that can completely contain the bounding box selection area of the target object; according to the target area information, crop out from the first target image a first sub-image of the target object; generating a first output image based on the first sub-image.
  • the above-mentioned process of obtaining the first output image based on the first sub-image may include: : Splicing the first sub-images corresponding to each target object to obtain the first output image.
  • the method further includes: acquiring second voice information; and performing voice recognition on the second voice information to obtain a second voice recognition result ; According to the second voice recognition result, determine that the second voice information is a voice command for adjusting the display mode of the target object; Determine the second target image from the original image; Cut out the second child of the target object from the second target image image; generating a second output image based on the second sub-image, and displaying the second output image in the viewfinder.
  • the method further includes: in the third target image, determining the distance between each edge of the bounding box selection area of the target object and the corresponding edge of the image, the third target image
  • the image is the image with the largest field of view in the image set; if the minimum value among the multiple distances is less than the preset distance threshold, prompt information is displayed in the viewfinder, and the prompt information is used to prompt adjustment of the camera pose.
  • an embodiment of the present application provides a shooting method, which is applied to a terminal device.
  • a picture captured by a camera is displayed in a viewfinder, and the picture includes at least one object; when a first instruction is received, the first instruction is used for Designate at least two target objects, and determine at least two objects in the picture corresponding to the first instruction as at least two target objects; then, display the picture of each target object in sub-areas in the viewfinder, wherein one area displays one The picture of the target object.
  • the first instruction is a voice instruction.
  • the above process of determining the at least two objects in the screen corresponding to the first instruction as the at least two target objects may include: extracting the first instruction of the first instruction from a pre-established data set 1. index data, the data set includes attribute keywords and attribute indexes; based on the association relationship of each object, the target area information corresponding to the first index data is determined, and the object corresponding to the target area information is the target object;
  • the association relationship is the mapping relationship between the second index data of the attribute information of the object and the region information.
  • the region information is used to describe the position of the object in the original image collected by the camera.
  • the method before determining the target area information corresponding to the first index data based on the association relationship of each object, the method further includes: determining object information of each object in the original image, where the object information includes area information and attribute information; extract second index data of attribute information of each object from the data set; for each object, associate the second index data with the area information to obtain the association relationship between the index data and the area information.
  • the above process of extracting the second index data of the attribute information of each object from the data set may include: for each attribute information, searching for the first key matching the attribute information from the data set word; the index data of the first key word in the data set is used as the second index data.
  • the above process of determining object information of each object in the original image may include: inputting the original image into the target recognition model; and obtaining object information output by the target recognition model.
  • the above-mentioned process of determining the target area information corresponding to the first index data based on the association relationship of each object may include: for each second index data, Each index is respectively matched with each index in the second index data, and the number of items of the target index is determined, and the target index is an index in the second index data that matches the index in the first index data; The area information corresponding to the second index data with the largest number of items is determined as the target area information.
  • the above process of extracting the first index data of the first instruction from the pre-established data set may include: searching attribute keywords corresponding to the second index data for the first instruction The matching second key word; the index data of the second key word in the data set is used as the first index data.
  • the cameras include N cameras with different field of view, where N is a positive integer greater than or equal to 1;
  • the process of displaying the picture of each target object in sub-regions in the viewfinder frame may include: : For each target object, determine the first target image from N original images, the first target image is the image with the smallest field of view in the image set, and the image set includes at least one image that can completely contain the bounding box selection area of the target object.
  • the original image for each target object, according to the target area information, the first sub-image of the target object is cropped from the first target image; the first sub-image corresponding to each target object is stitched to obtain a first output image; Display the first output image in the viewfinder.
  • the method further includes: acquiring second voice information; performing voice recognition on the second voice information to obtain the second voice information. Voice recognition result; according to the second voice recognition result, determine that the second voice information is a voice command for adjusting the display mode of the target object; determine the second target image from the original image; second sub-image; generating a second output image based on the second sub-image, and displaying the second output image in the viewfinder.
  • the method further includes: in the third target image, determining the distance between each edge of the bounding box selection area of the target object and the corresponding edge of the image, the third target image
  • the image is the image with the largest field of view in the image set; if the minimum value among the multiple distances is less than the preset distance threshold, prompt information is displayed in the viewfinder, and the prompt information is used to prompt adjustment of the camera pose.
  • an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, the above-mentioned first aspect or the second The method of any of the aspect or the third aspect.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, any of the above-mentioned first aspect or second aspect or third aspect is implemented item method.
  • an embodiment of the present application provides a chip system, the chip system includes a processor, the processor is coupled to a memory, and the processor executes a computer program stored in the memory to implement the first aspect or the second aspect or the first aspect.
  • the chip system may be a single chip, or a chip module composed of multiple chips.
  • an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the method described in any one of the first aspect or the second aspect or the third aspect.
  • FIG. 1 is a schematic diagram of a hardware structure of a terminal device 100 according to an embodiment of the present application
  • FIG. 2 is a block diagram of a software structure of a terminal device 100 according to an embodiment of the application
  • FIG. 3 is a schematic block diagram of a flow of a photographing method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a tracking shooting in a video recording scene provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of stable tracking provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a tracking shooting of multi-target tracking provided by an embodiment of the present application.
  • FIG. 7A is a schematic diagram of an image processing process in a tracking shooting scene provided by an embodiment of the present application.
  • FIG. 7B is a schematic diagram of an interface in a tracking shooting scene provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of replacing a target object provided by an embodiment of the present application.
  • 9A is a schematic diagram of another image processing process in a tracking shooting scene provided by an embodiment of the present application.
  • 9B is a schematic diagram of another interface in a tracking shooting scene provided by an embodiment of the present application.
  • FIG. 9C is a schematic diagram of an original image set provided by an embodiment of the present application.
  • FIG. 9D is a schematic diagram of a sub-image provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of tracking shooting in a shooting scene provided by an embodiment of the present application.
  • FIG. 11 is a schematic block diagram of a shooting process provided by an embodiment of the present application.
  • FIG. 12 is a schematic flowchart of a voice shooting process provided by an embodiment of the present application.
  • FIG. 13 is a schematic flowchart of a method for determining a target object provided by an embodiment of the present application.
  • FIG. 14 is another schematic flowchart of the method for determining a target object provided by an embodiment of the present application.
  • FIG. 15 is another schematic flow diagram of the photographing method provided by the embodiment of the present application.
  • 16 is a schematic block diagram of an apparatus for determining a target object provided by an embodiment of the present application.
  • 17 is another schematic block diagram of an apparatus for determining a target object provided by an embodiment of the present application.
  • FIG. 18 is a schematic block diagram of a photographing apparatus provided by an embodiment of the present application.
  • the following first exemplarily introduces terminal devices that may be involved in the embodiments of the present application.
  • the terminal device 100 may include a processor 110, a memory 120, an audio module 130, a speaker 130A, a receiver 130B, a microphone 130C, a camera 140, a display screen 150, and a sensor module 160.
  • the sensor module 160 may include, but is not limited to, a touch sensor 160A and the like.
  • the terminal device 100 may include more or less components than those shown in the drawings, or combine some components, or separate some components, or arrange different components.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the terminal device 100 may further include at least one of the following: an earphone interface, a button, a motor, an indicator, and a subscriber identification module (SIM) card interface 1, an external memory interface, a universal string Line bus (universal serial bus, USB) interface, charging management module, power management module, battery, antenna, mobile communication module, and wireless communication module, etc.
  • SIM subscriber identification module
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP) ), controller, memory, video codec, digital signal processor (DSP), and/or neural-network processing unit (NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • graphics processor graphics processor
  • image signal processor image signal processor
  • ISP image signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the controller may be the nerve center and command center of the terminal device 100 .
  • the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), general-purpose input and output ( general-purpose input/output, GPIO) interface.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • MIPI mobile industry processor interface
  • GPIO general-purpose input and output
  • the I2C interface is a bidirectional synchronous serial bus, including a serial data line (serial data line, SDA) and a serial clock line (derail clock line, SCL).
  • the processor 110 may contain multiple sets of I2C buses.
  • the processor 110 may be respectively coupled to the touch sensor 160A, the camera 140 and the like through different I2C bus interfaces.
  • the processor 110 may couple the touch sensor 160A through the I2C interface, so that the processor 110 communicates with the touch sensor 160A through the I2C bus interface, so as to realize the touch function of the terminal device 100 .
  • the I2S interface can be used for audio communication.
  • the processor 110 may contain multiple sets of I2S buses.
  • the processor 110 may be coupled with the audio module 130 through an I2S bus to implement communication between the processor 010 and the audio module 130 .
  • the MIPI interface can be used to connect the processor 110 with the display screen 150, the camera 140 and other peripheral devices. MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc.
  • the processor 110 communicates with the camera 140 through a CSI interface to implement the shooting function of the terminal device 100 .
  • the processor 110 communicates with the display screen 150 through the DSI interface to implement the display function of the terminal device 100 .
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface may be used to connect the processor 110 with the camera 140, the display screen 150, the audio module 130, the sensor module 160, and the like.
  • the GPIO interface can also be configured as I2C interface, I2S interface, MIPI interface, etc.
  • the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the terminal device 100 .
  • the terminal device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the terminal device 100 implements a display function through a GPU, a display screen 150, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 150 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the display screen 150 is used to display images, videos, and the like.
  • the display screen 150 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
  • LED organic light-emitting diode
  • AMOLED organic light-emitting diode
  • FLED flexible light-emitting diode
  • Miniled MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
  • the terminal device 100 may include 1 or N display screens 150 , where N is a positive integer greater than 1.
  • the terminal device 1000 can realize the shooting function through the ISP, the camera 140, the video codec, the GPU, the display screen 150, and the application processor.
  • the ISP is used to process the data fed back by the camera 140 .
  • the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, converting it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise, brightness, and skin tone.
  • ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 140 .
  • Camera 140 is used to capture still images or video.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the terminal device 100 may include 1 or N cameras 140 , where N is a positive integer greater than 1.
  • the at least two cameras 140 may include cameras with different viewing angles.
  • the terminal device 100 includes a wide-angle camera, a main camera, and a telephoto camera.
  • the three cameras have different viewing angles, and images with different viewing angles can be collected through cameras with different viewing angles.
  • a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals.
  • Video codecs are used to compress or decompress digital video.
  • the terminal device 100 may support one or more video codecs.
  • the terminal device 100 can play or record videos in various encoding formats, for example, moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
  • MPEG moving picture experts group
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the terminal device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • the memory 120 may include internal memory and/or external memory, and the internal memory may be used to store computer-executable program code, which includes instructions.
  • the processor 110 executes various functional applications and data processing of the terminal device 100 by executing the instructions stored in the internal memory.
  • the internal memory may include a program storage area and a data storage area.
  • the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
  • the storage data area may store data and the like created during the use of the terminal device 100 .
  • the internal memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • the terminal device 100 may implement audio functions through an audio module 130, a speaker 130A, a receiver 130B, a microphone 130C, an application processor, and the like. Such as music playback, recording, etc.
  • the audio module 130 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 130 may also be used to encode and decode audio signals. In some embodiments, the audio module 130 may be provided in the processor 110 , or some functional modules of the audio module 130 may be provided in the processor 110 .
  • Speaker 130A also referred to as a "speaker” is used to convert audio electrical signals into sound signals.
  • the terminal device 100 can listen to music through the speaker 130A, or listen to a hands-free call, or send out voice prompt information and the like.
  • the receiver 130B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
  • the voice can be received by placing the receiver 130B close to the human ear.
  • the microphone 130C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 130C through a human mouth, and input the sound signal into the microphone 130C.
  • the terminal device 100 may be provided with at least one microphone 130C.
  • the touch sensor 160A is also called "touch panel”.
  • the touch sensor 160A may be disposed on the display screen 150 , and the touch sensor 160A and the display screen 150 form a touch screen, also referred to as a “touch screen”.
  • the touch sensor 160A is used to detect touch operations on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to touch operations may be provided through display screen 150 .
  • the touch sensor 160A may also be disposed on the surface of the terminal device 100 , which is different from the position where the display screen 150 is located.
  • the software architecture of the terminal device 100 will be exemplarily introduced below with reference to FIG. 2 .
  • the software system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of the terminal device 100 .
  • FIG. 2 is a block diagram of a software structure of a terminal device 100 according to an embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and system libraries, and a kernel layer.
  • the application layer can include a series of application packages.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and so on.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include window managers, content providers, view systems, telephony managers, resource managers, notification managers, and the like.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
  • Content providers are used to store and retrieve data and make this data accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the telephony manager is used to provide the communication function of the terminal device 100 .
  • the resource manager is used to provide various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can automatically disappear after a brief pause without user interaction.
  • the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications that appear on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.
  • Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
  • the core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
  • a system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
  • the Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.
  • the software and hardware workflows of the terminal device 100 are exemplarily described below with reference to the capturing and photographing scene.
  • a corresponding hardware interrupt is sent to the kernel layer.
  • the kernel layer processes touch operations into raw input events (including touch coordinates, timestamps of touch operations, etc.). Raw input events are stored at the kernel layer.
  • the application framework layer obtains the original input event from the kernel layer, and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and the control corresponding to the click operation is the control of the camera application icon, as an example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer, and then starts the camera driver by calling the kernel layer. Camera 140 captures still images or video.
  • the terminal device 100 may be a mobile phone, a tablet computer, or other types of terminal devices, and the specific type of the terminal device 100 is not limited in this embodiment of the present application.
  • FIG. 3 is a schematic flow diagram of a photographing method provided by an embodiment of the present application. The method is applied to the terminal device 100, and the method may include the following steps:
  • Step S301 start the shooting application, enter the video recording mode or the photographing mode, and display the captured image in the viewfinder.
  • shooting applications refer to shooting applications, which are usually camera applications.
  • the terminal device 100 After detecting the operation for starting the shooting application, the terminal device 100 starts the shooting application in response to the operation, and invokes the camera driver through the kernel layer to capture the image through the camera and display the image in the viewfinder.
  • the main interface 42 of the mobile phone 41 displays applications such as the camera 43, smart life, settings, calendar, gallery, and clock, etc.
  • the terminal device 100 is embodied as a mobile phone 41
  • the photographing application is embodied as a camera 43 .
  • the mobile phone 41 After receiving the click operation on the camera 43, the mobile phone 41 starts the camera 43 in response to the click operation. After the mobile phone 41 enters the video recording mode, the video recording preview interface 44 is displayed, and the object 46 , the object 47 , the object 48 , and the object 49 are displayed in the viewfinder frame of the video recording preview interface 44 .
  • the video preview interface 44 also includes an icon 45, the icon 45 is used to turn on and off the tracking shooting mode, that is, the user can click the icon 45 to enter or exit the tracking shooting mode. At this time, the icon 45 indicates that the mobile phone 41 has not entered the tracking shooting mode.
  • tracking shooting may also be referred to as tracking shooting or target tracking.
  • the tracking shooting can keep the tracking target in the center of the screen as much as possible without using a mobile terminal device or manually adjusting the focal length, avoiding the screen shake caused by the user's mobile terminal device.
  • Step S302 a trigger operation is detected, and in response to the trigger operation, the voice shooting mode is entered.
  • the trigger operation is used to trigger the voice shooting mode, and the trigger operation may include operations such as single-click, double-click, or long-press.
  • the trigger operation is a click operation.
  • the mobile phone 41 in FIG. 4 receives the user's click operation on the icon 45 , it enters the tracking shooting mode.
  • the mobile phone 41 can call the microphone to collect voice information in real time, so as to realize functions such as allowing the user to specify the target object through voice, that is, after the mobile phone 41 enters the tracking shooting mode, it can be regarded as entering the voice shooting mode.
  • the mobile phone 41 can also enter and exit the voice shooting mode through other icons or buttons.
  • the trigger operation can also be a voice command, that is, the user can turn on and off the tracking shooting mode through a voice command.
  • a voice command that is, the user can turn on and off the tracking shooting mode through a voice command.
  • the terminal device 100 receives the voice "turn on voice photography” or "turn on voice video", it performs voice recognition on the voice to obtain a voice recognition result.
  • the voice recognition result includes a preset keyword, it can be determined that the voice is A voice command for triggering the voice shooting mode; and then in response to the voice command, the voice shooting mode is entered.
  • the preset keywords may be preset, for example, the preset keywords may include "open", “enter”, “photograph”, “shoot", and "video”.
  • the terminal device 100 After the terminal device 100 enters the voice shooting mode, it performs target recognition on original images with different viewing angles, and collects voice information in real time through a microphone to identify the voice information.
  • step S302 is an optional step.
  • step S303 target recognition is performed on the N original images respectively, and object information of each object in each original image is obtained.
  • the terminal device 100 includes N cameras with different field of view, and N pieces of original images with different field of view are obtained through N cameras with different field of view, where N is a positive integer greater than or equal to 1.
  • the terminal device 100 is a mobile phone
  • the mobile phone may include a wide-angle camera, a telephoto camera, and a main camera, and the three cameras have different field angles. Through the three cameras with different field of view, the mobile phone can collect three original images with different field of view.
  • the terminal device 100 can call N cameras with different field of view to collect N original images with different field of view when starting the shooting application;
  • N cameras are called to collect N different views.
  • the terminal device 100 may perform target recognition on each original image to obtain object information of each object in each original image.
  • Each original image may contain one or more objects, which can include people, animals, and text.
  • Object information includes attribute information and area information.
  • the area information is used to represent the area where the object is located in the original image, that is, the area information can be used to represent the location and size of the object in the original image.
  • the area information may be pixel coordinate information of the bounding box selection area of the object, and the bounding box selection area is usually a rectangular area, and of course it may be an area of other shapes.
  • the area information may also be coordinate information of the area where the object is actually located in the image.
  • Attribute information can be used to describe the object, and the attribute information includes, but is not limited to, type, gender, age group, activity information, hair length, hair color, clothing type, clothing color, and posture. Different objects may contain different types of properties. Exemplarily, for a character, the types of attribute information include gender, age group, and clothing color. For animals, the types of attribute information include categories and colors.
  • the type refers to the category of the object, and the category may exemplarily include characters, animals, and characters.
  • the activity in the attribute information is used to describe the current action of the object, which may exemplarily include skateboarding, running, and hula hooping. That is, through the activity information in the attribute information, it can be known whether the current action of the object is running or playing hula hoop. Gestures can be used to describe the current pose of an object, for example, a pose can tell whether a person is standing or sitting.
  • the terminal device 100 may respectively input N original images into the target recognition model to obtain the object information output by the target recognition model.
  • the object recognition model may be pre-built and pre-trained.
  • the target recognition model can include a target detection network model and a target classification network model, etc., which are used to identify the image to identify the attribute information of the objects in the image, as well as the location and size of each object in the image. Wait.
  • the terminal device 100 may also perform target recognition in other ways, for example, by combining image semantic segmentation and instance segmentation, etc., to recognize the object information of each object in the image.
  • the mobile phone 41 can collect the original image corresponding to the video preview interface 44, the original image can be input into the target recognition model to obtain object information of each object output by the target recognition model.
  • the mobile phone 41 can identify the information of the object 47 included in the original image, and through the information of the object 47, it can be determined that the object is a woman and is playing hula hoops, as well as its position and size in the original image.
  • the terminal device 100 may perform target recognition on each frame of the collected original image in real time, or may perform target recognition every preset number of frames.
  • Step S304 extract the second index data of the attribute information of each object from the pre-established data set.
  • the above data set is a pre-established database of common user information, which may be embodied as a database, and the data set is preset with different attribute indexes for different object types.
  • attribute indexes for different object types.
  • character types it may include, but is not limited to, the following attribute categories: gender, age group, hair length, hair color, clothing type, clothing color, posture, and activity.
  • a corresponding attribute index is set under each attribute type.
  • object types can also include but are not limited to animals, characters, and the like. For the convenience of description, the following description will take a database as an example.
  • attribute indexes of the character types in the database may be as shown in Table 1 below.
  • the terminal device 100 extracts the attribute information of each object in each original image, and matches the attribute information of each object with the attribute keywords in the corresponding attribute types in the database, so as to find the attribute information in the database that matches the attribute information.
  • the first key term may include words and/or words. If the first keyword matching the attribute information is found in the database, the index data of the first keyword is used as the second index data.
  • the object 46 is a little boy playing a skateboard, and the object is a character type.
  • the attribute category index corresponding to the character type includes A to N.
  • the mobile phone 41 obtains the attribute information and area information of the object 46 through target recognition.
  • the attribute information may include: male, child, skateboard, short hair, black hair, white T-shirt, . . . and standing.
  • the attribute information of the object 46 is matched with the attribute key words in the corresponding attribute type in the database. Specifically, for the gender attribute category, the "male" in the attribute information is matched with the attribute keyword corresponding to 001A and the keyword corresponding to 002A in Table 1 above, and it is determined that "male” in the attribute information is ” matches property index 001A.
  • the "children" in the attribute information are respectively corresponding to the attribute key words corresponding to 001B, the attribute key words corresponding to 002B, the attribute key words corresponding to 003B, and the attribute key words corresponding to 004B in Table 1 above.
  • the attribute key words are matched, and the attribute index matching "little boy" is determined to be 003B.
  • the attribute types such as skateboard, short hair, black hair, white T-shirt, ..., and standing in the attribute information are matched with the attribute keywords of the corresponding attribute types in Table 1 above to obtain the object.
  • the attribute index corresponding to the attribute information of 46 is obtained, that is, the second index data of the object 46 is obtained.
  • the property indexes of object 46 are shown in Table 2 below.
  • the property index of object 47 and the property index of object 48 can be obtained.
  • the attribute index of the object 47 is shown in Table 3 below.
  • the property indexes of object 48 are shown in Table 4 below.
  • Step S305 for each object, associate the second index data with the area information to obtain an association relationship of each object, where the association relationship is an association relationship between the index data and the area information.
  • the terminal device 100 associates the second index data of the object with the region information of the object in each original image to establish each object The relationship between the index data and the region information of each original image. Based on the association relationship, the region information of the object can be found through the index data.
  • the second index data of the object 46 in FIG. 4 may be as shown in Table 2 above.
  • the mobile phone 41 includes a wide-angle camera, a main camera and a telephoto camera, and the wide-angle camera has the largest field of view, the main camera has the second largest field of view, and the telephoto camera has the smallest field of view.
  • the area information of the object refers to the pixel coordinate information of the bounding box selection area of the object.
  • the second index data of the object 46 in the above table 2 are respectively associated with the coordinates of the bounding box selection area of the object 46 in the original image collected by the wide-angle camera, and the bounding box selection area of the object 46 in the original image collected by the main camera.
  • the coordinate association establishes the association relationship between the index data of the object 46 and the area information.
  • association relationship of the objects 46 is shown in Table 5 below.
  • the index data of the object 46 includes 001A, 003B, 001C and 002N, etc. Through the index data, the area information of the object 46 in the original image collected by the wide-angle camera, and the original image collected by the main camera of the object 46 can be found. Region information in the image.
  • the processor 110 of the terminal device 100 performs target recognition on the original image captured by the camera 140 to obtain object information of each object in the original image; the processor 110 reads the database in the memory 120 and extracts each object from the database. The second index data of the attribute information of the object; for each object, the processor 110 associates the second index data with the area information, obtains the association between the index data and the area information, and stores the association in the memory 120 .
  • the processor 110 collects the first voice through the microphone 130C and the audio module 130, it performs voice recognition on the first voice to obtain the first voice recognition result; the processor 110 then reads the database from the memory 120, The first index data of the first speech recognition result is extracted from the database, the pre-established association relationship is read, and the target area information corresponding to the first index data is determined according to the first index data and the association relationship.
  • Step S306 detecting the first voice information.
  • the terminal device 100 can call the microphone to collect the first voice information in real time.
  • Step S307 Perform speech recognition on the first speech information to obtain a first speech recognition result.
  • the terminal device 100 can extract the first voiceprint feature of the first voice information, and then determine the similarity between the first voiceprint feature and the pre-stored voiceprint feature. If the similarity between the feature and the pre-stored voiceprint feature is greater than the preset threshold, it is considered that the first voice information is the voice of a specific user, then voice recognition is performed according to the first voiceprint feature and the first voice information to obtain the first voice recognition result.
  • the terminal device 100 may prompt the user to have no permission, or may store the first voiceprint feature for the next voice command identification.
  • the terminal device 100 may input the first voiceprint feature and the first voice information into the semantic understanding model, and obtain the first voice recognition result output by the semantic understanding model.
  • the semantic understanding model may include a voice feature analysis module and a semantic analysis module, the voice feature analysis module may be used to filter background interference sounds, and the semantic analysis module may be used to extract user voice commands.
  • the terminal device 100 may directly recognize the first voice information without extracting the voiceprint feature of the first voice information, that is, without distinguishing whether the first voice information is the voice of a specific user, to obtain The first speech recognition result.
  • Step S308 when it is determined according to the first speech recognition result that the first speech information is the speech used to specify the target object, extract the first index of the first speech recognition result from the data set based on the attribute keywords corresponding to the second index data data.
  • the terminal device 100 may further confirm whether the first voice information is a voice command.
  • the first voice recognition result includes a specific keyword
  • the first voice information may be considered as the voice for specifying the target object, and the specific keyword is preset.
  • the specific keyword may include "shooting" , Track, and Take Photo.
  • the first voice information can be a voice command for specifying the target object for the first time, or a voice command for increasing or decreasing the target object, or a voice command for replacing the target object, all of which can be considered to be used for specifying the target object. voice commands. Adding or subtracting target objects and replacing target objects are relative to the specified target objects.
  • the voice "follow the little boy playing skateboard" collected in Fig. 4 is the voice command for specifying the target object for the first time.
  • the target object can be replaced from the object 46 to other objects by collecting the first voice information, and other objects can also be added or subtracted as the target object based on the object 46 .
  • the terminal device 100 may prompt the user to re-input the voice, or may not perform a prompting operation.
  • the specific voice command may exemplarily include commands to adjust the display size and display area of the tracked object, and the like.
  • the terminal device 100 may perform an operation corresponding to the specific voice command.
  • the terminal device 100 may match the first voice recognition result with the attribute keywords corresponding to the second index data, and find out the voice that matches the first voice recognition result.
  • the second key word, and then the attribute index corresponding to the second key word that matches the first speech recognition result is used as the index data of the first speech recognition result.
  • the first speech recognition result may also be matched with each key word in the database.
  • the matching with the attribute keywords corresponding to the second index data reduces the matching range and improves the matching accuracy and matching speed.
  • the mobile phone 41 displays the video recording preview interface 44 in response to the user operation, and after entering the voice shooting mode, the mobile phone 41 respectively performs target recognition on the N original images collected, and obtains the information of each object in each original image. object information, and extract the second index data of the attribute information of each object from the database, and establish an association relationship between the index data and the area information. Then, at a certain moment, the mobile phone 41 collects the first voice information through the microphone, and the first voice information is "tracking the little boy playing skateboard".
  • the mobile phone 41 After determining that the first voice information is the voice of a specific user according to the pre-stored voiceprint feature, the mobile phone 41 performs voice recognition on the first voice information, and obtains the first voice recognition result as "tracking the boy playing skateboard".
  • the recognition result contains the keyword "tracking”, so it is determined that the voice is a voice command used to specify the target object, and then the index data of "skateboard” and "boy” are searched from the attribute keywords corresponding to the second index data to obtain The first index data of the first speech recognition result.
  • the index data of "skateboard” is 001C
  • the index data of "boy” are 001A and 003B, that is, the first index data includes 001C, 001A and 003B.
  • Step S309 Determine the target area information corresponding to the first index data according to the association relationship between the index data and the area information, and the object corresponding to the target area information is the target object.
  • the terminal device 100 searches the index data set for an index matching the first index data, determines the object with the largest number of matching items as the target object, and determines the area information corresponding to the object as the target area information.
  • the terminal device 100 may prompt the user that no object matching the voice can be found.
  • the prompt mode may include voice prompt and/or text prompt.
  • the terminal device can send out a voice "No matching object can be found, please re-enter the voice", and the text corresponding to the voice can also be displayed in the viewfinder.
  • the index data set refers to a collection of second index data including each object.
  • the index data set may include the second index data of object 46 , the second index data of object 47 , the second index data of object 48 , and the second index data of object 49 .
  • the terminal device 100 may prompt the user for further selection.
  • the prompt information may be text prompt information, voice prompt information, or other types of prompt information.
  • the first speech recognition result is "following the little boy playing skateboard", that is, the first speech recognition result only contains the object "little boy", but two target objects are matched.
  • the mobile phone can let the user execute Click the operation, or let the user re-input the voice to further determine the target object that the user wants to specify from the two target objects.
  • the terminal device 100 uses the area information with the most matching items as the target area information corresponding to the first index data.
  • the first speech recognition result is "tracking a boy playing skateboard”
  • the index data corresponding to "skateboard” is 001C
  • the index data corresponding to "boy” are 001A and 003B. That is, the first index data includes 001C, 001A, and 003B.
  • the second index data corresponding to the area information of the object 46 includes 001A, 003B, 001C, 002D, 005E, 008F and 002N
  • the second index data corresponding to the area information of the target object 48 includes 001A, 003B , 003C and 002D, 005E, 001F and 002N.
  • the target indexes that match the indexes in the first all data include 001C, 001A, and 003B, that is, the number of items that match the first index data is 3. That is, the number of items of the target index is 3; and in the second index data of the object 48, the target index that matches the indexes in the first all data only includes 001A and 003B, and does not include 001C, that is, the first index data The number of matching items is 2.
  • the target area information includes the coordinate information of the bounding box selection area of the original image collected by the wide-angle camera, and the coordinate information of the bounding frame selection area of the original image collected by the main camera.
  • the terminal device 100 may use a corresponding mark to identify the target object, so as to let the user know the target object.
  • the marking method can be, but not limited to, a rectangular frame. For example, referring to FIG.
  • the mobile phone 41 collects the voice "tracking the boy playing skateboard", performs voice recognition on the voice, and determines the target area information, it can determine that the target object specified by the user is the object 46, and according to the target
  • the area information can determine the area where the object 46 is located in the original image, and a rectangular frame can be used to mark the object 46, that is, the rectangular frame 411 is used to frame the object 46 in the video preview interface 410, and the target object designated by the user's voice is prompted as the object 46.
  • the mobile phone 41 may not use the corresponding mark to identify the target object.
  • Step S310 Determine the target image from the N original images, and then crop a sub-image from the target image according to the target area information, the sub-image includes the target object, and the target image refers to the image with the smallest field of view in the image set.
  • the image set includes at least one original image that can completely include the bounding box selection area of the target object, that is, images that can completely include the bounding box selection area of the target object are selected from the N original images, and these images are formed into an image set.
  • the terminal device 100 can first select images that can completely contain the bounding box selection area of the target object from the N original images to form an image set, and then select the one with the smallest field of view from the image set according to the field of view angle from small to large.
  • the original image is used as the target image.
  • the bounding box selection area of the target object is usually a bounding rectangle area.
  • the target object is the object 46
  • the original image that can completely include the bounding box selection area of the object 46 includes the original image captured by the wide-angle camera and the original image captured by the main camera, and the original image captured by the telephoto camera.
  • Object 46 is not included in the original image.
  • the image set includes the original image collected by the wide-angle camera and the original image collected by the main camera. Since the field of view of the wide-angle camera is larger than the field of view of the main camera, the original image collected by the main camera is determined as the target image. Then, according to the target area information of the object 46 in the original image collected by the main camera, a sub-image including the object 46 is collected.
  • the terminal device 100 After the terminal device 100 determines the target image, it can expand the bounding box selection area of the target object by a certain ratio (for example, 5%), and then according to the preset image width and height ratio, according to the target area information of the target image, from the target image. Crop out the sub-image.
  • the preset image aspect ratio can be set according to actual needs.
  • a sub-image includes a target object.
  • multiple sub-images are obtained by cropping.
  • the voice collected by the terminal device 100 is "tracking a little boy playing a skateboard and a little boy running".
  • the target objects include “tracking a little boy playing a skateboard” and “a little boy running”, according to the "tracking little boy”
  • the sub-image containing "tracking the little boy playing skateboard” is obtained by cropping. subimage.
  • Step S311 generating an output image based on the sub-image, and then displaying the output image in the viewfinder.
  • the terminal device 100 can reduce or enlarge the single sub-image to a preset resolution to obtain an output image, and then display the output image in the viewfinder according to the expected display mode.
  • the terminal device 100 may stitch the multiple sub-images into one image, and then reduce or enlarge the image obtained by splicing to obtain an output image, or may also reduce or enlarge a single sub-image After that, the output image is obtained by splicing, and then the output image is displayed in the viewfinder according to the expected display mode. For example, referring to FIG.
  • the mobile phone 41 collects the voice "tracking the little boy playing skateboard", it recognizes the voice information, and according to the voice recognition result, finds the corresponding target area information; The target object is selected, and the target object designated by the user's voice is prompted as the object 46 . At this time, the user can follow the shooting of the object 46 by clicking the shooting button. After receiving the click operation on the shooting button, the mobile phone 41 can crop a sub-image from the target image, and then generate an output image based on the sub-image, and display the output image in the viewfinder frame, that is, the display interface 412 .
  • the mobile phone 41 recognizes that the voice contains the keyword "tracking”, and determines that the target object is the object 46, which can be automatically displayed after the display interface 410. Interface 412.
  • the mobile phone 41 can directly display the interface 412 after the phone 41 collects the voice "tracking the little boy playing skateboard" and determines through voice recognition that the object 46 needs to be tracked and photographed, that is, the mobile phone 41 may not display the interface 410, but after recognizing the target object and the shooting action, the object 46 is directly tracked and shot.
  • the terminal device 100 In the tracking shooting mode, when the target object moves, the terminal device 100 stably tracks the target object.
  • FIG. 5 is a schematic diagram of stable tracking provided by an embodiment of the present application, and the scenarios in FIG. 4 and FIG. 5 are the same.
  • the mobile phone 41 collects an image 51 through a wide-angle camera, a main camera or a telephoto camera, and the image 51 includes a plurality of objects; target recognition is performed on the image 51 to obtain object information of each object in the image 51;
  • the second index data of the attribute information of each object is extracted from the database, and an association relationship between the index data of each object and the area information is established.
  • the mobile phone 41 When detecting the voice 52 input by the user, the mobile phone 41 performs voice recognition on the voice 52 in combination with the object information, that is, the mobile phone 41 performs voice recognition on the voice 42, and obtains the voice recognition result "tracking the boy playing skateboard", and then according to the index data and the area The relationship between the information is determined to determine the target object corresponding to the speech recognition result and the target area information corresponding to the target object.
  • the mobile phone 41 determines that the target object specified by the voice 52 is the object 53 , and according to the region information of the object 53 , a sub-image including the object 53 is obtained by cropping, and the output image 54 is generated based on the sub-image.
  • the cell phone 41 captures the image 55 . Compared to image 55 and image 51, the position of target object 53 has changed in the image.
  • the mobile phone 41 performs target recognition on the image 55 to obtain the object information, and then according to the area information of the object 53 , a sub-image including the object 53 is obtained by cropping, and an output image 56 is generated based on the sub-image.
  • the terminal device first performs target recognition on the images of each field of view to obtain the object information of each object, and then extracts the index data corresponding to the attribute information of the object in the database, and compares the index data and the area information.
  • the user does not need to manually select the object of the scene frame to designate the target object, but can conveniently and accurately designate the tracking target through voice, which improves the convenience of shooting.
  • the embodiments of the present application can implement single-target tracking, and can also implement multi-target tracking.
  • FIG. 6 is a schematic diagram of tracking and shooting of multi-target tracking according to an embodiment of the present application.
  • the mobile phone 61 receives the click operation on the camera 62 , the mobile phone 61 starts the camera application and displays the captured image in the viewfinder.
  • the mobile phone 61 displays a video recording preview interface 63, and the video recording preview interface 63 includes shooting subjects such as object 64, object 65, object 66, and object 67.
  • the mobile phone 61 After the mobile phone 61 enters the voice shooting mode or enters the tracking shooting mode, it collects original images of different viewing angles through cameras with different viewing angles, and performs target recognition on each original image to obtain object information of each object. And, based on the pre-established database, the index data of the attribute information of each object is extracted, and the association relationship between the index data and the area information is established.
  • the mobile phone 61 collects the voice “Two screens to track the boy playing skateboard and the girl playing hula hoop” through the microphone, and recognizes the voice to obtain the voice recognition result; and then extracts the index data of the voice recognition result from the database, based on the index data and The relationship between the area information determines the target object corresponding to the speech recognition result and the target area information.
  • the first index data of the speech recognition result includes the indices of the at least two target objects.
  • the target area information is determined based on the association relationship between the index data and the area information.
  • the voice collected by the mobile phone 61 includes two target objects, namely "a boy playing a skateboard” and "a girl playing a hula hoop".
  • the first index data corresponding to the speech recognition result includes the index data corresponding to the "boy playing skateboard” and the index data of "girl playing hula hoop”; then, according to the index data corresponding to the "boy playing skateboard", based on the association relationship
  • the target area information of the "boy playing skateboard” is determined, and according to the index data of the "girl playing hula hoop", the target area information of the "girl playing hula hoop" is determined based on the association relationship.
  • the mobile phone 61 determines that the target objects corresponding to the voice "two-screen tracking of a boy playing a skateboard and a girl playing a hula hoop" are the object 64 and the object 66 .
  • the mobile phone 61 uses the circumscribed rectangle 69 to frame the object 66 according to the area information of the object 64 and the area information corresponding to the object 66 , and uses the circumscribed rectangle 610 to frame the object 64 . That is, after the mobile phone 61 collects the voice and determines the target object specified by the voice, the video preview interface 69 is displayed, and in the pre-preview interface 68, the target objects specified by the voice are prompted to be the object 64 and the 66.
  • the corresponding shooting targets of the voice "Two-screen tracking of a boy playing a skateboard and a girl playing a hula hoop" are objects 64 and 66, and can also be based on the extracted keywords “two screens” and "" “Tracking” to determine the shooting action as tracking shooting, and the display mode is two-screen display.
  • the mobile phone 61 determines the target image that can completely include the frame selection area of the object 64 and the frame selection area of the object 66 and has the smallest field of view.
  • the object 64 and the object The target image for 66 may or may not be the same. Specifically, for the object 64, the target image of the object 64 is determined; for the object 66, the target image of the object 66 is determined; and then according to the target area information, the first sub-image corresponding to the object 64 is cropped from the target image of the object 64 , and the second sub-image corresponding to the object 66 is cropped from the target image of the object 66 . Finally, the first sub-image and the second sub-image are spliced to obtain an output image, and a viewfinder is displayed.
  • the mobile phone 61 displays a tracking shooting interface 611 .
  • the left half area of the tracking shooting interface 611 tracks the display object 64
  • the right half area tracks the display object 66 .
  • the mobile phone 61 can directly display the tracking shooting interface 611 without displaying the video preview interface 68 .
  • the user's voice may not specify two-screen tracking, and when the mobile phone 61 determines that there are two target objects, it automatically uses two screens to track the two target objects.
  • the collected speech is "tracking a boy on a skateboard and a girl on a hula hoop".
  • the multi-target tracking scenario is illustrated by taking two targets as an example in FIG. 6 .
  • the process is similar to the process in FIG. 6 , and details are not repeated here.
  • the collected voices are "tracking a boy playing a skateboard, a girl playing a hula hoop, and a boy running."
  • the mobile phone 61 determines that there are three target objects specified by the voice, it uses three-screen tracking to display the three target objects.
  • a prompt may be displayed on the interface information to prompt the user to adjust the pose of the camera or terminal to improve user experience.
  • the terminal device 100 detects the distance between the edge of the bounding box selection area of the target object and the edge of the image with the largest field of view.
  • the distance corresponding to the edge with the smallest distance is less than a certain distance
  • the distance threshold it can be considered that the target object is about to exceed the maximum capture range of the camera.
  • prompt information can be displayed in the viewfinder frame, and the prompt information can be used to prompt the user to adjust the camera pose, and further, can also be used to prompt the camera to adjust the direction.
  • the prompt information can be text information and symbol information, that is, the text "Please adjust the camera direction" is displayed in the viewfinder, and an arrow is used to indicate the direction of camera adjustment.
  • the terminal device 100 determines the target object specified by the voice, and when the target object is tracked and photographed, if the second voice information is collected, the second voice information is subjected to voice recognition to obtain a second voice recognition result .
  • the second voice information is a voice command for adjusting the display mode of the target object
  • the cropping position of the sub-image of the target object is adjusted according to the target area information of the target object.
  • the display manner of the target object includes the display position and/or display size of the target object.
  • the terminal device 100 adjusts the cropping position of the sub-image so that the target object in the output image deviates from the center of the screen and deviates to the right.
  • cropping sub-images keep the target object in the center of the frame as much as possible.
  • the second voice information when it is determined according to the second voice recognition result that the second voice information is a voice command used to enlarge or reduce the target object, that is, when the second voice information is used to adjust the display size of the target object, it can be selected from different In the field-of-view image, an appropriate field-of-view image is determined as the target image, and then a sub-image is obtained by cropping the target image.
  • the terminal device 100 determines that the second voice information is a voice command for zooming in on the target object according to the second voice recognition result, it selects the image corresponding to the telephoto camera as the target image, And crop the sub-image from the original image collected by the telephoto camera.
  • the user can adjust the display position and/or display size of the target object through voice, which further improves the convenience of shooting.
  • the terminal device 100 is a mobile phone
  • the mobile phone includes a main camera, a wide-angle camera, and a telephoto camera
  • the viewing angles from large to small are: a wide-angle camera, a main camera, and a telephoto camera.
  • FIG. 7A is a schematic diagram of an image processing process in a tracking shooting scene
  • FIG. 7B is a schematic interface diagram in a tracking shooting scene.
  • the mobile phone collects images of different field of view through three cameras, and performs target recognition on the images of each field of view to obtain the object information of each object; based on the attribute information and area information of each object, establish index data and The relationship between regional information.
  • the mobile phone detects the voice "tracking the boy playing skateboard", it performs voice command recognition in combination with the object information, and determines that the target object specified by the voice is object 72; and it is assumed that the images collected by the main camera and the wide-angle camera can completely contain the object at this time.
  • the image captured by the main camera is used as the target image, and then according to the target area information of the object 72, the image captured by the main camera is cropped according to the default scale and position to obtain the sub-image of the object 72;
  • the sub-image generates an output image 73, and displays the image 73 on the tracking shooting interface 78.
  • FIG. 7B For details, please refer to FIG. 7B.
  • the mobile phone continues to track and photograph the object 72, and at a certain moment collects the voice "go to the right", the mobile phone recognizes the voice, and determines that the voice is a voice command for adjusting the display position of the target object; then, the mobile phone according to the object 72 corresponds to the target area information, adjust the cropping position, and crop the sub-image from the image collected by the main camera, so that the object 72 is displayed on the right side of the screen, that is, by adjusting the cropping position, let the sub-image completely include the frame selection area of the object 72, And let the object 72 be offset from the center of the screen in the output image, and offset to the right.
  • FIG. 7A by adjusting the cropping position, the mobile phone obtains a sub-image by cropping, and based on the sub-image, an image 74 is obtained, and the image 74 is displayed on the tracking shooting interface 79 in FIG. 7B .
  • the mobile phone determines that the original image captured by the main camera cannot completely include the frame selection area of the object 72, it selects the frame selection area that can completely include the object 72.
  • the wide-angle camera image of the frame selection area that is, the image collected by the wide-angle camera is determined as the target image; then according to the target area information corresponding to the object 72, the sub-image of the object 72 is obtained by cropping from the image collected by the wide-angle camera, and then based on the sub-image
  • An image 75 is generated and displayed on the tracking capture interface 710 as shown in FIG. 7B .
  • the object 72 continues to move.
  • prompt information can be generated to prompt the user to adjust the camera pose.
  • the mobile phone generates an image 76 and displays the image 76 on the tracking shooting interface 711 .
  • the image 76 includes a text prompt message "Please adjust the camera direction", and an arrow prompt to prompt the user to move the mobile phone to the right.
  • the mobile phone detects that the voice is "zoomed into the upper body", it performs voice recognition on the voice to obtain a voice recognition result.
  • the pre-established database includes key words of body parts, for example, the pre-established database includes upper body, lower body, head, eyes, mouth, upper limbs, etc.; according to the speech recognition result, it is determined that the speech is used to amplify the target
  • the image captured by the telephoto camera is determined as the target image from images of three different viewing angles, and a sub-image is obtained by cropping the target image, and an image 77 is generated based on the sub-image.
  • the image 77 is displayed on the tracking capture interface 712 as shown in Figure 7B.
  • the target object when the terminal device 100 is tracking and photographing the target object, the target object can be changed, the target object can be added, and the target object can be decreased through voice commands.
  • the scene in FIG. 8 is the same as that in FIG. 4 .
  • the mobile phone 41 displays the tracking shooting interface 412 , the mobile phone 41 continues to perform tracking shooting of the object 46 .
  • the mobile phone 41 collects the voice "change to a running boy", and then the mobile phone 41 performs voice recognition on the voice to determine that the user needs to replace the tracking object from object 46 to object 48; extract the index of the voice After the data, based on the relationship between the index data and the area information, the target area information of the object 48 is determined; A sub-image of the object 48 is cropped out of the target image, and an output image is generated based on the sub-image, and the output image is displayed, that is, the tracking shooting interface 413 is displayed.
  • the tracking object displayed on the tracking shooting interface 413 is the object 48 , that is, the tracking object is changed from the object 46 to the object 48 .
  • the mobile phone 41 continues to track and photograph the object 48 .
  • the mobile phone 41 collects the voice "add another screen to track the girl playing hula hoop", and then the mobile phone 41 performs voice recognition on the voice to obtain a voice recognition result; when the voice is determined to be used according to the voice recognition result
  • the voice command of the tracking object after extracting the index data of this voice recognition result, based on the correlation between the index data and the area information, determine that the object designated by this voice is the target area information of the object 47 and the object 47; Then, according to the target area information corresponding to the object 47, the sub-image of the object 47 is obtained by cropping; finally, the sub-image of the object 48 and the sub-image of the object 47 are spliced to obtain an output image, and the output image is displayed on the tracking shooting interface 414.
  • the object 48 is tracked in the left area
  • the object 47 is tracked in the right area.
  • the mobile phone 41 After adding the tracking object, the mobile phone 41 can continuously track the object 48 and the object 47 through the two screens. At a certain moment, the mobile phone 41 collects the voice "remove the running boy", and then, the mobile phone 41 recognizes the voice to obtain a voice recognition result; when it is determined according to the voice recognition result that the voice is a voice command for reducing the tracking object
  • the mobile phone 41 determines that the object designated by the speech is the object 48, that is, determine that the object to be removed is the object 48, and then based on the target of the object 47 According to the region information, a sub-image of the object 47 is obtained by cropping, an output image is generated based on the sub-image, and the output image is displayed on the tracking shooting interface 415 .
  • FIG. 9A is a schematic diagram of an image processing process in a tracking shooting scene
  • FIG. 9B is a schematic diagram of an interface corresponding to FIG. 9A
  • 9A and 9B are the same as the scenario of FIG. 6 .
  • the mobile phone collects images of different field of view through cameras with different field of view, performs target recognition on the images of each field of view, and obtains object information of each object; based on the attribute information and area information of each object , to establish the relationship between index data and regional information.
  • the mobile phone After the mobile phone detects the voice "two-screen tracking of a boy playing a skateboard and a girl playing a hula hoop", it performs voice command recognition combined with the object information, and determines that the tracking objects specified by the voice are object 92 and object 93; Select object 92 and object 93, specifically as shown in image 91, which also includes object 94 and object 95; then determine the frame selection area that can completely include object 92 and object 93, and the target with the smallest field of view Then, according to the target area information of object 92 and the target area information of object 93, the sub-image of object 92 and the sub-image of object 93 are obtained by cropping from the target image; Finally, two sub-images are spliced to generate image 96, The image 96 is displayed on the tracking shooting interface 99 shown in FIG. 9B . The left region of the tracking shooting interface 99 tracks the display object 92 , and the right region tracks the display object 93 .
  • the mobile phone includes three cameras: a main camera, a telephoto camera, and a wide-angle camera, and a set of original images collected by the mobile phone is shown in FIG. 9C .
  • the original image set includes the image 91 collected by the main camera, the image 912 collected by the telephoto camera, and the image 913 collected by the wide-angle camera.
  • the images in the original image set 913 with a larger field of view show more objects, while the images in the original image 912 with a smaller field of view show fewer objects .
  • the images captured by the main camera, wide-angle and telephoto all include the object, but only the images captured by the main camera and wide-angle can completely include the bounding frame of the object 92
  • the selected area that is, the image set of the object 92 includes the image 91 and the image 913; then, the image with the smallest field of view is determined from the image set as the target image, and at this time, the image set of the object 92 includes the image 91 and the image 913, Since the field of view of the image 91 is smaller than that of the image 913, it can be determined that the image 91 is the target image.
  • the sub-image 914 of the object 92 is cropped from the image 91 according to the predetermined aspect ratio.
  • the sub-image of the object 93 is obtained by cropping, and then the sub-image 914 of the object 92 and the sub-image of the object 93 are zoomed and displayed, and the output image 96 is obtained by splicing.
  • the mobile phone continuously tracks and shoots object 92 and object 93, and at a certain moment collects the voice "remove the boy playing skateboard and replace it with a boy who runs".
  • the mobile phone recognizes the voice and determines that the voice is used to replace the tracking object.
  • the left region of the tracking shooting interface 910 tracks the display object 94
  • the right region tracks the display object 93 .
  • the mobile phone continuously tracked and photographed object 93 and object 94, and at a certain moment collected the voice "add another screen to track the little girl in red".
  • the mobile phone recognized the voice and determined that the voice was used to replace the tracking object.
  • the left region of the tracking shooting interface 911 tracks the display object 94
  • the middle region tracks the display object 93
  • the right region tracks the display object 95 .
  • the terminal device 100 can perform target recognition on the images of each field of view collected in real time, obtain the object information of each object, and establish the region information and index data of the object. relationship between. Based on this, after detecting the user's voice, the voice is recognized to obtain a voice recognition result, and based on the establishment of the relationship between the area information and the index data, the information contained in the voice is determined, for example, the target object corresponding to the voice is determined, and the Actions contained in the voice, etc.; finally, the actions corresponding to the voice commands are performed.
  • the user in the process of tracking and photographing the tracking object, can change the tracking object, increase the tracking object, and reduce the tracking object through voice, and can also adjust the display size and display position of the tracking object through the voice, which further improves the performance of the tracking object.
  • tracking shooting is exemplarily introduced in a video recording scene.
  • the technical solutions provided in the embodiments of the present application can also be applied to photographing scenes.
  • the main interface 102 of the mobile phone 101 includes multiple applications such as the camera 103 , smart life, settings, calendar, clock, and gallery.
  • the mobile phone 101 After the mobile phone 101 receives the click operation on the camera 103, the mobile phone 101 responds to the click operation and displays the photo preview interface 104, and the viewfinder of the photo preview interface 104 includes objects 105, 106, 107, and 108, etc.
  • the viewfinder of the photo preview interface 104 includes objects 105, 106, 107, and 108, etc.
  • the mobile phone 101 After the mobile phone 101 enters the voice shooting mode, it performs target recognition on images of different viewing angles to obtain object information, and extracts index data of the attribute information of the object from the database, and establishes an association relationship between the index data and the area information.
  • the mobile phone 101 After the mobile phone 101 collects the voice "following the running boy", the mobile phone 101 displays the photo preview interface 109 .
  • the mobile phone 101 performs speech recognition on the speech, obtains the speech recognition result, and extracts the index data of the speech recognition result; then, according to the correlation between the index data and the area information, find the target corresponding to the speech recognition result object, and the target area information of the target object, at this time, it is determined that the target object corresponding to the voice is the object 107; then, the frame selection area that can completely include the object 107 is determined from the images with different viewing angles, and the viewing angle The smallest target image is then cropped from the target image according to the target area information of the object 107 to obtain a sub-image, the sub-image is zoomed and displayed, an output image is obtained, and the photo preview interface 109 is displayed.
  • the mobile phone 101 continuously tracks the object 107 and displays a photo preview interface 1010 .
  • the user can click the shooting button 1011 to take a picture of the object 107 .
  • the mobile phone 101 captures a photo of the object 107 in response to the click operation.
  • the mobile phone 101 displays the photo preview interface 1013 in response to the click operation, and the photo preview interface 1013 displays the photo of the object 107 .
  • the user can change the tracking object, increase the tracking object, and reduce the tracking object through voice commands, and can also adjust the display position of the tracking object, zoom in or out through the voice command, and track the object.
  • the user is prompted to adjust the camera pose.
  • the terminal device 100 is a mobile phone, and the mobile phone includes a wide-angle camera, a main camera, and a telephoto camera, and the three cameras have different fields of view.
  • a photo or video of the tracking target can be obtained by shooting in two ways, one of which is the traditional way, and the other is the way provided by the embodiment of the present application.
  • the target direction is checked; the user manually moves the mobile phone to find the target object; after finding the target object, manually zoom in and out to adjust the display ratio and focal length of the target object in the image, etc. ; Then, when the target object moves, manually track the moving target, that is, manually move the mobile phone, so that the target object is located in the center of the screen as much as possible; finally, a photo or video of the target object is generated.
  • the user needs to manually operate to determine the target object, manually adjust the display scale of the target object, and, when the target object moves, the mobile phone needs to be moved manually.
  • the mobile phone enters the camera to take a photo preview or video, and after comparing the target direction, the wide-angle camera, the main camera, and the telephoto camera are used to collect the wide-angle image, the main camera image, and the telephoto image, respectively.
  • the field of view image is used for target recognition to obtain the object information of each object, and the index data of the attribute information of the object is extracted from the database, and the association relationship between the index data and the area information is established.
  • Users can send tracking commands by voice, and the mobile phone can accurately identify the voice command according to the relationship between the index data and the area information and the pre-stored voiceprint features, and determine the target object corresponding to the voice command and the target area information of the target object; Then, according to the target area information of the target object and the expected display mode, a sub-image is obtained by cropping from the field of view image, that is, a single target area or multiple target areas are generated.
  • the intended display manner may include the image aspect ratio.
  • the focal length is adjusted, that is, the cropped image is zoomed and displayed to achieve the target display ratio, and a photo or video of the tracked object is generated.
  • FIG. 12 For the manner provided by the embodiments of the present application, reference may also be made to the schematic flowchart of the voice shooting shown in FIG. 12 .
  • the mobile phone After the mobile phone enters the camera to take pictures and preview or record the scene, the user can manually adjust the camera pose to aim at the target direction. At this time, the mobile phone can collect wide-angle images, main camera images and telephoto images in real time, and perform target recognition on each image to obtain object information.
  • the mobile phone will also collect the user's voice command in real time through the microphone, based on the object information and pre-stored voiceprint features, accurately identify the user's voice command, determine the target object corresponding to the voice command, and the information contained in the voice command; then, according to The target area information of the target object and the expected display mode are cropped from the field of view image to obtain a sub-image, that is, a single target area or multiple target areas are generated.
  • the intended display manner may include the image aspect ratio.
  • adjust the focal length to achieve the target display ratio, and generate a photo or video of the tracked object.
  • the user can be prompted to adjust the pose on the preview interface.
  • the user can specify the tracking object by voice, and continuously track the tracking object without manual operation by the user.
  • the method provided by the embodiment of the present application can allow the user to conveniently and accurately specify the tracking object, thereby improving the convenience of tracking and shooting.
  • the embodiment of the present application also provides a method for determining a target object, so that the user can specify the target object by voice, which improves the convenience and accuracy of determining the target object.
  • FIG. 13 is a schematic flowchart of a method for determining a target object provided by an embodiment of the present application.
  • the method can be applied to the terminal device 100, and the method can include the following steps:
  • Step S1301 Acquire a to-be-processed image, where the to-be-processed image includes at least one object.
  • the above-mentioned images to be processed may include at least two original images with different viewing angles.
  • the images to be processed include an original image captured by a telephoto camera and an original image captured by a wide-angle camera; it may also include an original image with one viewing angle.
  • Step S1302 Determine the object information of each object in the image to be processed, the object information includes area information and attribute information, and the area information is used to describe the position of the object in the image to be processed.
  • object recognition may be performed on the image to be processed to determine object information of each image in the image to be processed.
  • Step S1303 Extract second index data of each attribute information from a pre-established data set, where the data set includes attribute key words and attribute indexes.
  • Step S1304 For each object, associate the second index data with the area information to obtain an association relationship between the index data and the area information.
  • Step S1305 acquiring first voice information.
  • Step S1306 Extract the first index data of the first voice information from the data set based on the attribute keywords corresponding to the second index data in the data set.
  • Step S1307 Determine target area information corresponding to the first index data based on the association relationship and the second index data, wherein the object corresponding to the target area information is the target object.
  • the method for determining a target object provided by the embodiments of the present application may be applied to the above-mentioned tracking video scene and the photographing scene, but is not limited to the above-mentioned scene.
  • the method for determining a target object provided by the embodiments of the present application may be applied to the above-mentioned tracking video scene and the photographing scene, but is not limited to the above-mentioned scene.
  • FIG. 13 For the convenience of description, for the same points in FIG. 13 as the above embodiments, reference may be made to the corresponding contents of the above embodiments, which will not be repeated here.
  • the object information of each object in the image is identified in advance, and based on the pre-established data set, the association relationship between the index data of each object and the area information is established; when the voice command is collected, based on the association The relationship identifies the object in the image that corresponds to the voice command, allowing users to easily and quickly specify the target object.
  • FIG. 14 is another schematic flowchart of the method for determining a target object provided by an embodiment of the present application.
  • the method can be applied to the terminal device 100, and the method can include the following steps:
  • Step S1401 displaying a picture captured by the camera in the viewfinder, the picture including at least one object.
  • the terminal device 100 may include one or at least two cameras with different viewing angles.
  • the terminal device 100 may display the output image in the viewfinder based on the original image captured by one of the cameras.
  • Step S1402 Acquire first voice information, where the first voice information is used to specify a target object.
  • Step S1403 Determine the object corresponding to the first voice information in the screen as the target object.
  • the terminal device 100 may, after collecting the original image, perform target recognition on the original image to obtain object information of each object in the original image; then, based on the object information and the pre-established data set, establish a relationship between the index data and the area information After collecting the first voice information, the first voice information can be recognized, and based on the voice recognition result and the association relationship, the object corresponding to the first voice information is determined, and the object is determined as the target object.
  • the output image of the target object can be displayed.
  • the interface 412 is displayed.
  • the user can adjust the display method of the target object by voice, increase or decrease the target object, replace the target object, etc.
  • the display method of the target object by voice, increase or decrease the target object, replace the target object, etc.
  • the user can conveniently and quickly specify the target object through voice.
  • FIG. 15 is another schematic flow diagram of the photographing method provided by the embodiment of the present application.
  • the method may be applied to the terminal device 100, and the method may include the following steps:
  • Step S1501 Display a picture captured by the camera in the viewfinder frame, and the picture includes at least one object.
  • Step S1502 Receive a first instruction, where the first instruction is used to specify at least two target objects.
  • the above-mentioned first instruction may be a voice instruction or a non-voice instruction.
  • the user may input the above-mentioned first instruction by finger-pointing or touching.
  • Step S1503 Determine at least two objects in the screen corresponding to the first instruction as at least two target objects.
  • the terminal device can first establish the association relationship between the index data of each object and the region information based on the object information of each object and the pre-established data set, and then based on the association relationship and the voice recognition result , to determine the object corresponding to the first instruction in the image.
  • the terminal device can first establish the association relationship between the index data of each object and the region information based on the object information of each object and the pre-established data set, and then based on the association relationship and the voice recognition result , to determine the object corresponding to the first instruction in the image.
  • the terminal device can determine the target object designated by the user according to the user's touch position. For example, the user can input the first instruction by clicking the object 64 and the object 66 in FIG. 6 successively, and the terminal device determines the object corresponding to the user's clicked position as the target object according to the display position of each object in the image.
  • Step S1504 Display the picture of each target object in sub-regions in the viewfinder frame, wherein one region displays the picture of one target object.
  • the terminal device can cut out sub-images based on the target area information of the target object, and then splicing the sub-images to display in an expected display manner.
  • the terminal device may also cut out sub-images respectively, and then display the sub-images after splicing.
  • the first instruction is a voice instruction.
  • the terminal device specifies two target objects in the collected voice "Two-screen tracking of a boy playing a skateboard and a girl playing a hula hoop", and after recognizing the voice and determining the target object based on the association relationship, the interface 611 is displayed.
  • the left area of the interface 611 displays the picture of the object 64
  • the right area displays the picture of the object 66 .
  • the display result of the terminal device may refer to the interface 911 in FIG. 9B .
  • the terminal device displays the at least two target objects in different regions, which improves the convenience of tracking and shooting.
  • FIG. 16 shows a schematic block diagram of an apparatus for determining a target object provided by an embodiment of the present application. For convenience of description, only parts related to the embodiment of the present application are shown.
  • the apparatus may include:
  • an image acquisition module 161 configured to acquire a to-be-processed image, where the to-be-processed image includes at least one object;
  • the first object information determination module 162 is used to determine the object information of each object in the image to be processed, the object information includes area information and attribute information, and the area information is used to describe the position of the object in the image to be processed;
  • the first extraction module 163 is configured to extract the second index data of each attribute information from a pre-established data set, where the data set includes attribute key words and attribute indexes;
  • the first establishment module 164 is used for associating the second index data with the area information for each object, so as to obtain the association relationship between the index data and the area information;
  • a first voice information obtaining module 165 configured to obtain the first voice information
  • the second extraction module 166 is configured to extract the first index data of the first speech information from the data set based on the attribute keywords corresponding to the second index data in the data set;
  • the first target object determination module 167 is configured to determine target area information corresponding to the first index data based on the association relationship and the second index data, wherein the object corresponding to the target area information is the target object.
  • the above-mentioned first target object determination module is specifically configured to: for each second index data, respectively match each index in the second index data with each index in the first index data, and determine The number of items of the target index, the target index is the index in the second index data that matches the index in the first index data; based on the association relationship, the area information corresponding to the second index data with the largest number of items is determined as the target area information.
  • the above-mentioned first extraction module is specifically configured to: for each attribute information, find the first keyword word matching the attribute information from the data set; use the index data of the first keyword word in the data set as the second keyword index data.
  • the above-mentioned second extraction module is specifically configured to: perform speech recognition on the first speech information to obtain a first speech recognition result; and determine, according to the first speech recognition result, that the first speech information is used for specifying the target object A voice command; searching for a second key word matching the first speech recognition result from attribute key words corresponding to the second index data; using the index data of the second key word in the data set as the first index data.
  • the above-mentioned second extraction module is specifically configured to: extract the voiceprint feature of the first voice information; determine the similarity between the voiceprint feature and the pre-stored voiceprint feature; when the similarity is greater than or equal to a preset threshold
  • the voiceprint feature and the first speech information are input into the semantic understanding model, the first speech recognition result output by the semantic understanding model is obtained.
  • the above-mentioned first object information determination module is specifically configured to: input the image to be processed into the target recognition model; and obtain the object information output by the target recognition model.
  • the first voice information is a voice command for specifying the target object for the first time, or a voice command for increasing or decreasing the target object, or a voice command for replacing the target object.
  • the image to be processed includes N original images with different viewing angles, where N is a positive integer greater than or equal to 1.
  • the above device may also include:
  • the first display module is configured to obtain the first output image of the target object according to the target area information based on the original image, and display the first output image in the viewfinder frame.
  • the above-mentioned first display module is specifically configured to: determine a first target image from N original images, the first target image is the image with the smallest field of view in the image set, and the image set includes at least one image that can completely contain The original image of the bounding box selection area of the target object; the first sub-image of the target object is cropped from the first target image according to the target area information; the first output image is generated based on the first sub-image.
  • the above-mentioned first display module is specifically configured to: display the first sub-image corresponding to each target object Stitching to obtain the first output image.
  • the above-mentioned apparatus further comprises:
  • a second voice information acquisition module configured to acquire second voice information
  • a first recognition module configured to perform speech recognition on the second speech information to obtain a second speech recognition result
  • a first determining module configured to determine, according to the second voice recognition result, that the second voice information is a voice command for adjusting the display mode of the target object
  • a second determining module configured to determine the second target image from the original image
  • a first cropping module for cropping out the second sub-image of the target object from the second target image
  • the second display module is configured to generate a second output image based on the second sub-image, and display the second output image in the viewfinder.
  • the above-mentioned apparatus may further include:
  • the first detection module is used to determine the distance between each edge of the bounding box selection area of the target object and the corresponding image edge in the third target image, and the third target image is the image with the largest field of view in the image set ;
  • the first prompt module is configured to display prompt information in the viewfinder if the minimum value of the multiple distances is less than or equal to the preset distance threshold, and the prompt information is used to prompt adjustment of the camera pose.
  • the attribute information includes at least one of: type, gender, age group, activity information, hair length, hair color, clothing type, clothing color, and posture.
  • the above-mentioned device for determining the target object has the function of realizing the above-mentioned method for determining the target object, and the function can be realized by hardware, and can also be realized by executing corresponding software through hardware, and the hardware or software includes one or more modules corresponding to the above-mentioned functions, Modules can be software and/or hardware.
  • FIG. 17 shows a schematic block diagram of an apparatus for determining a target object provided by an embodiment of the present application. For convenience of description, only parts related to the embodiment of the present application are shown.
  • the apparatus may include:
  • the first picture display module 171 is used to display the picture captured by the camera in the viewfinder frame, and the picture includes at least one object;
  • a first obtaining module 172 configured to obtain first voice information, and the first voice information is used to specify a target object;
  • the second target object determination module 173 is configured to determine the object corresponding to the first voice information in the picture as the target object.
  • the second target object determination module is specifically configured to: extract the first index data of the first voice information from a pre-established data set, where the data set includes attribute key words and attribute indexes; based on the association relationship of each object , determine the target area information corresponding to the first index data, and the object corresponding to the target area information is the target object;
  • the association relationship is a mapping relationship between the second index data of the attribute information of the object and the area information, and the area information is used to describe the position of the object in the original image collected by the camera.
  • the above-mentioned apparatus may further include:
  • the second object information determination module is used to determine the object information of each object in the original image, and the object information includes area information and attribute information;
  • the third extraction module is used to extract the second index data of the attribute information of each object from the data set;
  • the second establishment module is used for associating the second index data with the area information for each object, so as to obtain the association relationship between the index data and the area information.
  • the above-mentioned third extraction module is specifically configured to: for each attribute information, find the first keyword word matching the attribute information from the data set; use the index data of the first keyword word in the data set as the second keyword index data.
  • the second object information determination module is specifically configured to: input the original image into the target recognition model; and obtain the object information output by the target recognition model.
  • the second target object determination module is specifically configured to: for each second index data, respectively match each index in the first index data with each index in the second index data, and determine the target The number of items in the index, and the target index is the index in the second index data that matches the index in the first index data; based on the association relationship, the area information corresponding to the second index data with the largest number of items is determined as the target area information.
  • the second target object determining module is specifically configured to: perform voice recognition on the first voice information to obtain a first voice recognition result; and determine the first voice information as a target object for specifying the target object according to the first voice recognition result searching for the second key word matching the first speech recognition result from the attribute key words corresponding to the second index data; using the index data of the second key word in the data set as the first index data.
  • the second target object determination module is specifically configured to: extract the voiceprint feature of the first voice information; determine the similarity between the voiceprint feature and the pre-stored voiceprint feature; when the similarity is greater than or equal to a preset
  • the threshold is set, the voiceprint feature and the first speech information are input into the semantic understanding model, and the first speech recognition result output by the semantic understanding model is obtained.
  • the cameras include N cameras with different viewing angles, where N is a positive integer greater than or equal to 1.
  • the above device also includes:
  • the third display module is configured to obtain the first output image of the target object according to the target area information based on the original images of N different viewing angles, and display the first output image in the viewfinder frame.
  • the above-mentioned third display module is used to: determine a first target image from N original images, the first target image is the image with the smallest field of view in the image set, and the image set includes at least one image that can completely contain the target The original image of the bounding box selection area of the object; the first sub-image of the target object is cropped from the first target image according to the target area information; the first output image is generated based on the first sub-image.
  • the above-mentioned third display module is specifically configured to: display the first sub-image corresponding to each target object Stitching to obtain the first output image.
  • the above-mentioned apparatus further comprises:
  • a second obtaining module configured to obtain second voice information
  • a second recognition module configured to perform speech recognition on the second voice information to obtain a second voice recognition result
  • a third determining module configured to determine, according to the second voice recognition result, that the second voice information is a voice command for adjusting the display mode of the target object
  • a fourth determining module used for determining the second target image from the original image
  • the second cropping module is used for cropping out the second sub-image of the target object from the second target image
  • the fourth display module is configured to generate a second output image based on the second sub-image, and display the second output image in the viewfinder.
  • the above-mentioned apparatus may further include:
  • the second detection module is used to determine the distance between each edge of the bounding box selection area of the target object and the corresponding image edge in the third target image, where the third target image is the image with the largest field of view in the image set ;
  • the second prompt module is configured to display prompt information in the viewfinder if the minimum value among the multiple distances is smaller than the preset distance threshold, and the prompt information is used to prompt adjustment of the camera pose.
  • the above-mentioned device for determining the target object has the function of realizing the above-mentioned method for determining the target object, and the function can be realized by hardware, and can also be realized by executing corresponding software through hardware, and the hardware or software includes one or more modules corresponding to the above-mentioned functions, Modules can be software and/or hardware.
  • FIG. 18 shows a schematic block diagram of the photographing apparatus provided by the embodiment of the present application. For the convenience of description, only the part related to the embodiment of the present application is shown.
  • the apparatus may include:
  • the second picture display module 181 is configured to display the picture captured by the camera in the viewfinder, and the picture includes at least one object;
  • a receiving module 182 configured to receive a first instruction, where the first instruction is used to specify at least two target objects;
  • a third target object determination module 183 configured to determine at least two objects in the screen corresponding to the first instruction as at least two target objects
  • the sub-area display module 184 is configured to display the picture of each target object in sub-areas in the viewfinder, wherein one area displays the picture of one target object.
  • the first instruction is a voice instruction.
  • the third target object determination module is specifically configured to: extract the first index data of the first instruction from a pre-established data set, where the data set includes attribute key words and attribute indexes; Determine the target area information corresponding to the first index data, and the object corresponding to the target area information is the target object;
  • the association relationship is the mapping relationship between the second index data of the attribute information of the object and the region information.
  • the region information is used to describe the position of the object in the original image collected by the camera.
  • the above-mentioned apparatus may further include:
  • the third object information determination module is used to determine the object information of each object in the original image, and the object information includes area information and attribute information;
  • a fourth extraction module used for extracting the second index data of the attribute information of each object from the data set
  • the third establishing module is used for associating the second index data with the area information for each object, so as to obtain the association relationship between the index data and the area information.
  • the fourth extraction module is specifically configured to: for each attribute information, find the first keyword word matching the attribute information from the data set; use the index data of the first keyword word in the data set as the second index data.
  • the third object information determination module is specifically configured to: input the original image into the target recognition model; and obtain the object information output by the target recognition model.
  • the third target object determination module is specifically configured to: for each second index data, respectively match each index in the first index data with each index in the second index data, and determine the target The number of items in the index, and the target index is the index in the second index data that matches the index in the first index data; based on the association relationship, the area information corresponding to the second index data with the largest number of items is determined as the target area information.
  • the third target object determination module is specifically configured to: search for the second keyword matching the first instruction from the attribute keywords corresponding to the second index data;
  • the index data is used as the first index data.
  • the cameras include N cameras with different field of view, where N is a positive integer greater than or equal to 1; the above-mentioned sub-area display module is specifically configured to: for each target object, determine the number of images from the N original images.
  • a target image the first target image is the image with the smallest field of view in the image set, and the image set includes at least one original image that can completely contain the bounding box selection area of the target object; for each target object, according to the target area information, from The first sub-image of the target object is cropped from the first target image; the first sub-images corresponding to each target object are stitched to obtain a first output image; and the first output image is displayed in the viewfinder.
  • the above-mentioned apparatus further comprises:
  • a third acquiring module configured to acquire the second voice information
  • a third recognition module configured to perform speech recognition on the second speech information to obtain a second speech recognition result
  • a fifth determining module configured to determine, according to the second voice recognition result, that the second voice information is a voice command for adjusting the display mode of the target object
  • a sixth determining module configured to determine the second target image from the original image
  • the third cropping module is used for cropping out the second sub-image of the target object from the second target image
  • the fifth display module is configured to generate a second output image based on the second sub-image, and display the second output image in the viewfinder frame.
  • the above-mentioned apparatus may further include:
  • the third detection module is used to determine the distance between each edge of the bounding box selection area of the target object and the corresponding image edge in the third target image, where the third target image is the image with the largest field of view in the image set ;
  • the third prompt module is configured to display prompt information in the viewfinder if the minimum value of the multiple distances is smaller than the preset distance threshold, and the prompt information is used to prompt adjustment of the camera pose.
  • the above-mentioned photographing device has the function of realizing the above-mentioned photographing method, and the function can be realized by hardware or by executing corresponding software in hardware.
  • the hardware or software includes one or more modules corresponding to the above-mentioned functions, and the modules can be software and/or software. or hardware.
  • the terminal device provided by the embodiments of the present application may include a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program, the method in any of the foregoing method embodiments is implemented.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.
  • the embodiments of the present application provide a computer program product, when the computer program product runs on a terminal device, so that the terminal device can implement the steps in the foregoing method embodiments when executed.
  • An embodiment of the present application further provides a chip system, where the chip system includes a processor, the processor is coupled to a memory, and the processor executes a computer program stored in the memory, so as to implement the methods described in the foregoing method embodiments. method.
  • the chip system may be a single chip, or a chip module composed of multiple chips.
  • references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Studio Devices (AREA)

Abstract

Disclosed in embodiments of the present application are a method for determining a target object, and a photographing method and device, for use in conveniently and accurately determining a target object and improving the convenience of photographing. The method for determining the target object comprises: obtaining an image to be processed; determining object information of each object in said image, the object information comprising area information and attribute information; extracting second index data of each piece of attribute information from a pre-established data set, the data set comprising attribute key words and attribute indexes; for each object, associating the second index data with the area information to obtain an association between the index data and the area information; obtaining first voice information; extracting first index data of the first voice information from the data set on the basis of an attribute key word corresponding to the second index data in the data set; and determining, on the basis of the association and the second index data, target area information corresponding to the first index data, an object corresponding to the target area information being a target object.

Description

确定目标对象的方法、拍摄方法和装置Method, photographing method and device for determining target object
本申请要求于2021年03月29日提交国家知识产权局、申请号为202110336707.X、申请名称为“确定目标对象的方法、拍摄方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the State Intellectual Property Office on March 29, 2021, the application number is 202110336707.X, and the application name is "Method for Determining Target Object, Shooting Method and Device", the entire content of which is approved by Reference is incorporated in this application.
技术领域technical field
本申请涉及图像处理技术领域,尤其涉及一种确定目标对象的方法、拍摄方法和装置。The present application relates to the technical field of image processing, and in particular, to a method, a photographing method, and an apparatus for determining a target object.
背景技术Background technique
随着终端技术和图像处理技术的不断发展,终端设备的拍摄功能也越来越强大。With the continuous development of terminal technology and image processing technology, the shooting function of terminal equipment is also becoming more and more powerful.
目前,终端设备进行拍摄时,人为手动选定目标对象后,再对该目标对象进行拍摄,以得到跟踪对象的视频或图片。At present, when a terminal device is shooting, a target object is manually selected, and then the target object is shot to obtain a video or picture of the tracking object.
用户手动选择目标对象的时候,可能会由于点选错误等原因,导致实际选定的目标对象与用户想要指定的目标对象不一致,即不能准确地选定目标对象。另外,用户在手拿物品等情况下,需要先放下手中的物品,再手动选定目标对象,便捷性较差。When the user manually selects the target object, the actual selected target object may be inconsistent with the target object that the user wants to specify due to reasons such as clicking errors, that is, the target object cannot be accurately selected. In addition, when a user holds an item in his hand, etc., he needs to put down the item in his hand first, and then manually select the target object, which is less convenient.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种确定目标对象方法、拍摄方法和装置,可以让用户便捷准确地指定目标对象。The embodiments of the present application provide a method for determining a target object, a shooting method and an apparatus, which can allow a user to conveniently and accurately specify a target object.
第一方面,本申请实施例提供一种确定目标对象的方法,该方法应用于终端设备,通过先获取待处理图像,待处理图像包括至少一个对象;再确定待处理图像中各个对象的对象信息,该对象信息包括区域信息和属性信息,区域信息用于描述对象在待处理图像中的位置;然后,从预先建立的数据集中提取各个属性信息的第二索引数据,该数据集包括属性关键字词和属性索引;针对每个对象,将第二索引数据与区域信息进行关联,得到索引数据和区域信息之间的关联关系;当获取第一语音信息的时候,基于数据集中第二索引数据对应的属性关键字词,从数据集中提取第一语音信息的第一索引数据;最后,基于关联关系和第二索引数据,确定第一索引数据对应的目标区域信息,其中,目标区域信息对应的对象为目标对象。In a first aspect, an embodiment of the present application provides a method for determining a target object. The method is applied to a terminal device, by first acquiring an image to be processed, and the image to be processed includes at least one object; and then determining object information of each object in the image to be processed , the object information includes area information and attribute information, the area information is used to describe the position of the object in the image to be processed; then, the second index data of each attribute information is extracted from a pre-established data set, the data set includes attribute keywords Word and attribute index; for each object, associate the second index data with the region information to obtain the association relationship between the index data and the region information; when acquiring the first voice information, based on the second index data in the dataset corresponding to attribute key words, extract the first index data of the first voice information from the data set; finally, based on the association relationship and the second index data, determine the target area information corresponding to the first index data, wherein, the object corresponding to the target area information for the target object.
本申请实施例中,通过预先识别出图像中各个对象的对象信息,并基于预先建立的数据集,建立各个对象的索引数据和区域信息之间的关联关系;然后,在采集到语音命令时,基于关联关系确定图像中与该语音命令对应的对象,以让用户可以便捷快速地指定目标对象。In the embodiment of the present application, the object information of each object in the image is identified in advance, and based on the pre-established data set, the association relationship between the index data of each object and the area information is established; then, when the voice command is collected, The object in the image corresponding to the voice command is determined based on the association relationship, so that the user can conveniently and quickly specify the target object.
在第一方面的一些可能的实现方式中,上述基于关联关系和第二索引数据,确定第一索引数据对应的目标区域信息的过程可以包括:针对每个第二索引数据,将第二索引数据中的各项索引分别与第一索引数据中的各项索引进行匹配,确定目标索引的项数,目标索引为第二索引数据中与第一索引数据中的索引相匹配的索引;基于关联关系,将项数最多的第二索引数据对应的区域信息确定为目标区域信息。In some possible implementations of the first aspect, the above-mentioned process of determining target area information corresponding to the first index data based on the association relationship and the second index data may include: for each second index data, The indexes in the first index data are respectively matched with the indexes in the first index data, and the number of items of the target index is determined, and the target index is the index in the second index data that matches the index in the first index data; based on the association relationship , and the area information corresponding to the second index data with the largest number of items is determined as the target area information.
在第一方面的一些可能的实现方式中,上述从预先建立的数据集中提取各个属性 信息的第二索引数据的过程可以包括:针对每个属性信息,从数据集中查找与属性信息匹配的第一关键字词;将数据集中第一关键字词的索引数据作为第二索引数据。In some possible implementations of the first aspect, the above process of extracting the second index data of each attribute information from the pre-established data set may include: for each attribute information, searching for the first index data matching the attribute information from the data set Key word; the index data of the first key word in the data set is used as the second index data.
在第一方面的一些可能的实现方式中,上述基于数据集中第二索引数据对应的属性关键字词,从数据集中提取第一语音信息的第一索引数据的过程可以包括:对第一语音信息进行语音识别,得到第一语音识别结果;根据第一语音识别结果,确定第一语音信息为用于指定目标对象的语音命令;从第二索引数据对应的属性关键字词中查找与第一语音识别结果相匹配的第二关键字词;将数据集中第二关键字词的索引数据作为第一索引数据。In some possible implementations of the first aspect, the process of extracting the first index data of the first voice information from the data set based on the attribute keywords corresponding to the second index data in the data set may include: Carry out voice recognition to obtain a first voice recognition result; according to the first voice recognition result, determine that the first voice information is a voice command for specifying a target object; search for the first voice from the attribute keyword corresponding to the second index data Identify the second key word matching the result; and use the index data of the second key word in the data set as the first index data.
在第一方面的一些可能的实现方式中,上述对第一语音信息进行语音识别,得到第一语音识别结果的过程可以包括:提取第一语音信息的声纹特征;确定声纹特征和预存储声纹特征之间的相似度;当相似度大于或等于预设阈值时,将声纹特征和第一语音信息输入至语义理解模型,获得语义理解模型输出的第一语音识别结果。In some possible implementations of the first aspect, the above-mentioned process of performing speech recognition on the first voice information to obtain the first voice recognition result may include: extracting voiceprint features of the first voice information; determining the voiceprint features and pre-storing Similarity between voiceprint features; when the similarity is greater than or equal to a preset threshold, input the voiceprint feature and the first speech information into the semantic understanding model to obtain the first speech recognition result output by the semantic understanding model.
在第一方面的一些可能的实现方式中,上述确定待处理图像中各个对象的对象信息的过程可以包括:将待处理图像输入至目标识别模型;获得目标识别模型输出的对象信息。In some possible implementations of the first aspect, the above process of determining object information of each object in the image to be processed may include: inputting the image to be processed into the target recognition model; and obtaining object information output by the target recognition model.
在第一方面的一些可能的实现方式中,第一语音信息为用于初次指定目标对象的语音命令,或用于增加或减少目标对象的语音命令,或用于替换目标对象的语音命令。该实现方式中,用户不仅可以通过语音初次指定目标对象,还可以在指定目标对象的基础上,通过语音增加或减少目标对象,以及替换目标对象,进一步增加了拍摄便捷性,提高了用户体验。In some possible implementations of the first aspect, the first voice information is a voice command for specifying the target object for the first time, or a voice command for increasing or decreasing the target object, or a voice command for replacing the target object. In this implementation, the user can not only specify the target object for the first time by voice, but also can increase or decrease the target object and replace the target object by voice on the basis of specifying the target object, which further increases the convenience of shooting and improves the user experience.
在第一方面的一些可能的实现方式中,待处理图像包括N个不同视场角的原图像,N为大于或等于1的正整数。在确定第一索引数据对应的目标区域信息之后,该方法还包括:基于原图像,根据目标区域信息得到目标对象的第一输出图像,并将第一输出图像显示在取景框。In some possible implementations of the first aspect, the image to be processed includes N original images with different viewing angles, where N is a positive integer greater than or equal to 1. After determining the target area information corresponding to the first index data, the method further includes: obtaining a first output image of the target object according to the target area information based on the original image, and displaying the first output image in the viewfinder frame.
在第一方面的一些可能的实现方式中,上述基于原图像,根据目标区域信息得到目标对象的第一输出图像的过程可以包括:从N张原图像中确定第一目标图像,第一目标图像为图像集合中视场角最小的图像,图像集合包括至少一张可完整包含目标对象的外接框选区域的原图像;根据目标区域信息,从第一目标图像中裁剪出目标对象的第一子图像;基于第一子图像生成第一输出图像。In some possible implementations of the first aspect, the above process of obtaining the first output image of the target object based on the original image and the target area information may include: determining the first target image from N original images, and the first target image It is the image with the smallest field of view in the image set, and the image set includes at least one original image that can completely contain the bounding box selection area of the target object; according to the target area information, the first sub-image of the target object is cropped from the first target image. ; generating a first output image based on the first sub-image.
在第一方面的一些可能的实现方式中,当有至少两个目标对象,且每个目标对象均对应有一个第一子图像时,上述基于第一子图像得到第一输出图像的过程可以包括:将每个目标对象对应的第一子图像进行拼接,得到第一输出图像。In some possible implementations of the first aspect, when there are at least two target objects, and each target object corresponds to a first sub-image, the above-mentioned process of obtaining the first output image based on the first sub-image may include: : Splicing the first sub-images corresponding to each target object to obtain the first output image.
在第一方面的一些可能的实现方式中,在将第一输出图像显示在取景框之后,该方法还包括:获取第二语音信息;对第二语音信息进行语音识别,得到第二语音识别结果;根据第二语音识别结果,确定第二语音信息为用于调整目标对象的显示方式的语音命令;从原图像中确定第二目标图像;从第二目标图像中裁剪出目标对象的第二子图像;基于第二子图像生成第二输出图像,并将第二输出图像显示在取景框。In some possible implementations of the first aspect, after displaying the first output image in the viewfinder frame, the method further includes: acquiring second voice information; performing voice recognition on the second voice information to obtain a second voice recognition result ; According to the second voice recognition result, determine that the second voice information is a voice command for adjusting the display mode of the target object; Determine the second target image from the original image; Cut out the second child of the target object from the second target image image; generating a second output image based on the second sub-image, and displaying the second output image in the viewfinder.
在该实现方式中,用户可以通过语音调整目标对象的显示方式,进一步提高了拍摄便捷性。In this implementation manner, the user can adjust the display mode of the target object through voice, which further improves the convenience of shooting.
在第一方面的一些可能的实现方式中,该方法还可以包括:在第三目标图像中,确定目标对象的外接框选区域的每个边与相对应的图像边之间的距离,第三目标图像为图像集合中视场角最大的图像;若多个距离中的最小值小于或等于预设距离阈值,则在取景框显示提示信息,提示信息用于提示调整相机位姿。In some possible implementations of the first aspect, the method may further include: in the third target image, determining the distance between each edge of the bounding box selection area of the target object and the corresponding edge of the image, and the third The target image is the image with the largest field of view in the image set; if the minimum value among the multiple distances is less than or equal to the preset distance threshold, prompt information is displayed in the viewfinder, and the prompt information is used to prompt adjustment of the camera pose.
在该实现方式中,当目标对象即将超出摄像头最大捕获范围时,通过提示用户调整相机位姿,可以提高用户体验。In this implementation manner, when the target object is about to exceed the maximum capture range of the camera, the user experience can be improved by prompting the user to adjust the camera pose.
在第一方面的一些可能的实现方式中,上述属性信息包括以下至少一项:类型、性别、年龄段、活动信息、头发长度、头发颜色、着装类型、衣服颜色以及姿势。In some possible implementations of the first aspect, the above attribute information includes at least one of the following: type, gender, age group, activity information, hair length, hair color, clothing type, clothing color, and posture.
第二方面,本申请实施例提供了一种确定目标对象的方法,应用于终端设备,该方法通过在取景框内显示通过摄像头捕获的画面,该画面包括至少一个对象;在获取第一语音信息的时候,第一语音信息用于指定目标对象,将画面中与第一语音信息对应的对象确定为目标对象。In a second aspect, an embodiment of the present application provides a method for determining a target object, which is applied to a terminal device. In the method, a picture captured by a camera is displayed in a viewfinder, and the picture includes at least one object; after obtaining the first voice information When the first voice information is used to designate the target object, the object corresponding to the first voice information in the screen is determined as the target object.
在第二方面的一些可能的实现方式中,上述将画面中与第一语音信息对应的对象确定为目标对象的过程可以包括:从预先建立的数据集中提取第一语音信息的第一索引数据,数据集包括属性关键字词和属性索引;基于各个对象的关联关系,确定第一索引数据对应的目标区域信息,目标区域信息对应的对象为目标对象;In some possible implementations of the second aspect, the above process of determining the object corresponding to the first voice information in the picture as the target object may include: extracting the first index data of the first voice information from a pre-established data set, The data set includes attribute key words and attribute indexes; based on the association relationship of each object, the target area information corresponding to the first index data is determined, and the object corresponding to the target area information is the target object;
其中,关联关系为对象的属性信息的第二索引数据和区域信息之间的映射关系,区域信息用于描述对象在摄像头采集的原图像中的位置。The association relationship is a mapping relationship between the second index data of the attribute information of the object and the area information, and the area information is used to describe the position of the object in the original image collected by the camera.
在第二方面的一些可能的实现方式中,在基于各个对象的关联关系,确定第一索引数据对应的目标区域信息之前,该方法还包括:确定原图像中各个对象的对象信息,对象信息包括区域信息和属性信息;从数据集中提取各个对象的属性信息的第二索引数据;针对每个对象,将第二索引数据与区域信息进行关联,得到索引数据和区域信息之间的关联关系。In some possible implementations of the second aspect, before determining the target area information corresponding to the first index data based on the association relationship of each object, the method further includes: determining object information of each object in the original image, where the object information includes area information and attribute information; extract second index data of attribute information of each object from the data set; for each object, associate the second index data with the area information to obtain the association relationship between the index data and the area information.
在第二方面的一些可能的实现方式中,上述从数据集中提取各个对象的属性信息的第二索引数据的过程可以包括:针对每个属性信息,从数据集查找与属性信息匹配的第一关键字词;将数据集中第一关键字词的索引数据作为第二索引数据。In some possible implementations of the second aspect, the above process of extracting the second index data of the attribute information of each object from the data set may include: for each attribute information, searching the data set for a first key matching the attribute information word; the index data of the first key word in the data set is used as the second index data.
在第二方面的一些可能的实现方式中,上述确定原图像中各个对象的对象信息的过程可以包括:将原图像输入至目标识别模型;获得目标识别模型输出的对象信息。In some possible implementations of the second aspect, the above process of determining object information of each object in the original image may include: inputting the original image into the target recognition model; and obtaining object information output by the target recognition model.
在第二方面的一些可能的实现方式中,上述基于各个对象的关联关系,确定第一索引数据对应的目标区域信息的过程可以包括:针对每个第二索引数据,将第一索引数据中的各项索引分别与第二索引数据中的各项索引进行匹配,确定目标索引的项数,目标索引为第二索引数据中与第一索引数据中的索引相匹配的索引;基于关联关系,将项数最多的第二索引数据对应的区域信息确定为目标区域信息。In some possible implementations of the second aspect, the above-mentioned process of determining the target area information corresponding to the first index data based on the association relationship of each object may include: for each second index data, Each index is respectively matched with each index in the second index data, and the number of items of the target index is determined, and the target index is an index in the second index data that matches the index in the first index data; The area information corresponding to the second index data with the largest number of items is determined as the target area information.
在第二方面的一些可能的实现方式中,上述从预先建立的数据集中提取第一语音信息的第一索引数据的过程可以包括:对第一语音信息进行语音识别,得到第一语音识别结果;根据第一语音识别结果,确定第一语音信息为用于指定目标对象的语音命令;从第二索引数据对应的属性关键字词中查找与第一语音识别结果相匹配的第二关键字词;将数据集中第二关键字词的索引数据作为第一索引数据。In some possible implementations of the second aspect, the above process of extracting the first index data of the first voice information from the pre-established data set may include: performing voice recognition on the first voice information to obtain a first voice recognition result; According to the first speech recognition result, it is determined that the first speech information is a speech command for specifying the target object; the second key word that matches the first speech recognition result is searched from the attribute key words corresponding to the second index data; The index data of the second key word in the data set is used as the first index data.
在第二方面的一些可能的实现方式中,上述对第一语音信息进行语音识别,得到 第一语音识别结果的过程可以包括:提取第一语音信息的声纹特征;确定声纹特征和预存储声纹特征之间的相似度;当相似度大于或等于预设阈值时,将声纹特征和第一语音信息输入至语义理解模型,获得语义理解模型输出的第一语音识别结果。In some possible implementations of the second aspect, the above-mentioned process of performing speech recognition on the first voice information to obtain the first voice recognition result may include: extracting voiceprint features of the first voice information; determining the voiceprint features and pre-storing Similarity between voiceprint features; when the similarity is greater than or equal to a preset threshold, input the voiceprint feature and the first speech information into the semantic understanding model to obtain the first speech recognition result output by the semantic understanding model.
在第二方面的一些可能的实现方式中,摄像头包括N个不同视场角的摄像头,N为大于或等于1的正整数。In some possible implementations of the second aspect, the cameras include N cameras with different viewing angles, where N is a positive integer greater than or equal to 1.
在基于各个对象的关联关系,确定第一索引数据对应的目标区域信息之后,该方法还包括:基于N个不同视场角的原图像,根据目标区域信息得到目标对象的第一输出图像,并将第一输出图像显示在取景框。After determining the target area information corresponding to the first index data based on the association relationship of each object, the method further includes: obtaining a first output image of the target object according to the target area information based on N original images with different viewing angles, and Display the first output image in the viewfinder.
在第二方面的一些可能的实现方式中,上述基于N个不同视场角的原图像,根据目标区域信息得到目标对象的第一输出图像的过程可以包括:从N张原图像中确定第一目标图像,第一目标图像为图像集合中视场角最小的图像,图像集合包括至少一张可完整包含目标对象的外接框选区域的原图像;根据目标区域信息,从第一目标图像中裁剪出目标对象的第一子图像;基于第一子图像生成第一输出图像。In some possible implementations of the second aspect, the above-mentioned process of obtaining the first output image of the target object according to the target area information based on the original images of N different viewing angles may include: determining the first output image from the N original images. target image, the first target image is the image with the smallest field of view in the image set, and the image set includes at least one original image that can completely contain the bounding box selection area of the target object; according to the target area information, crop out from the first target image a first sub-image of the target object; generating a first output image based on the first sub-image.
在第二方面的一些可能的实现方式中,当有至少两个目标对象,且每个目标对象均对应有一个第一子图像时,上述基于第一子图像得到第一输出图像的过程可以包括:将每个目标对象对应的第一子图像进行拼接,得到第一输出图像。In some possible implementations of the second aspect, when there are at least two target objects, and each target object corresponds to a first sub-image, the above-mentioned process of obtaining the first output image based on the first sub-image may include: : Splicing the first sub-images corresponding to each target object to obtain the first output image.
在第二方面的一些可能的实现方式中,在将第一输出图像显示在取景框之后,该方法还包括:获取第二语音信息;对第二语音信息进行语音识别,得到第二语音识别结果;根据第二语音识别结果,确定第二语音信息为用于调整目标对象的显示方式的语音命令;从原图像中确定第二目标图像;从第二目标图像中裁剪出目标对象的第二子图像;基于第二子图像生成第二输出图像,并将第二输出图像显示在取景框。In some possible implementations of the second aspect, after displaying the first output image in the viewfinder, the method further includes: acquiring second voice information; and performing voice recognition on the second voice information to obtain a second voice recognition result ; According to the second voice recognition result, determine that the second voice information is a voice command for adjusting the display mode of the target object; Determine the second target image from the original image; Cut out the second child of the target object from the second target image image; generating a second output image based on the second sub-image, and displaying the second output image in the viewfinder.
在第二方面的一些可能的实现方式中,该方法还包括:在第三目标图像中,确定目标对象的外接框选区域的每个边与相对应的图像边之间的距离,第三目标图像为图像集合中视场角最大的图像;若多个距离中的最小值小于预设距离阈值,则在取景框显示提示信息,提示信息用于提示调整相机位姿。In some possible implementations of the second aspect, the method further includes: in the third target image, determining the distance between each edge of the bounding box selection area of the target object and the corresponding edge of the image, the third target image The image is the image with the largest field of view in the image set; if the minimum value among the multiple distances is less than the preset distance threshold, prompt information is displayed in the viewfinder, and the prompt information is used to prompt adjustment of the camera pose.
第三方面,本申请实施例提供一种拍摄方法,应用于终端设备,该方法通过在取景框内显示摄像头捕获的画面,画面包括至少一个对象;当接收第一指令时,第一指令用于指定至少两个目标对象,将画面中与第一指令对应的至少两个对象确定为至少两个目标对象;然后,在取景框内分区域显示每个目标对象的画面,其中,一个区域显示一个目标对象的画面。In a third aspect, an embodiment of the present application provides a shooting method, which is applied to a terminal device. In the method, a picture captured by a camera is displayed in a viewfinder, and the picture includes at least one object; when a first instruction is received, the first instruction is used for Designate at least two target objects, and determine at least two objects in the picture corresponding to the first instruction as at least two target objects; then, display the picture of each target object in sub-areas in the viewfinder, wherein one area displays one The picture of the target object.
在第三方面的一些可能的实现方式中,第一指令为语音指令。In some possible implementations of the third aspect, the first instruction is a voice instruction.
在第三方面的一些可能的实现方式中,上述将画面中与第一指令对应的至少两个对象确定为至少两个目标对象的过程可以包括:从预先建立的数据集中提取第一指令的第一索引数据,数据集包括属性关键字词和属性索引;基于各个对象的关联关系,确定第一索引数据对应的目标区域信息,目标区域信息对应的对象为目标对象;In some possible implementations of the third aspect, the above process of determining the at least two objects in the screen corresponding to the first instruction as the at least two target objects may include: extracting the first instruction of the first instruction from a pre-established data set 1. index data, the data set includes attribute keywords and attribute indexes; based on the association relationship of each object, the target area information corresponding to the first index data is determined, and the object corresponding to the target area information is the target object;
其中,关联关系为对象的属性信息的第二索引数据和区域信息之间的映射关系区域信息用于描述对象在摄像头采集的原图像中的位置。Wherein, the association relationship is the mapping relationship between the second index data of the attribute information of the object and the region information. The region information is used to describe the position of the object in the original image collected by the camera.
在第三方面的一些可能的实现方式中,在基于各个对象的关联关系,确定第一索引数据对应的目标区域信息之前,该方法还包括:确定原图像中各个对象的对象信息, 对象信息包括区域信息和属性信息;从数据集中提取各个对象的属性信息的第二索引数据;针对每个对象,将第二索引数据与区域信息进行关联,得到索引数据和区域信息之间的关联关系。In some possible implementations of the third aspect, before determining the target area information corresponding to the first index data based on the association relationship of each object, the method further includes: determining object information of each object in the original image, where the object information includes area information and attribute information; extract second index data of attribute information of each object from the data set; for each object, associate the second index data with the area information to obtain the association relationship between the index data and the area information.
在第三方面的一些可能的实现方式中,上述从数据集中提取各个对象的属性信息的第二索引数据的过程可以包括:针对每个属性信息,从数据集中查找与属性信息匹配的第一关键字词;将数据集中第一关键字词的索引数据作为第二索引数据。In some possible implementations of the third aspect, the above process of extracting the second index data of the attribute information of each object from the data set may include: for each attribute information, searching for the first key matching the attribute information from the data set word; the index data of the first key word in the data set is used as the second index data.
在第三方面的一些可能的实现方式中,上述确定原图像中各个对象的对象信息的过程可以包括:将原图像输入至目标识别模型;获得目标识别模型输出的对象信息。In some possible implementations of the third aspect, the above process of determining object information of each object in the original image may include: inputting the original image into the target recognition model; and obtaining object information output by the target recognition model.
在第三方面的一些可能的实现方式中,上述基于各个对象的关联关系,确定第一索引数据对应的目标区域信息的过程可以包括:针对每个第二索引数据,将第一索引数据中的各项索引分别与第二索引数据中的各项索引进行匹配,确定目标索引的项数,目标索引为第二索引数据中与第一索引数据中的索引相匹配的索引;基于关联关系,将项数最多的第二索引数据对应的区域信息确定为目标区域信息。In some possible implementations of the third aspect, the above-mentioned process of determining the target area information corresponding to the first index data based on the association relationship of each object may include: for each second index data, Each index is respectively matched with each index in the second index data, and the number of items of the target index is determined, and the target index is an index in the second index data that matches the index in the first index data; The area information corresponding to the second index data with the largest number of items is determined as the target area information.
在第三方面的一些可能的实现方式中,上述从预先建立的数据集中提取第一指令的第一索引数据的过程可以包括:从第二索引数据对应的属性关键字词中查找与第一指令相匹配的第二关键字词;将数据集中第二关键字词的索引数据作为第一索引数据。In some possible implementations of the third aspect, the above process of extracting the first index data of the first instruction from the pre-established data set may include: searching attribute keywords corresponding to the second index data for the first instruction The matching second key word; the index data of the second key word in the data set is used as the first index data.
在第三方面的一些可能的实现方式中,摄像头包括N个不同视场角的摄像头,N为大于或等于1的正整数;在取景框内分区域显示每个目标对象的画面的过程可以包括:针对每个目标对象,从N张原图像中确定第一目标图像,第一目标图像为图像集合中视场角最小的图像,图像集合包括至少一张可完整包含目标对象的外接框选区域的原图像;针对每个目标对象,根据目标区域信息,从第一目标图像中裁剪出目标对象的第一子图像;将每个目标对象对应的第一子图像进行拼接,得到第一输出图像;将第一输出图像显示在取景框。In some possible implementations of the third aspect, the cameras include N cameras with different field of view, where N is a positive integer greater than or equal to 1; the process of displaying the picture of each target object in sub-regions in the viewfinder frame may include: : For each target object, determine the first target image from N original images, the first target image is the image with the smallest field of view in the image set, and the image set includes at least one image that can completely contain the bounding box selection area of the target object. The original image; for each target object, according to the target area information, the first sub-image of the target object is cropped from the first target image; the first sub-image corresponding to each target object is stitched to obtain a first output image; Display the first output image in the viewfinder.
在第三方面的一些可能的实现方式中,在取景框内分区域显示每个目标对象的画面之后,该方法还包括:获取第二语音信息;对第二语音信息进行语音识别,得到第二语音识别结果;根据第二语音识别结果,确定第二语音信息为用于调整目标对象的显示方式的语音命令;从原图像中确定第二目标图像;从第二目标图像中裁剪出目标对象的第二子图像;基于第二子图像生成第二输出图像,并将第二输出图像显示在取景框。In some possible implementations of the third aspect, after displaying the picture of each target object in sub-regions in the viewfinder frame, the method further includes: acquiring second voice information; performing voice recognition on the second voice information to obtain the second voice information. Voice recognition result; according to the second voice recognition result, determine that the second voice information is a voice command for adjusting the display mode of the target object; determine the second target image from the original image; second sub-image; generating a second output image based on the second sub-image, and displaying the second output image in the viewfinder.
在第三方面的一些可能的实现方式中,该方法还包括:在第三目标图像中,确定目标对象的外接框选区域的每个边与相对应的图像边之间的距离,第三目标图像为图像集合中视场角最大的图像;若多个距离中的最小值小于预设距离阈值,则在取景框显示提示信息,提示信息用于提示调整相机位姿。In some possible implementations of the third aspect, the method further includes: in the third target image, determining the distance between each edge of the bounding box selection area of the target object and the corresponding edge of the image, the third target image The image is the image with the largest field of view in the image set; if the minimum value among the multiple distances is less than the preset distance threshold, prompt information is displayed in the viewfinder, and the prompt information is used to prompt adjustment of the camera pose.
第四方面,本申请实施例提供一种终端设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述第一方面或第二方面或第三方面任一项的方法。In a fourth aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, the above-mentioned first aspect or the second The method of any of the aspect or the third aspect.
第五方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现如上述第一方面或第二方面或第三方面任一项的方法。In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, any of the above-mentioned first aspect or second aspect or third aspect is implemented item method.
第六方面,本申请实施例提供一种芯片系统,该芯片系统包括处理器,处理器与存储器耦合,处理器执行存储器中存储的计算机程序,以实现如上述第一方面或第二方面或第三方面任一项所述的方法。该芯片系统可以为单个芯片,或者多个芯片组成的芯片模组。In a sixth aspect, an embodiment of the present application provides a chip system, the chip system includes a processor, the processor is coupled to a memory, and the processor executes a computer program stored in the memory to implement the first aspect or the second aspect or the first aspect. The method of any one of the three aspects. The chip system may be a single chip, or a chip module composed of multiple chips.
第七方面,本申请实施例提供一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面或第二方面或第三方面任一项所述的方法。In a seventh aspect, an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the method described in any one of the first aspect or the second aspect or the third aspect.
可以理解的是,上述第二方面至第七方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。It can be understood that, for the beneficial effects of the foregoing second aspect to the seventh aspect, reference may be made to the relevant descriptions in the foregoing first aspect, which will not be repeated here.
附图说明Description of drawings
图1为本申请实施例提供的终端设备100的硬件结构示意图;FIG. 1 is a schematic diagram of a hardware structure of a terminal device 100 according to an embodiment of the present application;
图2为本申请实施例的终端设备100的软件结构框图;FIG. 2 is a block diagram of a software structure of a terminal device 100 according to an embodiment of the application;
图3为本申请实施例提供的拍摄方法的一种流程示意框图;3 is a schematic block diagram of a flow of a photographing method provided by an embodiment of the present application;
图4为本申请实施例提供的录像场景下的一种跟踪拍摄示意图;4 is a schematic diagram of a tracking shooting in a video recording scene provided by an embodiment of the present application;
图5为本申请实施例提供的稳定跟踪示意图;5 is a schematic diagram of stable tracking provided by an embodiment of the present application;
图6为本申请实施例提供的多目标跟踪的一种跟踪拍摄示意图;FIG. 6 is a schematic diagram of a tracking shooting of multi-target tracking provided by an embodiment of the present application;
图7A为本申请实施例提供的跟踪拍摄场景下的一种图像处理过程示意图;7A is a schematic diagram of an image processing process in a tracking shooting scene provided by an embodiment of the present application;
图7B为本申请实施例提供的跟踪拍摄场景下的一种界面示意图;7B is a schematic diagram of an interface in a tracking shooting scene provided by an embodiment of the present application;
图8为本申请实施例提供的更换目标对象的一种示意图;8 is a schematic diagram of replacing a target object provided by an embodiment of the present application;
图9A为本申请实施例提供的跟踪拍摄场景下的另一种图像处理过程示意图;9A is a schematic diagram of another image processing process in a tracking shooting scene provided by an embodiment of the present application;
图9B为本申请实施例提供的跟踪拍摄场景下的另一种界面示意图;9B is a schematic diagram of another interface in a tracking shooting scene provided by an embodiment of the present application;
图9C为本申请实施例提供的原图像集合示意图;FIG. 9C is a schematic diagram of an original image set provided by an embodiment of the present application;
图9D为本申请实施例提供的子图像示意图;FIG. 9D is a schematic diagram of a sub-image provided by an embodiment of the present application;
图10为本申请实施例提供的拍照场景下的跟踪拍摄示意图;10 is a schematic diagram of tracking shooting in a shooting scene provided by an embodiment of the present application;
图11为本申请实施例提供的拍摄流程示意框图;11 is a schematic block diagram of a shooting process provided by an embodiment of the present application;
图12为本申请实施例提供的语音拍摄流程示意图;FIG. 12 is a schematic flowchart of a voice shooting process provided by an embodiment of the present application;
图13为本申请实施例提供的确定目标对象的方法的一种流程示意框图;FIG. 13 is a schematic flowchart of a method for determining a target object provided by an embodiment of the present application;
图14为本申请实施例提供的确定目标对象的方法的另一种流程示意框图;FIG. 14 is another schematic flowchart of the method for determining a target object provided by an embodiment of the present application;
图15为本申请实施例提供的拍摄方法的另一种流程示意框图;FIG. 15 is another schematic flow diagram of the photographing method provided by the embodiment of the present application;
图16为本申请实施例提供的确定目标对象的装置的一种示意框图;16 is a schematic block diagram of an apparatus for determining a target object provided by an embodiment of the present application;
图17为本申请实施例提供的确定目标对象的装置的另一种示意框图;17 is another schematic block diagram of an apparatus for determining a target object provided by an embodiment of the present application;
图18为本申请实施例提供的拍摄装置的示意框图。FIG. 18 is a schematic block diagram of a photographing apparatus provided by an embodiment of the present application.
具体实施方式Detailed ways
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are set forth in order to provide a thorough understanding of the embodiments of the present application.
下面先示例性介绍本申请实施例可能涉及的终端设备。The following first exemplarily introduces terminal devices that may be involved in the embodiments of the present application.
参见图1,为本申请实施例提供的终端设备100的硬件结构示意框图。如图1所示,终端设备100可以包括设备100可以包括处理器110,存储器120,音频模块130,扬声器130A,受话器130B,麦克风130C,摄像头140,显示屏150,以及传感器模 块160。其中,传感器模块160可以包括但不限于触摸传感器160A等。Referring to FIG. 1 , it is a schematic block diagram of a hardware structure of a terminal device 100 according to an embodiment of the present application. As shown in FIG. 1 , the terminal device 100 may include a processor 110, a memory 120, an audio module 130, a speaker 130A, a receiver 130B, a microphone 130C, a camera 140, a display screen 150, and a sensor module 160. The sensor module 160 may include, but is not limited to, a touch sensor 160A and the like.
可以理解的是,本申请实施例示意的结构并不构成对终端设备100的具体限定。在本申请另一些实施例中,终端设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the terminal device 100 . In other embodiments of the present application, the terminal device 100 may include more or less components than those shown in the drawings, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
例如,在一些实施例中,终端设备100还可以包括以下至少一项:耳机接口,按键,马达,指示器,以及用户标识模块(subscriber identification module,SIM)卡接口1,外部存储器接口,通用串行总线(universal serial bus,USB)接口,充电管理模块,电源管理模块,电池,天线,移动通信模块,以及无线通信模块等。For example, in some embodiments, the terminal device 100 may further include at least one of the following: an earphone interface, a button, a motor, an indicator, and a subscriber identification module (SIM) card interface 1, an external memory interface, a universal string Line bus (universal serial bus, USB) interface, charging management module, power management module, battery, antenna, mobile communication module, and wireless communication module, etc.
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP) ), controller, memory, video codec, digital signal processor (DSP), and/or neural-network processing unit (NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
其中,控制器可以是终端设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。处理器110中还可以设置存储器,用于存储指令和数据。The controller may be the nerve center and command center of the terminal device 100 . The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions. A memory may also be provided in the processor 110 for storing instructions and data.
在一些实施例中,处理器110可以包括一个或多个接口。例如,接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口。In some embodiments, the processor 110 may include one or more interfaces. For example, the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), general-purpose input and output ( general-purpose input/output, GPIO) interface.
其中,I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。在一些实施例中,处理器110可以包含多组I2C总线。处理器110可以通过不同的I2C总线接口分别耦合触摸传感器160A,以及摄像头140等。例如:处理器110可以通过I2C接口耦合触摸传感器160A,使处理器110与触摸传感器160A通过I2C总线接口通信,实现终端设备100的触摸功能。Among them, the I2C interface is a bidirectional synchronous serial bus, including a serial data line (serial data line, SDA) and a serial clock line (derail clock line, SCL). In some embodiments, the processor 110 may contain multiple sets of I2C buses. The processor 110 may be respectively coupled to the touch sensor 160A, the camera 140 and the like through different I2C bus interfaces. For example, the processor 110 may couple the touch sensor 160A through the I2C interface, so that the processor 110 communicates with the touch sensor 160A through the I2C bus interface, so as to realize the touch function of the terminal device 100 .
I2S接口可以用于音频通信。在一些实施例中,处理器110可以包含多组I2S总线。处理器110可以通过I2S总线与音频模块130耦合,实现处理器010与音频模块130之间的通信。The I2S interface can be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled with the audio module 130 through an I2S bus to implement communication between the processor 010 and the audio module 130 .
MIPI接口可以被用于连接处理器110与显示屏150,摄像头140等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器110和摄像头140通过CSI接口通信,实现终端设备100的拍摄功能。处理器110和显示屏150通过DSI接口通信,实现终端设备100的显示功能。The MIPI interface can be used to connect the processor 110 with the display screen 150, the camera 140 and other peripheral devices. MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc. In some embodiments, the processor 110 communicates with the camera 140 through a CSI interface to implement the shooting function of the terminal device 100 . The processor 110 communicates with the display screen 150 through the DSI interface to implement the display function of the terminal device 100 .
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器110与摄像头140,显示屏150,音频模块130,传感器模块160等。GPIO接口还可以被配置为I2C接口,I2S 接口,MIPI接口等。The GPIO interface can be configured by software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface may be used to connect the processor 110 with the camera 140, the display screen 150, the audio module 130, the sensor module 160, and the like. The GPIO interface can also be configured as I2C interface, I2S interface, MIPI interface, etc.
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对终端设备100的结构限定。在本申请另一些实施例中,终端设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the terminal device 100 . In other embodiments of the present application, the terminal device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
终端设备100通过GPU,显示屏150,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏150和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。The terminal device 100 implements a display function through a GPU, a display screen 150, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 150 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
显示屏150用于显示图像,视频等。显示屏150包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端设备100可以包括1个或N个显示屏150,N为大于1的正整数。The display screen 150 is used to display images, videos, and the like. The display screen 150 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light). emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on. In some embodiments, the terminal device 100 may include 1 or N display screens 150 , where N is a positive integer greater than 1.
终端设备1000可以通过ISP,摄像头140,视频编解码器,GPU,显示屏150以及应用处理器等实现拍摄功能。The terminal device 1000 can realize the shooting function through the ISP, the camera 140, the video codec, the GPU, the display screen 150, and the application processor.
ISP用于处理摄像头140反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头140中。The ISP is used to process the data fed back by the camera 140 . For example, when taking a photo, the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, converting it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and skin tone. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 140 .
摄像头140用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,终端设备100可以包括1个或N个摄像头140,N为大于1的正整数。Camera 140 is used to capture still images or video. The object is projected through the lens to generate an optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the terminal device 100 may include 1 or N cameras 140 , where N is a positive integer greater than 1.
当终端设备100包括至少两个摄像头140时,该至少两个摄像头140可以包括不同视场角的摄像头。例如,终端设备100包括广角摄像头、主摄像头和长焦摄像头,这三个摄像头的视场角不同,通过不同视场角的摄像头,可以采集到不同视场角的图像。When the terminal device 100 includes at least two cameras 140, the at least two cameras 140 may include cameras with different viewing angles. For example, the terminal device 100 includes a wide-angle camera, a main camera, and a telephoto camera. The three cameras have different viewing angles, and images with different viewing angles can be collected through cameras with different viewing angles.
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。A digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals.
视频编解码器用于对数字视频压缩或解压缩。终端设备100可以支持一种或多种视频编解码器。这样,终端设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。Video codecs are used to compress or decompress digital video. The terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record videos in various encoding formats, for example, moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现终端设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。The NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, such as the transfer mode between neurons in the human brain, it can quickly process the input information, and can continuously learn by itself. Applications such as intelligent cognition of the terminal device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
存储器120可以包括内部存储器和/或外部存储器,内部存储器可以用于存储计算机可执行程序代码,可执行程序代码包括指令。处理器110通过运行存储在内部存储器的指令,从而执行终端设备100的各种功能应用以及数据处理。内部存储器可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储终端设备100使用过程中所创建的数据等。此外,内部存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。The memory 120 may include internal memory and/or external memory, and the internal memory may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications and data processing of the terminal device 100 by executing the instructions stored in the internal memory. The internal memory may include a program storage area and a data storage area. The storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. The storage data area may store data and the like created during the use of the terminal device 100 . In addition, the internal memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
终端设备100可以通过音频模块130,扬声器130A,受话器130B,麦克风130C,以及应用处理器等实现音频功能。例如音乐播放,录音等。The terminal device 100 may implement audio functions through an audio module 130, a speaker 130A, a receiver 130B, a microphone 130C, an application processor, and the like. Such as music playback, recording, etc.
音频模块130用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块130还可以用于对音频信号编码和解码。在一些实施例中,音频模块130可以设置于处理器110中,或将音频模块130的部分功能模块设置于处理器110中。The audio module 130 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 130 may also be used to encode and decode audio signals. In some embodiments, the audio module 130 may be provided in the processor 110 , or some functional modules of the audio module 130 may be provided in the processor 110 .
扬声器130A,也称“喇叭”,用于将音频电信号转换为声音信号。终端设备100可以通过扬声器130A收听音乐,或收听免提通话,或者发出语音提示信息等。 Speaker 130A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The terminal device 100 can listen to music through the speaker 130A, or listen to a hands-free call, or send out voice prompt information and the like.
受话器130B,也称“听筒”,用于将音频电信号转换成声音信号。当终端设备100接听电话或语音信息时,可以通过将受话器130B靠近人耳接听语音。The receiver 130B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the terminal device 100 answers a call or a voice message, the voice can be received by placing the receiver 130B close to the human ear.
麦克风130C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风130C发声,将声音信号输入到麦克风130C。终端设备100可以设置至少一个麦克风130C。The microphone 130C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 130C through a human mouth, and input the sound signal into the microphone 130C. The terminal device 100 may be provided with at least one microphone 130C.
触摸传感器160A,也称“触控面板”。触摸传感器160A可以设置于显示屏150,由触摸传感器160A与显示屏150组成触摸屏,也称“触控屏”。触摸传感器160A用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏150提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器160A也可以设置于终端设备100的表面,与显示屏150所处的位置不同。The touch sensor 160A is also called "touch panel". The touch sensor 160A may be disposed on the display screen 150 , and the touch sensor 160A and the display screen 150 form a touch screen, also referred to as a “touch screen”. The touch sensor 160A is used to detect touch operations on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to touch operations may be provided through display screen 150 . In other embodiments, the touch sensor 160A may also be disposed on the surface of the terminal device 100 , which is different from the position where the display screen 150 is located.
在介绍完终端设备100的硬件架构之后,下面将结合图2对终端设备100的软件架构进行示例性介绍。After the hardware architecture of the terminal device 100 is introduced, the software architecture of the terminal device 100 will be exemplarily introduced below with reference to FIG. 2 .
终端设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的Android系统为例,示例性说明终端设备100的软件结构。The software system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of the terminal device 100 .
图2为本申请实施例的终端设备100的软件结构框图。FIG. 2 is a block diagram of a software structure of a terminal device 100 according to an embodiment of the present application.
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程 序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces. In some embodiments, the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and system libraries, and a kernel layer.
应用程序层可以包括一系列应用程序包。The application layer can include a series of application packages.
如图2所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。As shown in Figure 2, the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and so on.
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.
如图2所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。As shown in Figure 2, the application framework layer may include window managers, content providers, view systems, telephony managers, resource managers, notification managers, and the like.
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。内容提供器用于存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。A window manager is used to manage window programs. The window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc. Content providers are used to store and retrieve data and make this data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。The view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications. A display interface can consist of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
电话管理器用于提供终端设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。资源管理器用于为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。The telephony manager is used to provide the communication function of the terminal device 100 . For example, the management of call status (including connecting, hanging up, etc.). The resource manager is used to provide various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on. The notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can automatically disappear after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc. The notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications that appear on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。The core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。The application layer and the application framework layer run in virtual machines. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。2D图形引擎是2D绘图的绘图引擎。内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。A system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc. The Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications. The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc. The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing. 2D graphics engine is a drawing engine for 2D drawing. The kernel layer is the layer between hardware and software. The kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.
下面结合捕获拍照场景,示例性说明终端设备100的软件以及硬件的工作流程。The software and hardware workflows of the terminal device 100 are exemplarily described below with reference to the capturing and photographing scene.
当触摸传感器160A接收到触摸操作,相应的硬件中断被发给内核层。内核层将触摸操作加工成原始输入事件(包括触摸坐标,触摸操作的时间戳等信息)。原始输入事件被存储在内核层。应用程序框架层从内核层获取原始输入事件,识别该输入事件所对应的控件。以该触摸操作是触摸单击操作,该单击操作所对应的控件为相机应用图标的控件为例,相机应用调用应用框架层的接口,启动相机应用,进而通过调用内核层启动摄像头驱动,通过摄像头140捕获静态图像或视频。When the touch sensor 160A receives a touch operation, a corresponding hardware interrupt is sent to the kernel layer. The kernel layer processes touch operations into raw input events (including touch coordinates, timestamps of touch operations, etc.). Raw input events are stored at the kernel layer. The application framework layer obtains the original input event from the kernel layer, and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and the control corresponding to the click operation is the control of the camera application icon, as an example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer, and then starts the camera driver by calling the kernel layer. Camera 140 captures still images or video.
具体应用中,终端设备100可以为手机,可以为平板电脑,也可以为其它类型的终端设备,本申请实施例不对终端设备100的具体类型不作限定。In a specific application, the terminal device 100 may be a mobile phone, a tablet computer, or other types of terminal devices, and the specific type of the terminal device 100 is not limited in this embodiment of the present application.
在示例性介绍终端设备100的硬件架构和软件架构之后,下面以终端设备100作为示例,对本申请实施例提供的技术方案进行详细阐述。After the hardware architecture and software architecture of the terminal device 100 are exemplarily introduced, the technical solutions provided by the embodiments of the present application are described in detail below by taking the terminal device 100 as an example.
请参见图3,图3为本申请实施例提供的拍摄方法的一种流程示意框图,该方法应用于终端设备100,该方法可以包括以下步骤:Please refer to FIG. 3. FIG. 3 is a schematic flow diagram of a photographing method provided by an embodiment of the present application. The method is applied to the terminal device 100, and the method may include the following steps:
步骤S301、启动拍摄应用,进入录像模式或拍照模式,在取景框内显示捕捉到的画面。Step S301 , start the shooting application, enter the video recording mode or the photographing mode, and display the captured image in the viewfinder.
可以理解的是,上述拍摄应用是指拍摄类应用,通常为相机应用。It can be understood that the above shooting applications refer to shooting applications, which are usually camera applications.
终端设备100在检测到用于启动拍摄应用的操作之后,响应于该操作,启动拍摄应用,并通过内核层调用摄像头驱动,以通过摄像头捕获到图像,将图像显示取景框内。After detecting the operation for starting the shooting application, the terminal device 100 starts the shooting application in response to the operation, and invokes the camera driver through the kernel layer to capture the image through the camera and display the image in the viewfinder.
例如,参见图4示出的录像场景下的一种跟踪拍摄示意图,如图4所示,手机41的主界面42上显示有相机43、智慧生活、设置、日历、图库以及时钟等应用程序,此时,终端设备100具体表现为手机41,拍摄应用具体表现为相机43。For example, referring to a schematic diagram of a tracking shooting in a video recording scene shown in FIG. 4, as shown in FIG. 4, the main interface 42 of the mobile phone 41 displays applications such as the camera 43, smart life, settings, calendar, gallery, and clock, etc. At this time, the terminal device 100 is embodied as a mobile phone 41 , and the photographing application is embodied as a camera 43 .
手机41接收到针对相机43的点击操作之后,响应该点击操作,启动相机43。手机41进入录像模式后,显示录像预览界面44,录像预览界面44的取景框显示有对象46、对象47、对象48以及对象49等拍摄主体。此外,录像预览界面44中还包括图标45,该图标45用于开启和关闭跟踪拍摄模式,即用户可以通过点击图标45进入或退出跟踪拍摄模式。此时,图标45表征手机41还没有进入跟踪拍摄模式。After receiving the click operation on the camera 43, the mobile phone 41 starts the camera 43 in response to the click operation. After the mobile phone 41 enters the video recording mode, the video recording preview interface 44 is displayed, and the object 46 , the object 47 , the object 48 , and the object 49 are displayed in the viewfinder frame of the video recording preview interface 44 . In addition, the video preview interface 44 also includes an icon 45, the icon 45 is used to turn on and off the tracking shooting mode, that is, the user can click the icon 45 to enter or exit the tracking shooting mode. At this time, the icon 45 indicates that the mobile phone 41 has not entered the tracking shooting mode.
本申请实施例中,跟踪拍摄也可以称为跟拍或者目标跟踪。在跟踪目标运动时,通过跟踪拍摄可以不用移动终端设备,也不用手动调整焦距,即可让跟踪目标尽可能在保持画面中央,避免了用户移动终端设备导致的画面抖动情况。In this embodiment of the present application, tracking shooting may also be referred to as tracking shooting or target tracking. When tracking the movement of the target, the tracking shooting can keep the tracking target in the center of the screen as much as possible without using a mobile terminal device or manually adjusting the focal length, avoiding the screen shake caused by the user's mobile terminal device.
步骤S302、检测到触发操作,响应于触发操作,进入语音拍摄模式。Step S302, a trigger operation is detected, and in response to the trigger operation, the voice shooting mode is entered.
需要说明的是,上述触发操作用于触发语音拍摄模式,该触发操作可以包括单击、双击或者长按等操作。例如,该触发操作为点击操作,图4中的手机41接收到用户针对图标45的点击操作之后,则进入跟踪拍摄模式。跟踪拍摄模式下,手机41可以调用麦克风实时采集语音信息,以实现让用户通过语音指定目标对象等功能,即手机41进入跟踪拍摄模式之后即可认为进入了语音拍摄模式。当然,手机41也可以通过其它图标或按钮等,进入和退出语音拍摄模式。It should be noted that the above trigger operation is used to trigger the voice shooting mode, and the trigger operation may include operations such as single-click, double-click, or long-press. For example, the trigger operation is a click operation. After the mobile phone 41 in FIG. 4 receives the user's click operation on the icon 45 , it enters the tracking shooting mode. In the tracking shooting mode, the mobile phone 41 can call the microphone to collect voice information in real time, so as to realize functions such as allowing the user to specify the target object through voice, that is, after the mobile phone 41 enters the tracking shooting mode, it can be regarded as entering the voice shooting mode. Of course, the mobile phone 41 can also enter and exit the voice shooting mode through other icons or buttons.
该触发操作也可以为语音命令,即用户可以通过语音命令开启和关闭跟踪拍摄模式。例如,终端设备100接收到语音“开启语音拍照”或者“开启语音视频”之后,对语音进行语音识别,得到语音识别结果,当语音识别结果中包括预设关键词时,则可以 确定该语音为用于触发语音拍摄模式的语音命令;然后再响应于该语音命令,进入语音拍摄模式。其中,预设关键词可以是预先设置,例如,预设关键词可以包括“开启”、“进入”、“拍照”、“拍摄”和“视频”等。The trigger operation can also be a voice command, that is, the user can turn on and off the tracking shooting mode through a voice command. For example, after the terminal device 100 receives the voice "turn on voice photography" or "turn on voice video", it performs voice recognition on the voice to obtain a voice recognition result. When the voice recognition result includes a preset keyword, it can be determined that the voice is A voice command for triggering the voice shooting mode; and then in response to the voice command, the voice shooting mode is entered. The preset keywords may be preset, for example, the preset keywords may include "open", "enter", "photograph", "shoot", and "video".
终端设备100进入语音拍摄模式之后,对不同视场角的原图像进行目标识别,并且通过麦克风实时采集语音信息,对语音信息进行识别。After the terminal device 100 enters the voice shooting mode, it performs target recognition on original images with different viewing angles, and collects voice information in real time through a microphone to identify the voice information.
在其它一些实施例中,终端设备100在启动拍摄应用之后,也可以自动进入语音拍摄模式,不用再通过触发操作进行语音拍摄模式。换句话说,上述步骤S302是可选步骤。In some other embodiments, after the terminal device 100 starts the shooting application, it can also automatically enter the voice shooting mode, and it is no longer necessary to perform the voice shooting mode through a trigger operation. In other words, the above step S302 is an optional step.
步骤S303、分别对N张原图像进行目标识别,得到每张原图像中每个对象的对象信息。In step S303, target recognition is performed on the N original images respectively, and object information of each object in each original image is obtained.
其中,终端设备100包括N个不同视场角的摄像头,通过N个不同视场角的摄像头分别进行图像采集,得到N张不同视场角的原图像,N为大于或等于1的正整数。例如,终端设备100为手机时,手机可以包括广角摄像头、长焦摄像头和主摄像头,主摄像头、广角摄像头和长焦摄像头这三个摄像头的视场角不同。手机通过这三个视场角不同的摄像头,可以采集得到三张视场角不同的原图像。Wherein, the terminal device 100 includes N cameras with different field of view, and N pieces of original images with different field of view are obtained through N cameras with different field of view, where N is a positive integer greater than or equal to 1. For example, when the terminal device 100 is a mobile phone, the mobile phone may include a wide-angle camera, a telephoto camera, and a main camera, and the three cameras have different field angles. Through the three cameras with different field of view, the mobile phone can collect three original images with different field of view.
当终端设备100包括至少两个不同视场角的摄像头时,终端设备100可以在启动拍摄应用时,调用N个不同视场角的摄像头,采集得到N张不同视场角的原图像;也可以在启动拍摄应用时,只调用一个摄像头进行图像采集,并将采集到的图像显示在预览界面的取景框内,然后在进入到语音拍摄模式之后,再调用N个摄像头,采集得到N张不同视场角的原图像。When the terminal device 100 includes at least two cameras with different field of view, the terminal device 100 can call N cameras with different field of view to collect N original images with different field of view when starting the shooting application; When starting the shooting application, only one camera is called for image acquisition, and the collected image is displayed in the viewfinder of the preview interface. After entering the voice shooting mode, N cameras are called to collect N different views. Original image of field angle.
终端设备100在采集到N张原图像之后,可以分别对每张原图像进行目标识别,以得到每张原图像中各个对象的对象信息。每张原图像中可能包含一个或多个对象,该对象可以包括人、动物和文字等。After collecting the N original images, the terminal device 100 may perform target recognition on each original image to obtain object information of each object in each original image. Each original image may contain one or more objects, which can include people, animals, and text.
对象信息包括属性信息和区域信息。区域信息用于表征对象在原图像中的所在区域,即该区域信息可以用于表征对象在原图像中的所在位置和尺寸大小。一般情况下,区域信息可以为对象的外接框选区域的像素坐标信息,该外接框选区域通常为矩形区域,当然也可以为其它形状的区域。当然,区域信息也可以为对象在图像中的实际所处区域的坐标信息。Object information includes attribute information and area information. The area information is used to represent the area where the object is located in the original image, that is, the area information can be used to represent the location and size of the object in the original image. In general, the area information may be pixel coordinate information of the bounding box selection area of the object, and the bounding box selection area is usually a rectangular area, and of course it may be an area of other shapes. Of course, the area information may also be coordinate information of the area where the object is actually located in the image.
属性信息可以用于描述对象,该属性信息包括但不限于类型、性别、年龄段、活动信息、头发长度、头发颜色、着装类型、衣服颜色以及姿势等。不同的对象,包含的属性种类可能会有所不同。示例性地,针对人物来说,属性信息的种类包括性别、年龄段以及衣服颜色等。而针对动物来说,属性信息的种类包括类别以及颜色等。Attribute information can be used to describe the object, and the attribute information includes, but is not limited to, type, gender, age group, activity information, hair length, hair color, clothing type, clothing color, and posture. Different objects may contain different types of properties. Exemplarily, for a character, the types of attribute information include gender, age group, and clothing color. For animals, the types of attribute information include categories and colors.
其中,类型是指对象的类别,类别可以示例性包括人物、动物以及文字等。属性信息中的活动用于描述对象的当前动作,其可以示例性包括滑板、跑步以及呼啦圈等。即通过属性信息中的活动信息可以得知对象的当前动作是跑步还是玩呼啦圈。姿势可以用于描述对象的当前姿态,例如,通过姿势可以得知人是站着的,还是坐着的。The type refers to the category of the object, and the category may exemplarily include characters, animals, and characters. The activity in the attribute information is used to describe the current action of the object, which may exemplarily include skateboarding, running, and hula hooping. That is, through the activity information in the attribute information, it can be known whether the current action of the object is running or playing hula hoop. Gestures can be used to describe the current pose of an object, for example, a pose can tell whether a person is standing or sitting.
在一些实施例中,终端设备100可以分别将N张原图像输入目标识别模型,获得该目标识别模型输出的对象信息。该目标识别模型可以是预先构建并预先训练完成的。该目标识别模型可以包括目标检测网络模型和目标分类网络模型等,用于对图像进行 识别,以识别图像中的对象的属性信息,以及各对象在图像中的所处位置和所占面积的大小等。当然,终端设备100也可以通过其它方式进行目标识别,例如,可以结合图像语义分割和实例分割等方式,识别图像中各个对象的对象信息。In some embodiments, the terminal device 100 may respectively input N original images into the target recognition model to obtain the object information output by the target recognition model. The object recognition model may be pre-built and pre-trained. The target recognition model can include a target detection network model and a target classification network model, etc., which are used to identify the image to identify the attribute information of the objects in the image, as well as the location and size of each object in the image. Wait. Of course, the terminal device 100 may also perform target recognition in other ways, for example, by combining image semantic segmentation and instance segmentation, etc., to recognize the object information of each object in the image.
以图4为例,当手机41采集到录像预览界面44对应的原图像之后,可以将该原图像输入至目标识别模型,获得目标识别模型输出的各个对象的对象信息。示例性地,手机41可以识别出原图像中包括对象47的信息,通过对象47的信息可以确定该对象为女性,并且正在玩呼啦圈,以及其在原图像中的位置和尺寸大小等。Taking FIG. 4 as an example, after the mobile phone 41 collects the original image corresponding to the video preview interface 44, the original image can be input into the target recognition model to obtain object information of each object output by the target recognition model. Exemplarily, the mobile phone 41 can identify the information of the object 47 included in the original image, and through the information of the object 47, it can be determined that the object is a woman and is playing hula hoops, as well as its position and size in the original image.
终端设备100可以实时对采集的每一帧原图像进行目标识别,也可以每隔预设数量帧再进行目标识别。The terminal device 100 may perform target recognition on each frame of the collected original image in real time, or may perform target recognition every preset number of frames.
步骤S304、在预先建立的数据集中提取各个对象的属性信息的第二索引数据。Step S304 , extract the second index data of the attribute information of each object from the pre-established data set.
需要说明的是,上述数据集是预先建立的用户常用信息数据库,其可以具体表现为数据库,该数据集针对不同的对象类型预先设置有不同的属性索引。例如,针对人物类型,其可以包括但不限于以下属性种类:性别、年龄段、头发长短、头发颜色、着装类型、衣服颜色、姿势以及活动等。每种属性种类下设置有对应的属性索引。对象类型除了人物之外,还可以包括但不限于动物,以及文字等。为了便于描述,下文将以数据库为例进行描述。It should be noted that the above data set is a pre-established database of common user information, which may be embodied as a database, and the data set is preset with different attribute indexes for different object types. For example, for character types, it may include, but is not limited to, the following attribute categories: gender, age group, hair length, hair color, clothing type, clothing color, posture, and activity. A corresponding attribute index is set under each attribute type. In addition to characters, object types can also include but are not limited to animals, characters, and the like. For the convenience of description, the following description will take a database as an example.
作为示例而非限定,数据库中人物类型的属性索引可以如下表1所示。As an example and not a limitation, the attribute indexes of the character types in the database may be as shown in Table 1 below.
表1Table 1
Figure PCTCN2022083080-appb-000001
Figure PCTCN2022083080-appb-000001
Figure PCTCN2022083080-appb-000002
Figure PCTCN2022083080-appb-000002
终端设备100在提取出每张原图像中各个对象的属性信息,将各个对象的属性信息,分别与数据库中对应的属性种类中的属性关键字词进行匹配,以查找到数据库中与属性信息相匹配的第一关键字词。该第一关键字词可以包括单字和/或词语。如果在数据库中查找到与属性信息相匹配的第一关键字词,则将第一关键字词的索引数据作为该的第二索引数据。以图4示出的场景为例,对象46为玩滑板的小男孩,该对象为人物类型,在上表1中,人物类型对应的属性种类索引包括A至N。The terminal device 100 extracts the attribute information of each object in each original image, and matches the attribute information of each object with the attribute keywords in the corresponding attribute types in the database, so as to find the attribute information in the database that matches the attribute information. The first keyword that matches. The first key term may include words and/or words. If the first keyword matching the attribute information is found in the database, the index data of the first keyword is used as the second index data. Taking the scene shown in FIG. 4 as an example, the object 46 is a little boy playing a skateboard, and the object is a character type. In Table 1 above, the attribute category index corresponding to the character type includes A to N.
首先,手机41通过目标识别,得到对象46的属性信息和区域信息,该属性信息可以包括:男、小孩子、滑板、短头发、黑色头发、白色T恤、…、以及站着等。First, the mobile phone 41 obtains the attribute information and area information of the object 46 through target recognition. The attribute information may include: male, child, skateboard, short hair, black hair, white T-shirt, . . . and standing.
将对象46的属性信息,与数据库中对应的属性种类里的属性关键字词进行匹配。具体地,针对性别属性种类,将属性信息中的“男”,分别与上表1中的001A对应的属性关键字词和002A对应的关键字词进行匹配,确定出与属性信息中的“男”相匹配的属性索引为001A。The attribute information of the object 46 is matched with the attribute key words in the corresponding attribute type in the database. Specifically, for the gender attribute category, the "male" in the attribute information is matched with the attribute keyword corresponding to 001A and the keyword corresponding to 002A in Table 1 above, and it is determined that "male" in the attribute information is ” matches property index 001A.
针对年龄段属性种类,将属性信息中的“小孩子”,分别与上表1中的001B对应的属性关键字词、002B对应的属性关键字词、003B对应的属性关键字词以及004B对应的属性关键字词进行匹配,确定出与“小男孩”相匹配的属性索引为003B。For the age group attribute types, the "children" in the attribute information are respectively corresponding to the attribute key words corresponding to 001B, the attribute key words corresponding to 002B, the attribute key words corresponding to 003B, and the attribute key words corresponding to 004B in Table 1 above. The attribute key words are matched, and the attribute index matching "little boy" is determined to be 003B.
依此类推,分别将属性信息中的滑板、短头发、黑色头发、白色T恤、…、以及站着等属性种类,与上表1中对应属性种类的属性关键字词进行匹配,以得到对象46的属性信息对应的属性索引,即得到对象46的第二索引数据。By analogy, the attribute types such as skateboard, short hair, black hair, white T-shirt, ..., and standing in the attribute information are matched with the attribute keywords of the corresponding attribute types in Table 1 above to obtain the object. The attribute index corresponding to the attribute information of 46 is obtained, that is, the second index data of the object 46 is obtained.
作为示例而非限定,对象46的属性索引如下表2所示。By way of example and not limitation, the property indexes of object 46 are shown in Table 2 below.
表2Table 2
Figure PCTCN2022083080-appb-000003
Figure PCTCN2022083080-appb-000003
Figure PCTCN2022083080-appb-000004
Figure PCTCN2022083080-appb-000004
同理,可以得到对象47的属性索引,以及对象48的属性索引。Similarly, the property index of object 47 and the property index of object 48 can be obtained.
其中,对象47的属性索引如下表3所示。Among them, the attribute index of the object 47 is shown in Table 3 below.
表3table 3
Figure PCTCN2022083080-appb-000005
Figure PCTCN2022083080-appb-000005
对象48的属性索引如下表4所示。The property indexes of object 48 are shown in Table 4 below.
表4Table 4
Figure PCTCN2022083080-appb-000006
Figure PCTCN2022083080-appb-000006
步骤S305、针对每个对象,将第二索引数据和区域信息关联,得到各个对象的关联关系,该关联关系为索引数据和区域信息之间的关联关系。Step S305 , for each object, associate the second index data with the area information to obtain an association relationship of each object, where the association relationship is an association relationship between the index data and the area information.
针对每个对象,终端设备100在提取出该对象的属性信息的第二索引数据之后, 将该对象的第二索引数据和该对象在每张原图像中的区域信息关联,以建立每个对象的索引数据和每张原图像的区域信息之间的关联关系。基于该关联关系,通过索引数据可以查找到该对象的区域信息。以图4的场景为例,图4中的对象46的第二索引数据可以如上表2所示。手机41包括广角摄像头、主摄像头以及长焦摄像头,且广角摄像头的视场角最大,主摄像头的视场角次之,长焦摄像头的视场角最小。假设广角摄像头和主摄像头采集到的原图像中包括对象46,而长焦摄像头采集到的原图像中不包括对象46。此时,对象的区域信息是指对象的外接框选区域的像素坐标信息。For each object, after extracting the second index data of the attribute information of the object, the terminal device 100 associates the second index data of the object with the region information of the object in each original image to establish each object The relationship between the index data and the region information of each original image. Based on the association relationship, the region information of the object can be found through the index data. Taking the scene of FIG. 4 as an example, the second index data of the object 46 in FIG. 4 may be as shown in Table 2 above. The mobile phone 41 includes a wide-angle camera, a main camera and a telephoto camera, and the wide-angle camera has the largest field of view, the main camera has the second largest field of view, and the telephoto camera has the smallest field of view. It is assumed that the original image captured by the wide-angle camera and the main camera includes the object 46 , while the original image captured by the telephoto camera does not include the object 46 . In this case, the area information of the object refers to the pixel coordinate information of the bounding box selection area of the object.
将上表2中对象46的第二索引数据,分别与对象46在广角摄像头采集的原图像中的外接框选区域坐标关联,以及与对象46在主摄像头采集的原图像中的外接框选区域坐标关联,建立对象46的索引数据和区域信息之间的关联关系。The second index data of the object 46 in the above table 2 are respectively associated with the coordinates of the bounding box selection area of the object 46 in the original image collected by the wide-angle camera, and the bounding box selection area of the object 46 in the original image collected by the main camera. The coordinate association establishes the association relationship between the index data of the object 46 and the area information.
作为示例而非限定,对象46的关联关系如下表5所示。As an example and not a limitation, the association relationship of the objects 46 is shown in Table 5 below.
表5table 5
Figure PCTCN2022083080-appb-000007
Figure PCTCN2022083080-appb-000007
由上表5可知,对象46的索引数据包括001A、003B、001C以及002N等,通过索引数据可以查找到对象46在广角摄像头采集的原图像中的区域信息,以及对象46在主摄像头采集的原图像中的区域信息。It can be seen from Table 5 above that the index data of the object 46 includes 001A, 003B, 001C and 002N, etc. Through the index data, the area information of the object 46 in the original image collected by the wide-angle camera, and the original image collected by the main camera of the object 46 can be found. Region information in the image.
示例性地,终端设备100的处理器110对通过摄像头140捕获的原图像进行目标识别,得到原图像中各个对象的对象信息;处理器110读取存储器120中的数据库,并从数据库中提取各个对象的属性信息的第二索引数据;针对每个对象,处理器110将第二索引数据和区域信息进行关联,得到索引数据和区域信息之间的关联关系,并将该关联关系存储至存储器120。Exemplarily, the processor 110 of the terminal device 100 performs target recognition on the original image captured by the camera 140 to obtain object information of each object in the original image; the processor 110 reads the database in the memory 120 and extracts each object from the database. The second index data of the attribute information of the object; for each object, the processor 110 associates the second index data with the area information, obtains the association between the index data and the area information, and stores the association in the memory 120 .
某个时刻,处理器获110通过麦克风130C和音频模块130采集到第一语音后,对第一语音进行语音识别,得到第一语音识别结果;处理器110再从存储器120读取数据库,从该数据库中提取第一语音识别结果的第一索引数据,再读取预先建立的关联关系,根据第一索引数据和关联关系,确定第一索引数据对应的目标区域信息。At a certain moment, after the processor 110 collects the first voice through the microphone 130C and the audio module 130, it performs voice recognition on the first voice to obtain the first voice recognition result; the processor 110 then reads the database from the memory 120, The first index data of the first speech recognition result is extracted from the database, the pre-established association relationship is read, and the target area information corresponding to the first index data is determined according to the first index data and the association relationship.
步骤S306、检测到第一语音信息。Step S306, detecting the first voice information.
具体地,终端设备100在进入语音拍摄模式之后,可以调用麦克风实时采集第一语音信息。Specifically, after entering the voice shooting mode, the terminal device 100 can call the microphone to collect the first voice information in real time.
步骤S307、对第一语音信息进行语音识别,得到第一语音识别结果。Step S307: Perform speech recognition on the first speech information to obtain a first speech recognition result.
终端设备100在采集到第一语音信息之后,可以提取第一语音信息的第一声纹特征,然后再确定第一声纹特征与预先存储声纹特征之间的相似度,当第一声纹特征和预存储声纹特征之间的相似度大于预设阈值,则认为第一语音信息是特定用户的语音,则根据第一声纹特征和第一语音信息进行语音识别,得到第一语音识别结果。After collecting the first voice information, the terminal device 100 can extract the first voiceprint feature of the first voice information, and then determine the similarity between the first voiceprint feature and the pre-stored voiceprint feature. If the similarity between the feature and the pre-stored voiceprint feature is greater than the preset threshold, it is considered that the first voice information is the voice of a specific user, then voice recognition is performed according to the first voiceprint feature and the first voice information to obtain the first voice recognition result.
当第一声纹特征和预存储声纹特征之间的相似度小于或等于预设阈值时,终端设备100可以提示用户无权限,或者可以存储该第一声纹特征,用于下一次语音命令的识别。When the similarity between the first voiceprint feature and the pre-stored voiceprint feature is less than or equal to the preset threshold, the terminal device 100 may prompt the user to have no permission, or may store the first voiceprint feature for the next voice command identification.
具体应用中,终端设备100可以将第一声纹特征和第一语音信息输入语义理解模型,获得语义理解模型输出的第一语音识别结果。该语义理解模型可以包括语音特征分析模块和语义分析模块,语音特征分析模块可以用于过滤背景干扰声音,语义分析模块可以用于提取用户语音命令。In a specific application, the terminal device 100 may input the first voiceprint feature and the first voice information into the semantic understanding model, and obtain the first voice recognition result output by the semantic understanding model. The semantic understanding model may include a voice feature analysis module and a semantic analysis module, the voice feature analysis module may be used to filter background interference sounds, and the semantic analysis module may be used to extract user voice commands.
在另一些实施例中,终端设备100也可以不用提取第一语音信息的声纹特征,即不区分该第一语音信息是否为特定用户的语音,而是直接对第一语音信息进行识别,得到第一语音识别结果。In other embodiments, the terminal device 100 may directly recognize the first voice information without extracting the voiceprint feature of the first voice information, that is, without distinguishing whether the first voice information is the voice of a specific user, to obtain The first speech recognition result.
步骤S308、当根据第一语音识别结果确定第一语音信息为用于指定目标对象的语音时,基于第二索引数据对应的属性关键字词,从数据集中提取第一语音识别结果的第一索引数据。Step S308, when it is determined according to the first speech recognition result that the first speech information is the speech used to specify the target object, extract the first index of the first speech recognition result from the data set based on the attribute keywords corresponding to the second index data data.
终端设备100在得到第一语音识别结果之后,可以进一步确认第一语音信息是否为语音命令。当第一语音识别结果包括特定关键字词时,则可以认为第一语音信息为用于指定目标对象的语音,该特定关键字词是预先设置的,例如,特定关键字词可以包括“拍摄”、“跟踪”、以及“拍照”等。After obtaining the first voice recognition result, the terminal device 100 may further confirm whether the first voice information is a voice command. When the first voice recognition result includes a specific keyword, the first voice information may be considered as the voice for specifying the target object, and the specific keyword is preset. For example, the specific keyword may include "shooting" , Track, and Take Photo.
第一语音信息可以是初次指定目标对象的语音命令,也可以是用于增加或减少目标对象的语音命令,也可以是用于替换目标对象的语音命令,这些均可以认为是用于指定目标对象的语音命令。增加或减少目标对象,以及替换目标对象均是相对应于已指定目标对象而言。The first voice information can be a voice command for specifying the target object for the first time, or a voice command for increasing or decreasing the target object, or a voice command for replacing the target object, all of which can be considered to be used for specifying the target object. voice commands. Adding or subtracting target objects and replacing target objects are relative to the specified target objects.
例如,图4中采集的语音“跟踪玩滑板的小男孩”则为初次指定目标对象的语音命令。在显示界面412之后,可以通过采集第一语音信息,将目标对象从对象46替换为其它对象,也可以在对象46的基础上,增加或减少其它对象作为目标对象。For example, the voice "follow the little boy playing skateboard" collected in Fig. 4 is the voice command for specifying the target object for the first time. After the interface 412 is displayed, the target object can be replaced from the object 46 to other objects by collecting the first voice information, and other objects can also be added or subtracted as the target object based on the object 46 .
当第一语音信息不是用于指定目标对象的语音,且该第一语音信息不是特定语音命令,终端设备100可以提示用户重新输入语音,或者也可以不进行提示操作。特定语音命令可以示例性包括调整跟踪对象的显示大小和显示区域的命令等。When the first voice information is not the voice for specifying the target object, and the first voice information is not a specific voice command, the terminal device 100 may prompt the user to re-input the voice, or may not perform a prompting operation. The specific voice command may exemplarily include commands to adjust the display size and display area of the tracked object, and the like.
当第一语音信息不是用于指定目标对象的语音,而是特定语音命令时,终端设备100可以执行该特定语音命令对应的操作。When the first voice information is not a voice for specifying the target object, but a specific voice command, the terminal device 100 may perform an operation corresponding to the specific voice command.
当第一语音信息为用于指定目标对象的语音时,终端设备100可以将第一语音识别结果与第二索引数据对应的属性关键字词进行匹配,查找出与第一语音识别结果相匹配的第二关键字词,然后再将与第一语音识别结果相匹配的第二关键字词对应的属性索引,作为第一语音识别结果的索引数据。When the first voice information is the voice used to specify the target object, the terminal device 100 may match the first voice recognition result with the attribute keywords corresponding to the second index data, and find out the voice that matches the first voice recognition result. The second key word, and then the attribute index corresponding to the second key word that matches the first speech recognition result is used as the index data of the first speech recognition result.
需要说明的是,在其它一些实施例中,也可以将第一语音识别结果与数据库中各个关键字词进行匹配。但是,与第二索引数据对应的属性关键字词进行匹配相较于与数据库中所有的属性关键字词进行匹配,前者缩小了匹配范围,提高了匹配精度和匹配速度。It should be noted that, in some other embodiments, the first speech recognition result may also be matched with each key word in the database. However, compared with matching with all the attribute keywords in the database, the matching with the attribute keywords corresponding to the second index data reduces the matching range and improves the matching accuracy and matching speed.
例如,参见图4,手机41响应于用户操作,显示录像预览界面44,并进入语音拍摄模式之后,手机41分别对采集到的N张原图像进行目标识别,得到各张原图像中 各个对象的对象信息,并从数据库提取各个对象的属性信息的第二索引数据,建立索引数据和区域信息之间的关联关系。然后,在某个时刻,手机41通过麦克风采集到第一语音信息,第一语音信息为“跟踪玩滑板的小男孩”。在根据预存储声纹特征确定第一语音信息为特定用户的语音之后,手机41对该第一语音信息进行语音识别,得到第一语音识别结果为“跟踪玩滑板的男孩”,由于第一语音识别结果中包含关键字“跟踪”,故确定该语音为用于指定目标对象的语音命令,再从第二索引数据对应的属性关键字词中搜索“滑板”和“男孩”的索引数据,得到第一语音识别结果的第一索引数据。如上表1所示,可以得到“滑板”的索引数据为001C,“男孩”的索引数据为001A和003B,即第一索引数据包括001C、001A和003B。For example, referring to FIG. 4 , the mobile phone 41 displays the video recording preview interface 44 in response to the user operation, and after entering the voice shooting mode, the mobile phone 41 respectively performs target recognition on the N original images collected, and obtains the information of each object in each original image. object information, and extract the second index data of the attribute information of each object from the database, and establish an association relationship between the index data and the area information. Then, at a certain moment, the mobile phone 41 collects the first voice information through the microphone, and the first voice information is "tracking the little boy playing skateboard". After determining that the first voice information is the voice of a specific user according to the pre-stored voiceprint feature, the mobile phone 41 performs voice recognition on the first voice information, and obtains the first voice recognition result as "tracking the boy playing skateboard". The recognition result contains the keyword "tracking", so it is determined that the voice is a voice command used to specify the target object, and then the index data of "skateboard" and "boy" are searched from the attribute keywords corresponding to the second index data to obtain The first index data of the first speech recognition result. As shown in Table 1 above, it can be obtained that the index data of "skateboard" is 001C, and the index data of "boy" are 001A and 003B, that is, the first index data includes 001C, 001A and 003B.
步骤S309、根据索引数据和区域信息之间的关联关系,确定第一索引数据对应的目标区域信息,目标区域信息对应的对象为目标对象。Step S309: Determine the target area information corresponding to the first index data according to the association relationship between the index data and the area information, and the object corresponding to the target area information is the target object.
终端设备100基于关联关系,在索引数据集中查找与第一索引数据匹配的索引,将匹配项数最多的对象确定为目标对象,该对象对应的区域信息确定为目标区域信息。Based on the association relationship, the terminal device 100 searches the index data set for an index matching the first index data, determines the object with the largest number of matching items as the target object, and determines the area information corresponding to the object as the target area information.
如果在索引数据集查找不到与第一索引数据相匹配的索引,终端设备100可以提示用户查找不到与语音匹配的对象。提示方式可以包括语音提示和/或文字提示。例如,终端设备可以发出语音“查找不到匹配的对象,请重新输入语音”,并且,在取景框内也显示与该语音对应的文字。If no index matching the first index data can be found in the index data set, the terminal device 100 may prompt the user that no object matching the voice can be found. The prompt mode may include voice prompt and/or text prompt. For example, the terminal device can send out a voice "No matching object can be found, please re-enter the voice", and the text corresponding to the voice can also be displayed in the viewfinder.
索引数据集是指包括各个对象的第二索引数据的集合。例如,以图4的场景为例,索引数据集可以包括对象46的第二索引数据,对象47的第二索引数据,对象48的第二索引数据,以及对象49的第二索引数据。The index data set refers to a collection of second index data including each object. For example, taking the scenario of FIG. 4 as an example, the index data set may include the second index data of object 46 , the second index data of object 47 , the second index data of object 48 , and the second index data of object 49 .
如果第一语音识别结果只包含一个对象,但却匹配出至少两个目标对象,终端设备100可以提示用户作进一步地选择。提示信息可以是文字提示信息,可以是语音提示信息,也可以是其它类型的提示信息。例如,第一语音识别结果为“跟踪玩滑板的小男孩”,即第一语音识别结果只包含“小男孩”这一个对象,但是却匹配出两个目标对象,此时,手机可以让用户执行点选操作,或者让用户重新输入语音,以从两个目标对象中进一步确定出用户想要指定的目标对象。If the first speech recognition result contains only one object, but at least two target objects are matched, the terminal device 100 may prompt the user for further selection. The prompt information may be text prompt information, voice prompt information, or other types of prompt information. For example, the first speech recognition result is "following the little boy playing skateboard", that is, the first speech recognition result only contains the object "little boy", but two target objects are matched. At this time, the mobile phone can let the user execute Click the operation, or let the user re-input the voice to further determine the target object that the user wants to specify from the two target objects.
终端设备100查找第一索引数据对应的目标区域信息的过程中,将匹配项最多的区域信息作为第一索引数据对应的目标区域信息。In the process of searching the target area information corresponding to the first index data, the terminal device 100 uses the area information with the most matching items as the target area information corresponding to the first index data.
例如,以图4的场景为例,此时,第一语音识别结果为“跟踪玩滑板的男孩”,“滑板”对应的索引数据为001C,“男孩”对应的索引数据为001A和003B。即第一索引数据包括001C、001A以及003B。For example, taking the scene in FIG. 4 as an example, at this time, the first speech recognition result is "tracking a boy playing skateboard", the index data corresponding to "skateboard" is 001C, and the index data corresponding to "boy" are 001A and 003B. That is, the first index data includes 001C, 001A, and 003B.
基于索引数据集可以得知,对象46的区域信息对应的第二索引数据包括001A、003B、001C、002D、005E、008F以及002N,目标对象48的区域信息对应的第二索引数据包括001A、003B、003C以及002D、005E、001F以及002N。Based on the index data set, it can be known that the second index data corresponding to the area information of the object 46 includes 001A, 003B, 001C, 002D, 005E, 008F and 002N, and the second index data corresponding to the area information of the target object 48 includes 001A, 003B , 003C and 002D, 005E, 001F and 002N.
通过对比可知,对象46的第二索引数据中,与第一所有数据中的各项索引相匹配的目标索引包括001C、001A以及003B,即与第一索引数据相匹配的项数为3,也即目标索引的项数为3;而对象48的第二索引数据中,与第一所有数据中的各项索引相匹配的目标索引只包括001A以及003B,没有包括001C,即与第一索引数据相匹配的项数为2。By comparison, it can be seen that in the second index data of the object 46, the target indexes that match the indexes in the first all data include 001C, 001A, and 003B, that is, the number of items that match the first index data is 3. That is, the number of items of the target index is 3; and in the second index data of the object 48, the target index that matches the indexes in the first all data only includes 001A and 003B, and does not include 001C, that is, the first index data The number of matching items is 2.
同理,可以得知对象47和对象49的匹配项数均为0。Similarly, it can be known that the number of matching items of object 47 and object 49 are both 0.
也就是说,对象46的目标索引的项数最多,故将对象46的区域信息作为目标区域信息,对象46作为目标对象。参见上表5可知,目标区域信息包括广角摄像头采集的原图像的外接框选区域坐标信息,以及主摄像头采集的原图像的外接框选区域坐标信息。That is to say, the number of items of the target index of the object 46 is the largest, so the area information of the object 46 is used as the target area information, and the object 46 is used as the target object. Referring to Table 5 above, it can be known that the target area information includes the coordinate information of the bounding box selection area of the original image collected by the wide-angle camera, and the coordinate information of the bounding frame selection area of the original image collected by the main camera.
在一些实施例中,终端设备100可以在确定出用户语音指定的目标对象之后,可以使用相应的标记对目标对象进行标识,以让用户得知目标对象。标记方式可以是但不限于是矩形框。例如,参见图4,手机41在采集到语音“跟踪玩滑板的男孩”,并对该语音进行语音识别,确定出目标区域信息之后,可以确定出用户指定的目标对象为对象46,并根据目标区域信息可以确定出对象46在原图像中的所在区域,可以使用矩形框标记对象46,即在录像预览界面410中使用矩形框411框选对象46,以提示用户语音指定的目标对象为对象46。In some embodiments, after determining the target object specified by the user's voice, the terminal device 100 may use a corresponding mark to identify the target object, so as to let the user know the target object. The marking method can be, but not limited to, a rectangular frame. For example, referring to FIG. 4 , after the mobile phone 41 collects the voice "tracking the boy playing skateboard", performs voice recognition on the voice, and determines the target area information, it can determine that the target object specified by the user is the object 46, and according to the target The area information can determine the area where the object 46 is located in the original image, and a rectangular frame can be used to mark the object 46, that is, the rectangular frame 411 is used to frame the object 46 in the video preview interface 410, and the target object designated by the user's voice is prompted as the object 46.
当然,在其它一些实施例中,手机41在确定目标对象以及目标区域信息之后,也可以不用使用相应的标记对目标对象进行标识。Of course, in some other embodiments, after determining the target object and the target area information, the mobile phone 41 may not use the corresponding mark to identify the target object.
步骤S310、从N张原图像中确定目标图像,再根据目标区域信息,从目标图像中裁剪出子图像,该子图像包括目标对象,该目标图像是指图像集合中视场角最小的图像。Step S310: Determine the target image from the N original images, and then crop a sub-image from the target image according to the target area information, the sub-image includes the target object, and the target image refers to the image with the smallest field of view in the image set.
图像集合包括至少一张可完整包含目标对象的外接框选区域的原图像,即从N张原图像中筛选出可完整包含目标对象的外接框选区域的图像,将这些图像组成一个图像集合。The image set includes at least one original image that can completely include the bounding box selection area of the target object, that is, images that can completely include the bounding box selection area of the target object are selected from the N original images, and these images are formed into an image set.
终端设备100可以先从N张原图像中筛选出可完整包含目标对象的外接框选区域的图像,组成图像集合,然后再从图像集合中按照视场角从小到大,选择视场角最小的原图像作为目标图像。其中,目标对象的外接框选区域通常是外接矩形区域。The terminal device 100 can first select images that can completely contain the bounding box selection area of the target object from the N original images to form an image set, and then select the one with the smallest field of view from the image set according to the field of view angle from small to large. The original image is used as the target image. Wherein, the bounding box selection area of the target object is usually a bounding rectangle area.
例如,参见上表5所示,目标对象为对象46,可完整包含对象46的外接框选区域的原图像包括广角摄像头采集的原图像,以及主摄像头采集的原图像,而长焦摄像头采集的原图像中没有包含对象46。此时,图像集合包括广角摄像头采集的原图像,以及主摄像头采集的原图像。由于广角摄像头的视场角大于主摄像头的视场角,则将主摄像头采集的原图像确定为目标图像。然后,根据对象46在主摄像头采集的原图像中的目标区域信息,采集包含对象46的子图像。For example, as shown in Table 5 above, the target object is the object 46, and the original image that can completely include the bounding box selection area of the object 46 includes the original image captured by the wide-angle camera and the original image captured by the main camera, and the original image captured by the telephoto camera. Object 46 is not included in the original image. At this time, the image set includes the original image collected by the wide-angle camera and the original image collected by the main camera. Since the field of view of the wide-angle camera is larger than the field of view of the main camera, the original image collected by the main camera is determined as the target image. Then, according to the target area information of the object 46 in the original image collected by the main camera, a sub-image including the object 46 is collected.
终端设备100确定出目标图像之后,可以将目标对象的外接框选区域扩大一定比例(例如,5%)后,再按照预设的图像宽高比例,根据目标图像的目标区域信息,从目标图像中裁剪出子图像。预设的图像宽高比例可以根据实际需要设定。After the terminal device 100 determines the target image, it can expand the bounding box selection area of the target object by a certain ratio (for example, 5%), and then according to the preset image width and height ratio, according to the target area information of the target image, from the target image. Crop out the sub-image. The preset image aspect ratio can be set according to actual needs.
需要说明的是,一幅子图像包括一个目标对象。当有多个目标对象时,则裁剪得到多幅子图像。例如,终端设备100采集到的语音为“跟踪玩滑板的小男孩和跑步的小男孩”,此时,目标对象包括“跟踪玩滑板的小男孩”和“跑步的小男孩”,则根据“跟踪玩滑板的小男孩”对应的目标区域信息,裁剪得到包含“跟踪玩滑板的小男孩”的子图像,根据“跑步的小男孩”对应的目标区域信息,裁剪得到包含“跑步的小男孩”的子图像。It should be noted that a sub-image includes a target object. When there are multiple target objects, multiple sub-images are obtained by cropping. For example, the voice collected by the terminal device 100 is "tracking a little boy playing a skateboard and a little boy running". At this time, the target objects include "tracking a little boy playing a skateboard" and "a little boy running", according to the "tracking little boy" According to the target area information corresponding to the "little boy playing skateboard", the sub-image containing "tracking the little boy playing skateboard" is obtained by cropping. subimage.
步骤S311、基于子图像生成输出图像,再将输出图像显示在取景框。Step S311 , generating an output image based on the sub-image, and then displaying the output image in the viewfinder.
需要说明的是,当只有一幅子图像时,终端设备100可以将单一子图像缩小或放大至预设分辨率,以得到输出图像,再按照预期的显示方式,将输出图像显示在取景框。当有多幅子图像时,终端设备100可以将多幅子图像拼接成一幅图像,再对拼接得到的图像进行缩小或放大,以得到输出图像,或者,也可以将单幅子图像缩小或放大后,再拼接得到输出图像,然后再按照预期的显示方式,将该输出图像显示在取景框。例如,参见图4,手机41在采集到语音“跟踪玩滑板的小男孩”之后,对该语音信息进行识别,并根据语音识别结果,查找到对应的目标区域信息;再用外接矩形区域411框选出目标对象,以提示用户语音指定的目标对象为对象46。此时,用户可以通过点击拍摄按钮,对对象46进行跟踪拍摄。手机41在接收到针对拍摄按钮的点击操作之后,可以从目标图像中裁剪得到子图像,再基于子图像生成输出图像,将输出图像显示在取景框,即显示界面412。It should be noted that when there is only one sub-image, the terminal device 100 can reduce or enlarge the single sub-image to a preset resolution to obtain an output image, and then display the output image in the viewfinder according to the expected display mode. When there are multiple sub-images, the terminal device 100 may stitch the multiple sub-images into one image, and then reduce or enlarge the image obtained by splicing to obtain an output image, or may also reduce or enlarge a single sub-image After that, the output image is obtained by splicing, and then the output image is displayed in the viewfinder according to the expected display mode. For example, referring to FIG. 4 , after the mobile phone 41 collects the voice "tracking the little boy playing skateboard", it recognizes the voice information, and according to the voice recognition result, finds the corresponding target area information; The target object is selected, and the target object designated by the user's voice is prompted as the object 46 . At this time, the user can follow the shooting of the object 46 by clicking the shooting button. After receiving the click operation on the shooting button, the mobile phone 41 can crop a sub-image from the target image, and then generate an output image based on the sub-image, and display the output image in the viewfinder frame, that is, the display interface 412 .
当然,用户也可以不用手动点击拍摄按钮,以触发跟踪拍摄,此时,手机41识别出语音中包含关键字“跟踪”,并确定出目标对象为对象46,可以在显示界面410之后,自动显示界面412。Of course, the user does not need to manually click the shooting button to trigger the tracking shooting. At this time, the mobile phone 41 recognizes that the voice contains the keyword "tracking", and determines that the target object is the object 46, which can be automatically displayed after the display interface 410. Interface 412.
在另一些实施例中,手机41在采集到语音“跟踪玩滑板的小男孩”,通过语音识别确定出需要对对象46进行跟踪拍摄之后,则可以直接显示界面412,即手机41可以不显示界面410,而是识别出目标对象和拍摄动作之后,直接对对象46进行跟踪拍摄。In other embodiments, the mobile phone 41 can directly display the interface 412 after the phone 41 collects the voice "tracking the little boy playing skateboard" and determines through voice recognition that the object 46 needs to be tracked and photographed, that is, the mobile phone 41 may not display the interface 410, but after recognizing the target object and the shooting action, the object 46 is directly tracked and shot.
在跟踪拍摄模式下,当目标对象移动时,终端设备100根据对目标对象进行稳定跟踪。例如,请参见图5,图5为本申请实施例提供的稳定跟踪示意图,图4和图5的场景相同。如图5所示,手机41通过广角摄像头、主摄像头或者长焦摄像头采集到图像51,该图像51中包括多个对象;对图像51进行目标识别,得到图像51中各个对象的对象信息;从数据库中提取出各个对象的属性信息的第二索引数据,建立各个对象的索引数据和区域信息之间的关联关系。In the tracking shooting mode, when the target object moves, the terminal device 100 stably tracks the target object. For example, please refer to FIG. 5 , which is a schematic diagram of stable tracking provided by an embodiment of the present application, and the scenarios in FIG. 4 and FIG. 5 are the same. As shown in FIG. 5 , the mobile phone 41 collects an image 51 through a wide-angle camera, a main camera or a telephoto camera, and the image 51 includes a plurality of objects; target recognition is performed on the image 51 to obtain object information of each object in the image 51; The second index data of the attribute information of each object is extracted from the database, and an association relationship between the index data of each object and the area information is established.
当检测到用户输入的语音52时,手机41结合对象信息对语音52进行语音识别,即手机41对语音42进行语音识别,得到语音识别结果“跟踪玩滑板的男孩”,再根据索引数据和区域信息之间的关联关系,确定出语音识别结果对应的目标对象,以及该目标对象对应的目标区域信息。此时,手机41确定出语音52指定的目标对象为对象53,根据对象53的区域信息,裁剪得到包含对象53的子图像,基于子图像生成输出图像54。When detecting the voice 52 input by the user, the mobile phone 41 performs voice recognition on the voice 52 in combination with the object information, that is, the mobile phone 41 performs voice recognition on the voice 42, and obtains the voice recognition result "tracking the boy playing skateboard", and then according to the index data and the area The relationship between the information is determined to determine the target object corresponding to the speech recognition result and the target area information corresponding to the target object. At this time, the mobile phone 41 determines that the target object specified by the voice 52 is the object 53 , and according to the region information of the object 53 , a sub-image including the object 53 is obtained by cropping, and the output image 54 is generated based on the sub-image.
当对象53移动时,手机41采集到图像55。图像55和图像51相比,目标对象53的位置在图像中的位置发生了改变。手机41对图像55进行目标识别,得到对象信息,再根据对象53的区域信息,裁剪得到包含对象53的子图像,基于子图像生成输出图像56。When the object 53 moves, the cell phone 41 captures the image 55 . Compared to image 55 and image 51, the position of target object 53 has changed in the image. The mobile phone 41 performs target recognition on the image 55 to obtain the object information, and then according to the area information of the object 53 , a sub-image including the object 53 is obtained by cropping, and an output image 56 is generated based on the sub-image.
需要说明的是,在根据对象53的区域信息裁剪得到的子图像中,除了对象53之外,还可能包含其它背景信息,为了描述方便,图4、图5以及后续附图中,可能会省略一些背景信息,仅示出了目标对象相关的信息。本申请实施例中,终端设备首先通过对各个视场角的图像进行目标识别,以得到各个对象的对象信息,再在数据库中提取对象的属性信息对应的索引数据,将索引数据和区域信息进行关联,得到各个对象的关联关系;然后在检测到语音信息之后,对语音信息进行识别,并根据关联关系 确定语音信息指定的目标对象,以及该目标对象的目标区域信息,最后根据目标区域信息,对目标对象进行拍摄。这样,用户不用手动点选取景框的对象以指定目标对象,而是可以通过语音便捷准确地指定跟踪目标,提高拍摄便捷性。It should be noted that in the sub-image cropped according to the area information of the object 53, in addition to the object 53, other background information may also be included. For the convenience of description, in FIG. 4, FIG. 5 and subsequent drawings, it may be omitted. Some background information, only the information related to the target object is shown. In the embodiment of the present application, the terminal device first performs target recognition on the images of each field of view to obtain the object information of each object, and then extracts the index data corresponding to the attribute information of the object in the database, and compares the index data and the area information. Then, after detecting the voice information, identify the voice information, and determine the target object specified by the voice information and the target area information of the target object according to the association relationship, and finally, according to the target area information, Shoot the target object. In this way, the user does not need to manually select the object of the scene frame to designate the target object, but can conveniently and accurately designate the tracking target through voice, which improves the convenience of shooting.
本申请实施例可以实现单目标跟踪,也可以实现多目标跟踪。The embodiments of the present application can implement single-target tracking, and can also implement multi-target tracking.
请参见图6,图6为本申请实施例提供的多目标跟踪的一种跟踪拍摄示意图。如图6所示,当手机61接收到针对相机62的点击操作之后,手机61则启动相机应用,在取景框内显示捕捉到的画面。进入录像模式之后,手机61显示录像预览界面63,录像预览界面63包括对象64、对象65、对象66和对象67等拍摄主体。Referring to FIG. 6 , FIG. 6 is a schematic diagram of tracking and shooting of multi-target tracking according to an embodiment of the present application. As shown in FIG. 6 , after the mobile phone 61 receives the click operation on the camera 62 , the mobile phone 61 starts the camera application and displays the captured image in the viewfinder. After entering the video recording mode, the mobile phone 61 displays a video recording preview interface 63, and the video recording preview interface 63 includes shooting subjects such as object 64, object 65, object 66, and object 67.
手机61进入语音拍摄模式或者进入跟踪拍摄模式之后,通过不同视场角的摄像头采集不同视场角的原图像,并对各个原图像进行目标识别,得到各个对象的对象信息。并且,基于预先建立的数据库,提取出各个对象的属性信息的索引数据,并建立索引数据和区域信息之间的关联关系。After the mobile phone 61 enters the voice shooting mode or enters the tracking shooting mode, it collects original images of different viewing angles through cameras with different viewing angles, and performs target recognition on each original image to obtain object information of each object. And, based on the pre-established database, the index data of the attribute information of each object is extracted, and the association relationship between the index data and the area information is established.
手机61通过麦克风采集到语音“两屏跟踪玩滑板的男孩和玩呼啦圈的女孩”,对语音进行识别,得到语音识别结果;再从数据库中提取出语音识别结果的索引数据,基于索引数据和区域信息之间的关联关系,确定出语音识别结果对应的目标对象,以及目标区域信息。The mobile phone 61 collects the voice “Two screens to track the boy playing skateboard and the girl playing hula hoop” through the microphone, and recognizes the voice to obtain the voice recognition result; and then extracts the index data of the voice recognition result from the database, based on the index data and The relationship between the area information determines the target object corresponding to the speech recognition result and the target area information.
需要说明的是,当语音指定至少两个目标对象时,语音识别结果的第一索引数据包括至少两个目标对象的索引。此时,针对每个目标对象的索引数据,基于索引数据和区域信息之间的关联关系,确定出目标区域信息。It should be noted that, when the speech specifies at least two target objects, the first index data of the speech recognition result includes the indices of the at least two target objects. At this time, for the index data of each target object, the target area information is determined based on the association relationship between the index data and the area information.
例如,以图6的场景为例,手机61采集到的语音包括两个目标对象,分别为“玩滑板的男孩”以及“玩呼啦圈的女孩”。语音识别结果对应的第一索引数据包括“玩滑板的男孩”对应的索引数据,以及“玩呼啦圈的女孩”的索引数据;然后,根据“玩滑板的男孩”对应的索引数据,基于关联关系确定出“玩滑板的男孩”的目标区域信息,根据“玩呼啦圈的女孩”的索引数据,基于关联关系确定出“玩呼啦圈的女孩”的目标区域信息。For example, taking the scene in FIG. 6 as an example, the voice collected by the mobile phone 61 includes two target objects, namely "a boy playing a skateboard" and "a girl playing a hula hoop". The first index data corresponding to the speech recognition result includes the index data corresponding to the "boy playing skateboard" and the index data of "girl playing hula hoop"; then, according to the index data corresponding to the "boy playing skateboard", based on the association relationship The target area information of the "boy playing skateboard" is determined, and according to the index data of the "girl playing hula hoop", the target area information of the "girl playing hula hoop" is determined based on the association relationship.
此时,手机61确定出语音“两屏跟踪玩滑板的男孩和玩呼啦圈的女孩”对应的目标对象为对象64和对象66。手机61根据对象64的区域信息和对象66对应的区域信息,使用外接矩形69框选对象66,使用外接矩形610框选对象64。即手机61在采集到语音,并确定出语音指定的目标对象之后,在显示录像预览界面69,在预先预览界面68中通过外接虚线矩形的方式,提示用户语音指定的目标对象为对象64和对象66。At this time, the mobile phone 61 determines that the target objects corresponding to the voice "two-screen tracking of a boy playing a skateboard and a girl playing a hula hoop" are the object 64 and the object 66 . The mobile phone 61 uses the circumscribed rectangle 69 to frame the object 66 according to the area information of the object 64 and the area information corresponding to the object 66 , and uses the circumscribed rectangle 610 to frame the object 64 . That is, after the mobile phone 61 collects the voice and determines the target object specified by the voice, the video preview interface 69 is displayed, and in the pre-preview interface 68, the target objects specified by the voice are prompted to be the object 64 and the 66.
手机61通过语音识别,可以确定出语音“两屏跟踪玩滑板的男孩和玩呼啦圈的女孩”对应的拍摄目标为对象64和对象66,还可以根据提取出的关键字“两屏”和“跟踪”,确定拍摄动作为跟踪拍摄,显示方式为两屏显示。Through voice recognition on the mobile phone 61, it can be determined that the corresponding shooting targets of the voice "Two-screen tracking of a boy playing a skateboard and a girl playing a hula hoop" are objects 64 and 66, and can also be based on the extracted keywords "two screens" and "" "Tracking" to determine the shooting action as tracking shooting, and the display mode is two-screen display.
此时,手机61在确定出语音指定的目标对象之后,确定出可完整包含对象64的框选区域和对象66的框选区域,且视场角最小的目标图像,此时,对象64和对象66的目标图像可能不相同,也可能相同。具体地,针对对象64,确定出对象64的目标图像;针对对象66,确定出对象66的目标图像;再根据目标区域信息,从对象64的目标图像中裁剪出对象64对应的第一子图像,以及从对象66的目标图像中裁剪出对象66对应的第二子图像。最后,将第一子图像和第二子图像进行拼接,得到输出图像,并显示取景框。At this time, after determining the target object specified by the voice, the mobile phone 61 determines the target image that can completely include the frame selection area of the object 64 and the frame selection area of the object 66 and has the smallest field of view. At this time, the object 64 and the object The target image for 66 may or may not be the same. Specifically, for the object 64, the target image of the object 64 is determined; for the object 66, the target image of the object 66 is determined; and then according to the target area information, the first sub-image corresponding to the object 64 is cropped from the target image of the object 64 , and the second sub-image corresponding to the object 66 is cropped from the target image of the object 66 . Finally, the first sub-image and the second sub-image are spliced to obtain an output image, and a viewfinder is displayed.
如图6所示,手机61显示跟踪拍摄界面611。其中,跟踪拍摄界面611的左半区域跟踪显示对象64,右半区域跟踪显示对象66。As shown in FIG. 6 , the mobile phone 61 displays a tracking shooting interface 611 . Among them, the left half area of the tracking shooting interface 611 tracks the display object 64 , and the right half area tracks the display object 66 .
在另一些实施例中,手机61在采集到语音,并确定出语音对应的目标对象之后,可以直接显示跟踪拍摄界面611,不显示录像预览界面68。In other embodiments, after the mobile phone 61 collects the voice and determines the target object corresponding to the voice, the mobile phone 61 can directly display the tracking shooting interface 611 without displaying the video preview interface 68 .
需要说明的是,当包括至少两张子图像时,可以按照目标对象在用户语音中出现的先后顺序进行顺序显示,顺序显示可以示例性为:终端设备在横屏状态,从左到右;竖屏状态,从上到下。例如,如图6所示,在用户语音“两屏跟踪玩滑板的男孩和玩呼啦圈的女孩”中,对象64先于对象66出现,且手机61处于横屏状态,在将对象64显示在左边区域,将对象66显示在右边区域。It should be noted that when at least two sub-images are included, they can be displayed sequentially according to the order in which the target objects appear in the user's voice. screen status, from top to bottom. For example, as shown in Figure 6, in the user's voice "Two screens to track the boy playing a skateboard and the girl playing a hula hoop", the object 64 appears before the object 66, and the mobile phone 61 is in the horizontal screen state, and the object 64 is displayed on the In the left area, the object 66 is displayed in the right area.
另外,用户语音也可以不指定两屏跟踪,手机61在确定出目标对象为两个时,则自动使用两屏对两个目标对象进行跟踪。例如,采集到的语音为“跟踪玩滑板的男孩和玩呼啦圈的女孩”。In addition, the user's voice may not specify two-screen tracking, and when the mobile phone 61 determines that there are two target objects, it automatically uses two screens to track the two target objects. For example, the collected speech is "tracking a boy on a skateboard and a girl on a hula hoop".
可以理解的是,图6中以两个目标为例对多目标跟踪场景进行说明。当包括对三个以上目标对象时,其过程与图6中的过程类似,在此不再赘述。例如,采集到的语音为“跟踪玩滑板的男孩、玩呼啦圈的女孩以及跑步的男孩”,手机61确定语音指定的目标对象包括三个时,则使用三屏跟踪显示三个目标对象。It can be understood that the multi-target tracking scenario is illustrated by taking two targets as an example in FIG. 6 . When more than three target objects are included, the process is similar to the process in FIG. 6 , and details are not repeated here. For example, the collected voices are "tracking a boy playing a skateboard, a girl playing a hula hoop, and a boy running." When the mobile phone 61 determines that there are three target objects specified by the voice, it uses three-screen tracking to display the three target objects.
在一些实施例中,终端设备100确定出语音指定的目标对象,并对目标对象进行跟踪拍摄时,如果检测到目标对象靠近最大视场角的原图像的边缘时,则可以在界面上显示提示信息,以提示用户调整相机或者终端的位姿,提高用户体验。In some embodiments, when the terminal device 100 determines the target object specified by the voice, and when the target object is tracked and photographed, if it is detected that the target object is close to the edge of the original image with the largest field of view, a prompt may be displayed on the interface information to prompt the user to adjust the pose of the camera or terminal to improve user experience.
具体地,终端设备100在最大视场角的图像中,检测目标对象的外接框选区域的边缘与该最大视场角的图像的边缘之间的距离,当距离最小的边对应的距离小于一定距离阈值时,则可以认为目标对象即将超出相机的最大捕获范围。此时,为了可以继续对目标对象进行跟踪拍摄,则可以在取景框内显示提示信息,该提示信息用于提示用户调整相机位姿,进一步地,还可以用于提示相机调整的方向。例如,该提示信息可以为文本信息和符号信息,即在取景框内显示文字“请调整相机方向”,并使用箭头表示相机调整的方向。Specifically, in the image with the largest field of view, the terminal device 100 detects the distance between the edge of the bounding box selection area of the target object and the edge of the image with the largest field of view. When the distance corresponding to the edge with the smallest distance is less than a certain distance When the distance threshold is reached, it can be considered that the target object is about to exceed the maximum capture range of the camera. At this time, in order to continue to track and shoot the target object, prompt information can be displayed in the viewfinder frame, and the prompt information can be used to prompt the user to adjust the camera pose, and further, can also be used to prompt the camera to adjust the direction. For example, the prompt information can be text information and symbol information, that is, the text "Please adjust the camera direction" is displayed in the viewfinder, and an arrow is used to indicate the direction of camera adjustment.
在一些实施例中,终端设备100确定出语音指定的目标对象,并对目标对象进行跟踪拍摄时,如果采集到第二语音信息,则对第二语音信息进行语音识别,得到第二语音识别结果。当根据第二语音识别结果,确定出第二语音信息是用于调整目标对象的显示方式的语音命令,根据目标对象的目标区域信息,调整该目标对象的子图像的裁剪位置。目标对象的显示方式包括目标对象的显示位置和/或显示大小。In some embodiments, the terminal device 100 determines the target object specified by the voice, and when the target object is tracked and photographed, if the second voice information is collected, the second voice information is subjected to voice recognition to obtain a second voice recognition result . When it is determined according to the second voice recognition result that the second voice information is a voice command for adjusting the display mode of the target object, the cropping position of the sub-image of the target object is adjusted according to the target area information of the target object. The display manner of the target object includes the display position and/or display size of the target object.
例如,当第二语音信息为“往右边一些”,终端设备100则调整子图像的裁剪位置,以让输出图像中的目标对象偏离画面中心,且向右偏离。通常情况下,在裁剪子图像时,让目标对象尽可能地位于画面中心。For example, when the second voice information is "to the right", the terminal device 100 adjusts the cropping position of the sub-image so that the target object in the output image deviates from the center of the screen and deviates to the right. Typically, when cropping sub-images, keep the target object in the center of the frame as much as possible.
其中,当根据第二语音识别结果,确定出第二语音信息为用于放大或缩小目标对象的语音命令时,即第二语音信息为用于调整目标对象的显示大小时,则可以从不同的视场角图像中,确定出合适的视场角图像作为目标图像,然后再从目标图像中裁剪得到子图像。例如,第二语音信息为“放大到上半身”,终端设备100根据第二语音识别结果,确定第二语音信息为用于放大目标对象的语音命令时,选择长焦摄像头对应 的图像作为目标图像,并从长焦摄像头采集的原图像中裁剪得到子图像。Wherein, when it is determined according to the second voice recognition result that the second voice information is a voice command used to enlarge or reduce the target object, that is, when the second voice information is used to adjust the display size of the target object, it can be selected from different In the field-of-view image, an appropriate field-of-view image is determined as the target image, and then a sub-image is obtained by cropping the target image. For example, when the second voice information is "zoom in to the upper body", and the terminal device 100 determines that the second voice information is a voice command for zooming in on the target object according to the second voice recognition result, it selects the image corresponding to the telephoto camera as the target image, And crop the sub-image from the original image collected by the telephoto camera.
在跟踪拍摄过程,用户可以通过语音调整目标对象的显示位置和/或显示大小,进一步提高了拍摄便捷性。During the tracking shooting process, the user can adjust the display position and/or display size of the target object through voice, which further improves the convenience of shooting.
示例性,终端设备100为手机,该手机包括主摄像头、广角摄像头和长焦摄像头,视场角从大到小为:广角摄像头、主摄像头和长焦摄像头。图7A为跟踪拍摄场景下的一种图像处理过程示意图,图7B为跟踪拍摄场景下的一种界面示意图。Exemplarily, the terminal device 100 is a mobile phone, the mobile phone includes a main camera, a wide-angle camera, and a telephoto camera, and the viewing angles from large to small are: a wide-angle camera, a main camera, and a telephoto camera. FIG. 7A is a schematic diagram of an image processing process in a tracking shooting scene, and FIG. 7B is a schematic interface diagram in a tracking shooting scene.
如图7A,手机通过三个摄像头采集得到不同视场角的图像,对各个视场角的图像进行目标识别,得到各个对象的对象信息;基于各个对象的属性信息和区域信息,建立索引数据和区域信息之间的关联关系。手机检测到语音“跟踪玩滑板的男孩”之后,结合对象信息进行语音命令识别,确定出语音指定的目标对象为对象72;并且,假设此时主摄像头和广角摄像头采集的图像均可完整包含对象72的框选区域,则将主摄像头采集的图像作为目标图像,然后根据对象72的目标区域信息,从主摄像头采集的图像,按照默认比例和位置,裁剪得到对象72的子图像;最后,基于子图像生成输出图像73,并将图像73显示在跟踪拍摄界面78,具体可以参见图7B。As shown in Figure 7A, the mobile phone collects images of different field of view through three cameras, and performs target recognition on the images of each field of view to obtain the object information of each object; based on the attribute information and area information of each object, establish index data and The relationship between regional information. After the mobile phone detects the voice "tracking the boy playing skateboard", it performs voice command recognition in combination with the object information, and determines that the target object specified by the voice is object 72; and it is assumed that the images collected by the main camera and the wide-angle camera can completely contain the object at this time. 72, the image captured by the main camera is used as the target image, and then according to the target area information of the object 72, the image captured by the main camera is cropped according to the default scale and position to obtain the sub-image of the object 72; The sub-image generates an output image 73, and displays the image 73 on the tracking shooting interface 78. For details, please refer to FIG. 7B.
手机持续对对象72进行跟踪拍摄,在某个时刻采集到语音“往右边一些”,手机对该语音进行识别,确定出该语音为用于调整目标对象显示位置的语音命令;然后,手机根据对象72对应的目标区域信息,调整裁剪位置,从主摄像头采集的图像中裁剪得到子图像,以让对象72显示在画面右边,即通过调整裁剪位置,让子图像完整包含对象72的框选区域,且让对象72在输出图像中偏离画面中心,且向右偏离。如图7A,手机通过调整裁剪位置,裁剪得到子图像,基于该子图像得到图像74,并将图像74显示在图7B中的跟踪拍摄界面79。The mobile phone continues to track and photograph the object 72, and at a certain moment collects the voice "go to the right", the mobile phone recognizes the voice, and determines that the voice is a voice command for adjusting the display position of the target object; then, the mobile phone according to the object 72 corresponds to the target area information, adjust the cropping position, and crop the sub-image from the image collected by the main camera, so that the object 72 is displayed on the right side of the screen, that is, by adjusting the cropping position, let the sub-image completely include the frame selection area of the object 72, And let the object 72 be offset from the center of the screen in the output image, and offset to the right. As shown in FIG. 7A , by adjusting the cropping position, the mobile phone obtains a sub-image by cropping, and based on the sub-image, an image 74 is obtained, and the image 74 is displayed on the tracking shooting interface 79 in FIG. 7B .
在跟踪拍摄过程中,当目标对象运动到主摄像头采集的原图像边缘时,手机在确定出主摄像头采集到的原图像不能完整包含对象72的框选区域时,则选择可完整包含对象72的框选区域的广角摄像头图像,即将通过广角摄像头采集的图像确定为目标图像;然后根据对象72对应的目标区域信息,从广角摄像头采集的图像中裁剪得到对象72的子图像,再基于该子图像生成图像75,并将图像75显示在如图7B中的跟踪拍摄界面710。During the tracking shooting process, when the target object moves to the edge of the original image captured by the main camera, when the mobile phone determines that the original image captured by the main camera cannot completely include the frame selection area of the object 72, it selects the frame selection area that can completely include the object 72. The wide-angle camera image of the frame selection area, that is, the image collected by the wide-angle camera is determined as the target image; then according to the target area information corresponding to the object 72, the sub-image of the object 72 is obtained by cropping from the image collected by the wide-angle camera, and then based on the sub-image An image 75 is generated and displayed on the tracking capture interface 710 as shown in FIG. 7B .
对象72继续移动,当手机检测到目标对象72靠近广角摄像头图像的边缘时,即确定跟踪目标即将超出摄像头最大捕获范围,则可以生成提示信息,以提示用户调整相机位姿。如图7A和图7B所示,手机生成图像76,并将图像76显示在跟踪拍摄界面711。图像76上包括“请调整相机方向”的文字提示信息,以及箭头提示,以提示用户向右移动手机。The object 72 continues to move. When the mobile phone detects that the target object 72 is close to the edge of the wide-angle camera image, that is, it is determined that the tracking target is about to exceed the maximum capture range of the camera, prompt information can be generated to prompt the user to adjust the camera pose. As shown in FIG. 7A and FIG. 7B , the mobile phone generates an image 76 and displays the image 76 on the tracking shooting interface 711 . The image 76 includes a text prompt message "Please adjust the camera direction", and an arrow prompt to prompt the user to move the mobile phone to the right.
在某一个时刻,手机检测到语音“放大到上半身”后,对该语音进行语音识别,得到语音识别结果。其中,预先建立的数据库中包括身体部分的关键字词,例如,预先建立的数据库中包括上半身、下半身、头部、眼睛、嘴巴以及上肢等;根据该语音识别结果确定该语音为用于放大目标对象的语音命令,则从三种不同视场角的图像中,确定出长焦摄像头采集的图像作为目标图像,并从目标图像中裁剪得到子图像,基于子图像生成图像77。最后,将图像77显示在如图7B中所示的跟踪拍摄界面712。At a certain moment, after the mobile phone detects that the voice is "zoomed into the upper body", it performs voice recognition on the voice to obtain a voice recognition result. Wherein, the pre-established database includes key words of body parts, for example, the pre-established database includes upper body, lower body, head, eyes, mouth, upper limbs, etc.; according to the speech recognition result, it is determined that the speech is used to amplify the target For the voice command of the object, the image captured by the telephoto camera is determined as the target image from images of three different viewing angles, and a sub-image is obtained by cropping the target image, and an image 77 is generated based on the sub-image. Finally, the image 77 is displayed on the tracking capture interface 712 as shown in Figure 7B.
在一些实施例中,终端设备100在对目标对象进行跟踪拍摄时,可以通过语音命 令实现变更目标对象、增加目标对象以及减少目标对象。In some embodiments, when the terminal device 100 is tracking and photographing the target object, the target object can be changed, the target object can be added, and the target object can be decreased through voice commands.
例如,参见图8示出的更换目标对象的一种示意图,图8与图4的场景相同。如图4和图8所示,手机41显示跟踪拍摄界面412之后,手机41持续对对象46进行跟踪拍摄。在某个时刻,手机41采集到语音“更换为跑步的男孩”,然后,手机41对该语音进行语音识别,确定出用户需要将跟踪对象从对象46更换为对象48;提取出该语音的索引数据之后,基于索引数据和区域信息之间的关联关系,确定出对象48的目标区域信息;根据对象48对应的目标区域信息,从可完整包含对象48的框选区域,且视场角最小的目标图像中裁剪出对象48的子图像,并基于子图像生成输出图像,显示输出图像,即显示跟踪拍摄界面413。跟踪拍摄界面413显示的跟踪对象为对象48,即跟踪对象从对象46变更为对象48。For example, referring to a schematic diagram of replacing the target object shown in FIG. 8 , the scene in FIG. 8 is the same as that in FIG. 4 . As shown in FIG. 4 and FIG. 8 , after the mobile phone 41 displays the tracking shooting interface 412 , the mobile phone 41 continues to perform tracking shooting of the object 46 . At a certain moment, the mobile phone 41 collects the voice "change to a running boy", and then the mobile phone 41 performs voice recognition on the voice to determine that the user needs to replace the tracking object from object 46 to object 48; extract the index of the voice After the data, based on the relationship between the index data and the area information, the target area information of the object 48 is determined; A sub-image of the object 48 is cropped out of the target image, and an output image is generated based on the sub-image, and the output image is displayed, that is, the tracking shooting interface 413 is displayed. The tracking object displayed on the tracking shooting interface 413 is the object 48 , that is, the tracking object is changed from the object 46 to the object 48 .
变更跟踪对象之后,手机41持续对对象48进行跟踪拍摄。在某个时刻,手机41采集到语音“增加一屏,跟踪玩呼啦圈的女孩”,然后,手机41对该语音进行语音识别,得到语音识别结果;当根据语音识别结果确定出该语音为用于增加跟踪对象的语音命令时,提取出该语音识别结果的索引数据之后,基于索引数据和区域信息之间的关联关系,确定出该语音指定的对象为对象47以及对象47的目标区域信息;接着,根据对象47对应的目标区域信息,裁剪得到对象47的子图像;最后,将对象48的子图像和对象47的子图像进行拼接,得到输出图像,并将该输出图像显示在跟踪拍摄界面414。跟踪拍摄界面414中,左边区域对对象48进行跟踪,右边区域对对象47进行跟踪。After the tracking object is changed, the mobile phone 41 continues to track and photograph the object 48 . At a certain moment, the mobile phone 41 collects the voice "add another screen to track the girl playing hula hoop", and then the mobile phone 41 performs voice recognition on the voice to obtain a voice recognition result; when the voice is determined to be used according to the voice recognition result When adding the voice command of the tracking object, after extracting the index data of this voice recognition result, based on the correlation between the index data and the area information, determine that the object designated by this voice is the target area information of the object 47 and the object 47; Then, according to the target area information corresponding to the object 47, the sub-image of the object 47 is obtained by cropping; finally, the sub-image of the object 48 and the sub-image of the object 47 are spliced to obtain an output image, and the output image is displayed on the tracking shooting interface 414. In the tracking shooting interface 414, the object 48 is tracked in the left area, and the object 47 is tracked in the right area.
增加跟踪对象之后,手机41可以通过两屏对对象48和对象47进行持续跟踪。在某个时刻,手机41采集到语音“去掉跑步的男孩”,然后,手机41对该语音进行识别,得到语音识别结果;当根据语音识别结果确定出该语音为用于减少跟踪对象的语音命令时,提取该语音识别结果的索引数据后,基于索引数据和区域信息之间的关联关系,确定该语音指定的对象为对象48,即确定需要去掉的对象为对象48,再基于对象47的目标区域信息,裁剪得到对象47的子图像,基于该子图像生成输出图像,并将该输出图像显示在跟踪拍摄界面415。After adding the tracking object, the mobile phone 41 can continuously track the object 48 and the object 47 through the two screens. At a certain moment, the mobile phone 41 collects the voice "remove the running boy", and then, the mobile phone 41 recognizes the voice to obtain a voice recognition result; when it is determined according to the voice recognition result that the voice is a voice command for reducing the tracking object When, after extracting the index data of the speech recognition result, based on the correlation between the index data and the region information, determine that the object designated by the speech is the object 48, that is, determine that the object to be removed is the object 48, and then based on the target of the object 47 According to the region information, a sub-image of the object 47 is obtained by cropping, an output image is generated based on the sub-image, and the output image is displayed on the tracking shooting interface 415 .
又例如,图9A为跟踪拍摄场景下的图像处理过程示意图,图9B为图9A对应的界面示意图。图9A和图9B与图6的场景相同。For another example, FIG. 9A is a schematic diagram of an image processing process in a tracking shooting scene, and FIG. 9B is a schematic diagram of an interface corresponding to FIG. 9A . 9A and 9B are the same as the scenario of FIG. 6 .
如图9A所示,手机通过不同视场角的摄像头采集得到不同视场角的图像,对各个视场角的图像进行目标识别,得到各个对象的对象信息;基于各个对象的属性信息和区域信息,建立索引数据和区域信息之间的关联关系。As shown in FIG. 9A , the mobile phone collects images of different field of view through cameras with different field of view, performs target recognition on the images of each field of view, and obtains object information of each object; based on the attribute information and area information of each object , to establish the relationship between index data and regional information.
手机检测到语音“两屏跟踪玩滑板的男孩和玩呼啦圈的女孩”之后,结合对象信息进行语音命令识别,确定出语音指定的跟踪对象为对象92和对象93;然后,可以使用外接矩形框框选对象92和对象93,具体可以如图像91所示,图像91中还包括对象94和对象95;再确定出均可完整包含对象92和对象93的框选区域,且视场角最小的目标图像;接着,根据对象92的目标区域信息以及对象93的目标区域信息,从目标图像中裁剪得到对象92的子图像和对象93的子图像;最后,将两个子图像进行拼接,生成图像96,并将图像96显示在图9B所示的跟踪拍摄界面99。跟踪拍摄界面99的左边区域跟踪显示对象92,右边区域跟踪显示对象93。After the mobile phone detects the voice "two-screen tracking of a boy playing a skateboard and a girl playing a hula hoop", it performs voice command recognition combined with the object information, and determines that the tracking objects specified by the voice are object 92 and object 93; Select object 92 and object 93, specifically as shown in image 91, which also includes object 94 and object 95; then determine the frame selection area that can completely include object 92 and object 93, and the target with the smallest field of view Then, according to the target area information of object 92 and the target area information of object 93, the sub-image of object 92 and the sub-image of object 93 are obtained by cropping from the target image; Finally, two sub-images are spliced to generate image 96, The image 96 is displayed on the tracking shooting interface 99 shown in FIG. 9B . The left region of the tracking shooting interface 99 tracks the display object 92 , and the right region tracks the display object 93 .
示例性,手机包括主摄、长焦以及广角三个摄像头,手机采集到的原图像集合如图9C所示。其中,该原图像集合中包括主摄像头采集的图像91、长焦摄像头采集的图像912、以及广角摄像头采集的图像913。通过对比可知,相较于主摄像头的图像91,视场角更大的原图像集合913中的图像显示更多的对象,而视场角更小的原图像912中的图像显示更少的对象。Exemplarily, the mobile phone includes three cameras: a main camera, a telephoto camera, and a wide-angle camera, and a set of original images collected by the mobile phone is shown in FIG. 9C . The original image set includes the image 91 collected by the main camera, the image 912 collected by the telephoto camera, and the image 913 collected by the wide-angle camera. By comparison, it can be seen that, compared with the image 91 of the main camera, the images in the original image set 913 with a larger field of view show more objects, while the images in the original image 912 with a smaller field of view show fewer objects .
基于图9C示出的原图像集合,针对对象92来说,主摄、广角以及长焦采集的图像内均包含该对象,但只有主摄和广角采集的图像中能完整包含对象92的外接框选区域,即对象92的图像集合包括图像91和图像913;然后,从图像集合中确定出视场角最小的图像作为目标图像,此时,对象92的图像集合中包含图像91和图像913,由于图像91的视场角小于图像913,则可以确定出图像91为目标图像。Based on the original image set shown in FIG. 9C , for the object 92 , the images captured by the main camera, wide-angle and telephoto all include the object, but only the images captured by the main camera and wide-angle can completely include the bounding frame of the object 92 The selected area, that is, the image set of the object 92 includes the image 91 and the image 913; then, the image with the smallest field of view is determined from the image set as the target image, and at this time, the image set of the object 92 includes the image 91 and the image 913, Since the field of view of the image 91 is smaller than that of the image 913, it can be determined that the image 91 is the target image.
如图9D示出的子图像示意图可以得知,手机确定出对象92的目标图像为图像91之后,按照预先的宽高比例,从图像91中裁剪出对象92的子图像914。As can be seen from the schematic diagram of the sub-image shown in FIG. 9D , after the mobile phone determines that the target image of the object 92 is the image 91 , the sub-image 914 of the object 92 is cropped from the image 91 according to the predetermined aspect ratio.
同理,裁剪得到对象93的子图像,然后将对象92的子图像914以及对象93的子图像进行缩放显示,拼接得到输出图像96。Similarly, the sub-image of the object 93 is obtained by cropping, and then the sub-image 914 of the object 92 and the sub-image of the object 93 are zoomed and displayed, and the output image 96 is obtained by splicing.
手机对对象92和对象93进行持续跟踪拍摄,在某个时刻采集到语音“去掉玩滑板的男孩,更换为跑步的男孩”,手机对该语音进行识别,确定出该语音为用于更换跟踪对象的语音命令;然后,再确定对象93和对象94的目标区域信息,并根据对象93和对象94的目标区域信息从目标图像中裁剪得到子图像,将子图像进行拼接,得到图像97,并将该图像97显示在图9B所示的跟踪拍摄界面910。跟踪拍摄界面910的左边区域跟踪显示对象94,右边区域跟踪显示对象93。The mobile phone continuously tracks and shoots object 92 and object 93, and at a certain moment collects the voice "remove the boy playing skateboard and replace it with a boy who runs". The mobile phone recognizes the voice and determines that the voice is used to replace the tracking object. Then, determine the target area information of the object 93 and the object 94 again, and cut out the sub-image from the target image according to the target area information of the object 93 and the object 94, the sub-image is spliced, and the image 97 is obtained, and the The image 97 is displayed on the tracking capture interface 910 shown in FIG. 9B . The left region of the tracking shooting interface 910 tracks the display object 94 , and the right region tracks the display object 93 .
手机对对象93和对象94进行持续跟踪拍摄,在某个时刻采集到语音“增加一屏,跟踪穿红衣服的小女孩”,手机对该语音进行识别,确定出该语音为用于更换跟踪对象的语音命令,且确定该语音对应的目标对象为对象95;然后,再确定对象93、对象94和对象95对应的目标区域信息,并根据对象93、对象94和对象95对应的目标区域信息从目标图像中裁剪得到子图像,将三个子图像进行拼接,得到图像98,并将该图像98显示在图9B所示的跟踪拍摄界面911。跟踪拍摄界面911的左边区域跟踪显示对象94,中间区域跟踪显示对象93,右边区域跟踪显示对象95。The mobile phone continuously tracked and photographed object 93 and object 94, and at a certain moment collected the voice "add another screen to track the little girl in red". The mobile phone recognized the voice and determined that the voice was used to replace the tracking object. Then, determine the target area information corresponding to object 93, object 94 and object 95, and according to the target area information corresponding to object 93, object 94 and object 95 from A sub-image is obtained by cropping the target image, and the three sub-images are spliced to obtain an image 98 , and the image 98 is displayed on the tracking shooting interface 911 shown in FIG. 9B . The left region of the tracking shooting interface 911 tracks the display object 94 , the middle region tracks the display object 93 , and the right region tracks the display object 95 .
需要说明的是,终端设备100在对跟踪对象进行跟踪拍摄的过程中,可以实时采集到的各个视场角的图像进行目标识别,得到各个对象的对象信息,并建立对象的区域信息和索引数据之间的关联关系。基于此,检测到用户语音后,对语音进行识别,得到语音识别结果,并基于建立区域信息和索引数据之间的关联关系,确定语音中包含的信息,例如,确定语音对应的目标对象,确定语音包含的动作等;最后,执行与语音命令对应的动作。It should be noted that, in the process of tracking and photographing the tracking object, the terminal device 100 can perform target recognition on the images of each field of view collected in real time, obtain the object information of each object, and establish the region information and index data of the object. relationship between. Based on this, after detecting the user's voice, the voice is recognized to obtain a voice recognition result, and based on the establishment of the relationship between the area information and the index data, the information contained in the voice is determined, for example, the target object corresponding to the voice is determined, and the Actions contained in the voice, etc.; finally, the actions corresponding to the voice commands are performed.
本申请实施例中,在对跟踪对象进行跟踪拍摄的过程中,用户可以通过语音变更跟踪对象、增加跟踪对象以及减少跟踪对象,还可以通过语音调整跟踪对象的显示大小和显示位置,进一步提高了跟踪拍摄的便捷性和准确性。In the embodiment of the present application, in the process of tracking and photographing the tracking object, the user can change the tracking object, increase the tracking object, and reduce the tracking object through voice, and can also adjust the display size and display position of the tracking object through the voice, which further improves the performance of the tracking object. The ease and accuracy of tracking shots.
上文各实施例均以录像场景对跟踪拍摄进行示例性介绍。具体应用中,本申请实施例提供的技术方案也可以应用于拍照场景。In the above embodiments, tracking shooting is exemplarily introduced in a video recording scene. In specific applications, the technical solutions provided in the embodiments of the present application can also be applied to photographing scenes.
例如,参见图10示出的拍照场景下的跟踪拍摄示意图。如图10所示,手机101 的主界面102上包括有相机103、智慧生活、设置、日历、时钟以及图库等多个应用程序。当手机101接收到针对相机103的点击操作之后,手机101响应于该点击操作,显示拍照预览界面104,拍照预览界面104的取景框内包括对象105、对象106、对象107以及对象108等多个拍摄主体。For example, see the schematic diagram of tracking shooting in the shooting scene shown in FIG. 10 . As shown in FIG. 10 , the main interface 102 of the mobile phone 101 includes multiple applications such as the camera 103 , smart life, settings, calendar, clock, and gallery. After the mobile phone 101 receives the click operation on the camera 103, the mobile phone 101 responds to the click operation and displays the photo preview interface 104, and the viewfinder of the photo preview interface 104 includes objects 105, 106, 107, and 108, etc. Shoot the subject.
手机101进入语音拍摄模式之后,对不同视场角的图像进行目标识别,得到对象信息,并从数据库中提取出对象的属性信息的索引数据,建立索引数据和区域信息的关联关系。After the mobile phone 101 enters the voice shooting mode, it performs target recognition on images of different viewing angles to obtain object information, and extracts index data of the attribute information of the object from the database, and establishes an association relationship between the index data and the area information.
手机101在采集到语音“跟踪跑步的男孩”之后,则显示拍照预览界面109。在该过程中,手机101对该语音进行语音识别,得到语音识别结果,并提取出语音识别结果的索引数据;然后,根据索引数据和区域信息之间的关联关系,查找语音识别结果对应的目标对象,以及该目标对象的目标区域信息,此时,确定出语音对应的目标对象为对象107;接着,从不同视场角图像中确定出可完整包括对象107的框选区域,且视场角最小的目标图像,再根据对象107的目标区域信息,从目标图像中裁剪得到子图像,对该子图像进行缩放显示,得到输出图像,显示拍照预览界面109。After the mobile phone 101 collects the voice "following the running boy", the mobile phone 101 displays the photo preview interface 109 . In this process, the mobile phone 101 performs speech recognition on the speech, obtains the speech recognition result, and extracts the index data of the speech recognition result; then, according to the correlation between the index data and the area information, find the target corresponding to the speech recognition result object, and the target area information of the target object, at this time, it is determined that the target object corresponding to the voice is the object 107; then, the frame selection area that can completely include the object 107 is determined from the images with different viewing angles, and the viewing angle The smallest target image is then cropped from the target image according to the target area information of the object 107 to obtain a sub-image, the sub-image is zoomed and displayed, an output image is obtained, and the photo preview interface 109 is displayed.
在对象107移动的过程中,手机101对对象107进行持续跟踪,显示拍照预览界面1010。在跟踪拍摄过程中,用户可以点击拍摄按钮1011,对对象107进行拍照。手机101当接收到针对拍摄按钮1011的点击操作之后,响应于该点击操作,捕获得到对象107的照片。当手机101接收到针对按钮1012的点击操作时,手机101则响应于该点击操作,显示照片预览界面1013,该照片预览界面1013显示有对象107的照片。During the movement of the object 107 , the mobile phone 101 continuously tracks the object 107 and displays a photo preview interface 1010 . During the tracking shooting process, the user can click the shooting button 1011 to take a picture of the object 107 . After receiving the click operation on the shooting button 1011, the mobile phone 101 captures a photo of the object 107 in response to the click operation. When the mobile phone 101 receives the click operation on the button 1012 , the mobile phone 101 displays the photo preview interface 1013 in response to the click operation, and the photo preview interface 1013 displays the photo of the object 107 .
需要说明的是,在拍照场景下,用户可以通过语音命令更换跟踪对象、增加跟踪对象以及减少跟踪对象,也可以在通过语音命令调整跟踪对象的显示位置、放大或缩小跟踪对象,以及在跟踪对象到达最大视场角图像的边缘时,提示用户调整相机位姿。It should be noted that in the shooting scene, the user can change the tracking object, increase the tracking object, and reduce the tracking object through voice commands, and can also adjust the display position of the tracking object, zoom in or out through the voice command, and track the object. When reaching the edge of the image with the largest field of view, the user is prompted to adjust the camera pose.
为了更好地介绍本申请实施例提供的方案,下面将结合具体示例进行介绍说明。此时,终端设备100为手机,该手机包括广角摄像头、主摄像头和长焦摄像头,这三个摄像头的视场角不同。In order to better introduce the solutions provided by the embodiments of the present application, the following description will be given with reference to specific examples. At this time, the terminal device 100 is a mobile phone, and the mobile phone includes a wide-angle camera, a main camera, and a telephoto camera, and the three cameras have different fields of view.
参见图11示出的拍摄流程示意框图,如图11所示,可以通过两种方式拍摄得到跟踪目标的照片或者视频,其中一种是传统方式,另一种是本申请实施例提供的方式。Referring to the schematic block diagram of the shooting process shown in FIG. 11 , as shown in FIG. 11 , a photo or video of the tracking target can be obtained by shooting in two ways, one of which is the traditional way, and the other is the way provided by the embodiment of the present application.
传统方式下,手机进入相机拍照预览或者录像模式之后,对照目标方向;用户手动移动手机,以寻找目标对象;找到目标对象之后,手动进行缩放,以调整目标对象在图像中的显示比例和焦距等;然后,当目标对象移动时,手动跟踪运动目标,即手动移动手机,以让目标对象尽可能地位于画面中心;最终,生成目标对象的照片或者视频。In the traditional way, after the mobile phone enters the camera's photo preview or video recording mode, the target direction is checked; the user manually moves the mobile phone to find the target object; after finding the target object, manually zoom in and out to adjust the display ratio and focal length of the target object in the image, etc. ; Then, when the target object moves, manually track the moving target, that is, manually move the mobile phone, so that the target object is located in the center of the screen as much as possible; finally, a photo or video of the target object is generated.
在传统方式下,用户需要手动操作确定目标对象,手动调整目标对象的显示比例,并且,在目标对象移动时,需要手动移动手机。In the conventional manner, the user needs to manually operate to determine the target object, manually adjust the display scale of the target object, and, when the target object moves, the mobile phone needs to be moved manually.
本申请实施例提供的方式中,手机在进入相机拍照预览或者录像,对照目标方向之后,通过广角摄像头、主摄像头以及长焦摄像头,分别采集广角图像、主摄图像以及长焦图像,并对各个视场角图像进行目标识别,得到各个对象的对象信息,并且,从数据库中提取对象的属性信息的索引数据,建立索引数据和区域信息之间的关联关系。In the method provided by the embodiment of the present application, the mobile phone enters the camera to take a photo preview or video, and after comparing the target direction, the wide-angle camera, the main camera, and the telephoto camera are used to collect the wide-angle image, the main camera image, and the telephoto image, respectively. The field of view image is used for target recognition to obtain the object information of each object, and the index data of the attribute information of the object is extracted from the database, and the association relationship between the index data and the area information is established.
用户可以通过语音发送跟踪命令,手机根据索引数据和区域信息之间的关联关系以及预存储声纹特征,准确识别语音命令,确定出语音命令对应的目标对象,以及该目标对象的目标区域信息;然后,根据目标对象的目标区域信息,以及预期显示方式,从视场角图像中裁剪得到子图像,即生成单一目标区域,或者多目标区域。该预期显示方式可以包括图像宽高比。最后,根据目标对象的大小,调整焦距,即对裁剪出来的图像进行缩放显示,以达到目标显示比例,生成跟踪对象的照片或视频。Users can send tracking commands by voice, and the mobile phone can accurately identify the voice command according to the relationship between the index data and the area information and the pre-stored voiceprint features, and determine the target object corresponding to the voice command and the target area information of the target object; Then, according to the target area information of the target object and the expected display mode, a sub-image is obtained by cropping from the field of view image, that is, a single target area or multiple target areas are generated. The intended display manner may include the image aspect ratio. Finally, according to the size of the target object, the focal length is adjusted, that is, the cropped image is zoomed and displayed to achieve the target display ratio, and a photo or video of the tracked object is generated.
本申请实施例提供的方式还可以参见图12示出的语音拍摄流程示意图。如图12所示,手机进入相机拍照预览或录像场景之后,用户可以手动调整相机位姿,以对准目标方向。此时,手机可以实时采集广角图像、主摄图像和长焦图像,并对各个图像进行目标识别,得到对象信息。同时,手机也会通过麦克风实时采集用户的语音命令,基于对象信息和预存储声纹特征,准确识别用户语音命令,确定语音命令对应的目标对象,以及该语音命令中包含的信息;然后,根据目标对象的目标区域信息,以及预期显示方式,从视场角图像中裁剪得到子图像,即生成单一目标区域,或者多目标区域。该预期显示方式可以包括图像宽高比。最后,根据目标对象的大小,调整焦距,以达到目标显示比例,生成跟踪对象的照片或视频。For the manner provided by the embodiments of the present application, reference may also be made to the schematic flowchart of the voice shooting shown in FIG. 12 . As shown in Figure 12, after the mobile phone enters the camera to take pictures and preview or record the scene, the user can manually adjust the camera pose to aim at the target direction. At this time, the mobile phone can collect wide-angle images, main camera images and telephoto images in real time, and perform target recognition on each image to obtain object information. At the same time, the mobile phone will also collect the user's voice command in real time through the microphone, based on the object information and pre-stored voiceprint features, accurately identify the user's voice command, determine the target object corresponding to the voice command, and the information contained in the voice command; then, according to The target area information of the target object and the expected display mode are cropped from the field of view image to obtain a sub-image, that is, a single target area or multiple target areas are generated. The intended display manner may include the image aspect ratio. Finally, according to the size of the target object, adjust the focal length to achieve the target display ratio, and generate a photo or video of the tracked object.
当检测目标对象偏离,即检测到目标对象靠近最大视场角图像的边缘时,则可以在预览界面提示用户调整位姿。When the detected target object deviates, that is, when the target object is detected to be close to the edge of the image with the largest field of view, the user can be prompted to adjust the pose on the preview interface.
在本申请实施例提供的方式中,用户可以通过语音指定跟踪对象,并对跟踪对象进行持续跟踪,不用用户手动操作。相较而言,本申请实施例提供的方式可以让用户便捷准确地指定跟踪对象,提高跟踪拍摄便捷性。In the manner provided by the embodiment of the present application, the user can specify the tracking object by voice, and continuously track the tracking object without manual operation by the user. In comparison, the method provided by the embodiment of the present application can allow the user to conveniently and accurately specify the tracking object, thereby improving the convenience of tracking and shooting.
本申请实施例还提供了一种确定目标对象的方法,以让用户可以通过语音指定目标对象,提高了确定目标对象的便捷性和准确性。The embodiment of the present application also provides a method for determining a target object, so that the user can specify the target object by voice, which improves the convenience and accuracy of determining the target object.
请参见图13,为本申请实施例提供的确定目标对象的方法的一种流程示意框图,该方法可以应用于终端设备100,该方法可以包括以下步骤:Please refer to FIG. 13 , which is a schematic flowchart of a method for determining a target object provided by an embodiment of the present application. The method can be applied to the terminal device 100, and the method can include the following steps:
步骤S1301、获取待处理图像,待处理图像包括至少一个对象。Step S1301: Acquire a to-be-processed image, where the to-be-processed image includes at least one object.
上述待处理图像可以包括至少两个不同视场角的原图像,例如,该待处理图像包括长焦摄像头采集的原图像以及广角摄像头采集的原图像;也可以包括一个视场角的原图像。The above-mentioned images to be processed may include at least two original images with different viewing angles. For example, the images to be processed include an original image captured by a telephoto camera and an original image captured by a wide-angle camera; it may also include an original image with one viewing angle.
步骤S1302、确定待处理图像中各个对象的对象信息,对象信息包括区域信息和属性信息,区域信息用于描述对象在待处理图像中的位置。Step S1302: Determine the object information of each object in the image to be processed, the object information includes area information and attribute information, and the area information is used to describe the position of the object in the image to be processed.
在一些实施例中,可以对待处理图像进行目标识别,以确定出待处理图像中各个图像的对象信息。In some embodiments, object recognition may be performed on the image to be processed to determine object information of each image in the image to be processed.
步骤S1303、从预先建立的数据集中提取各个属性信息的第二索引数据,该数据集包括属性关键字词和属性索引。Step S1303: Extract second index data of each attribute information from a pre-established data set, where the data set includes attribute key words and attribute indexes.
步骤S1304、针对每个对象,将第二索引数据与区域信息进行关联,得到索引数据和区域信息之间的关联关系。Step S1304: For each object, associate the second index data with the area information to obtain an association relationship between the index data and the area information.
步骤S1305、获取第一语音信息。Step S1305, acquiring first voice information.
步骤S1306、基于数据集中第二索引数据对应的属性关键字词,从数据集中提取第一语音信息的第一索引数据。Step S1306: Extract the first index data of the first voice information from the data set based on the attribute keywords corresponding to the second index data in the data set.
步骤S1307、基于关联关系和第二索引数据,确定第一索引数据对应的目标区域信息,其中,目标区域信息对应的对象为目标对象。Step S1307: Determine target area information corresponding to the first index data based on the association relationship and the second index data, wherein the object corresponding to the target area information is the target object.
其中,图13中与上文实施例的相同之处可以参见上文实施例的对应内容,在此不再赘述。Wherein, for the similarities between FIG. 13 and the above embodiment, reference may be made to the corresponding content of the above embodiment, and details are not repeated here.
需要说明的是,本申请实施例提供的确定目标对象的方法,可以应用于上文提及的跟踪录像场景以及拍照场景,但不限于上文提及的场景。为了描述简便,图13中与上文实施例的相同之处可以参见上文实施例的对应内容,在此不再赘述。It should be noted that, the method for determining a target object provided by the embodiments of the present application may be applied to the above-mentioned tracking video scene and the photographing scene, but is not limited to the above-mentioned scene. For the convenience of description, for the same points in FIG. 13 as the above embodiments, reference may be made to the corresponding contents of the above embodiments, which will not be repeated here.
本申请实施例中,通过预先识别出图像中各个对象的对象信息,并基于预先建立的数据集,建立各个对象的索引数据和区域信息之间的关联关系;在采集到语音命令时,基于关联关系确定图像中与该语音命令对应的对象,从而让用户可以便捷快速地指定目标对象。In the embodiment of the present application, the object information of each object in the image is identified in advance, and based on the pre-established data set, the association relationship between the index data of each object and the area information is established; when the voice command is collected, based on the association The relationship identifies the object in the image that corresponds to the voice command, allowing users to easily and quickly specify the target object.
请参见图14,为本申请实施例提供的确定目标对象的方法的另一种流程示意框图,该方法可以应用于终端设备100,该方法可以包括以下步骤:Please refer to FIG. 14 , which is another schematic flowchart of the method for determining a target object provided by an embodiment of the present application. The method can be applied to the terminal device 100, and the method can include the following steps:
步骤S1401、在取景框内显示通过摄像头捕获的画面,该画面包括至少一个对象。Step S1401 , displaying a picture captured by the camera in the viewfinder, the picture including at least one object.
其中,终端设备100可以包括一个或至少两个不同视场角的摄像头。终端设备100可以基于其中一个摄像头采集到的原图像,在取景框内显示输出画面。The terminal device 100 may include one or at least two cameras with different viewing angles. The terminal device 100 may display the output image in the viewfinder based on the original image captured by one of the cameras.
步骤S1402、获取第一语音信息,第一语音信息用于指定目标对象。Step S1402: Acquire first voice information, where the first voice information is used to specify a target object.
步骤S1403、将画面中与第一语音信息对应的对象确定为目标对象。Step S1403: Determine the object corresponding to the first voice information in the screen as the target object.
其中,终端设备100可以在采集到原图像之后,对原图像进行目标识别,得到原图像内各个对象的对象信息;然后基于该对象信息和预先建立的数据集,建立索引数据和区域信息之间的关联关系;当采集到第一语音信息之后,可以对第一语音信息进行识别,并基于语音识别结果和关联关系,确定出第一语音信息对应的对象,将该对象确定为目标对象。The terminal device 100 may, after collecting the original image, perform target recognition on the original image to obtain object information of each object in the original image; then, based on the object information and the pre-established data set, establish a relationship between the index data and the area information After collecting the first voice information, the first voice information can be recognized, and based on the voice recognition result and the association relationship, the object corresponding to the first voice information is determined, and the object is determined as the target object.
关联关系的建立过程以及确定第一语音信息对应的对象的过程可以参见上文实施例,在此不再赘述。For the process of establishing the association relationship and the process of determining the object corresponding to the first voice information, reference may be made to the above embodiments, and details are not described herein again.
在跟踪拍摄场景下,将画面中的第一语音信息对应的对象确定为目标对象之后,可以显示该目标对象的输出图像,例如,在图4的场景下,在确定出跟踪对象为对象46之后,显示界面412。In the tracking shooting scene, after the object corresponding to the first voice information in the screen is determined as the target object, the output image of the target object can be displayed. For example, in the scene in FIG. 4 , after the tracking object is determined to be the object 46 , the interface 412 is displayed.
在跟踪拍摄过程中,用户可以通过语音调整目标对象的显示方法,增加或减少目标对象,更换目标对象等,具体介绍请参见上文实施例,在此不再赘述。During the tracking shooting process, the user can adjust the display method of the target object by voice, increase or decrease the target object, replace the target object, etc. For details, please refer to the above embodiment, which will not be repeated here.
本申请实施例中,用户可以通过语音便捷快速地指定目标对象。In the embodiment of the present application, the user can conveniently and quickly specify the target object through voice.
请参见图15,为本申请实施例提供的拍摄方法的另一种流程示意框图,该方法可以应用于终端设备100,该方法可以包括以下步骤:Please refer to FIG. 15 , which is another schematic flow diagram of the photographing method provided by the embodiment of the present application. The method may be applied to the terminal device 100, and the method may include the following steps:
步骤S1501、在取景框内显示摄像头捕获的画面,该画面包括至少一个对象。Step S1501: Display a picture captured by the camera in the viewfinder frame, and the picture includes at least one object.
步骤S1502、接收第一指令,第一指令用于指定至少两个目标对象。Step S1502: Receive a first instruction, where the first instruction is used to specify at least two target objects.
上述第一指令可以是语音指令,也可以是非语音指令,例如,用户可以通过手指点选或触摸等方式输入上述第一指令。The above-mentioned first instruction may be a voice instruction or a non-voice instruction. For example, the user may input the above-mentioned first instruction by finger-pointing or touching.
步骤S1503、将画面中与第一指令对应的至少两个对象确定为至少两个目标对象。Step S1503: Determine at least two objects in the screen corresponding to the first instruction as at least two target objects.
当第一指令为语音指令时,终端设备可以先基于各个对象的对象信息和预先建立 的数据集,建立各个对象的索引数据和区域信息之间的关联关系,然后再基于关联关系和语音识别结果,确定出图像中与第一指令对应的对象。具体过程可以参见上文实施例,在此不再赘述。When the first command is a voice command, the terminal device can first establish the association relationship between the index data of each object and the region information based on the object information of each object and the pre-established data set, and then based on the association relationship and the voice recognition result , to determine the object corresponding to the first instruction in the image. For the specific process, reference may be made to the above embodiments, and details are not described herein again.
而当第一指令为非语音指令时,终端设备可以根据用户的触摸位置,确定用户指定的目标对象。例如,用户可以通过先后点选图6中的对象64和对象66,以输入第一指令,终端设备根据各个对象在图像的显示位置,确定用户点选位置对应的对象为目标对象。When the first instruction is a non-voice instruction, the terminal device can determine the target object designated by the user according to the user's touch position. For example, the user can input the first instruction by clicking the object 64 and the object 66 in FIG. 6 successively, and the terminal device determines the object corresponding to the user's clicked position as the target object according to the display position of each object in the image.
步骤S1504、在取景框内分区域显示每个目标对象的画面,其中,一个区域显示一个目标对象的画面。Step S1504: Display the picture of each target object in sub-regions in the viewfinder frame, wherein one region displays the picture of one target object.
当第一指令为语音指令时,终端设备可以基于目标对象的目标区域信息,分别裁剪出子图像,然后再将子图像进行拼接,按照预期显示方式进行显示。When the first command is a voice command, the terminal device can cut out sub-images based on the target area information of the target object, and then splicing the sub-images to display in an expected display manner.
当第一指令为非语音指令时,终端设备在确定出目标对象的显示区域之后,也可以分别裁剪出子图像,将子图像进行拼接后再进行显示。When the first command is a non-voice command, after determining the display area of the target object, the terminal device may also cut out sub-images respectively, and then display the sub-images after splicing.
例如,参见图6示出的场景,第一指令为语音指令。终端设备在采集的语音“两屏跟踪玩滑板的男孩和玩呼啦圈的女孩”指定了两个目标对象,在对语音进行识别,并基于关联关系确定出目标对象之后,则显示界面611。界面611中的左边区域显示的是对象64的画面,右边区域显示的是对象66的画面。当第一指令用于指定三个目标对象时,终端设备的显示结果可以参见图9B中的界面911。For example, referring to the scenario shown in FIG. 6 , the first instruction is a voice instruction. The terminal device specifies two target objects in the collected voice "Two-screen tracking of a boy playing a skateboard and a girl playing a hula hoop", and after recognizing the voice and determining the target object based on the association relationship, the interface 611 is displayed. The left area of the interface 611 displays the picture of the object 64 , and the right area displays the picture of the object 66 . When the first instruction is used to designate three target objects, the display result of the terminal device may refer to the interface 911 in FIG. 9B .
本申请实施例中,当用户指定至少两个目标对象时,终端设备分区域显示至少两个目标对象,提高了跟踪拍摄的便捷性。In the embodiment of the present application, when the user specifies at least two target objects, the terminal device displays the at least two target objects in different regions, which improves the convenience of tracking and shooting.
对应于上文实施例的确定目标对象的方法,图16示出了本申请实施例提供的确定目标对象的装置的示意框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the method for determining a target object in the above embodiment, FIG. 16 shows a schematic block diagram of an apparatus for determining a target object provided by an embodiment of the present application. For convenience of description, only parts related to the embodiment of the present application are shown.
请参见图16,该装置可以包括:Referring to Figure 16, the apparatus may include:
图像获取模块161,用于获取待处理图像,待处理图像包括至少一个对象;an image acquisition module 161, configured to acquire a to-be-processed image, where the to-be-processed image includes at least one object;
第一对象信息确定模块162,用于确定待处理图像中各个对象的对象信息,对象信息包括区域信息和属性信息,区域信息用于描述对象在待处理图像中的位置;The first object information determination module 162 is used to determine the object information of each object in the image to be processed, the object information includes area information and attribute information, and the area information is used to describe the position of the object in the image to be processed;
第一提取模块163,用于从预先建立的数据集中提取各个属性信息的第二索引数据,数据集包括属性关键字词和属性索引;The first extraction module 163 is configured to extract the second index data of each attribute information from a pre-established data set, where the data set includes attribute key words and attribute indexes;
第一建立模块164,用于针对每个对象,将第二索引数据与区域信息进行关联,得到索引数据和区域信息之间的关联关系;The first establishment module 164 is used for associating the second index data with the area information for each object, so as to obtain the association relationship between the index data and the area information;
第一语音信息获取模块165,用于获取第一语音信息;a first voice information obtaining module 165, configured to obtain the first voice information;
第二提取模块166,用于基于数据集中第二索引数据对应的属性关键字词,从数据集中提取第一语音信息的第一索引数据;The second extraction module 166 is configured to extract the first index data of the first speech information from the data set based on the attribute keywords corresponding to the second index data in the data set;
第一目标对象确定模块167,用于基于关联关系和第二索引数据,确定第一索引数据对应的目标区域信息,其中,目标区域信息对应的对象为目标对象。The first target object determination module 167 is configured to determine target area information corresponding to the first index data based on the association relationship and the second index data, wherein the object corresponding to the target area information is the target object.
在一些实施例中,上述第一目标对象确定模块具体用于:针对每个第二索引数据,将第二索引数据中的各项索引分别与第一索引数据中的各项索引进行匹配,确定目标索引的项数,目标索引为第二索引数据中与第一索引数据中的索引相匹配的索引;基于关联关系,将项数最多的第二索引数据对应的区域信息确定为目标区域信息。In some embodiments, the above-mentioned first target object determination module is specifically configured to: for each second index data, respectively match each index in the second index data with each index in the first index data, and determine The number of items of the target index, the target index is the index in the second index data that matches the index in the first index data; based on the association relationship, the area information corresponding to the second index data with the largest number of items is determined as the target area information.
在一些实施例中,上述第一提取模块具体用于:针对每个属性信息,从数据集中查找与属性信息匹配的第一关键字词;将数据集中第一关键字词的索引数据作为第二索引数据。In some embodiments, the above-mentioned first extraction module is specifically configured to: for each attribute information, find the first keyword word matching the attribute information from the data set; use the index data of the first keyword word in the data set as the second keyword index data.
在一些实施例中,上述第二提取模块具体用于:对第一语音信息进行语音识别,得到第一语音识别结果;根据第一语音识别结果,确定第一语音信息为用于指定目标对象的语音命令;从第二索引数据对应的属性关键字词中查找与第一语音识别结果相匹配的第二关键字词;将数据集中第二关键字词的索引数据作为第一索引数据。In some embodiments, the above-mentioned second extraction module is specifically configured to: perform speech recognition on the first speech information to obtain a first speech recognition result; and determine, according to the first speech recognition result, that the first speech information is used for specifying the target object A voice command; searching for a second key word matching the first speech recognition result from attribute key words corresponding to the second index data; using the index data of the second key word in the data set as the first index data.
在一些实施例中,上述第二提取模块具体用于:提取第一语音信息的声纹特征;确定声纹特征和预存储声纹特征之间的相似度;当相似度大于或等于预设阈值时,将声纹特征和第一语音信息输入至语义理解模型,获得语义理解模型输出的第一语音识别结果。In some embodiments, the above-mentioned second extraction module is specifically configured to: extract the voiceprint feature of the first voice information; determine the similarity between the voiceprint feature and the pre-stored voiceprint feature; when the similarity is greater than or equal to a preset threshold When the voiceprint feature and the first speech information are input into the semantic understanding model, the first speech recognition result output by the semantic understanding model is obtained.
在一些实施例中,上述第一对象信息确定模块具体用于:将待处理图像输入至目标识别模型;获得目标识别模型输出的对象信息。In some embodiments, the above-mentioned first object information determination module is specifically configured to: input the image to be processed into the target recognition model; and obtain the object information output by the target recognition model.
在一些实施例中,第一语音信息为用于初次指定目标对象的语音命令,或用于增加或减少目标对象的语音命令,或用于替换目标对象的语音命令。In some embodiments, the first voice information is a voice command for specifying the target object for the first time, or a voice command for increasing or decreasing the target object, or a voice command for replacing the target object.
在一些实施例中,待处理图像包括N个不同视场角的原图像,N为大于或等于1的正整数。上述装置还可以包括:In some embodiments, the image to be processed includes N original images with different viewing angles, where N is a positive integer greater than or equal to 1. The above device may also include:
第一显示模块,用于基于原图像,根据目标区域信息得到目标对象的第一输出图像,并将第一输出图像显示在取景框。The first display module is configured to obtain the first output image of the target object according to the target area information based on the original image, and display the first output image in the viewfinder frame.
在一些实施例中,上述第一显示模块具体用于:从N张原图像中确定第一目标图像,第一目标图像为图像集合中视场角最小的图像,图像集合包括至少一张可完整包含目标对象的外接框选区域的原图像;根据目标区域信息,从第一目标图像中裁剪出目标对象的第一子图像;基于第一子图像生成第一输出图像。In some embodiments, the above-mentioned first display module is specifically configured to: determine a first target image from N original images, the first target image is the image with the smallest field of view in the image set, and the image set includes at least one image that can completely contain The original image of the bounding box selection area of the target object; the first sub-image of the target object is cropped from the first target image according to the target area information; the first output image is generated based on the first sub-image.
在一些实施例中,当有至少两个目标对象,且每个目标对象均对应有一个第一子图像时,上述第一显示模块具体用于:将每个目标对象对应的第一子图像进行拼接,得到第一输出图像。In some embodiments, when there are at least two target objects, and each target object corresponds to a first sub-image, the above-mentioned first display module is specifically configured to: display the first sub-image corresponding to each target object Stitching to obtain the first output image.
在一些实施例中,上述装置还包括:In some embodiments, the above-mentioned apparatus further comprises:
第二语音信息获取模块,用于获取第二语音信息;A second voice information acquisition module, configured to acquire second voice information;
第一识别模块,用于对第二语音信息进行语音识别,得到第二语音识别结果;a first recognition module, configured to perform speech recognition on the second speech information to obtain a second speech recognition result;
第一确定模块,用于根据第二语音识别结果,确定第二语音信息为用于调整目标对象的显示方式的语音命令;a first determining module, configured to determine, according to the second voice recognition result, that the second voice information is a voice command for adjusting the display mode of the target object;
第二确定模块,用于从原图像中确定第二目标图像;a second determining module, configured to determine the second target image from the original image;
第一裁剪模块,用于从第二目标图像中裁剪出目标对象的第二子图像;a first cropping module for cropping out the second sub-image of the target object from the second target image;
第二显示模块,用于基于第二子图像生成第二输出图像,并将第二输出图像显示在取景框。The second display module is configured to generate a second output image based on the second sub-image, and display the second output image in the viewfinder.
在一些实施例中,上述装置还可以包括:In some embodiments, the above-mentioned apparatus may further include:
第一检测模块,用于在第三目标图像中,确定目标对象的外接框选区域的每个边与相对应的图像边之间的距离,第三目标图像为图像集合中视场角最大的图像;The first detection module is used to determine the distance between each edge of the bounding box selection area of the target object and the corresponding image edge in the third target image, and the third target image is the image with the largest field of view in the image set ;
第一提示模块,用于若多个距离中的最小值小于或等于预设距离阈值,则在取景 框显示提示信息,提示信息用于提示调整相机位姿。The first prompt module is configured to display prompt information in the viewfinder if the minimum value of the multiple distances is less than or equal to the preset distance threshold, and the prompt information is used to prompt adjustment of the camera pose.
在一些实施例中,属性信息包括以下至少一项:类型、性别、年龄段、活动信息、头发长度、头发颜色、着装类型、衣服颜色以及姿势。In some embodiments, the attribute information includes at least one of: type, gender, age group, activity information, hair length, hair color, clothing type, clothing color, and posture.
上述确定目标对象的装置具有实现上述确定目标对象的方法的功能,该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现,硬件或软件包括一个或多个与上述功能相对应的模块,模块可以是软件和/或硬件。The above-mentioned device for determining the target object has the function of realizing the above-mentioned method for determining the target object, and the function can be realized by hardware, and can also be realized by executing corresponding software through hardware, and the hardware or software includes one or more modules corresponding to the above-mentioned functions, Modules can be software and/or hardware.
需要说明的是,上述装置/模块之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。It should be noted that the information exchange, execution process and other contents between the above-mentioned devices/modules are based on the same concept as the method embodiments of the present application. For specific functions and technical effects, please refer to the method embodiments section. It is not repeated here.
对应于上文实施例的确定目标对象的方法,图17示出了本申请实施例提供的确定目标对象的装置的示意框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the method for determining a target object in the above embodiment, FIG. 17 shows a schematic block diagram of an apparatus for determining a target object provided by an embodiment of the present application. For convenience of description, only parts related to the embodiment of the present application are shown.
请参见图17,该装置可以包括:Referring to Figure 17, the apparatus may include:
第一画面显示模块171,用于在取景框内显示通过摄像头捕获的画面,画面包括至少一个对象;The first picture display module 171 is used to display the picture captured by the camera in the viewfinder frame, and the picture includes at least one object;
第一获取模块172,用于获取第一语音信息,第一语音信息用于指定目标对象;a first obtaining module 172, configured to obtain first voice information, and the first voice information is used to specify a target object;
第二目标对象确定模块173,用于将画面中与第一语音信息对应的对象确定为目标对象。The second target object determination module 173 is configured to determine the object corresponding to the first voice information in the picture as the target object.
在一些实施例中,第二目标对象确定模块具体用于:从预先建立的数据集中提取第一语音信息的第一索引数据,数据集包括属性关键字词和属性索引;基于各个对象的关联关系,确定第一索引数据对应的目标区域信息,目标区域信息对应的对象为目标对象;In some embodiments, the second target object determination module is specifically configured to: extract the first index data of the first voice information from a pre-established data set, where the data set includes attribute key words and attribute indexes; based on the association relationship of each object , determine the target area information corresponding to the first index data, and the object corresponding to the target area information is the target object;
其中,关联关系为对象的属性信息的第二索引数据和区域信息之间的映射关系,区域信息用于描述对象在摄像头采集的原图像中的位置。The association relationship is a mapping relationship between the second index data of the attribute information of the object and the area information, and the area information is used to describe the position of the object in the original image collected by the camera.
在一些实施例中,上述装置还可以包括:In some embodiments, the above-mentioned apparatus may further include:
第二对象信息确定模块,用于确定原图像中各个对象的对象信息,对象信息包括区域信息和属性信息;The second object information determination module is used to determine the object information of each object in the original image, and the object information includes area information and attribute information;
第三提取模块,用于从数据集中提取各个对象的属性信息的第二索引数据;The third extraction module is used to extract the second index data of the attribute information of each object from the data set;
第二建立模块,用于针对每个对象,将第二索引数据与区域信息进行关联,得到索引数据和区域信息之间的关联关系。The second establishment module is used for associating the second index data with the area information for each object, so as to obtain the association relationship between the index data and the area information.
在一些实施例中,上述第三提取模块具体用于:针对每个属性信息,从数据集查找与属性信息匹配的第一关键字词;将数据集中第一关键字词的索引数据作为第二索引数据。In some embodiments, the above-mentioned third extraction module is specifically configured to: for each attribute information, find the first keyword word matching the attribute information from the data set; use the index data of the first keyword word in the data set as the second keyword index data.
在一些实施例中,第二对象信息确定模块具体用于:将原图像输入至目标识别模型;获得目标识别模型输出的对象信息。In some embodiments, the second object information determination module is specifically configured to: input the original image into the target recognition model; and obtain the object information output by the target recognition model.
在一些实施例中,第二目标对象确定模块具体用于:针对每个第二索引数据,将第一索引数据中的各项索引分别与第二索引数据中的各项索引进行匹配,确定目标索引的项数,目标索引为第二索引数据中与第一索引数据中的索引相匹配的索引;基于关联关系,将项数最多的第二索引数据对应的区域信息确定为目标区域信息。In some embodiments, the second target object determination module is specifically configured to: for each second index data, respectively match each index in the first index data with each index in the second index data, and determine the target The number of items in the index, and the target index is the index in the second index data that matches the index in the first index data; based on the association relationship, the area information corresponding to the second index data with the largest number of items is determined as the target area information.
在一些实施例中,第二目标对象确定模块具体用于:对第一语音信息进行语音识 别,得到第一语音识别结果;根据第一语音识别结果,确定第一语音信息为用于指定目标对象的语音命令;从第二索引数据对应的属性关键字词中查找与第一语音识别结果相匹配的第二关键字词;将数据集中第二关键字词的索引数据作为第一索引数据。In some embodiments, the second target object determining module is specifically configured to: perform voice recognition on the first voice information to obtain a first voice recognition result; and determine the first voice information as a target object for specifying the target object according to the first voice recognition result searching for the second key word matching the first speech recognition result from the attribute key words corresponding to the second index data; using the index data of the second key word in the data set as the first index data.
在一些实施例中,第二目标对象确定模块具体用于:提取第一语音信息的声纹特征;确定声纹特征和预存储声纹特征之间的相似度;当相似度大于或等于预设阈值时,将声纹特征和第一语音信息输入至语义理解模型,获得语义理解模型输出的第一语音识别结果。In some embodiments, the second target object determination module is specifically configured to: extract the voiceprint feature of the first voice information; determine the similarity between the voiceprint feature and the pre-stored voiceprint feature; when the similarity is greater than or equal to a preset When the threshold is set, the voiceprint feature and the first speech information are input into the semantic understanding model, and the first speech recognition result output by the semantic understanding model is obtained.
在一些实施例中,摄像头包括N个不同视场角的摄像头,N为大于或等于1的正整数。上述装置还包括:In some embodiments, the cameras include N cameras with different viewing angles, where N is a positive integer greater than or equal to 1. The above device also includes:
第三显示模块,用于基于N个不同视场角的原图像,根据目标区域信息得到目标对象的第一输出图像,并将第一输出图像显示在取景框。The third display module is configured to obtain the first output image of the target object according to the target area information based on the original images of N different viewing angles, and display the first output image in the viewfinder frame.
在一些实施例中,上述第三显示模块用于:从N张原图像中确定第一目标图像,第一目标图像为图像集合中视场角最小的图像,图像集合包括至少一张可完整包含目标对象的外接框选区域的原图像;根据目标区域信息,从第一目标图像中裁剪出目标对象的第一子图像;基于第一子图像生成第一输出图像。In some embodiments, the above-mentioned third display module is used to: determine a first target image from N original images, the first target image is the image with the smallest field of view in the image set, and the image set includes at least one image that can completely contain the target The original image of the bounding box selection area of the object; the first sub-image of the target object is cropped from the first target image according to the target area information; the first output image is generated based on the first sub-image.
在一些实施例中,当有至少两个目标对象,且每个目标对象均对应有一个第一子图像时,上述第三显示模块具体用于:将每个目标对象对应的第一子图像进行拼接,得到第一输出图像。In some embodiments, when there are at least two target objects, and each target object corresponds to a first sub-image, the above-mentioned third display module is specifically configured to: display the first sub-image corresponding to each target object Stitching to obtain the first output image.
在一些实施例中,上述装置还包括:In some embodiments, the above-mentioned apparatus further comprises:
第二获取模块,用于获取第二语音信息;a second obtaining module, configured to obtain second voice information;
第二识别模块,用于对第二语音信息进行语音识别,得到第二语音识别结果;A second recognition module, configured to perform speech recognition on the second voice information to obtain a second voice recognition result;
第三确定模块,用于根据第二语音识别结果,确定第二语音信息为用于调整目标对象的显示方式的语音命令;a third determining module, configured to determine, according to the second voice recognition result, that the second voice information is a voice command for adjusting the display mode of the target object;
第四确定模块,用于从原图像中确定第二目标图像;a fourth determining module, used for determining the second target image from the original image;
第二裁剪模块,用于从第二目标图像中裁剪出目标对象的第二子图像;The second cropping module is used for cropping out the second sub-image of the target object from the second target image;
第四显示模块,用于基于第二子图像生成第二输出图像,并将第二输出图像显示在取景框。The fourth display module is configured to generate a second output image based on the second sub-image, and display the second output image in the viewfinder.
在一些实施例中,上述装置还可以包括:In some embodiments, the above-mentioned apparatus may further include:
第二检测模块,用于在第三目标图像中,确定目标对象的外接框选区域的每个边与相对应的图像边之间的距离,第三目标图像为图像集合中视场角最大的图像;The second detection module is used to determine the distance between each edge of the bounding box selection area of the target object and the corresponding image edge in the third target image, where the third target image is the image with the largest field of view in the image set ;
第二提示模块,用于若多个距离中的最小值小于预设距离阈值,则在取景框显示提示信息,提示信息用于提示调整相机位姿。The second prompt module is configured to display prompt information in the viewfinder if the minimum value among the multiple distances is smaller than the preset distance threshold, and the prompt information is used to prompt adjustment of the camera pose.
上述确定目标对象的装置具有实现上述确定目标对象的方法的功能,该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现,硬件或软件包括一个或多个与上述功能相对应的模块,模块可以是软件和/或硬件。The above-mentioned device for determining the target object has the function of realizing the above-mentioned method for determining the target object, and the function can be realized by hardware, and can also be realized by executing corresponding software through hardware, and the hardware or software includes one or more modules corresponding to the above-mentioned functions, Modules can be software and/or hardware.
需要说明的是,上述装置/模块之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。It should be noted that the information exchange, execution process and other contents between the above-mentioned devices/modules are based on the same concept as the method embodiments of the present application. For specific functions and technical effects, please refer to the method embodiments section. It is not repeated here.
对应于上文实施例的拍摄方法,图18示出了本申请实施例提供的拍摄装置的示意 框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the photographing method of the above embodiment, FIG. 18 shows a schematic block diagram of the photographing apparatus provided by the embodiment of the present application. For the convenience of description, only the part related to the embodiment of the present application is shown.
请参见图18,该装置可以包括:Referring to Figure 18, the apparatus may include:
第二画面显示模块181,用于在取景框内显示摄像头捕获的画面,画面包括至少一个对象;The second picture display module 181 is configured to display the picture captured by the camera in the viewfinder, and the picture includes at least one object;
接收模块182,用于接收第一指令,第一指令用于指定至少两个目标对象;a receiving module 182, configured to receive a first instruction, where the first instruction is used to specify at least two target objects;
第三目标对象确定模块183,用于将画面中与第一指令对应的至少两个对象确定为至少两个目标对象;a third target object determination module 183, configured to determine at least two objects in the screen corresponding to the first instruction as at least two target objects;
分区域显示模块184,用于在取景框内分区域显示每个目标对象的画面,其中,一个区域显示一个目标对象的画面。The sub-area display module 184 is configured to display the picture of each target object in sub-areas in the viewfinder, wherein one area displays the picture of one target object.
在一些实施例中,第一指令为语音指令。In some embodiments, the first instruction is a voice instruction.
在一些实施例中,第三目标对象确定模块具体用于:从预先建立的数据集中提取第一指令的第一索引数据,数据集包括属性关键字词和属性索引;基于各个对象的关联关系,确定第一索引数据对应的目标区域信息,目标区域信息对应的对象为目标对象;In some embodiments, the third target object determination module is specifically configured to: extract the first index data of the first instruction from a pre-established data set, where the data set includes attribute key words and attribute indexes; Determine the target area information corresponding to the first index data, and the object corresponding to the target area information is the target object;
其中,关联关系为对象的属性信息的第二索引数据和区域信息之间的映射关系区域信息用于描述对象在摄像头采集的原图像中的位置。Wherein, the association relationship is the mapping relationship between the second index data of the attribute information of the object and the region information. The region information is used to describe the position of the object in the original image collected by the camera.
在一些实施例中,上述装置还可以包括:In some embodiments, the above-mentioned apparatus may further include:
第三对象信息确定模块,用于确定原图像中各个对象的对象信息,对象信息包括区域信息和属性信息;The third object information determination module is used to determine the object information of each object in the original image, and the object information includes area information and attribute information;
第四提取模块,用于从数据集中提取各个对象的属性信息的第二索引数据;a fourth extraction module, used for extracting the second index data of the attribute information of each object from the data set;
第三建立模块,用于针对每个对象,将第二索引数据与区域信息进行关联,得到索引数据和区域信息之间的关联关系。The third establishing module is used for associating the second index data with the area information for each object, so as to obtain the association relationship between the index data and the area information.
在一些实施例中,第四提取模块具体用于:针对每个属性信息,从数据集中查找与属性信息匹配的第一关键字词;将数据集中第一关键字词的索引数据作为第二索引数据。In some embodiments, the fourth extraction module is specifically configured to: for each attribute information, find the first keyword word matching the attribute information from the data set; use the index data of the first keyword word in the data set as the second index data.
在一些实施例中,第三对象信息确定模块具体用于:将原图像输入至目标识别模型;获得目标识别模型输出的对象信息。In some embodiments, the third object information determination module is specifically configured to: input the original image into the target recognition model; and obtain the object information output by the target recognition model.
在一些实施例中,第三目标对象确定模块具体用于:针对每个第二索引数据,将第一索引数据中的各项索引分别与第二索引数据中的各项索引进行匹配,确定目标索引的项数,目标索引为第二索引数据中与第一索引数据中的索引相匹配的索引;基于关联关系,将项数最多的第二索引数据对应的区域信息确定为目标区域信息。In some embodiments, the third target object determination module is specifically configured to: for each second index data, respectively match each index in the first index data with each index in the second index data, and determine the target The number of items in the index, and the target index is the index in the second index data that matches the index in the first index data; based on the association relationship, the area information corresponding to the second index data with the largest number of items is determined as the target area information.
在一些实施例中,第三目标对象确定模块具体用于:从第二索引数据对应的属性关键字词中查找与第一指令相匹配的第二关键字词;将数据集中第二关键字词的索引数据作为第一索引数据。In some embodiments, the third target object determination module is specifically configured to: search for the second keyword matching the first instruction from the attribute keywords corresponding to the second index data; The index data is used as the first index data.
在一些实施例中,摄像头包括N个不同视场角的摄像头,N为大于或等于1的正整数;上述分区域显示模块具体用于:针对每个目标对象,从N张原图像中确定第一目标图像,第一目标图像为图像集合中视场角最小的图像,图像集合包括至少一张可完整包含目标对象的外接框选区域的原图像;针对每个目标对象,根据目标区域信息,从第一目标图像中裁剪出目标对象的第一子图像;将每个目标对象对应的第一子图像 进行拼接,得到第一输出图像;将第一输出图像显示在取景框。In some embodiments, the cameras include N cameras with different field of view, where N is a positive integer greater than or equal to 1; the above-mentioned sub-area display module is specifically configured to: for each target object, determine the number of images from the N original images. A target image, the first target image is the image with the smallest field of view in the image set, and the image set includes at least one original image that can completely contain the bounding box selection area of the target object; for each target object, according to the target area information, from The first sub-image of the target object is cropped from the first target image; the first sub-images corresponding to each target object are stitched to obtain a first output image; and the first output image is displayed in the viewfinder.
在一些实施例中,上述装置还包括:In some embodiments, the above-mentioned apparatus further comprises:
第三获取模块,用于获取第二语音信息;a third acquiring module, configured to acquire the second voice information;
第三识别模块,用于对第二语音信息进行语音识别,得到第二语音识别结果;a third recognition module, configured to perform speech recognition on the second speech information to obtain a second speech recognition result;
第五确定模块,用于根据第二语音识别结果,确定第二语音信息为用于调整目标对象的显示方式的语音命令;a fifth determining module, configured to determine, according to the second voice recognition result, that the second voice information is a voice command for adjusting the display mode of the target object;
第六确定模块,用于从原图像中确定第二目标图像;a sixth determining module, configured to determine the second target image from the original image;
第三裁剪模块,用于从第二目标图像中裁剪出目标对象的第二子图像;The third cropping module is used for cropping out the second sub-image of the target object from the second target image;
第五显示模块,用于基于第二子图像生成第二输出图像,并将第二输出图像显示在取景框。The fifth display module is configured to generate a second output image based on the second sub-image, and display the second output image in the viewfinder frame.
在一些实施例中,上述装置还可以包括:In some embodiments, the above-mentioned apparatus may further include:
第三检测模块,用于在第三目标图像中,确定目标对象的外接框选区域的每个边与相对应的图像边之间的距离,第三目标图像为图像集合中视场角最大的图像;The third detection module is used to determine the distance between each edge of the bounding box selection area of the target object and the corresponding image edge in the third target image, where the third target image is the image with the largest field of view in the image set ;
第三提示模块,用于若多个距离中的最小值小于预设距离阈值,则在取景框显示提示信息,提示信息用于提示调整相机位姿。The third prompt module is configured to display prompt information in the viewfinder if the minimum value of the multiple distances is smaller than the preset distance threshold, and the prompt information is used to prompt adjustment of the camera pose.
上述拍摄装置具有实现上述拍摄方法的功能,该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现,硬件或软件包括一个或多个与上述功能相对应的模块,模块可以是软件和/或硬件。The above-mentioned photographing device has the function of realizing the above-mentioned photographing method, and the function can be realized by hardware or by executing corresponding software in hardware. The hardware or software includes one or more modules corresponding to the above-mentioned functions, and the modules can be software and/or software. or hardware.
需要说明的是,上述装置/模块之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。It should be noted that the information exchange, execution process and other contents between the above-mentioned devices/modules are based on the same concept as the method embodiments of the present application. For specific functions and technical effects, please refer to the method embodiments section. It is not repeated here.
本申请实施例提供的终端设备,可以包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述方法实施例中任一项的方法。The terminal device provided by the embodiments of the present application may include a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the method in any of the foregoing method embodiments is implemented.
本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.
本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行时实现可实现上述各个方法实施例中的步骤。The embodiments of the present application provide a computer program product, when the computer program product runs on a terminal device, so that the terminal device can implement the steps in the foregoing method embodiments when executed.
本申请实施例还提供一种芯片系统,所述芯片系统包括处理器,所述处理器与存储器耦合,所述处理器执行存储器中存储的计算机程序,以实现如上述各个方法实施例所述的方法。所述芯片系统可以为单个芯片,或者多个芯片组成的芯片模组。An embodiment of the present application further provides a chip system, where the chip system includes a processor, the processor is coupled to a memory, and the processor executes a computer program stored in the memory, so as to implement the methods described in the foregoing method embodiments. method. The chip system may be a single chip, or a chip module composed of multiple chips.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。此外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在 其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments. It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. Furthermore, in the description of the specification of the present application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and cannot be construed as indicating or implying relative importance. References in this specification to "one embodiment" or "some embodiments" and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically emphasized otherwise.
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。Finally, it should be noted that: the above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this, and any changes or replacements within the technical scope disclosed in the present application should be covered by the present application. within the scope of protection of the application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims (39)

  1. 一种确定目标对象的方法,其特征在于,所述方法包括:A method for determining a target object, characterized in that the method comprises:
    获取待处理图像,所述待处理图像包括至少一个对象;acquiring an image to be processed, the image to be processed includes at least one object;
    确定所述待处理图像中各个所述对象的对象信息,所述对象信息包括区域信息和属性信息,所述区域信息用于描述所述对象在所述待处理图像中的位置;determining object information of each of the objects in the image to be processed, the object information includes area information and attribute information, and the area information is used to describe the position of the object in the image to be processed;
    从预先建立的数据集中提取各个所述属性信息的第二索引数据,所述数据集包括属性关键字词和属性索引;Extracting second index data of each of the attribute information from a pre-established data set, the data set including attribute key words and attribute indexes;
    针对每个所述对象,将所述第二索引数据与所述区域信息进行关联,得到索引数据和区域信息之间的关联关系;For each of the objects, associating the second index data with the area information to obtain an association relationship between the index data and the area information;
    获取第一语音信息;obtain the first voice information;
    基于所述数据集中所述第二索引数据对应的属性关键字词,从所述数据集中提取所述第一语音信息的第一索引数据;extracting the first index data of the first voice information from the data set based on attribute keywords corresponding to the second index data in the data set;
    基于所述关联关系和所述第二索引数据,确定所述第一索引数据对应的目标区域信息,其中,所述目标区域信息对应的对象为目标对象。Based on the association relationship and the second index data, target area information corresponding to the first index data is determined, wherein the object corresponding to the target area information is a target object.
  2. 根据权利要求1所述的方法,其特征在于,基于所述关联关系和所述第二索引数据,确定所述第一索引数据对应的目标区域信息,包括:The method according to claim 1, wherein determining the target area information corresponding to the first index data based on the association relationship and the second index data, comprising:
    针对每个所述第二索引数据,将所述第二索引数据中的各项索引分别与所述第一索引数据中的各项索引进行匹配,确定目标索引的项数,所述目标索引为所述第二索引数据中与所述第一索引数据中的索引相匹配的索引;For each of the second index data, each index in the second index data is matched with each index in the first index data, and the number of items of the target index is determined, and the target index is an index in the second index data that matches an index in the first index data;
    基于所述关联关系,将所述项数最多的第二索引数据对应的区域信息确定为所述目标区域信息。Based on the association relationship, the area information corresponding to the second index data with the largest number of items is determined as the target area information.
  3. 根据权利要求1所述的方法,其特征在于,从预先建立的数据集中提取各个所述属性信息的第二索引数据,包括:The method according to claim 1, wherein extracting the second index data of each of the attribute information from a pre-established data set comprises:
    针对每个所述属性信息,从所述数据集中查找与所述属性信息匹配的第一关键字词;For each of the attribute information, search the data set for a first keyword matching the attribute information;
    将所述数据集中所述第一关键字词的索引数据作为所述第二索引数据。The index data of the first keyword in the data set is used as the second index data.
  4. 根据权利要求1所述的方法,其特征在于,基于所述数据集中所述第二索引数据对应的属性关键字词,从所述数据集中提取所述第一语音信息的第一索引数据,包括:The method according to claim 1, wherein extracting the first index data of the first voice information from the data set based on attribute keywords corresponding to the second index data in the data set, comprising: :
    对所述第一语音信息进行语音识别,得到第一语音识别结果;performing voice recognition on the first voice information to obtain a first voice recognition result;
    根据所述第一语音识别结果,确定所述第一语音信息为用于指定目标对象的语音命令;According to the first voice recognition result, determine that the first voice information is a voice command for specifying a target object;
    从所述第二索引数据对应的属性关键字词中查找与所述第一语音识别结果相匹配的第二关键字词;Searching for a second keyword matching the first speech recognition result from attribute keywords corresponding to the second index data;
    将所述数据集中所述第二关键字词的索引数据作为所述第一索引数据。The index data of the second keyword in the data set is used as the first index data.
  5. 根据权利要求4所述的方法,其特征在于,对所述第一语音信息进行语音识别,得到第一语音识别结果,包括:The method according to claim 4, wherein the voice recognition is performed on the first voice information to obtain a first voice recognition result, comprising:
    提取所述第一语音信息的声纹特征;extracting the voiceprint feature of the first voice information;
    确定所述声纹特征和预存储声纹特征之间的相似度;determining the similarity between the voiceprint feature and the pre-stored voiceprint feature;
    当所述相似度大于或等于预设阈值时,将所述声纹特征和所述第一语音信息输入至语义理解模型,获得所述语义理解模型输出的所述第一语音识别结果。When the similarity is greater than or equal to a preset threshold, the voiceprint feature and the first voice information are input into a semantic understanding model to obtain the first voice recognition result output by the semantic understanding model.
  6. 根据权利要求1所述的方法,其特征在于,确定所述待处理图像中各个所述对象的对象信息,包括:The method according to claim 1, wherein determining the object information of each of the objects in the to-be-processed image comprises:
    将所述待处理图像输入至目标识别模型;inputting the image to be processed into a target recognition model;
    获得所述目标识别模型输出的所述对象信息。Obtain the object information output by the target recognition model.
  7. 根据权利要求1至6任一项所述方法,其特征在于,所述第一语音信息为用于初次指定目标对象的语音命令,或用于增加或减少目标对象的语音命令,或用于替换目标对象的语音命令。The method according to any one of claims 1 to 6, wherein the first voice information is a voice command for specifying the target object for the first time, or a voice command for increasing or decreasing the target object, or a voice command for replacing The target object's voice command.
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述待处理图像包括N个不同视场角的原图像,N为大于或等于1的正整数;The method according to any one of claims 1 to 7, wherein the to-be-processed image comprises N original images with different viewing angles, and N is a positive integer greater than or equal to 1;
    在确定所述第一索引数据对应的目标区域信息之后,所述方法还包括:After determining the target area information corresponding to the first index data, the method further includes:
    基于所述原图像,根据所述目标区域信息得到所述目标对象的第一输出图像,并将所述第一输出图像显示在取景框。Based on the original image, a first output image of the target object is obtained according to the target area information, and the first output image is displayed in a viewfinder.
  9. 根据权利要求8所述的方法,其特征在于,基于所述原图像,根据所述目标区域信息得到所述目标对象的第一输出图像,包括:The method according to claim 8, wherein, based on the original image, obtaining the first output image of the target object according to the target area information, comprising:
    从N张所述原图像中确定第一目标图像,所述第一目标图像为图像集合中视场角最小的图像,所述图像集合包括至少一张可完整包含所述目标对象的外接框选区域的原图像;A first target image is determined from the N original images, the first target image is the image with the smallest field of view in the image set, and the image set includes at least one frame selection area that can completely contain the target object the original image;
    根据所述目标区域信息,从所述第一目标图像中裁剪出所述目标对象的第一子图像;According to the target area information, crop a first sub-image of the target object from the first target image;
    基于所述第一子图像生成所述第一输出图像。The first output image is generated based on the first sub-image.
  10. 根据权利要求9所述的方法,其特征在于,当有至少两个所述目标对象,且每个目标对象均对应有一个所述第一子图像时,基于所述第一子图像得到所述第一输出图像,包括:The method according to claim 9, wherein when there are at least two target objects, and each target object corresponds to one of the first sub-images, obtaining the first sub-image based on the first sub-image The first output image, including:
    将每个所述目标对象对应的所述第一子图像进行拼接,得到所述第一输出图像。The first sub-images corresponding to each of the target objects are stitched to obtain the first output image.
  11. 根据权利要求9所述的方法,其特征在于,在将所述第一输出图像显示在取景框之后,所述方法还包括:The method according to claim 9, wherein after displaying the first output image in the viewfinder, the method further comprises:
    获取第二语音信息;obtain second voice information;
    对所述第二语音信息进行语音识别,得到第二语音识别结果;Perform voice recognition on the second voice information to obtain a second voice recognition result;
    根据所述第二语音识别结果,确定所述第二语音信息为用于调整所述目标对象的显示方式的语音命令;According to the second voice recognition result, determine that the second voice information is a voice command for adjusting the display mode of the target object;
    从所述原图像中确定第二目标图像;determining a second target image from the original image;
    从所述第二目标图像中裁剪出所述目标对象的第二子图像;Crop out a second sub-image of the target object from the second target image;
    基于所述第二子图像生成第二输出图像,并将所述第二输出图像显示在所述取景框。A second output image is generated based on the second sub-image, and the second output image is displayed in the viewfinder.
  12. 根据权利要求8至11任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 8 to 11, wherein the method further comprises:
    在第三目标图像中,确定所述目标对象的外接框选区域的每个边与相对应的图像 边之间的距离,所述第三目标图像为所述图像集合中视场角最大的图像;In the third target image, determine the distance between each edge of the bounding box selection area of the target object and the corresponding image edge, and the third target image is the image with the largest field of view in the image set;
    若多个所述距离中的最小值小于或等于预设距离阈值,则在所述取景框显示提示信息,所述提示信息用于提示调整相机位姿。If the minimum value among the plurality of distances is less than or equal to the preset distance threshold, prompt information is displayed in the viewfinder, and the prompt information is used to prompt adjustment of the camera pose.
  13. 根据权利要求1所述的方法,其特征在于,所述属性信息包括以下至少一项:类型、性别、年龄段、活动信息、头发长度、头发颜色、着装类型、衣服颜色以及姿势。The method of claim 1, wherein the attribute information includes at least one of the following: type, gender, age group, activity information, hair length, hair color, clothing type, clothing color, and posture.
  14. 一种确定目标对象的方法,其特征在于,应用于终端设备,所述方法包括:A method for determining a target object, characterized in that it is applied to a terminal device, the method comprising:
    在取景框内显示通过摄像头捕获的画面,所述画面包括至少一个对象;displaying a picture captured by the camera in the viewfinder, the picture including at least one object;
    获取第一语音信息,所述第一语音信息用于指定目标对象;obtaining first voice information, the first voice information is used to specify a target object;
    将所述画面中与所述第一语音信息对应的对象确定为目标对象。An object in the picture corresponding to the first voice information is determined as a target object.
  15. 根据权利要求14所述方法,其特征在于,将所述画面中与所述第一语音信息对应的对象确定为目标对象,包括:The method according to claim 14, wherein determining the object corresponding to the first voice information in the picture as the target object comprises:
    从预先建立的数据集中提取所述第一语音信息的第一索引数据,所述数据集包括属性关键字词和属性索引;Extracting the first index data of the first voice information from a pre-established data set, the data set including attribute key words and attribute indexes;
    基于各个所述对象的关联关系,确定所述第一索引数据对应的目标区域信息,所述目标区域信息对应的对象为所述目标对象;Determine the target area information corresponding to the first index data based on the association relationship of each of the objects, and the object corresponding to the target area information is the target object;
    其中,所述关联关系为所述对象的属性信息的第二索引数据和区域信息之间的映射关系,所述区域信息用于描述所述对象在摄像头采集的原图像中的位置。The association relationship is a mapping relationship between the second index data of the attribute information of the object and the area information, where the area information is used to describe the position of the object in the original image collected by the camera.
  16. 根据权利要求15所述的方法,其特征在于,在基于各个所述对象的关联关系,确定所述第一索引数据对应的目标区域信息之前,还包括:The method according to claim 15, wherein before determining the target area information corresponding to the first index data based on the association relationship of each of the objects, the method further comprises:
    确定所述原图像中各个所述对象的对象信息,所述对象信息包括所述区域信息和所述属性信息;determining object information of each of the objects in the original image, where the object information includes the region information and the attribute information;
    从所述数据集中提取各个对象的所述属性信息的第二索引数据;Extracting second index data of the attribute information of each object from the data set;
    针对每个所述对象,将所述第二索引数据与所述区域信息进行关联,得到索引数据和区域信息之间的关联关系。For each of the objects, the second index data is associated with the area information to obtain an association relationship between the index data and the area information.
  17. 根据权利要求16所述的方法,其特征在于,从所述数据集中提取各个对象的所述属性信息的第二索引数据,包括:The method according to claim 16, wherein extracting the second index data of the attribute information of each object from the data set comprises:
    针对每个所述属性信息,从所述数据集查找与所述属性信息匹配的第一关键字词;For each of the attribute information, searching the data set for a first keyword matching the attribute information;
    将所述数据集中所述第一关键字词的索引数据作为所述第二索引数据。The index data of the first keyword in the data set is used as the second index data.
  18. 根据权利要求16所述的方法,其特征在于,确定所述原图像中各个所述对象的对象信息,包括:The method according to claim 16, wherein determining the object information of each of the objects in the original image comprises:
    将所述原图像输入至目标识别模型;inputting the original image into the target recognition model;
    获得所述目标识别模型输出的所述对象信息。Obtain the object information output by the target recognition model.
  19. 根据权利要求15至18任一项所述的方法,其特征在于,基于各个所述对象的关联关系,确定所述第一索引数据对应的目标区域信息,包括:The method according to any one of claims 15 to 18, wherein determining the target area information corresponding to the first index data based on the association relationship of each of the objects, comprising:
    针对每个所述第二索引数据,将所述第一索引数据中的各项索引分别与所述第二索引数据中的各项索引进行匹配,确定目标索引的项数,所述目标索引为所述第二索引数据中与所述第一索引数据中的索引相匹配的索引;For each of the second index data, each index in the first index data is matched with each index in the second index data, and the number of items of the target index is determined, and the target index is an index in the second index data that matches an index in the first index data;
    基于所述关联关系,将所述项数最多的第二索引数据对应的区域信息确定为所述 目标区域信息。Based on the association relationship, the area information corresponding to the second index data with the largest number of items is determined as the target area information.
  20. 根据权利要求15所述的方法,其特征在于,从预先建立的数据集中提取所述第一语音信息的第一索引数据,包括:The method according to claim 15, wherein extracting the first index data of the first voice information from a pre-established data set comprises:
    对所述第一语音信息进行语音识别,得到第一语音识别结果;performing voice recognition on the first voice information to obtain a first voice recognition result;
    根据所述第一语音识别结果,确定所述第一语音信息为用于指定目标对象的语音命令;According to the first voice recognition result, determine that the first voice information is a voice command for specifying a target object;
    从所述第二索引数据对应的属性关键字词中查找与所述第一语音识别结果相匹配的第二关键字词;Searching for a second keyword matching the first speech recognition result from attribute keywords corresponding to the second index data;
    将所述数据集中所述第二关键字词的索引数据作为所述第一索引数据。The index data of the second keyword in the data set is used as the first index data.
  21. 根据权利要求15所述的方法,其特征在于,对所述第一语音信息进行语音识别,得到第一语音识别结果,包括:The method according to claim 15, wherein performing speech recognition on the first speech information to obtain a first speech recognition result, comprising:
    提取所述第一语音信息的声纹特征;extracting the voiceprint feature of the first voice information;
    确定所述声纹特征和预存储声纹特征之间的相似度;determining the similarity between the voiceprint feature and the pre-stored voiceprint feature;
    当所述相似度大于或等于预设阈值时,将所述声纹特征和所述第一语音信息输入至语义理解模型,获得所述语义理解模型输出的所述第一语音识别结果。When the similarity is greater than or equal to a preset threshold, the voiceprint feature and the first voice information are input into a semantic understanding model to obtain the first voice recognition result output by the semantic understanding model.
  22. 根据权利要求15至21任一项所述的方法,其特征在于,所述摄像头包括N个不同视场角的摄像头,N为大于或等于1的正整数;The method according to any one of claims 15 to 21, wherein the camera comprises N cameras with different field of view, and N is a positive integer greater than or equal to 1;
    在基于各个所述对象的关联关系,确定所述第一索引数据对应的目标区域信息之后,所述方法还包括:After determining the target area information corresponding to the first index data based on the association relationship of each of the objects, the method further includes:
    基于N个不同视场角的所述原图像,根据所述目标区域信息得到所述目标对象的第一输出图像,并将所述第一输出图像显示在取景框。Based on the original images of N different viewing angles, a first output image of the target object is obtained according to the target area information, and the first output image is displayed in the viewfinder.
  23. 根据权利要求22所述的方法,其特征在于,基于N个不同视场角的所述原图像,根据所述目标区域信息得到所述目标对象的第一输出图像,包括:The method according to claim 22, wherein obtaining the first output image of the target object according to the target area information based on the original images of N different viewing angles, comprising:
    从N张所述原图像中确定第一目标图像,所述第一目标图像为图像集合中视场角最小的图像,所述图像集合包括至少一张可完整包含所述目标对象的外接框选区域的原图像;A first target image is determined from the N original images, the first target image is the image with the smallest field of view in the image set, and the image set includes at least one frame selection area that can completely contain the target object the original image;
    根据所述目标区域信息,从所述第一目标图像中裁剪出所述目标对象的第一子图像;According to the target area information, crop a first sub-image of the target object from the first target image;
    基于所述第一子图像生成所述第一输出图像。The first output image is generated based on the first sub-image.
  24. 根据权利要求23所述的方法,其特征在于,当有至少两个所述目标对象,且每个目标对象均对应有一个所述第一子图像时,基于所述第一子图像得到所述第一输出图像,包括:The method according to claim 23, wherein when there are at least two target objects, and each target object corresponds to one of the first sub-images, obtaining the first sub-image based on the first sub-image The first output image, including:
    将每个所述目标对象对应的所述第一子图像进行拼接,得到所述第一输出图像。The first sub-images corresponding to each of the target objects are stitched to obtain the first output image.
  25. 根据权利要求22所述的方法,其特征在于,在将所述第一输出图像显示在取景框之后,所述方法还包括:The method according to claim 22, wherein after displaying the first output image in a viewfinder, the method further comprises:
    获取第二语音信息;obtain second voice information;
    对所述第二语音信息进行语音识别,得到第二语音识别结果;Perform voice recognition on the second voice information to obtain a second voice recognition result;
    根据所述第二语音识别结果,确定所述第二语音信息为用于调整所述目标对象的显示方式的语音命令;According to the second voice recognition result, determine that the second voice information is a voice command for adjusting the display mode of the target object;
    从所述原图像中确定第二目标图像;determining a second target image from the original image;
    从所述第二目标图像中裁剪出所述目标对象的第二子图像;Crop out a second sub-image of the target object from the second target image;
    基于所述第二子图像生成第二输出图像,并将所述第二输出图像显示在所述取景框。A second output image is generated based on the second sub-image, and the second output image is displayed in the viewfinder.
  26. 根据权利要求22至25任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 22 to 25, wherein the method further comprises:
    在第三目标图像中,确定所述目标对象的外接框选区域的每个边与相对应的图像边之间的距离,所述第三目标图像为所述图像集合中视场角最大的图像;In the third target image, determine the distance between each edge of the bounding box selection area of the target object and the corresponding image edge, and the third target image is the image with the largest field of view in the image set;
    若多个所述距离中的最小值小于预设距离阈值,则在所述取景框显示提示信息,所述提示信息用于提示调整相机位姿。If the minimum value among the plurality of distances is smaller than the preset distance threshold, prompt information is displayed in the viewfinder, and the prompt information is used to prompt adjustment of the camera pose.
  27. 一种拍摄方法,其特征在于,应用于终端设备,所述方法包括:A photographing method, characterized in that it is applied to a terminal device, the method comprising:
    在取景框内显示摄像头捕获的画面,所述画面包括至少一个对象;displaying a picture captured by the camera in the viewfinder, the picture including at least one object;
    接收第一指令,所述第一指令用于指定至少两个目标对象;receiving a first instruction, where the first instruction is used to specify at least two target objects;
    将所述画面中与所述第一指令对应的至少两个对象确定为所述至少两个目标对象;determining at least two objects in the picture corresponding to the first instruction as the at least two target objects;
    在所述取景框内分区域显示每个所述目标对象的画面,其中,一个区域显示一个所述目标对象的画面。The picture of each target object is displayed in sub-areas in the viewfinder frame, wherein one area displays a picture of the target object.
  28. 根据权利要求27所述的方法,其特征在于,所述第一指令为语音指令。The method of claim 27, wherein the first instruction is a voice instruction.
  29. 根据权利要求28所述的方法,其特征在于,将所述画面中与所述第一指令对应的至少两个对象确定为所述至少两个目标对象,包括:The method according to claim 28, wherein determining at least two objects in the picture corresponding to the first instruction as the at least two target objects comprises:
    从预先建立的数据集中提取所述第一指令的第一索引数据,所述数据集包括属性关键字词和属性索引;extracting the first index data of the first instruction from a pre-established data set, the data set including attribute key words and attribute indexes;
    基于各个所述对象的关联关系,确定所述第一索引数据对应的目标区域信息,所述目标区域信息对应的对象为所述目标对象;Determine the target area information corresponding to the first index data based on the association relationship of each of the objects, and the object corresponding to the target area information is the target object;
    其中,所述关联关系为所述对象的属性信息的第二索引数据和区域信息之间的映射关系所述区域信息用于描述所述对象在摄像头采集的原图像中的位置。The association relationship is a mapping relationship between the second index data of the attribute information of the object and the area information. The area information is used to describe the position of the object in the original image collected by the camera.
  30. 根据权利要求29所述的方法,其特征在于,在基于各个所述对象的关联关系,确定所述第一索引数据对应的目标区域信息之前,还包括:The method according to claim 29, wherein before determining the target area information corresponding to the first index data based on the association relationship of each of the objects, the method further comprises:
    确定所述原图像中各个所述对象的对象信息,所述对象信息包括所述区域信息和所述属性信息;determining object information of each of the objects in the original image, where the object information includes the region information and the attribute information;
    从所述数据集中提取各个对象的所述属性信息的第二索引数据;Extracting second index data of the attribute information of each object from the data set;
    针对每个所述对象,将所述第二索引数据与所述区域信息进行关联,得到索引数据和区域信息之间的关联关系。For each of the objects, the second index data is associated with the area information to obtain an association relationship between the index data and the area information.
  31. 根据权利要求30所述的方法,其特征在于,从所述数据集中提取各个对象的所述属性信息的第二索引数据,包括:The method according to claim 30, wherein extracting the second index data of the attribute information of each object from the data set comprises:
    针对每个所述属性信息,从所述数据集中查找与所述属性信息匹配的第一关键字词;For each of the attribute information, search the data set for a first keyword matching the attribute information;
    将所述数据集中所述第一关键字词的索引数据作为所述第二索引数据。The index data of the first keyword in the data set is used as the second index data.
  32. 根据权利要求30所述的方法,其特征在于,确定所述原图像中各个所述对象的对象信息,包括:The method according to claim 30, wherein determining the object information of each of the objects in the original image comprises:
    将所述原图像输入至目标识别模型;inputting the original image into the target recognition model;
    获得所述目标识别模型输出的所述对象信息。Obtain the object information output by the target recognition model.
  33. 根据权利要求29至32任一项所述的方法,其特征在于,基于各个所述对象的关联关系,确定所述第一索引数据对应的目标区域信息,包括:The method according to any one of claims 29 to 32, wherein determining the target area information corresponding to the first index data based on the association relationship of each of the objects, comprising:
    针对每个所述第二索引数据,将所述第一索引数据中的各项索引分别与所述第二索引数据中的各项索引进行匹配,确定目标索引的项数,所述目标索引为所述第二索引数据中与所述第一索引数据中的索引相匹配的索引;For each of the second index data, each index in the first index data is matched with each index in the second index data, and the number of items of the target index is determined, and the target index is an index in the second index data that matches an index in the first index data;
    基于所述关联关系,将所述项数最多的第二索引数据对应的区域信息确定为所述目标区域信息。Based on the association relationship, the area information corresponding to the second index data with the largest number of items is determined as the target area information.
  34. 根据权利要求29所述的方法,其特征在于,从预先建立的数据集中提取所述第一指令的第一索引数据,包括:The method according to claim 29, wherein extracting the first index data of the first instruction from a pre-established data set comprises:
    从所述第二索引数据对应的属性关键字词中查找与所述第一指令相匹配的第二关键字词;Searching for the second keyword matching the first instruction from the attribute keyword corresponding to the second index data;
    将所述数据集中所述第二关键字词的索引数据作为所述第一索引数据。The index data of the second keyword in the data set is used as the first index data.
  35. 根据权利要求29至34任一项所述的方法,其特征在于,所述摄像头包括N个不同视场角的摄像头,N为大于或等于1的正整数;The method according to any one of claims 29 to 34, wherein the camera comprises N cameras with different viewing angles, and N is a positive integer greater than or equal to 1;
    在所述取景框内分区域显示每个所述目标对象的画面,包括:The picture of each target object is displayed in sub-areas in the viewfinder, including:
    针对每个所述目标对象,从N张所述原图像中确定第一目标图像,所述第一目标图像为图像集合中视场角最小的图像,所述图像集合包括至少一张可完整包含所述目标对象的外接框选区域的原图像;For each target object, a first target image is determined from the N original images, the first target image is an image with the smallest field of view in an image set, and the image set includes at least one image that can completely contain the Describe the original image of the bounding box selection area of the target object;
    针对每个所述目标对象,根据所述目标区域信息,从所述第一目标图像中裁剪出所述目标对象的第一子图像;For each target object, according to the target area information, crop a first sub-image of the target object from the first target image;
    将每个所述目标对象对应的所述第一子图像进行拼接,得到第一输出图像;Stitching the first sub-images corresponding to each of the target objects to obtain a first output image;
    将所述第一输出图像显示在取景框。The first output image is displayed in the viewfinder.
  36. 根据权利要求29至35任一项所述的方法,其特征在于,在所述取景框内分区域显示每个所述目标对象的画面之后,所述方法还包括:The method according to any one of claims 29 to 35, wherein after displaying the picture of each target object in sub-areas in the viewfinder frame, the method further comprises:
    获取第二语音信息;obtain second voice information;
    对所述第二语音信息进行语音识别,得到第二语音识别结果;Perform voice recognition on the second voice information to obtain a second voice recognition result;
    根据所述第二语音识别结果,确定所述第二语音信息为用于调整所述目标对象的显示方式的语音命令;According to the second voice recognition result, determine that the second voice information is a voice command for adjusting the display mode of the target object;
    从所述原图像中确定第二目标图像;determining a second target image from the original image;
    从所述第二目标图像中裁剪出所述目标对象的第二子图像;Crop out a second sub-image of the target object from the second target image;
    基于所述第二子图像生成第二输出图像,并将所述第二输出图像显示在所述取景框。A second output image is generated based on the second sub-image, and the second output image is displayed in the viewfinder.
  37. 根据权利要求35至36任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 35 to 36, wherein the method further comprises:
    在第三目标图像中,确定所述目标对象的外接框选区域的每个边与相对应的图像边之间的距离,所述第三目标图像为所述图像集合中视场角最大的图像;In the third target image, determine the distance between each edge of the bounding box selection area of the target object and the corresponding image edge, and the third target image is the image with the largest field of view in the image set;
    若多个所述距离中的最小值小于预设距离阈值,则在所述取景框显示提示信息,所述提示信息用于提示调整相机位姿。If the minimum value among the plurality of distances is smaller than the preset distance threshold, prompt information is displayed in the viewfinder, and the prompt information is used to prompt adjustment of the camera pose.
  38. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处 理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至13或14至26或27至37任一项所述的方法。A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the computer program, the process according to claim 1 to The method of any one of 13 or 14 to 26 or 27 to 37.
  39. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至13或14至26或27至37任一项所述的方法。A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, any one of claims 1 to 13 or 14 to 26 or 27 to 37 is implemented method described in item.
PCT/CN2022/083080 2021-03-29 2022-03-25 Method for determining target object, and photographing method and device WO2022206605A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110336707.XA CN115225756A (en) 2021-03-29 2021-03-29 Method for determining target object, shooting method and device
CN202110336707.X 2021-03-29

Publications (1)

Publication Number Publication Date
WO2022206605A1 true WO2022206605A1 (en) 2022-10-06

Family

ID=83458051

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/083080 WO2022206605A1 (en) 2021-03-29 2022-03-25 Method for determining target object, and photographing method and device

Country Status (2)

Country Link
CN (1) CN115225756A (en)
WO (1) WO2022206605A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115361503A (en) * 2022-10-17 2022-11-18 天津大学四川创新研究院 Cross-camera scheduling method based on characteristic value topological network

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000101901A (en) * 1998-09-25 2000-04-07 Canon Inc Image pickup device and method therefor, image pickup device control system and storage medium
CN105704389A (en) * 2016-04-12 2016-06-22 上海斐讯数据通信技术有限公司 Intelligent photo taking method and device
CN105812650A (en) * 2015-06-29 2016-07-27 维沃移动通信有限公司 Image obtaining method and electronic device
CN106385537A (en) * 2016-09-19 2017-02-08 深圳市金立通信设备有限公司 Photographing method and terminal
CN106803886A (en) * 2017-02-28 2017-06-06 深圳天珑无线科技有限公司 A kind of method and device taken pictures
US20170374273A1 (en) * 2016-06-22 2017-12-28 International Business Machines Corporation Controlling a camera using a voice command and image recognition
CN108093167A (en) * 2016-11-22 2018-05-29 谷歌有限责任公司 Use the operable camera of natural language instructions
CN111182204A (en) * 2019-11-26 2020-05-19 广东小天才科技有限公司 Shooting method based on wearable device and wearable device
CN111263072A (en) * 2020-02-26 2020-06-09 Oppo广东移动通信有限公司 Shooting control method and device and computer readable storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000101901A (en) * 1998-09-25 2000-04-07 Canon Inc Image pickup device and method therefor, image pickup device control system and storage medium
CN105812650A (en) * 2015-06-29 2016-07-27 维沃移动通信有限公司 Image obtaining method and electronic device
CN105704389A (en) * 2016-04-12 2016-06-22 上海斐讯数据通信技术有限公司 Intelligent photo taking method and device
US20170374273A1 (en) * 2016-06-22 2017-12-28 International Business Machines Corporation Controlling a camera using a voice command and image recognition
CN106385537A (en) * 2016-09-19 2017-02-08 深圳市金立通信设备有限公司 Photographing method and terminal
CN108093167A (en) * 2016-11-22 2018-05-29 谷歌有限责任公司 Use the operable camera of natural language instructions
CN106803886A (en) * 2017-02-28 2017-06-06 深圳天珑无线科技有限公司 A kind of method and device taken pictures
CN111182204A (en) * 2019-11-26 2020-05-19 广东小天才科技有限公司 Shooting method based on wearable device and wearable device
CN111263072A (en) * 2020-02-26 2020-06-09 Oppo广东移动通信有限公司 Shooting control method and device and computer readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115361503A (en) * 2022-10-17 2022-11-18 天津大学四川创新研究院 Cross-camera scheduling method based on characteristic value topological network

Also Published As

Publication number Publication date
CN115225756A (en) 2022-10-21

Similar Documents

Publication Publication Date Title
WO2021121236A1 (en) Control method, electronic device, computer-readable storage medium, and chip
CN113810587B (en) Image processing method and device
WO2021147482A1 (en) Telephoto photographing method and electronic device
JP5872142B2 (en) UI providing method and display device using the same
WO2021104485A1 (en) Photographing method and electronic device
CN111726536A (en) Video generation method and device, storage medium and computer equipment
US20230421900A1 (en) Target User Focus Tracking Photographing Method, Electronic Device, and Storage Medium
US20230224574A1 (en) Photographing method and apparatus
WO2024007715A1 (en) Photographing method and related device
KR20150011742A (en) User terminal device and the control method thereof
CN115525188A (en) Shooting method and electronic equipment
CN115689963A (en) Image processing method and electronic equipment
WO2022143311A1 (en) Photographing method and apparatus for intelligent view-finding recommendation
WO2022206605A1 (en) Method for determining target object, and photographing method and device
WO2024179100A1 (en) Photographing method
WO2024179101A1 (en) Photographing method
WO2024169338A1 (en) Photographing method and electronic device
WO2023231595A1 (en) Filming method and electronic device
WO2022228010A1 (en) Method for generating cover, and electronic device
CN117097985B (en) Focusing method, electronic device and computer readable storage medium
CN116055861B (en) Video editing method and electronic equipment
WO2022194084A1 (en) Video playing method, terminal device, apparatus, system and storage medium
WO2024169363A1 (en) Focus tracking method, focusing method, photographing method, and electronic device
WO2023246666A1 (en) Search method and electronic device
CN117692762A (en) Shooting method and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22778787

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22778787

Country of ref document: EP

Kind code of ref document: A1