WO2020221121A1 - 视频查询方法、装置、设备及存储介质 - Google Patents

视频查询方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020221121A1
WO2020221121A1 PCT/CN2020/086670 CN2020086670W WO2020221121A1 WO 2020221121 A1 WO2020221121 A1 WO 2020221121A1 CN 2020086670 W CN2020086670 W CN 2020086670W WO 2020221121 A1 WO2020221121 A1 WO 2020221121A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
feature
moving object
media
target
Prior art date
Application number
PCT/CN2020/086670
Other languages
English (en)
French (fr)
Inventor
冯洋
马林
刘威
罗杰波
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2020221121A1 publication Critical patent/WO2020221121A1/zh
Priority to US17/333,001 priority Critical patent/US11755644B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • G06F16/7328Query by example, e.g. a complete video frame or video sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • This application relates to the field of media processing, in particular to a video query method, device, equipment and storage medium.
  • users can query videos related to the media through media, including pictures, audio, and video.
  • videos including pictures, audio, and video.
  • the user can use a video to be checked that includes the target object, and query from the candidate videos which candidate videos include the target object.
  • the accuracy of the above video query technology is not high.
  • the candidate video is often not queried.
  • This application provides a video query method, device, equipment and storage medium, which can accurately determine whether the moving object in the candidate video is related to the target object in the media to be searched based on the media characteristics and the video characteristics, which improves the query Accuracy.
  • an embodiment of the present application provides a video query method, the method including:
  • the media to be checked includes a target object, and the candidate video includes a moving object;
  • the media feature and the video feature it is determined whether the moving object in the candidate video is related to the target object.
  • an embodiment of the present application provides a video query device, the device includes an acquiring unit, a first determining unit, and a second determining unit:
  • the acquiring unit is configured to acquire the media feature of the media to be checked and the static image features corresponding to the candidate video; the media to be checked includes a target object, and the candidate video includes a moving object;
  • the first determining unit is configured to determine the video feature of the candidate video according to the static feature of the image and the motion timing information of the moving object in the candidate video;
  • the second determining unit is configured to determine whether the moving object in the candidate video is related to the target object according to the media feature and the video feature.
  • an embodiment of the present application provides a device for video query, the device includes a processor and a memory:
  • the memory is used to store program code and transmit the program code to the processor
  • the processor is configured to execute the video query method described in the foregoing aspect according to instructions in the program code.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium is used to store program code, and the program code is used to execute the video query method described in the foregoing aspect.
  • the media to be checked includes target objects, and candidate videos include moving objects.
  • FIG. 1 is a schematic diagram of an application scenario of a video query method provided by an embodiment of this application
  • FIG. 2 is a flowchart of a video query method provided by an embodiment of the application
  • Figure 3 is an example diagram of a convolutional long and short-term memory neural network in a traditional way
  • FIG. 4 is an example diagram of a misaligned long and short-term memory neural network provided by an embodiment of the application
  • FIG. 5 is a structural diagram of the processing flow of a video query method provided by an embodiment of the application.
  • FIG. 6 is a structural diagram of a video query device provided by an embodiment of the application.
  • FIG. 7 is a structural diagram of a video query device provided by an embodiment of the application.
  • FIG. 8 is a structural diagram of a terminal device provided by an embodiment of this application.
  • Fig. 9 is a structural diagram of a server provided by an embodiment of the application.
  • the video characteristics of the candidate video obtained by the traditional video query technology cannot accurately reflect the actual content of the candidate video. That is to say, it is difficult to accurately capture the moving objects in the candidate videos and reflect them in the corresponding video features. As a result, even if the target object is included in the media to be searched for, it is difficult to identify the candidate video and the candidate video. Media-related, video query accuracy is low.
  • the embodiment of the application provides a video query method, which is based on the motion timing information of the moving object in the candidate video when determining the video feature of the candidate video, which effectively avoids the difficulty of identifying the original moving object in the video query. .
  • the video query method provided by the embodiments of the present application can be applied to various video processing scenarios, for example, it can be applied to the recognition of people in videos, the tracking of objects by smart devices, the tracking of people by smart devices, the classification of video programs, and so on.
  • the video query method provided in the embodiments of this application can be applied to an electronic device with media processing functions.
  • the electronic device can be a terminal device.
  • the terminal device can be, for example, a smart terminal, a computer, a personal digital assistant (PDA), Tablet PC etc.
  • the electronic device can also be a server.
  • the server provides media processing services to the terminal device.
  • the terminal device can upload the media to be checked and the candidate video to the server.
  • the server uses the video query method provided in the embodiments of this application to determine the candidate video Whether the moving object in is related to the target object in the media to be checked, and the result is returned to the terminal device.
  • the server can be an independent server or a server in a cluster.
  • the following describes the video query method provided in the embodiments of the present application by taking a terminal device as an example in combination with actual application scenarios.
  • FIG. 1 shows an example diagram of an application scenario of a video query method.
  • This scenario includes a terminal device 101.
  • the terminal device 101 can determine the media characteristics of the media to be checked based on the acquired media to be checked.
  • Lifetime 101 can also determine the corresponding static characteristics of the image according to the candidate video corresponding to the media to be checked.
  • media features, image static features, as well as video features, sub-features mentioned later, etc. all belong to one type of feature, which can reflect that the identified object (for example, image, video) carries related content Information.
  • the image static feature of an image can reflect the image information displayed in the image
  • the image static feature of a video can reflect the image information in the video, that is, more attention is paid to the static information of each video frame.
  • the video features of this video can reflect the video information in the video, that is, pay more attention to the dynamic information embodied in continuous video frames.
  • the candidate video belongs to any video in the video range queried by the media to be queried.
  • the candidate video includes a moving object, which is referred to as a "moving object" in this application.
  • the moving object is an actively moving object or a passive moving object.
  • the moving objects are people, animals, objects, etc.
  • the description is mainly focused on the processing flow of a moving object in the candidate video, but the embodiment of the present application does not limit the number of moving objects in the candidate video.
  • the media format of the media to be checked is not limited in the embodiments of this application, and it may be an image or a video. Regardless of the media format of the media to be checked, the media to be checked includes target objects, which can be various possible objects such as people, animals, and objects. In the media to be checked, the target object can be moving or stationary.
  • the terminal device 101 determines the video feature of the candidate video according to the static feature of the image and the motion timing information of the moving object in the candidate video. Because the information embodied by the moving object in motion can be accurately captured through the motion timing information, the video features determined by this can accurately describe the moving object, effectively avoiding the original moving object in the video query. Identification difficulty. Therefore, the terminal device 101 can accurately determine whether the moving object in the candidate video is related to the target object in the media to be checked according to the media feature and the video feature, which improves the accuracy of the query and the user's query experience.
  • the video query method provided by the embodiment of the present application will be introduced in detail with the terminal device as the execution subject as an example and the accompanying drawings.
  • the method includes:
  • S201 Acquire media features of the media to be checked and static image features corresponding to candidate videos.
  • a convolutional neural network is used to obtain the media features of the media to be checked and the static image features corresponding to the candidate videos.
  • the video may include moving objects
  • a three-dimensional convolutional neural network may be used to obtain the media features of the media to be checked and the static image features corresponding to the candidate videos.
  • the static features of the image corresponding to the candidate video may be obtained through a three-dimensional convolutional neural network.
  • the format of the media to be checked includes: at least one of image and video. If the media to be checked is a video, the video features corresponding to the media to be checked can be obtained through the 3D convolutional neural network; if the media to be checked is an image, the static features of the image corresponding to the media to be checked can be obtained through the 3D convolutional neural network. The media features corresponding to the media to be checked can be obtained through other convolutional neural networks.
  • Image static features can be understood as features extracted from static information on a single image, features extracted from static information on a single video frame, or features extracted from static information on a single video segment.
  • a three-dimensional convolutional neural network (I3D) model pre-trained on the Kinetics dataset can be directly used to obtain the media features and candidate videos of the media to be checked Corresponding image static characteristics.
  • the Kinetics data set is a data set released by the Deepmind team.
  • the Kinetics data set is a large-scale, high-quality data set that includes approximately 650,000 video clips. These video clips cover 700 human action categories, including human-object interactions such as playing musical instruments, and human-to-human interactions such as shaking hands and hugging. Each action class has at least 600 video clips. Each video clip comes with a separate action category, which can be manually annotated, and lasts about 10 seconds.
  • the media feature of the media to be checked and the static image feature corresponding to the candidate video can be acquired at the same time, or the media features of the media to be checked and the static image feature corresponding to the candidate video can be acquired at different times.
  • the static features of the image corresponding to the candidate video may be obtained in advance.
  • the terminal device obtains the media to be checked and then obtains the media features of the media to be checked; or, when the user passes
  • the terminal device after the terminal device obtains the media to be checked, the terminal device simultaneously obtains the media characteristics of the media to be checked and the static characteristics of the image corresponding to the candidate video.
  • S202 Determine the video feature of the candidate video according to the static feature of the image and the motion timing information of the moving object in the candidate video.
  • the motion timing information of the moving object is used to identify the change in the movement trend of the moving object over time in the candidate video.
  • the movement timing information of a moving object includes changes in the movement position of the moving object under the movement trend between adjacent time nodes.
  • the motion timing information is obtained by learning the candidate video through a neural network, or may be obtained through other methods, such as a labeling method.
  • the motion timing information of the candidate video is further introduced. Therefore, in the process of determining the video features of the candidate videos, the information reflected by the moving objects in the candidate videos can be accurately captured through the motion timing information, so that the information carried by the video features can be accurately described
  • the movement of the moving object in the candidate video that is, the video feature can clearly identify the moving object.
  • the video feature can provide high-quality information related to the moving object, which improves the accuracy of the judgment.
  • S203 Determine whether the moving object in the candidate video is related to the target object according to the media feature and the video feature.
  • the terminal device classifies the candidate video related to the moving object and the target object into the query result according to the determination result, and displays it to the user after the query is completed
  • the query results help users find candidate videos related to the video to be checked.
  • the relevant interval information in the queried candidate video that is, the user's interest interval, is also provided.
  • the user is prevented from having to check which parts of the query result are related to the query requirement (such as the target object) from the beginning to the end, thereby reducing the time for the user to view the query result.
  • the embodiment of the present application also provides a method for identifying candidate videos, in which the appearance area of the target object is identified in the candidate video, and the user is effectively guided.
  • the target area can also be identified in the t-th video segment of the candidate video, thereby intuitively Point out to the user where the target object is in the current video display interface. For example, when the user is viewing a video in the query result, the user can directly play the video from the playback progress corresponding to the video through the time interval information provided by the query result .
  • the appearing area of the target object will have a specific identification effect, such as a bold color frame, etc., so that the user can check the identification effect Under the guidance, quickly focus on the target object to achieve your own query purpose.
  • the media to be checked includes target objects, and candidate videos include moving objects.
  • the moving object in the candidate video has a certain movement mode, that is, the moving object is generally not always in a certain fixed position in the video, but may appear in different positions of the video at different times according to the movement mode. Therefore, when determining the video features of the candidate videos, the candidate videos can be segmented, and the corresponding sub-features can be determined for different video segments, so that the above-mentioned motion modes of the moving objects can be determined more accurately through the sub-features, so as to improve The query accuracy during subsequent video queries.
  • the terminal device obtains multiple video segments by segmenting the candidate videos based on time sequence, for example, n video segments, where n is an integer greater than 1.
  • a video segment includes at least one video frame, and the number of video frames included in different video segments may also be different.
  • the t-th video segment and the t-1th video segment belong to adjacent video segments in time sequence, and the t-th video segment The time interval is later than the time interval of the t-1th video segment, and t is a positive integer greater than 1 and not greater than n.
  • the sub-feature corresponding to the t-th video segment carries: a feature used to reflect the information in the candidate video from the first video segment to the t-th video segment.
  • the sub-feature corresponding to the last video segment is equivalent to the video feature corresponding to the candidate video.
  • the moving object in the candidate video may have moved from position a to position b in the video frame. If the traditional method is adopted and the movement trend of the moving object is not considered, the information of the moving object reflected by the sub-feature corresponding to the t-th video segment is not concentrated, and may be scattered between the position a and the position b of the feature plane of the sub-feature. between. In other words, the sub-features determined in the traditional way do not clearly reflect the relevant information of the moving object. As a result, even if the moving object is actually related to the target object, it is difficult to accurately determine the relevant information during video query. result. Therefore, the embodiments of the present application provide a method for determining sub-features in video segments according to motion timing information. In this manner, relevant information of moving objects in the sub-features can be enhanced to improve query accuracy.
  • the candidate video may include multiple video segments, each video segment corresponds to a sub-feature, the t-th video segment corresponds to the t-th sub-feature, and each sub-feature is determined based on motion timing information.
  • the method for determining the sub-feature corresponding to each video segment is similar.
  • the method of determining the sub-feature based on the motion timing information will be introduced by taking the t-th sub-feature as an example.
  • the method includes: determining the target motion trend of the moving object in the t-1th sub-feature in the tth video segment. Since the motion timing information of the moving object in the candidate video can reflect the motion trend of the moving object between adjacent video segments, the t-th video segment and the t-1th video segment are adjacent in time sequence. Video segmentation, so according to the motion timing information, the target motion trend of the moving object in the t-1th sub-feature in the t-th video segment can be determined.
  • the t-th sub-feature is determined according to the adjusted t-1 th sub-feature and the static image feature corresponding to the t-th video segment.
  • the t-1 video segment Because the information in the t-1 video segment is superimposed on the information in the t video segment, and the t-1 video segment carries the information in the previous video segment, and so on , Which is equivalent to the determined t-th sub-feature carrying the features used to reflect the information in the candidate video from the first video segment to the t-th video segment, thereby enhancing the relevant information of the moving object in the sub-feature and helping Improve query accuracy.
  • the convolutional long short-term memory neural network (ConvLSTM) is directly used to determine the t-th sub-feature.
  • ConvLSTM convolutional long short-term memory neural network
  • the principle of the convolutional long short-term memory neural network is shown in Figure 3, where x t represents the static image feature corresponding to the t-th video segment, h t-1 represents the t-1 sub-feature, and h t represents the t-th sub-feature. Sub-features.
  • the convolutional long short-term memory neural network used in the traditional way directly uses the image static feature x t corresponding to the t-th video segment and the t-1th sub-feature h t-1 to determine the t-th sub-feature h t .
  • the dislocation long and short-term memory neural network is used.
  • the dislocation long- and short-term memory neural network is a modification of the traditional convolutional long- and short-term memory neural network.
  • the principle of the long and short-term memory neural network can be seen in Figure 4. Among them, x t represents the static image feature corresponding to the t-th video segment, h t-1 represents the t-1 sub-feature, h t represents the t-th sub-feature, and h't -1 represents the adjusted image according to the target motion trend The t-1th sub-feature.
  • the example of the misaligned long short-term memory neural network used in the embodiment of the application uses the image static feature x t corresponding to the t-th video segment and the adjusted t-1th sub-feature h't -1 to determine the t-th sub-feature h t .
  • control points can be defined on the feature plane, and these control points are evenly distributed on the feature plane.
  • control points there are 9 control points located on three horizontal lines and three The intersection of the vertical lines.
  • S(x,y) represents the feature plane
  • w i , v 1 , v 2 , and v 3 are interpolation parameters
  • (x i , y i ) are the coordinates of the i-th control point
  • n is the number of control points
  • i is not greater than A positive integer of n.
  • the control points After defining the control points, use a convolutional layer of the dislocation long-term short-term memory neural network to predict the offset value (dx i ; dy i ) of each control point. For the i-th control point (x i ; y i ), the position after the offset becomes (x i +dx i ; y i +dy i ). While moving the control point, the area near the control point will also move, for example, similar to the weighted movement in skeletal animation. Therefore, the moving target area is equivalent to the control points in the moving target area.
  • d t-1 w xd * x t + w hd * h t-1 + b d
  • d t-1 represents the offset value (dx i ; dy i ) corresponding to the control points in the target area
  • x t is the image static feature corresponding to the t-th video segment, which is used as the input of the misaligned long-term short-term memory neural network
  • H t-1 is the t-1th sub-feature
  • Is the t-1th sub-feature obtained after moving the control point according to the offset value d t-1 versus Constitute the first t-1 th feature
  • Warp () is offset function
  • [sigma] () represents a sigmoid activation function
  • i t, g t, f t, and o t are the input gate misalignment short and long term memory networks, new input , forgetting and output gates
  • w xd, w hd, w xi is the image
  • sub-features include not only related information that can reflect the moving object, but also other information, such as video background and other information that is obviously not related to the moving object. Therefore, in order to reduce the amount of calculation when calculating whether it is related to the target object, the information of this part of the area can be removed in advance, and the information of the area that may be related to the target object is retained to improve the efficiency of video query.
  • one possible implementation of S203 is to determine the target area that has an association relationship between the feature plane of the t-th sub-feature and the target object by removing information that is obviously irrelevant to the target object.
  • the target area and the regional characteristics of the target area can be determined in the following manner.
  • some boxes are evenly placed, these boxes can cover all positions in the video segment, and these boxes have a certain overlap.
  • a regional proposal network (Region Proposal Network, RPN) is used to determine whether the region corresponding to the box has an association relationship with the target object, and thereby the target region is determined.
  • RPN Registered Proposal Network
  • p k is the k th target region
  • h i is the i-th feature
  • PRN proposal for regional function PRN proposal for regional function
  • the area feature corresponding to the target area is determined according to the t-th sub-feature, so as to determine whether the moving object in the t-th video segment matches the target according to the area feature and the media feature.
  • Object related the formula for determining the regional characteristics corresponding to the target region is as follows:
  • the target region is a region
  • p k is the k th target region
  • h i is the i-th feature
  • ROI is a function of the pool area.
  • the target object may not appear in every video frame of the video to be checked, or some of the video frames are fuzzy or incomplete.
  • the video feature corresponding to the video to be searched is used as the query basis, it may increase the amount of calculation or reduce the query accuracy because the video feature carries more other information.
  • the video background content of the matched candidate video is similar to the video background content of the video to be checked but there is no target object.
  • the embodiment of the present application adopts an attention weighting method to determine the features of the video to be checked that can better reflect the relevant information of the target object.
  • the attention model is used to determine the weight between the content of the video frame and the target object in the video to be checked. Normally, in the video to be checked The more complete or clearer the target object included in the video frame content, the greater the weight between the video frame content and the target object in the obtained video to be checked.
  • the weight can be determined according to the correlation of the target area in the video to be searched and the candidate video, where the correlation between the video to be searched and the target area in the candidate video and the calculation formula for the weight are:
  • ⁇ k,j is the weight between the content of the video frame in the video to be checked and the target object, It is the video feature of the video to be checked.
  • the video to be checked includes 20 video frames, where the fifth video frame includes a complete and clear target object, while the target object does not appear in the remaining video frames, or is fuzzy or incomplete.
  • the weights between each video frame and the target object determined by the attention model may be respectively: the weight corresponding to the fifth video frame is 0.9, and the weight corresponding to the remaining video frames is 0.1.
  • the obtained video feature to be checked mainly reflects the fifth video
  • the relevant information of the target object in the frame reduces the influence of other information other than the target object on the characteristics of the video to be checked, and strengthens the relevant information of the target object in the characteristics of the video to be checked, facilitating a more accurate determination of whether the moving object in the video segment matches Target audience is relevant.
  • the characteristics of the video to be checked are obtained And the regional characteristics of the target area After that, these two features can be spliced together, and then passed through two convolutional layers and two fully connected layers, respectively output whether the target area is related to the video to be checked, and the precise area coordinates of the target area related to the video to be checked .
  • the calculation formula for determining whether the moving object in the t-th video segment is related to the target object according to the area feature and the video feature to be checked, and the calculation formula for the precise area coordinates of the target area related to the video to be checked are:
  • f is the with these two features are stitched together, l is the relevant or irrelevant category obtained by classification, bb is the regional coordinates after precision, Conv represents the convolutional layer, FC represents the fully connected layer, and softmax is the Softmax activation function.
  • the video query method provided in the embodiment of the present application will be introduced in combination with actual application scenarios.
  • the media to be checked is a video to be checked (for example, a short video posted), and the user wants to check the video to be checked to determine whether there is a copy of the short video in the candidate videos.
  • the video query can be performed through the method provided in the embodiment of the present application.
  • Using the method provided in the embodiments of the present application to perform video query mainly includes five parts: video feature extraction, long video dislocation accumulation, region proposal, attention weighting, and region precision.
  • the flow chart of the video query method is shown in Figure 5.
  • Part 1 Video feature extraction:
  • the fragments respectively correspond to the static characteristics of the image.
  • a dislocation long-term short-term memory neural network (WarpLSTM) can be used to collect the motion timing information in the candidate videos, and the motion timing information can be adjusted according to the motion timing information.
  • the sub-features of each video segment of the candidate video are adjusted for misalignment, so that when the t-th sub-feature is determined, it can be determined based on the adjusted t-1-th sub-feature and the static image feature corresponding to the extracted t-th video segment The t-th sub-feature.
  • the regional proposal network is used to select the target area in the candidate videos that may have an association relationship with the target object in the video to be checked, and the regional characteristics corresponding to the target area are determined by means of regional pooling.
  • the attention weighting of the video to be searched can be performed to obtain the weighted characteristics of the video to be searched.
  • Part 5 Regional refinement:
  • these two features can be spliced together, and then compared with the candidate videos through the convolutional layer and two fully connected layers, and output whether the target area is the same as the target area.
  • the process of outputting whether the target area is related to the video to be searched can be referred to as a classification process, and outputting the precise area coordinates of the target area related to the video to be searched can be reflected by identifying the target area, for example, setting a striking color for the target area
  • the process of outputting the precise area coordinates of the target area related to the video to be checked can be called frame regression.
  • This application provides an end-to-end video spatiotemporal relocation method.
  • the present application can query other long candidate videos for spatiotemporal video clips that contain the same semantic content as the query video.
  • the so-called "spatio-temporal video clip" in the time dimension refers to: locate a video frequency band in the longer candidate video; in the spatial dimension, it refers to accurately return to the target area in the queried video clip. It is the area related to the video to be checked.
  • this application can quickly locate the user's interested part in the long candidate video, and the user does not need to browse from beginning to end in person, thereby saving time.
  • this technology can also be used for video copy detection to detect the spread of copyright infringing videos.
  • an embodiment of the present application also provides a video query device.
  • the device includes an acquiring unit 601, a first determining unit 602, and a second determining unit 603:
  • the acquiring unit 601 is configured to acquire the media features of the media to be checked and the static image features corresponding to the candidate videos; the media to be checked includes target objects, and the candidate videos include moving objects;
  • the first determining unit 602 is configured to determine the video feature of the candidate video according to the static feature of the image and the motion timing information of the moving object in the candidate video;
  • the second determining unit 603 is configured to determine whether the moving object in the candidate video is related to the target object according to the media feature and the video feature.
  • the video feature includes sub-features corresponding to n video segments in the candidate video; the t-th video segment corresponds to the t-th sub-feature, and the motion timing information reflects all The motion trend of the moving object between adjacent video segments, t is a positive integer greater than 1 and not greater than n, and n is an integer greater than 1;
  • the first determining unit 602 is specifically configured to:
  • the t-th sub-feature is determined according to the adjusted t-1 th sub-feature and the static image feature corresponding to the t-th video segment.
  • the first determining unit 602 is further configured to:
  • the control point is moved according to the offset value, and the target area corresponding to the moving object in the feature plane of the t-1th sub-feature is adjusted based on the moved control point.
  • the second determining unit 603 is specifically configured to:
  • the regional feature and the media feature determine whether the moving object in the t-th video segment is related to the target object.
  • the second determining unit 603 is further configured to:
  • the device further includes an identification unit 604:
  • the identification unit 604 is configured to identify the target area in the t-th video segment of the candidate video.
  • the media to be checked includes target objects, and candidate videos include moving objects.
  • the difficulty of identification Therefore, it can be accurately determined whether the moving object in the candidate video is related to the target object in the media to be checked according to the media feature and the video feature, which improves the user's query experience.
  • an embodiment of the present application also provides a device for video query.
  • the device for video query will be introduced below with reference to the accompanying drawings.
  • an embodiment of the present application provides a device 700 for video query.
  • the device 700 may also be a terminal device.
  • the terminal device may include a mobile phone, a tablet computer, and a personal digital assistant (Personal Digital Assistant). , PDA for short), Point of Sales (POS for short), in-vehicle computer and other smart terminals. Take the terminal device as a mobile phone as an example:
  • FIG. 8 shows a block diagram of a part of the structure of a mobile phone related to a terminal device provided in an embodiment of the present application.
  • the mobile phone includes: a radio frequency (RF) circuit 710, a memory 720, an input unit 730, a display unit 740, a sensor 750, an audio circuit 760, a wireless fidelity (wireless fidelity, WiFi) module 770, a processing 780, and power supply 790.
  • RF radio frequency
  • FIG. 7 does not constitute a limitation on the mobile phone, and may include more or less components than those shown in the figure, or a combination of some components, or different component arrangements.
  • the RF circuit 710 can be used for receiving and sending signals during information transmission or communication. In particular, after receiving the downlink information of the base station, it is processed by the processor 780; in addition, the designed uplink data is sent to the base station.
  • the RF circuit 710 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA for short), a duplexer, and the like.
  • the RF circuit 710 can also communicate with the network and other devices through wireless communication.
  • the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access ( Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), Email, Short Message Service (Short Messaging Service, SMS) Wait.
  • GSM Global System of Mobile Communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • Email Short Message Service
  • SMS Short Messaging Service
  • the memory 720 may be used to store software programs and modules.
  • the processor 780 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 720.
  • the memory 720 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data (such as audio data, phone book, etc.) created by the use of mobile phones.
  • the memory 720 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the input unit 730 can be used to receive inputted digital or character information, and generate key signal input related to user settings and function control of the mobile phone.
  • the input unit 730 may include a touch panel 731 and other input devices 732.
  • the touch panel 731 also called a touch screen, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 731 or near the touch panel 731. Operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 731 may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch position, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it To the processor 780, and can receive and execute the commands sent by the processor 780.
  • the touch panel 731 can be realized by various types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 730 may also include other input devices 732.
  • other input devices 732 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, joystick, and the like.
  • the display unit 740 may be used to display information input by the user or information provided to the user and various menus of the mobile phone.
  • the display unit 740 may include a display panel 741.
  • the display panel 741 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD for short), an Organic Light-Emitting Diode (OLED), etc.
  • the touch panel 731 can cover the display panel 741. When the touch panel 731 detects a touch operation on or near it, it transmits it to the processor 780 to determine the type of the touch event. Type provides corresponding visual output on the display panel 741.
  • the touch panel 731 and the display panel 741 are used as two independent components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 731 and the display panel 741 may be integrated. Realize the input and output functions of mobile phones.
  • the mobile phone may also include at least one sensor 750, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor.
  • the ambient light sensor can adjust the brightness of the display panel 741 according to the brightness of the ambient light.
  • the proximity sensor can close the display panel 741 and/or when the mobile phone is moved to the ear. Or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three-axis), and can detect the magnitude and direction of gravity when stationary, and can be used to identify mobile phone posture applications (such as horizontal and vertical screen switching, related Games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.; as for other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which can be configured in mobile phones, we will not here Repeat.
  • mobile phone posture applications such as horizontal and vertical screen switching, related Games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.
  • vibration recognition related functions such as pedometer, percussion
  • other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which can be configured in mobile phones, we will not here Repeat.
  • the audio circuit 760, the speaker 761, and the microphone 762 can provide an audio interface between the user and the mobile phone.
  • the audio circuit 760 can transmit the electric signal converted from the received audio data to the speaker 761, and the speaker 761 converts it into a sound signal for output; on the other hand, the microphone 762 converts the collected sound signal into an electric signal, which is then output by the audio circuit 760.
  • WiFi is a short-distance wireless transmission technology.
  • the mobile phone can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 770. It provides users with wireless broadband Internet access.
  • FIG. 8 shows the WiFi module 770, it is understandable that it is not a necessary component of the mobile phone, and can be omitted as needed without changing the essence of the invention.
  • the processor 780 is the control center of the mobile phone. It uses various interfaces and lines to connect various parts of the entire mobile phone. It executes by running or executing software programs and/or modules stored in the memory 720, and calling data stored in the memory 720. Various functions and processing data of the mobile phone can be used to monitor the mobile phone as a whole.
  • the processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, and application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 780.
  • the mobile phone also includes a power source 790 (such as a battery) for supplying power to various components.
  • a power source 790 such as a battery
  • the power source can be logically connected to the processor 780 through a power management system, so that functions such as charging, discharging, and power management are realized through the power management system.
  • the mobile phone may also include a camera, a Bluetooth module, etc., which will not be repeated here.
  • the processor 780 included in the terminal device also has the following functions:
  • the media to be checked includes a target object, and the candidate video includes a moving object;
  • the media feature and the video feature it is determined whether the moving object in the candidate video is related to the target object.
  • FIG. 9 is a structural diagram of the server 800 provided by the embodiment of the application.
  • the server 800 may be relatively large due to different configurations or performance.
  • the difference may include one or more Central Processing Units (CPU) 822 (for example, one or more processors) and memory 832, and one or more storage media 830 for storing application programs 842 or data 844 (For example, one or one storage device in Shanghai).
  • the memory 832 and the storage medium 830 may be short-term storage or persistent storage.
  • the program stored in the storage medium 830 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server.
  • the central processing unit 822 may be configured to communicate with the storage medium 830, and execute a series of instruction operations in the storage medium 830 on the server 800.
  • the server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input and output interfaces 858, and/or one or more operating systems 841, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • operating systems 841 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the steps performed by the server in the foregoing embodiment may be based on the server structure shown in FIG. 9.
  • the CPU 822 is used to perform at least the following steps:
  • the media to be checked includes a target object, and the candidate video includes a moving object;
  • the media feature and the video feature it is determined whether the moving object in the candidate video is related to the target object.
  • the CPU 822 is further configured to execute the video query method described in the foregoing method embodiment.
  • An embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium is used to store program code, and the program code is used to execute the video query method described in the foregoing embodiment.
  • At least one (item) refers to one or more, and “multiple” refers to two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, “A and/or B” can mean: only A, only B, and both A and B , Where A and B can be singular or plural.
  • the character “/” generally indicates that the associated objects are in an “or” relationship.
  • the following at least one item (a)” or similar expressions refers to any combination of these items, including any combination of a single item (a) or plural items (a).
  • At least one (a) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c" ", where a, b, and c can be single or multiple.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disks or optical disks, etc., which can store program codes Medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例公开一种视频查询方法、装置、设备及存储介质,当用户需要查询待查视频时,该方法可以获取待查媒体的媒体特征和备选视频对应的图像静态特征。待查媒体包括目标对象,备选视频中包括运动对象。根据图像静态特征以及备选视频中运动对象的运动时序信息,确定所述备选视频的视频特征。由于通过运动时序信息可以准确的捕捉到运动中运动对象所体现的信息,故以此确定出的该视频特征可以准确的描述出该运动对象,有效的避免了在视频查询中原本运动对象带来不利影响。从而可以根据媒体特征以及该视频特征,准确的确定出备选视频中的运动对象是否与待查媒体中目标对象相关,提高了查询的准确性。

Description

视频查询方法、装置、设备及存储介质
本申请要求于2019年04月29日提交的申请号为201910355782.3、发明名称为“一种视频查询方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及媒体处理领域,特别是涉及一种视频查询方法、装置、设备及存储介质。
背景技术
随着搜索查询技术的发展,用户可以通过媒体查询与该媒体相关的视频,媒体包括图片、音频和视频等。例如用户可以通过一段包括目标对象的待查视频,从备选视频中查询都有哪些备选视频包括了该目标对象。
相关技术中,采用基于内容的视频查询技术来实现上述服务。
然而,上述视频查询技术精确度不高。在根据包括目标对象的媒体进行视频查询时,当一个备选视频中包括目标对象,但目标对象处于运动状态时,常常会查询不到这个备选视频。
发明内容
本申请提供了一种视频查询方法、装置、设备及存储介质,可以根据媒体特征以及该视频特征,准确的确定出备选视频中的运动对象是否与待查媒体中目标对象相关,提高了查询的精确度。
本申请实施例公开了如下技术方案:
根据本申请的一个方面,本申请实施例提供一种视频查询方法,所述方法包括:
获取待查媒体的媒体特征和备选视频对应的图像静态特征;所述待查媒体包括目标对象,所述备选视频中包括运动对象;
根据所述图像静态特征以及备选视频中所述运动对象的运动时序信息,确定所述备选视频的视频特征;
根据所述媒体特征以及所述视频特征,确定所述备选视频中的所述运动对象是否与所述目标对象相关。
根据本申请的另一方面,本申请实施例提供一种视频查询装置,所述装置包括获取单元、第一确定单元和第二确定单元:
所述获取单元,用于获取待查媒体的媒体特征和备选视频对应的图像静态特征;所述待查媒体包括目标对象,所述备选视频中包括运动对象;
所述第一确定单元,用于根据所述图像静态特征以及备选视频中所述运动对象的运动时序信息,确定所述备选视频的视频特征;
所述第二确定单元,用于根据所述媒体特征以及所述视频特征,确定所述备选视频中的所述运动对象是否与所述目标对象相关。
根据本申请的另一方面,本申请实施例提供一种用于视频查询的设备,所述设备包括处理器以及存储器:
所述存储器用于存储程序代码,并将所述程序代码传输给所述处理器;
所述处理器用于根据所述程序代码中的指令执行上述方面所述的视频查询方法。
根据本申请的另一方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质用于存储程序代码,所述程序代码用于执行上述方面所述的视频查询方法。
由上述技术方案可以看出,获取待查媒体的媒体特征和备选视频对应的图像静态特征。待查媒体包括目标对象,备选视频中包括运动对象。根据图像静态特征以及备选视频中运动对象的运动时序信息,确定所述备选视频的视频特征。由于通过运动时序信息可以准确的捕捉到运动中运动对象所体现的信息,故以此确定出的该视频特征可以准确的描述出该运动对象,有效的避免了在视频查询中原本运动对象带来不利影响。从而可以根据媒体特征以及该视频特征,准确的确定出备选视频中的运动对象是否与待查媒体中目标对象相关,提高了用户的查询体验。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种视频查询方法的应用场景示意图;
图2为本申请实施例提供的一种视频查询方法的流程图;
图3为传统方式中卷积长短期记忆神经网络的示例图;
图4为本申请实施例提供的错位长短期记忆神经网络的示例图;
图5为本申请实施例提供的一种视频查询方法的处理流程结构图;
图6为本申请实施例提供的一种视频查询装置的结构图;
图7为本申请实施例提供的一种视频查询装置的结构图;
图8为本申请实施例提供的一种终端设备的结构图;
图9为本申请实施例提供的一种服务器的结构图。
具体实施方式
下面结合附图,对本申请的实施例进行描述。
由于在视频查询中,如果备选视频中具有运动的目标对象,传统的视频查询技术所获取的备选视频的视频特征不能准确的体现出备选视频中的实际内容。也就是说,备选视频中运动的对象将难以被准确捕捉并体现到对应的视频特征中,导致即使用于查询的待查媒体中包括目标对象,也不容易识别出备选视频与待查媒体相关,视频查询准确性较低。
本申请实施例提供了一种视频查询方法,在确定备选视频的视频特征时依据了备选视频中运动对象的运动时序信息,有效的避免了在视频查询中原本运动对象带来的识别难度。
本申请实施例所提供的视频查询方法可以应用于各类视频处理场景,例如可以应用于视频中人物的识别,智能设备对物体的追踪、智能设备对人物的追踪、视频节目的分类等等。
本申请实施例提供的视频查询方法可以应用到具有媒体处理功能的电子设备中,该电子设备可以是终端设备,终端设备例如可以是智能终端、计算机、个人数字助理(Personal Digital Assistant,PDA)、平板电脑等。
该电子设备还可以是服务器,服务器是向终端设备提供媒体处理服务的,终端设备可以将待查媒体和备选视频上传给服务器,服务器利用本申请实施例提供的视频查询方法,确定备选视频中的运动对象是否与待查媒体中的目标对象相关,并将结果返回给终端设备。其中, 服务器可以是独立的服务器,也可以是集群中的服务器。
为了便于理解本申请的技术方案,下面结合实际应用场景,以终端设备为例对本申请实施例提供的视频查询方法进行介绍。
参见图1,图1示出了一种视频查询方法的应用场景的示例图。该场景中包括终端设备101,当用户通过待查媒体进行视频查询时,终端设备101根据所获取的待查媒体,可以确定待查媒体的媒体特征。而且,终身101还可以根据待查媒体对应的备选视频,确定对应的图像静态特征。在本申请实施例中,媒体特征、图像静态特征,以及后续提到的视频特征、子特征等都属于一种类型的特征,该特征可以体现所标识对象(例如图像、视频)中携带相关内容的信息。例如一张图像的图像静态特征可以体现该图像中所展示的图像信息,一个视频的图像静态特征可以体现该视频中的图像信息,即更关注于各个视频帧本身的静态信息。这个视频的视频特征可以体现该视频中的视频信息,即更关注连续视频帧所体现的动态信息。
在本申请实施例中,备选视频属于待查媒体所查询的视频范围中的任意一个视频。在备选视频中包括运动的对象,该对象在本申请中简称为“运动对象”。可选地,运动对象是主动运动的对象,或者,被动运动的对象。比如,运动对象是人、动物、物体等。在备选视频中,运动对象可以有一个,也可以有多个。为了便于说明,在后续实施例中,主要针对备选视频中的一个运动对象的处理流程进行描述,但本申请实施例对备选视频中的运动对象的数量不加以限定。
待查媒体的媒体格式在本申请实施例中并不限定,可以是图像,也可以是视频等。不论待查媒体是何种媒体格式,待查媒体中包括目标对象,目标对象可以是人、动物、物体等各种可能的对象。在待查媒体中,目标对象可以是运动的,也可以是静止的。
终端设备101根据图像静态特征以及备选视频中运动对象的运动时序信息,确定备选视频的视频特征。由于通过运动时序信息可以准确的捕捉到运动中运动对象所体现的信息,故以此确定出的视频特征可以准确的描述出该运动对象,有效的避免了在视频查询中原本运动对象带来的识别难度。从而终端设备101可以根据媒体特征以及该视频特征,准确的确定出备选视频中的运动对象是否与待查媒体中目标对象相关,提高了查询的准确度以及用户的查询体验。
接下来,将以终端设备为执行主体为例、结合附图对本申请实施例提供的视频查询方法进行详细介绍。参见图2,所述方法包括:
S201、获取待查媒体的媒体特征和备选视频对应的图像静态特征。
在本实施例中,采用卷积神经网络来获取待查媒体的媒体特征和备选视频对应的图像静态特征。
然而,由于视频中可能包括运动对象,针对视频进行特征提取时,为了可以更好地获取到视频中的运动信息。在本实施例中,可以采用三维卷积神经网络获取待查媒体的媒体特征和备选视频对应的图像静态特征。
可选地,在执行S201时,可以通过三维卷积神经网络获取备选视频对应的图像静态特征。待查媒体的格式包括:图像和视频中的至少一种。若待查媒体为视频,则可以通过三维卷积神经网络获取待查媒体对应的视频特征;若待查媒体为图像,则可以通过三维卷积神经网络获取待查媒体对应的图像静态特征,也可以通过其他卷积神经网络获取待查媒体对应的媒体特征。
也即,当待查媒体为视频类型时,媒体特征为视频特征;待查媒体为图像类型时,媒体特征为图像静态特征。图像静态特征可以理解为对单个图像上的静态信息提取的特征、对单个视频帧上的静态信息提取的特征、或者单个视频分段上的静态信息提取的特征。
需要说明的是,训练三维卷积神经网络需要大量标注的视频,也需要耗费大量的计算资源。为了降低对计算资源的需求,在一种可能的实现方式中,可以直接采用在Kinetics数据集上预训练好的三维卷积神经网络(I3D)模型来获取待查媒体的媒体特征和备选视频对应的图像静态特征。
Kinetics数据集是由Deepmind团队发布的数据集。Kinetics数据集是是一个大规模,高质量数据集,包括大约650,000个视频剪辑。这些视频剪辑涵盖700种人类动作类,包括诸如弹奏乐器之类的人与对象之间的交互,以及诸如握手和拥抱之类的人与人之间的交互。每个动作班至少有600个视频剪辑。每个视频剪辑都带有一个单独的动作类别,可以人工注释,并且持续10秒钟左右。
可以理解的是,可以同时获取待查媒体的媒体特征和备选视频对应的图像静态特征,也可以不同时获取待查媒体的媒体特征和备选视频对应的图像静态特征。例如,备选视频对应的图像静态特征可以是预先获取的,当用户通过待查媒体进行视频查询时,终端设备获取到待查媒体后,再获取待查媒体的媒体特征;或者,当用户通过待查媒体进行视频查询时,终端设备获取到待查媒体后,终端设备同时获取待查媒体的媒体特征和备选视频对应的图像静态特征。
S202、根据图像静态特征以及备选视频中运动对象的运动时序信息,确定备选视频的视频特征。
在本申请实施例中,运动对象的运动时序信息,用于标识在备选视频中该运动对象随着时间推移运动趋势变化的情况。例如,运动对象的运动时序信息包括相邻时间节点间,该运动对象的运动位置在运动趋势下的变化情况。可选地,该运动时序信息是通过神经网络对该备选视频学习得到,也可以通过其他方式得到的,比如标注方式。
在确定备选视频的视频特征时,除了采用可以体现备选视频中静态信息的图像静态特征,还进一步引入了备选视频的运动时序信息。故在确定备选视频的视频特征的过程中,可以通过运动时序信息,准确的捕捉到备选视频中运动对象在运动时所体现的信息,从而使得视频特征所携带的信息可以准确的描述出该运动对象在备选视频中的运动情况,也即使得该视频特征可以清楚的标识出运动对象。
由此,在后续判断运动对象是否与目标对象相关时,视频特征中可以提供与运动对象相关的高质量信息,提高了判断的准确性。
S203、根据媒体特征以及视频特征,确定备选视频中的运动对象是否与目标对象相关。
可以理解的是,若确定出备选视频中的运动对象是否与目标对象相关,终端设备根据确定结果将运动对象与目标对象相关的备选视频归入查询结果,并在查询完成后向用户显示查询结果,帮助用户查询到与待查视频相关的备选视频。
在进行视频查询过程中,根据备选视频的视频分段是否与目标对象相关,确定出备选视频中哪些时间区间为与目标对象相关的区间。从而在向用户提供查询结果时,同时提供所查询到备选视频中相关区间信息,即用户的感兴趣区间。避免用户在获取查询结果后,还得从头到尾查看查询结果中哪些部分与查询需求(例如目标对象)相关,从而减少了用户对查询结果的查看时间。
需要说明的是,由于视频中所显示的内容可能较多,用户即使通过相关区间信息直接查看了备选视频的相关部分,但是可能短时间内无法发现目标对象在当前视频显示界面的哪些位置。故为了提高用户的查看效率,本申请实施例还提供了一种对备选视频的标识方式,在备选视频中对目标对象的出现区域进行标识,有效的对用户进行了指引。
在这种实现方式中,若通过S203确定第t个视频分段中的运动对象是否与目标对象相关后,还可以在备选视频的第t个视频分段中对目标区域进行标识,从而直观地向用户指出目标对象在当前视频显示界面的哪些位置,例如,用户在查看查询结果中的一个视频时,通过查询结果提供的时间区间信息,用户可以直接从该视频对应的播放进度播放该视频。在播放过程中,出现目标对象(或者说与目标对象相关的运动对象)时,该目标对象的出现区域会有特定的标识效果,例如醒目颜色的外框等,从而用户可以在该标识效果的指引下,快速的将视线锁定目标对象,实现自己的查询目的。
由上述技术方案可以看出,获取待查媒体的媒体特征和备选视频对应的图像静态特征。待查媒体包括目标对象,备选视频中包括运动对象。根据图像静态特征以及备选视频中运动对象的运动时序信息,确定所述备选视频的视频特征。由于通过运动时序信息可以准确的捕捉到运动中运动对象所体现的信息,故以此确定出的该视频特征可以准确的描述出该运动对象,有效的避免了在视频查询中原本运动对象带来不利影响。从而可以根据媒体特征以及该视频特征,准确的确定出备选视频中的运动对象是否与待查媒体中目标对象相关,提高了用户的查询体验。
由于备选视频中的运动对象具有一定的运动方式,即运动对象一般不会一直处于视频中的某一固定位置,而可能会随着运动方式在不同的时间出现在视频的不同位置。故在确定备选视频的视频特征时,可以对备选视频进行分段,对不同的视频分段确定对应的子特征,从而通过子特征更为准确的确定运动对象的上述运动方式,以便提高后续视频查询时的查询准确性。
终端设备通过对备选视频进行基于时间顺序的分段,得到多个视频分段,比如n个视频分段,n为大于1的整数。一个视频分段包括至少一帧视频帧,不同视频分段所包括的视频帧数量也可以不同。终端设备通过分段所得到的n个视频分段中,第t个视频分段和第t-1个视频分段属于在时间顺序上的相邻视频分段,且第t个视频分段所处时间区间晚于第t-1个视频分段所处时间区间,t为大于1且不大于n的正整数。
其中,第t个视频分段对应的子特征携带有:用于体现备选视频从第1个视频分段至第t个视频分段中信息的特征。最后一个视频分段对应的子特征相当于该备选视频对应的视频特征。
在第t个视频分段时,备选视频中运动对象可能已经从视频画面中的a位置移动到了b位置。如果采用传统方式,不考虑运动对象的运动趋势,那第t个视频分段对应的子特征所能体现运动对象的信息并不集中,可能分散在子特征的特征平面的a位置到b位置之间。换句话说,传统方式中确定出的子特征中并不能明显的体现出运动对象的相关信息,从而导致在进行视频查询时,即使运动对象与目标对象实际相关,但是也难以准确的确定出相关结果。故此,本申请实施例提供了一种根据运动时序信息确定视频分段中的子特征的方式,通过该方式可以强化子特征中运动对象的相关信息,以起到提高查询准确性的目的。
备选视频中可能包括多个视频分段,每个视频分段对应一个子特征,第t个视频分段对应第t个子特征,每个子特征都是基于运动时序信息确定的。每个视频片段对应的子特征的 确定方式是类似的,接下来,为了便于介绍,将以第t个子特征为例对根据运动时序信息确定子特征的方式进行介绍。
在一种可能的实现方式中,所述方法包括:确定第t-1个子特征中运动对象在第t个视频分段中的目标运动趋势。由于备选视频中运动对象的运动时序信息可以体现运动对象在相邻视频分段间的运动趋势,而第t个视频分段和第t-1个视频分段属于在时间顺序上的相邻视频分段,故根据运动时序信息可以确定出第t-1个子特征中运动对象在第t个视频分段中的目标运动趋势。
然后,根据目标运动趋势调整运动对象在第t-1个子特征的特征平面中对应的目标区域,使得目标区域移动到第t个视频分段中运动对象所在位置。接着,根据调整后的第t-1个子特征和第t个视频分段对应的图像静态特征,确定第t个子特征。由于在第t个视频分段中信息的基础上叠加了第t-1个视频分段中信息,而第t-1个视频分段中又携带了其前一个视频分段中信息,依次类推,相当于确定出的第t个子特征携带有用于体现备选视频从第1个视频分段至第t个视频分段中信息的特征,从而强化子特征中运动对象的相关信息,有助于提高查询准确性。
需要说明的是,相比于传统方式,由于传统方式不考虑运动对象的运动趋势,直接采用卷积长短期记忆神经网络(ConvLSTM)来确定第t个子特征。卷积长短期记忆神经网络的原理示意参见图3所示,其中,x t表示第t个视频分段对应的图像静态特征,h t-1表示第t-1个子特征,h t表示第t个子特征。即传统方式中所采用的卷积长短期记忆神经网络直接利用第t个视频分段对应的图像静态特征x t和第t-1个子特征h t-1确定第t个子特征h t
而本申请实施例提供的根据运动时序信息确定子特征的方式中,所采用的是错位长短期记忆神经网络,错位长短期记忆神经网络是对传统卷积长短期记忆神经网络进行的改造,错位长短期记忆神经网络的原理示意可以参见图4所示。其中,x t表示第t个视频分段对应的图像静态特征,h t-1表示第t-1个子特征,h t表示第t个子特征,h’ t-1表示根据目标运动趋势调整后的第t-1个子特征。即本申请实施例所采用的错位长短期记忆神经网络例利用第t个视频分段对应的图像静态特征x t和调整后的第t-1个子特征h’ t-1确定第t个子特征h t
接下来,将详细介绍如何根据目标运动趋势,调整运动对象在第t-1个子特征的特征平面中对应的目标区域。
在一种可能的实现方式中,可以在特征平面上定义一些控制点,这些控制点均匀的分布在特征平面上,例如在图4当中,有9个控制点,分别坐落在三条横线和三条竖线的交点处。利用{(x 1,y 1),…,(x n,y n)}来表示定义好的n个控制点,则特征平面的计算公式为:
Figure PCTCN2020086670-appb-000001
其中,S(x,y)表示特征平面,
Figure PCTCN2020086670-appb-000002
是径向基函数,w i,v 1,v 2,v 3都是插值参数,(x i,y i)为第i个控制点的坐标,n为控制点的个数,i为不大于n的正整数。
定义好控制点之后,使用错位长短期记忆神经网络的一个卷积层来预测每个控制点的偏移值(dx i;dy i)。对于第i个控制点(x i;y i),偏移之后的位置变成(x i+dx i;y i+dy i)。移动控制点的同时,控制点附近的区域也会随之移动,比如类似于骨骼动画中的加权移动。因此,移动目标区域相当于移动目标区域中的控制点,为了调整目标区域,可以先根据目标运动趋势确定目标区域内的控制点对应的偏移值,然后,根据偏移值移动控制点,以移动后的控制 点为基准调整运动对象在第t-1个子特征的特征平面中对应的目标区域。
在这种情况下,错位长短期记忆神经网络的公式表示如下:
d t-1=w xd*x t+w hd*h t-1+b d
Figure PCTCN2020086670-appb-000003
Figure PCTCN2020086670-appb-000004
Figure PCTCN2020086670-appb-000005
Figure PCTCN2020086670-appb-000006
Figure PCTCN2020086670-appb-000007
Figure PCTCN2020086670-appb-000008
Figure PCTCN2020086670-appb-000009
Figure PCTCN2020086670-appb-000010
其中,d t-1表示目标区域内的控制点对应的偏移值(dx i;dy i),x t是第t个视频分段对应的图像静态特征,作为错位长短期记忆神经网络的输入,h t-1是第t-1个子特征;
Figure PCTCN2020086670-appb-000011
是根据偏移值d t-1移动控制点后得到的第t-1个子特征,
Figure PCTCN2020086670-appb-000012
是根据偏移值d t-1移动控制点后得到的第t-1个子特征,
Figure PCTCN2020086670-appb-000013
Figure PCTCN2020086670-appb-000014
共同构成第t-1个子特征,warp()是错位函数;σ()表示sigmoid激活函数;i t、g t、f t、和o t分别是错位长短期记忆神经网络的输入门、新输入、遗忘门和输出门;w xd、w hd、w xi、w hi、w xg、w hg、w xf、w hf、w xo、w ho、b d、b i、b g、b f、b o都是模型参数;h t为得到的第t个子特征,作为错位长短期记忆神经网络模型的输出,⊙表示同或运算。
在视频分段中,子特征除了包括能够体现出运动对象的相关信息,还会包括其他信息,例如视频背景等与运动对象明显不相关的信息。故为了减少计算与目标对象是否相关时的计算量,可以预先去除这部分区域的信息,保留可能与目标对象相关区域的信息,以提高视频查询效率。
针对第t个子特征,S203的一种可能实现方式为,通过去除与目标对象明显不相关的信息的方式,确定在第t个子特征的特征平面与目标对象具有关联关系的目标区域。
在本实施例中,可以通过以下方式确定目标区域以及目标区域的区域特征。在一个视频分段当中,均匀的放置一些方框,这些方框可以覆盖视频分段当中所有的位置,而且这些方框有一定的重叠。然后,采用区域提案网络(Region Proposal Network,RPN)判断该方框中所对应的区域是否与目标对象具有关联关系,从而确定出目标区域。其中,确定目标区域的公式如下所示:
p k=PRN(h i)
其中,p k是第k个目标区域,h i是第i个子特征,PRN为区域提案函数。
在确定出目标区域后,根据第t个子特征确定目标区域对应的区域特征,从而根据所述区域特征以及所述媒体特征,确定第t个视频分段中的所述运动对象是否与所述目标对象相关。其中,确定目标区域对应的区域特征的公式如下所示:
Figure PCTCN2020086670-appb-000015
其中,
Figure PCTCN2020086670-appb-000016
是目标区域的区域特征,p k是第k个目标区域,h i是第i个子特征,ROI是区域池化函数。
当待查媒体为视频格式时,即为待查视频时,可能目标对象并不会在待查视频的每一视 频帧中都有出现,或者在一些视频帧中较为模糊或不完整。在这种情况下,如果根据待查视频对应的视频特征作为查询依据的话,可能会由于该视频特征中携带有较多其他信息而增加计算量,或降低查询准确度。例如匹配出的备选视频中视频背景内容与待查视频的视频背景内容相近而没有目标对象的情况。为了避免这类情况发生,并降低计算量,本申请实施例采用了一种注意力加权的方式来确定出更能体现目标对象相关信息的待查视频特征。
在确定第t个视频分段中的运动对象是否与目标对象相关的过程中,首先,通过注意力模型确定待查视频中视频帧内容与目标对象间的权重,通常情况下,待查视频中视频帧内容中包括的目标对象越完整或越清晰,则得到的待查视频中视频帧内容与目标对象间的权重越大。例如,可以根据待查视频与备选视频中目标区域的相关性确定权重,其中,待查视频与备选视频中目标区域的相关性,以及权重的计算公式为:
Figure PCTCN2020086670-appb-000017
Figure PCTCN2020086670-appb-000018
其中,
Figure PCTCN2020086670-appb-000019
表示待查视频的视频特征,
Figure PCTCN2020086670-appb-000020
表示目标区域的区域特征,e k,j表示
Figure PCTCN2020086670-appb-000021
Figure PCTCN2020086670-appb-000022
相关性的向量,avg是取平均值函数,α k,j是待查视频中视频帧内容与目标对象间的权重,W q、W r、ω、b p和b s都是模型参数,而ω T是ω的转置。
这样,在根据确定的权重确定待查视频的待查视频特征时,降低了目标对象之外的其他信息对待查视频特征的影响,强化了待查视频特征中目标对象的相关信息,便于更加准确的根据区域特征以及待查视频特征,确定第t个视频分段中的运动对象是否与目标对象相关。
其中,待查视频特征的计算公式为:
Figure PCTCN2020086670-appb-000023
其中,
Figure PCTCN2020086670-appb-000024
是加权之后的待查视频特征,α k,j是待查视频中视频帧内容与目标对象间的权重,
Figure PCTCN2020086670-appb-000025
是待查视频的视频特征。
例如,待查视频包括20个视频帧,其中,第5个视频帧包括了完整且清晰的目标对象,而其余个视频帧中目标对象没有出现,或者较为模糊或不完整。那么,通过注意力模型确定各个视频帧与目标对象间的权重可能分别为:第5个视频帧对应的权重为0.9,其余视频帧对应的权重为0.1。则在根据确定的权重确定待查视频的待查视频特征时,由于第5个视频帧对应的权重明显高于其余视频帧对应的权重,得到的待查视频特征主要体现的是第5个视频帧中目标对象的相关信息,降低了目标对象之外的其他信息对待查视频特征的影响,强化了待查视频特征中目标对象的相关信息,便于更加准确的视频分段中的运动对象是否与目标对象相关。
需要说明的是,得到待查视频特征
Figure PCTCN2020086670-appb-000026
和目标区域的区域特征
Figure PCTCN2020086670-appb-000027
之后,可以将这两个特征拼接在一起,然后经过两个卷积层和两个全连接层,分别输出目标区域是否和待查视频相关,以及与待查视频相关的目标区域精确的区域坐标。
根据区域特征以及待查视频特征,确定第t个视频分段中的运动对象是否与目标对象相关的计算公式,以及与待查视频相关的目标区域精确的区域坐标的计算公式为:
Figure PCTCN2020086670-appb-000028
l=softmax[FC(Conv(f))]
bb=FC(Conv(f))
其中,f为将
Figure PCTCN2020086670-appb-000029
Figure PCTCN2020086670-appb-000030
这两个特征进行拼接得到的特征,l是分类得到的相关还是不相关的类别,bb是精确之后的区域坐标,Conv代表卷积层,FC代表全连接层,softmax是Softmax激活函数。
接下来,将结合实际应用场景对本申请实施例提供的视频查询方法进行介绍。在该应用场景中,待查媒体为待查视频(例如发布的短视频),用户希望通过查询该待查视频确定备选视频中是否存在拷贝该短视频的情况。为此,可以通过本申请实施例提供的方法进行视频查询。
利用本申请实施例提供的方法进行视频查询主要包括:视频特征提取、长视频错位累积、区域提案、注意力加权和区域精确化五个部分。该视频查询方法的流程结构图参见图5所示。
第一部分:视频特征提取:
对于给定一个待查视频以及一个备选视频,首先分别对待查视频和备选视频进行特征提取。例如,可以分别对查询视频和备选视频进行分段,分别得到多个视频分段,然后利用三维卷积网络分别对备选视频和待查视频的各个视频分段进行特征提取,得到各个视频片段分别对应的图像静态特征。
第二部分:长视频错位累积:
由于备选视频中包括运动对象,为了准确的捕捉到运动中运动对象所体现的信息,可以采用错位长短期记忆神经网络(WarpLSTM)来汇集备选视频当中的运动时序信息,根据运动时序信息对备选视频的各个视频片段的子特征进行错位调整,以便在确定第t个子特征时,可以根据调整后的第t-1个子特征和提取到的第t个视频分段对应的图像静态特征确定第t个子特征。
第三部分:区域提案:
利用区域提案网络(RPN)选出备选视频中可能和待查视频中目标对象具有关联关系的目标区域,通过区域池化的方式确定目标区域对应的区域特征。
第四部分:注意力加权:
针对待查视频,为了确定出更能体现目标对象相关信息的待查视频特征,可以对待查视频进行注意力加权,得到加权后的待查视频特征。
第五部分:区域精确化:
得到待查视频特征和目标区域的区域特征之后,可以将这两个特征拼接在一起,然后经过卷积层和两个全连接层与备选视频进行比对,分别输出目标区域是否和待查视频相关的类别,以及与待查视频相关的目标区域精确的区域坐标,从而精确化定位出备选视频中哪些视频片段的哪些区域与待查视频相关。其中,输出目标区域是否和待查视频相关的过程可以称为分类过程,输出与待查视频相关的目标区域精确的区域坐标可以通过对目标区域进行标识的方式体现,例如为目标区域设置醒目颜色的外框,此时,输出与待查视频相关的目标区域精确的区域坐标的过程可以称为框回归。
本申请提供了一种端到端的视频时空再定位方法。在给定一个短的查询视频之后,本申请可以在其它长的备选视频中,查询到与查询视频包含相同语义内容的时空视频片段。所谓“时空视频片段”,在时间维度上是指:在在较长的备选视频中定位出一个视频频段;在空间 维度上是指在查询到的视频片段中精确地返回目标区域,目标区域是与待查视频相关的区域。在一个示例中,用户在给定一个短的查询视频之后,本申请能够在长的备选视频中快速定位到用户感兴趣的部分,用户不必亲自去从头到尾浏览,从而节省时间。另外,本技术也可以用于视频拷贝检测,用于检测侵犯版权的视频传播。
基于前述实施例提供的一种视频查询方法,本申请实施例还提供一种视频查询装置,参见图6,所述装置包括获取单元601、第一确定单元602和第二确定单元603:
所述获取单元601,用于获取待查媒体的媒体特征和备选视频对应的图像静态特征;所述待查媒体包括目标对象,所述备选视频中包括运动对象;
所述第一确定单元602,用于根据所述图像静态特征以及所述备选视频中所述运动对象的运动时序信息,确定所述备选视频的视频特征;
所述第二确定单元603,用于根据所述媒体特征以及所述视频特征,确定所述备选视频中的所述运动对象是否与所述目标对象相关。
在一种可能的实现方式中,所述视频特征包括所述备选视频中的n个视频分段对应的子特征;第t个视频分段对应第t个子特征,所述运动时序信息体现所述运动对象在相邻视频分段间的运动趋势,t为大于1且不大于n的正整数,n为大于1的整数;
所述第一确定单元602,具体用于:
确定第t-1个子特征中的所述运动对象在第t个视频分段中的目标运动趋势;
根据所述目标运动趋势,调整所述运动对象在所述第t-1个子特征的特征平面中对应的目标区域;
根据调整后的第t-1个子特征和所述第t个视频分段对应的图像静态特征,确定所述第t个子特征。
在一种可能的实现方式中,所述第一确定单元602,还用于:
根据所述目标运动趋势确定所述目标区域内的控制点对应的偏移值;
根据所述偏移值移动所述控制点,以移动后的所述控制点为基准调整所述运动对象在所述第t-1个子特征的特征平面中对应的目标区域。
在一种可能的实现方式中,所述第二确定单元603,具体用于:
确定在第t个子特征的特征平面与所述目标对象具有关联关系的目标区域;
根据所述第t个子特征确定所述目标区域对应的区域特征;
根据所述区域特征以及所述媒体特征,确定所述第t个视频分段中的所述运动对象是否与所述目标对象相关。
在一种可能的实现方式中,所述第二确定单元603,还用于:
通过注意力模型确定所述待查视频中的视频帧内容与所述目标对象之间的权重;
根据所述权重确定所述待查视频的待查视频特征;
根据所述区域特征以及所述待查视频特征,确定所述第t个视频分段中的所述运动对象是否与所述目标对象相关。
在一种可能的实现方式中,若第t个视频分段中的所述运动对象是否与所述目标对象相关,参见图7,所述装置还包括标识单元604:
所述标识单元604,用于在所述备选视频的第t个视频分段中对所述目标区域进行标识。
由上述技术方案可以看出,获取待查媒体的媒体特征和备选视频对应的图像静态特征。 待查媒体包括目标对象,备选视频中包括运动对象。根据图像静态特征以及备选视频中运动对象的运动时序信息,确定所述备选视频的视频特征。由于通过运动时序信息可以准确的捕捉到运动中运动对象所体现的信息,故以此确定出的该视频特征可以准确的描述出该运动对象,有效的避免了在视频查询中原本运动对象带来的识别难度。从而可以根据媒体特征以及该视频特征,准确的确定出备选视频中的运动对象是否与待查媒体中目标对象相关,提高了用户的查询体验。
本申请实施例还提供了一种用于视频查询的设备,下面结合附图对用于视频查询的设备进行介绍。请参见图8所示,本申请实施例提供了一种用于视频查询的设备700,该设备700还可以是终端设备,该终端设备可以,包括手机、平板电脑、个人数字助理(Personal Digital Assistant,简称PDA)、销售终端(Point of Sales,简称POS)、车载电脑等任意智能终端,以终端设备为手机为例:
图8示出的是与本申请实施例提供的终端设备相关的手机的部分结构的框图。参考图8,手机包括:射频(Radio Frequency,简称RF)电路710、存储器720、输入单元730、显示单元740、传感器750、音频电路760、无线保真(wireless fidelity,简称WiFi)模块770、处理器780、以及电源790等部件。本领域技术人员可以理解,图7中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图8对手机的各个构成部件进行具体的介绍:
RF电路710可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器780处理;另外,将设计上行的数据发送给基站。通常,RF电路710包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,简称LNA)、双工器等。此外,RF电路710还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(Global System of Mobile communication,简称GSM)、通用分组无线服务(General Packet Radio Service,简称GPRS)、码分多址(Code Division Multiple Access,简称CDMA)、宽带码分多址(Wideband Code Division Multiple Access,简称WCDMA)、长期演进(Long Term Evolution,简称LTE)、电子邮件、短消息服务(Short Messaging Service,简称SMS)等。
存储器720可用于存储软件程序以及模块,处理器780通过运行存储在存储器720的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器720可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器720可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元730可用于接收输入的数字或字符信息,以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元730可包括触控面板731以及其他输入设备732。触控面板731,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板731上或在触控面板731附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板731可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号 传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器780,并能接收处理器780发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板731。除了触控面板731,输入单元730还可以包括其他输入设备732。具体地,其他输入设备732可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元740可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元740可包括显示面板741,可选的,可以采用液晶显示器(Liquid Crystal Display,简称LCD)、有机发光二极管(Organic Light-Emitting Diode,简称OLED)等形式来配置显示面板741。进一步的,触控面板731可覆盖显示面板741,当触控面板731检测到在其上或附近的触摸操作后,传送给处理器780以确定触摸事件的类型,随后处理器780根据触摸事件的类型在显示面板741上提供相应的视觉输出。虽然在图7中,触控面板731与显示面板741是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板731与显示面板741集成而实现手机的输入和输出功能。
手机还可包括至少一种传感器750,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板741的亮度,接近传感器可在手机移动到耳边时,关闭显示面板741和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路760、扬声器761,传声器762可提供用户与手机之间的音频接口。音频电路760可将接收到的音频数据转换后的电信号,传输到扬声器761,由扬声器761转换为声音信号输出;另一方面,传声器762将收集的声音信号转换为电信号,由音频电路760接收后转换为音频数据,再将音频数据输出处理器780处理后,经RF电路710以发送给比如另一手机,或者将音频数据输出至存储器720以便进一步处理。
WiFi属于短距离无线传输技术,手机通过WiFi模块770可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图8示出了WiFi模块770,但是可以理解的是,其并不属于手机的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器780是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器720内的软件程序和/或模块,以及调用存储在存储器720内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器780可包括一个或多个处理单元;优选的,处理器780可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器780中。
手机还包括给各个部件供电的电源790(比如电池),优选的,电源可以通过电源管理系统与处理器780逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管未示出,手机还可以包括摄像头、蓝牙模块等,在此不再赘述。
在本实施例中,该终端设备所包括的处理器780还具有以下功能:
获取待查媒体的媒体特征和备选视频对应的图像静态特征;所述待查媒体包括目标对象,所述备选视频中包括运动对象;
根据所述图像静态特征以及备选视频中所述运动对象的运动时序信息,确定所述备选视频的视频特征;
根据所述媒体特征以及所述视频特征,确定所述备选视频中的所述运动对象是否与所述目标对象相关。
本申请实施例提供的用于视频查询的设备可以是服务器,请参见图9所示,图9为本申请实施例提供的服务器800的结构图,服务器800可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(Central Processing Units,简称CPU)822(例如,一个或一个以上处理器)和存储器832,一个或一个以上存储应用程序842或数据844的存储介质830(例如一个或一个以上海量存储设备)。其中,存储器832和存储介质830可以是短暂存储或持久存储。存储在存储介质830的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器822可以设置为与存储介质830通信,在服务器800上执行存储介质830中的一系列指令操作。
服务器800还可以包括一个或一个以上电源826,一个或一个以上有线或无线网络接口850,一个或一个以上输入输出接口858,和/或,一个或一个以上操作系统841,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
上述实施例中由服务器所执行的步骤,可以基于该图9所示的服务器结构。
其中,CPU 822至少用于执行如下步骤:
获取待查媒体的媒体特征和备选视频对应的图像静态特征;所述待查媒体包括目标对象,所述备选视频中包括运动对象;
根据所述图像静态特征以及备选视频中所述运动对象的运动时序信息,确定所述备选视频的视频特征;
根据所述媒体特征以及所述视频特征,确定所述备选视频中的所述运动对象是否与所述目标对象相关。
可选地,CPU 822还用于执行上述方法实施例所述的视频查询方法。
本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质用于存储程序代码,所述程序代码用于执行前述实施例所述的视频查询方法。
本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中 的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (14)

  1. 一种视频查询方法,其特征在于,应用于电子设备中,所述方法包括:
    获取待查媒体的媒体特征和备选视频对应的图像静态特征;所述待查媒体包括目标对象,所述备选视频中包括运动对象;
    根据所述图像静态特征以及所述备选视频中所述运动对象的运动时序信息,确定所述备选视频的视频特征;
    根据所述媒体特征以及所述视频特征,确定所述备选视频中的所述运动对象是否与所述目标对象相关。
  2. 根据权利要求1所述的方法,其特征在于,所述视频特征包括所述备选视频中n个视频分段对应的子特征;第t个视频分段对应第t个子特征,所述运动时序信息体现所述运动对象在相邻视频分段间的运动趋势,t为大于1且不大于n的正整数,n为大于1的整数;
    所述根据所述图像静态特征以及备选视频中所述运动对象的运动时序信息,确定所述备选视频的视频特征,包括:
    确定第t-1个子特征中的所述运动对象在第t个视频分段中的目标运动趋势;
    根据所述目标运动趋势,调整所述运动对象在所述第t-1个子特征的特征平面中对应的目标区域;
    根据调整后的第t-1个子特征和所述第t个视频分段对应的图像静态特征,确定所述第t个子特征。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述目标运动趋势调整所述运动对象在所述第t-1个子特征的特征平面中对应的目标区域,包括:
    根据所述目标运动趋势确定所述目标区域内的控制点对应的偏移值;
    根据所述偏移值移动所述控制点,以移动后的所述控制点为基准调整所述运动对象在所述第t-1个子特征的特征平面中对应的目标区域。
  4. 根据权利要求2所述的方法,其特征在于,所述根据所述媒体特征以及所述视频特征,确定所述备选视频中的所述运动对象是否与所述目标对象相关,包括:
    确定在所述第t个子特征的特征平面与所述目标对象具有关联关系的目标区域;
    根据所述第t个子特征确定所述目标区域对应的区域特征;
    根据所述区域特征以及所述媒体特征,确定所述第t个视频分段中的所述运动对象是否与所述目标对象相关。
  5. 根据权利要求4所述的方法,其特征在于,所述待查媒体为待查视频,所述根据所述区域特征以及所述媒体特征,确定所述第t个视频分段中的所述运动对象是否与所述目标对象相关,包括:
    通过注意力模型确定所述待查视频中的视频帧内容与所述目标对象之间的权重;
    根据所述权重确定所述待查视频的待查视频特征;
    根据所述区域特征以及所述待查视频特征,确定所述第t个视频分段中的所述运动对象是否与所述目标对象相关。
  6. 根据权利要求4或5所述的方法,其特征在于,若第t个视频分段中的所述运动对象与所述目标对象相关,所述方法还包括:
    在所述备选视频的第t个视频分段中对所述目标区域进行标识。
  7. 一种视频查询装置,其特征在于,所述装置包括:获取单元、第一确定单元和第二确定单元:
    所述获取单元,用于获取待查媒体的媒体特征和备选视频对应的图像静态特征;所述待查媒体包括目标对象,所述备选视频中包括运动对象;
    所述第一确定单元,用于根据所述图像静态特征以及所述备选视频中所述运动对象的运动时序信息,确定所述备选视频的视频特征;
    所述第二确定单元,用于根据所述媒体特征以及所述视频特征,确定所述备选视频中的所述运动对象是否与所述目标对象相关。
  8. 根据权利要求7所述的装置,其特征在于,所述视频特征包括所述备选视频中n个视频分段对应的子特征;第t个视频分段对应第t个子特征,所述运动时序信息体现所述运动对象在相邻视频分段间的运动趋势,t为大于1且不大于n的正整数,n为大于1的整数;
    所述第一确定单元,具体用于:
    确定第t-1个子特征中的所述运动对象在第t个视频分段中的目标运动趋势;
    根据所述目标运动趋势,调整所述运动对象在所述第t-1个子特征的特征平面中对应的目标区域;
    根据调整后的第t-1个子特征和所述第t个视频分段对应的图像静态特征,确定所述第t个子特征。
  9. 根据权利要求8所述的装置,其特征在于,所述第一确定单元,还用于:
    根据所述目标运动趋势确定所述目标区域内的控制点对应的偏移值;
    根据所述偏移值移动所述控制点,以移动后的所述控制点为基准调整所述运动对象在所述第t-1个子特征的特征平面中对应的目标区域。
  10. 根据权利要求8所述的装置,其特征在于,所述第二确定单元,具体用于:
    确定在所述第t个子特征的特征平面与所述目标对象具有关联关系的目标区域;
    根据所述第t个子特征确定所述目标区域对应的区域特征;
    根据所述区域特征以及所述媒体特征,确定所述第t个视频分段中的所述运动对象是否与所述目标对象相关。
  11. 根据权利要求10所述的装置,其特征在于,所述第二确定单元,还用于:
    通过注意力模型确定所述待查视频中的视频帧内容与所述目标对象之间的权重;
    根据所述权重确定所述待查视频的待查视频特征;
    根据所述区域特征以及所述待查视频特征,确定所述第t个视频分段中的所述运动对象是否与所述目标对象相关。
  12. 根据权利要求10或11所述的装置,其特征在于,若第t个视频分段中的所述运动对象与所述目标对象相关,所述装置还包括标识单元:
    所述标识单元,用于在所述备选视频的第t个视频分段中对所述目标区域进行标识。
  13. 一种用于视频查询的设备,其特征在于,所述设备包括处理器以及存储器:
    所述存储器用于存储程序代码,并将所述程序代码传输给所述处理器;
    所述处理器用于根据所述程序代码中的指令执行权利要求1至6任一所述的视频查询方法。
  14. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储程序代码,所述程序代码用于执行权利要求1至6任一所述的视频查询方法。
PCT/CN2020/086670 2019-04-29 2020-04-24 视频查询方法、装置、设备及存储介质 WO2020221121A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/333,001 US11755644B2 (en) 2019-04-29 2021-05-27 Video query method, apparatus, and device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910355782.3A CN110083742B (zh) 2019-04-29 2019-04-29 一种视频查询方法和装置
CN201910355782.3 2019-04-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/333,001 Continuation US11755644B2 (en) 2019-04-29 2021-05-27 Video query method, apparatus, and device, and storage medium

Publications (1)

Publication Number Publication Date
WO2020221121A1 true WO2020221121A1 (zh) 2020-11-05

Family

ID=67417598

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/086670 WO2020221121A1 (zh) 2019-04-29 2020-04-24 视频查询方法、装置、设备及存储介质

Country Status (3)

Country Link
US (1) US11755644B2 (zh)
CN (1) CN110083742B (zh)
WO (1) WO2020221121A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083742B (zh) 2019-04-29 2022-12-06 腾讯科技(深圳)有限公司 一种视频查询方法和装置
US11270147B1 (en) 2020-10-05 2022-03-08 International Business Machines Corporation Action-object recognition in cluttered video scenes using text
US11423252B1 (en) * 2021-04-29 2022-08-23 International Business Machines Corporation Object dataset creation or modification using labeled action-object videos

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239566A (zh) * 2014-09-28 2014-12-24 小米科技有限责任公司 视频搜索的方法及装置
CN106570165A (zh) * 2016-11-07 2017-04-19 北京航空航天大学 一种基于内容的视频检索方法及装置
CN107609513A (zh) * 2017-09-12 2018-01-19 北京小米移动软件有限公司 视频类型确定方法及装置
CN109086709A (zh) * 2018-07-27 2018-12-25 腾讯科技(深圳)有限公司 特征提取模型训练方法、装置及存储介质
CN110083742A (zh) * 2019-04-29 2019-08-02 腾讯科技(深圳)有限公司 一种视频查询方法和装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101299812B (zh) * 2008-06-25 2012-12-05 北京中星微电子有限公司 视频分析和存储方法、系统,及视频检索方法、系统
CN102567380A (zh) * 2010-12-28 2012-07-11 沈阳聚德视频技术有限公司 在视频图像中检索车辆信息的方法
CN102609548A (zh) * 2012-04-19 2012-07-25 李俊 一种基于运动目标的视频内容检索方法及系统
CN104125430B (zh) * 2013-04-28 2017-09-12 华为技术有限公司 视频运动目标检测方法、装置以及视频监控系统
CN103347167B (zh) * 2013-06-20 2018-04-17 上海交通大学 一种基于分段的监控视频内容描述方法
CN104166685B (zh) * 2014-07-24 2017-07-11 北京捷成世纪科技股份有限公司 一种检测视频片段的方法和装置
CN104200218B (zh) * 2014-08-18 2018-02-06 中国科学院计算技术研究所 一种基于时序信息的跨视角动作识别方法及系统
US10319412B2 (en) * 2016-11-16 2019-06-11 Adobe Inc. Robust tracking of objects in videos
US10515117B2 (en) * 2017-02-14 2019-12-24 Cisco Technology, Inc. Generating and reviewing motion metadata
CN108399381B (zh) * 2018-02-12 2020-10-30 北京市商汤科技开发有限公司 行人再识别方法、装置、电子设备和存储介质
CN109101653A (zh) * 2018-08-27 2018-12-28 国网天津市电力公司 一种视频文件的检索方法及其系统和应用

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239566A (zh) * 2014-09-28 2014-12-24 小米科技有限责任公司 视频搜索的方法及装置
CN106570165A (zh) * 2016-11-07 2017-04-19 北京航空航天大学 一种基于内容的视频检索方法及装置
CN107609513A (zh) * 2017-09-12 2018-01-19 北京小米移动软件有限公司 视频类型确定方法及装置
CN109086709A (zh) * 2018-07-27 2018-12-25 腾讯科技(深圳)有限公司 特征提取模型训练方法、装置及存储介质
CN110083742A (zh) * 2019-04-29 2019-08-02 腾讯科技(深圳)有限公司 一种视频查询方法和装置

Also Published As

Publication number Publication date
CN110083742A (zh) 2019-08-02
US11755644B2 (en) 2023-09-12
CN110083742B (zh) 2022-12-06
US20210287006A1 (en) 2021-09-16

Similar Documents

Publication Publication Date Title
CN108304441B (zh) 网络资源推荐方法、装置、电子设备、服务器及存储介质
WO2020253657A1 (zh) 视频片段定位方法、装置、计算机设备及存储介质
WO2020177582A1 (zh) 视频合成的方法、模型训练的方法、设备及存储介质
WO2020215962A1 (zh) 视频推荐方法、装置、计算机设备及存储介质
TW202011268A (zh) 目標跟蹤方法、裝置、介質以及設備
WO2020221121A1 (zh) 视频查询方法、装置、设备及存储介质
TWI489397B (zh) 用於提供適應性手勢分析之方法、裝置及電腦程式產品
CN110263213B (zh) 视频推送方法、装置、计算机设备及存储介质
WO2019233216A1 (zh) 一种手势动作的识别方法、装置以及设备
CN111582116B (zh) 一种视频抹除痕迹检测方法、装置、设备和存储介质
CN111552888A (zh) 内容推荐方法、装置、设备及存储介质
CN110163066B (zh) 多媒体数据推荐方法、装置及存储介质
WO2018133681A1 (zh) 搜索结果排序方法、装置、服务器及存储介质
CN111629247B (zh) 一种信息显示方法、装置及电子设备
CN109495616B (zh) 一种拍照方法及终端设备
CN110933468A (zh) 播放方法、装置、电子设备及介质
WO2020259522A1 (zh) 一种内容查找方法、相关设备及计算机可读存储介质
CN107728877B (zh) 一种应用推荐方法及移动终端
CN111738100A (zh) 一种基于口型的语音识别方法及终端设备
CN109361864B (zh) 一种拍摄参数设置方法及终端设备
CN112001442B (zh) 特征检测方法、装置、计算机设备及存储介质
CN109739414A (zh) 一种图片处理方法、移动终端、计算机可读存储介质
CN110958487B (zh) 一种视频播放进度定位方法及电子设备
CN112766406A (zh) 物品图像的处理方法、装置、计算机设备及存储介质
CN110213307B (zh) 多媒体数据推送方法、装置、存储介质及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20798061

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20798061

Country of ref document: EP

Kind code of ref document: A1