CN112040325B

CN112040325B - Video playing method and device, electronic equipment and storage medium

Info

Publication number: CN112040325B
Application number: CN202011200283.6A
Authority: CN
Inventors: 付培; 罗振波; 朱翔宇; 吉翔
Original assignee: Chengdu Ruiyan Technology Co ltd
Current assignee: Chengdu Ruiyan Technology Co ltd
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2021-01-29
Anticipated expiration: 2040-11-02
Also published as: CN112040325A

Abstract

The application provides a video playing method, a video playing device, an electronic device and a storage medium, wherein the method comprises the following steps: screening a target object image from a target image library, and acquiring index information corresponding to the target object image, wherein the target image library stores a mapping relation between the target object image intercepted by a video image frame in video stream data and the index information of the target object image; screening a positioning image frame from video stream data according to index information corresponding to the target object image; and playing the positioned image frames from the video stream data according to the positioned image frames. In the implementation process, the positioning image frames are screened from the video stream data through the target image library established in advance, so that the situation that the operation error exists when the video playing progress bar is dragged manually is avoided, and the accuracy of the image frames positioned to the target object from the video stream data is improved.

Description

Video playing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the technical field of video processing and image processing, and in particular, to a video playing method, an apparatus, an electronic device, and a storage medium.

Background

After the obtained video stream data is shot by using the monitoring camera, the prover can use the video stream data as evidence, and in the process of presenting the evidence, the prover needs to locate an image frame appearing in a target object within a certain time period in the video stream data and play the video stream data according to the image frame appearing in the target object.

At present, methods for obtaining image frames appearing in a target object from video stream data and playing a video mostly obtain the image frames appearing in the target object in a manual searching manner, that is, gradually reduce the range and obtain the image frames appearing in the target object based on provided positioning information, specifically for example: if the time range of the appearance of the suspect in the video stream data obtained by the monitoring camera is about 9 pm, the suspect can be manually positioned to about 9 pm to start playing the video picture, and the suspect image frame is positioned in the video stream data. In a specific practice process, it is found that it is difficult to directly drag the video playing progress bar to a position where a suspect appears due to factors such as an operation error and inaccurate video image frame time caused by manually dragging the video playing progress bar, that is, it is difficult to accurately position an image frame appearing in a target object from video stream data.

Disclosure of Invention

An object of the embodiments of the present application is to provide a video playing method, apparatus, electronic device and storage medium, which are used to improve the problem that it is difficult to accurately locate an image frame appearing in a target object from video stream data.

The embodiment of the application provides a video playing method, which comprises the following steps: screening a target object image from a target image library, and acquiring index information corresponding to the target object image, wherein the target image library stores a mapping relation between the target object image intercepted by a video image frame in video stream data and the index information of the target object image; screening a positioning image frame from video stream data according to index information corresponding to the target object image; and playing the positioned image frames from the video stream data according to the positioned image frames. In the implementation process, the positioning image frames are screened from the video stream data through the target image library established in advance, so that the situation that the operation error exists when the video playing progress bar is dragged manually is avoided, and the accuracy of the image frames positioned to the target object from the video stream data is improved.

Optionally, in this embodiment of the present application, the screening out the positioning image frame from the video stream data according to the index information corresponding to the target object image includes: screening a plurality of candidate image frames from video stream data according to the index information of the target object image; and screening out positioning image frames from the plurality of candidate image frames according to the similarity between each candidate image frame in the plurality of candidate image frames and the target object image. In the implementation process, the positioning image frames are screened out from the plurality of candidate image frames according to the similarity between each candidate image frame in the plurality of candidate image frames and the target object image, so that the accuracy of the image frames positioned to the target object from the video stream data is effectively improved.

Optionally, in this embodiment of the present application, the index information is an acquisition time; screening a plurality of candidate image frames from video stream data according to the index information of the target object image, wherein the method comprises the following steps: acquiring a first acquisition moment of an image frame corresponding to a target object image; and determining the video image frame with the acquisition time in a preset time range near the first acquisition time as a candidate image frame.

Optionally, in an embodiment of the present application, the screening out a positioning image frame from a plurality of candidate image frames according to a similarity between each candidate image frame in the plurality of candidate image frames and the target object image includes: determining candidate image frames which are screened out from the plurality of candidate image frames and meet preset conditions as positioning image frames, wherein the preset conditions comprise: the similarity between the candidate image frame and the target object image is the maximum, or the similarity between the candidate image frame and the target object image is larger than a preset similarity threshold and appears earliest. In the implementation process, the candidate image frames which are screened out from the plurality of candidate image frames and meet the preset conditions are determined as the positioning image frames, and the preset conditions comprise: the similarity between the candidate image frame and the target object image is the maximum, or the similarity between the candidate image frame and the target object image is greater than a preset similarity threshold and appears earliest; the accuracy of screening out the positioning image frame from a plurality of candidate image frames is improved.

Optionally, in an embodiment of the present application, the screening out a positioning image frame from a plurality of candidate image frames according to a similarity between each candidate image frame in the plurality of candidate image frames and the target object image includes: the method comprises the steps of screening a plurality of similar image frames with the same type as a target object of a target object image from a plurality of candidate image frames, screening the similar image frames with the maximum similarity to the target object image from the similar image frames, and determining the similar image frames with the maximum similarity as positioning image frames. In the implementation process, a plurality of similar image frames with the same type as a target object of a target object image are screened out from a plurality of candidate image frames, then the similar image frame with the maximum similarity to the target object image is screened out from the plurality of similar image frames, and the similar image frame with the maximum similarity is determined as a positioning image frame; the accuracy of screening out the positioning image frame from a plurality of candidate image frames is improved.

Optionally, in this embodiment of the present application, the screening out the target object image from the target image library includes: acquiring an image to be inquired; extracting the characteristic information of the image to be inquired, wherein the characteristic information comprises: image features or face features; and screening out the target object image with the maximum similarity with the image to be inquired from the target image library according to the characteristic information of the image to be inquired. In the implementation process, the face characteristics of the image to be inquired are extracted, and the target object image with the maximum similarity to the image to be inquired is screened out from the target image library according to the face characteristics of the image to be inquired; therefore, the accuracy of screening the target object image with the maximum similarity to the image to be inquired from the target image library is effectively improved.

Optionally, in this embodiment of the present application, the screening out the target object image from the target image library includes: in response to an image selection operation for the target image library, a target object image by the image selection operation is obtained.

Optionally, in this embodiment of the application, before the step of screening out the target object image from the target image library, the method further includes: acquiring a mapping relation between a target object image and index information of the target object image, correcting the mapping relation to obtain a corrected mapping relation, and storing the corrected mapping relation into a target image library; or acquiring a mapping relation between the target object image and the index information of the target object image, storing the mapping relation into a target image library, and then correcting the mapping relation. In the implementation process, the mapping relation between the target object image and the index information of the target object image is obtained, the mapping relation is corrected to obtain the corrected mapping relation, and then the corrected mapping relation is stored in a target image library; or acquiring a mapping relation between the target object image and the index information of the target object image, storing the mapping relation into a target image library, and then correcting the mapping relation; thereby effectively improving the accuracy of locating the image frame appearing to the target object from the video stream data.

Optionally, in this embodiment of the present application, storing the mapping relationship to a target image library includes: predicting a candidate area where a target object is located in a video image frame in video stream data and a prediction probability of the target object existing in the candidate area by using a target detection model; and if the prediction probability is greater than a preset probability threshold, intercepting a target object image from the video image frame according to the candidate region, and storing the mapping relation between the target object image and the index information of the target object image into a target image library. In the implementation process, a target detection model is used for predicting a candidate area where a target object is located in a video image frame in video stream data and the prediction probability of the target object existing in the candidate area; if the prediction probability is larger than a preset probability threshold, intercepting a target object image from a video image frame according to the candidate region, and storing the mapping relation between the target object image and the index information of the target object image into a target image library; the accuracy of locating image frames from video stream data to the appearance of a target object is improved.

Optionally, in this embodiment of the present application, modifying the mapping relationship includes: screening out a correction positioning frame from video stream data according to index information corresponding to the target object image; and replacing the mapping relation in the target image library by using the mapping relation between the corrected positioning frame and the index information of the corrected positioning frame. In the implementation process, a correction positioning frame is screened out from video stream data according to index information corresponding to a target object image; replacing the mapping relation in the target image library by using the mapping relation between the corrected positioning frame and the index information of the corrected positioning frame; thereby effectively improving the accuracy of locating the image frame appearing to the target object from the video stream data.

An embodiment of the present application further provides a video playing device, including: the index information acquisition module is used for screening a target object image from a target image library and acquiring index information corresponding to the target object image, and the target image library stores the mapping relation between the target object image intercepted by a video image frame in video stream data and the index information of the target object image; the positioning image screening module is used for screening positioning image frames from the video stream data according to the index information corresponding to the target object image; and the video positioning playing module is used for playing the image frames positioned from the video stream data according to the positioning image frames.

Optionally, in an embodiment of the present application, the positioning image filtering module includes: the candidate image screening module is used for screening a plurality of candidate image frames from the video stream data according to the index information of the target object image; and the positioning frame screening submodule is used for screening the positioning image frames from the candidate image frames according to the similarity between each candidate image frame in the candidate image frames and the target object image.

Optionally, in this embodiment of the present application, the index information is an acquisition time; a candidate image screening module comprising: the acquisition time acquisition module is used for acquiring a first acquisition time of an image frame corresponding to the target object image; and the first candidate determining module is used for determining the video image frame with the acquisition time in a preset time range near the first acquisition time as a candidate image frame.

Optionally, in this embodiment of the present application, the index information obtaining module includes: the query image acquisition module is used for acquiring an image to be queried; the characteristic information extraction module is used for extracting the characteristic information of the image to be inquired, and the characteristic information comprises: image features or face features; and the target image screening module is used for screening out a target object image with the maximum similarity with the image to be inquired from the target image library according to the characteristic information of the image to be inquired.

Optionally, in this embodiment of the present application, the index information obtaining module includes: and the target image obtaining module is used for responding to the image selection operation in the target image library and obtaining the target object image made by the image selection operation.

Optionally, in this embodiment of the present application, the video playing apparatus further includes: the mapping relation correction module is used for acquiring the mapping relation between the target object image and the index information of the target object image, correcting the mapping relation to acquire a corrected mapping relation, and then storing the corrected mapping relation into a target image library; or acquiring a mapping relation between the target object image and the index information of the target object image, storing the mapping relation into a target image library, and then correcting the mapping relation.

Optionally, in this embodiment of the present application, the mapping relationship correcting module is specifically configured to: predicting a candidate area where a target object is located in a video image frame in video stream data and a prediction probability of the target object existing in the candidate area by using a target detection model; and if the prediction probability is greater than a preset probability threshold, intercepting a target object image from the video image frame according to the candidate region, and storing the mapping relation between the target object image and the index information of the target object image into a target image library.

Optionally, in this embodiment of the present application, the mapping relationship correcting module is specifically configured to: screening out a correction positioning frame from video stream data according to index information corresponding to the target object image; and replacing the mapping relation in the target image library by using the mapping relation between the corrected positioning frame and the index information of the corrected positioning frame.

An embodiment of the present application further provides an electronic device, including: a processor and a memory, the memory storing processor-executable machine-readable instructions, the machine-readable instructions when executed by the processor performing the method as described above.

Embodiments of the present application also provide a storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the method as described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic flow chart of a video playing method provided in an embodiment of the present application;

fig. 2 is a schematic diagram illustrating a relationship between similarity and frame identification provided in an embodiment of the present application;

fig. 3 is a schematic flow chart illustrating a modified mapping relationship provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a video playback device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Before introducing the video playing method provided in the embodiment of the present application, some concepts related in the embodiment of the present application are introduced:

the target detection network model, also referred to as a target detection model for short, refers to a neural network model obtained after training a target detection network by using training data, and the target detection network is divided into stages, which can be roughly divided into: a single-stage detection model and a two-stage detection model; the single-stage detection model is a network model which directly outputs the region and category information of a target without independently searching for a candidate region; the two-stage detection model refers to a network model which needs to be completed by two steps for obtaining and classifying candidate regions by a detection algorithm.

A server refers to a device that provides computing services over a network, such as: x86 server and non-x 86 server, non-x 86 server includes: mainframe, minicomputer, and UNIX server.

It should be noted that the video playing method provided in the embodiment of the present application may be executed by an electronic device, where the electronic device refers to a device terminal having a function of executing a computer program or the server described above, and the device terminal includes, for example: a smart phone, a Personal Computer (PC), a tablet computer, a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), a network switch or a network router, and the like.

Before introducing the video playing method provided by the embodiment of the present application, an application scenario applicable to the video playing method is introduced, where the application scenario includes but is not limited to: public security and peace and security city and other fields; specific examples thereof include: after the obtained Video stream data is captured by using a Digital Video Recorder (DVR) or a Network Video Recorder (NVR), the forensic uses the image frame of a suspect appearing in the Video stream data as evidence and needs to find the position of the image frame of the suspect appearing in the Video stream data.

Since an electronic device (e.g., a local server) that processes video uses local Time, the electronic device receives Real Time Streaming Protocol (RTSP) and the like from another device and uses video stream data. Because the time of the local server is different from the time of the video stream, and the decoding difference and frame loss phenomenon may exist in the video stream transmission process, the target object image in the target image library and its index information (for example, the acquisition time) are not accurate. Therefore, the video playing method can be used for inquiring and replaying video stream data, namely retrieving image frames appearing in a suspect in a certain past time period, improving the accuracy of positioning the image frames appearing in a target suspect, and playing according to the positioned image frames. That is, according to the video playing method, the position of a target object (such as a pedestrian or a vehicle) in the source video stream data can be quickly and accurately located according to the image of the target object, and the video stream is played back from the position of the target object in the source video stream data, so that the post analysis is greatly facilitated.

Please refer to fig. 1 for a schematic flow chart of a video playing method provided in the embodiment of the present application; the video playing method has the main idea that a target image library established in advance is used for screening out positioning image frames from video stream data, so that the situation that an operation error exists when a video playing progress bar is dragged manually is avoided, and the accuracy of the image frames positioned to a target object from the video stream data is improved; the video playing method may include:

step S110: and screening the target object image from the target image library, and acquiring index information corresponding to the target object image.

The target object image refers to an area image captured according to an area occupied by a video image frame of the target image in video stream data, where the target object may be a pedestrian (such as a suspect), a vehicle, an animal, or the like, and the specific obtaining manner of the target object image is, for example: predicting a candidate area where a target object is located in a video image frame in video stream data by using a target detection model, and intercepting a target object image from the video image frame according to the candidate area; among them, target detection models that may be used include, but are not limited to: a Regional Convolutional Neural Network (RCNN).

The target image library is a mapping relation set library which stores the mapping relation between the target object image intercepted by the video image frame in the video stream data and the index information of the target object image, and the target image library can be a relational database or a non-relational database. The target image library is obtained in a specific manner, for example: and storing the mapping relation between the acquired target object image and the index information of the target object image into a target image library.

The index information refers to position information of a target object image in the target image library in video stream data, and specifically, information such as a collection time or a frame identifier of the target object image may be specifically selected as the index information, that is, information that can uniquely identify the image and can be sorted may be selected as the index information of this time.

The implementation of step S110 may include two cases:

in the first case, a user (for example, a video retriever) may provide an image to be queried including a target object, and the image to be queried is compared with images in the target image library to determine a target object image, which may include:

step S111: and acquiring an image to be inquired.

The embodiment of the step S111 includes: a first obtaining mode, wherein a forensics person provides an image to be inquired, and the forensics person can obtain the image by shooting a target object by using a camera of a video camera or the like; the forensics can also obtain the image to be inquired from the file system or obtain the image from the database; in the second obtaining mode, software such as a browser is used to obtain an image to be queried on the internet, or other application programs are used to access the internet to obtain the image to be queried. The image to be queried may be an image photographed by a suspect in the past, or an image drawn by a painter according to the witness description.

Step S112: and extracting the characteristic information of the image to be inquired.

The embodiment of step S112 described above is, for example: extracting image features or face features of an image to be queried using a convolutional neural network model, which may be used herein, including but not limited to: LeNet, AlexNet, VGG, GoogLeNet, and ResNet, among other models; wherein the characteristic information includes: image characteristics or characteristics of the target object, and if the target object is a person, the characteristic information includes: appearance features or facial features, etc.

Step S113: and screening out the target object image with the maximum similarity with the image to be inquired from the target image library according to the characteristic information of the image to be inquired.

The embodiment of the step S113 includes: screening out a target object image with the characteristic information having the maximum similarity with the image to be inquired from a target image library, and if the searched target suspect is the target suspect, screening out the target object image with the maximum similarity with the image to be inquired from the target image library according to the appearance characteristics or the human face characteristics of the target object and the like; the calculation method of the similarity includes: respectively extracting first characteristic Information of a target object in an image to be inquired and second characteristic Information of the target object in the target object image, and calculating the similarity of the first characteristic Information and the second characteristic Information, wherein the similarity can be quantified by cosine Distance, Euclidean Distance (Euclidean Distance), Hamming Distance (Hamming Distance) or Information Entropy (Information Entropy) and the like.

In the second case, if the user cannot provide the image to be queried, the user may use the image in the target image library as the target object image directly, and this case may include:

step S114: in response to an image selection operation for the target image library, a target object image by the image selection operation is obtained.

The embodiment of step S114 described above is, for example: and the acquirer browses the target object image in the target image library on the electronic equipment, and if the target object image has a suspect, the true selection operation of the target object image is executed on the electronic equipment to obtain the target object image made by the image selection operation.

After step S110, step S120 is performed: and screening out positioning image frames from the video stream data according to the index information corresponding to the target object image.

The implementation of step S120 may include:

step S121: and screening a plurality of candidate image frames from the video stream data according to the index information of the target object image.

There are many embodiments of the above step S121, including but not limited to:

in the first embodiment, the acquisition time of the target object image is used as the index information, and the positioning image frame is screened out from the video stream data according to the acquisition time, for example: acquiring a first acquisition time of an image frame corresponding to a target object image, wherein the first acquisition time can also be understood as a time on a time axis of a video stream and is marked as T, and determining a video image frame with the acquisition time positioned in a preset time range near the first acquisition time as a candidate image frame, namely, taking T as a center, and intercepting a video stream in a time range of [ T-N, T + N ], wherein [ T-N, T + N ] is a preset time range, N is a preset deviation duration, and for example, the deviation duration N can be set to 10 minutes; the preset time range may be set according to specific conditions, for example: assuming that the first acquisition time of the image frame corresponding to the target object image is 9 pm and the preset time range is ten minutes before and after, the video image frames at 8 pm and 50 minutes to 9 pm and 10 minutes can be all taken as candidate image frames.

In the second implementation mode, the frame identification of the target object image is used as the index information, and the positioning image frame is screened out from the video stream data according to the frame identification; the implementation principle of the second embodiment is similar to that of the first embodiment, except that the frame identifier is used as index information to screen out the positioning image frame, and therefore, the implementation principle will not be described in detail here.

Step S122: and screening out positioning image frames from the plurality of candidate image frames according to the similarity between each candidate image frame in the plurality of candidate image frames and the target object image.

There are many embodiments of the above step S122, including but not limited to:

in the first embodiment, the positioning image frame is screened from the multiple candidate image frames only according to the similarity, for example: determining candidate image frames which are screened out from the plurality of candidate image frames and meet preset conditions as positioning image frames, wherein the preset conditions comprise: the similarity between the candidate image frame and the target object image is the maximum, or the similarity between the candidate image frame and the target object image is greater than a preset similarity threshold and appears earliest; the preset similarity threshold may be set according to specific situations, and the preset similarity threshold may be set to 70%, 80%, or 90%. Assuming that the preset condition is that the candidate image frame has a similarity greater than a preset similarity threshold with the target object image and appears earliest, the candidate image frame is a video image frame with a similarity greater than a preset similarity threshold with the target object image and is assumed to be 8 pm, 50 pm, to 9 pm, 10 pm, and the candidate image frame with 8 pm, 50 pm and appearing earliest can be determined as the positioning image frame.

In a second embodiment, a positioning image frame is screened out from a plurality of candidate image frames according to the similarity and the type of the target object, for example: the method includes the steps of screening a plurality of similar image frames which are the same as target object types (such as pedestrians, vehicles, human faces, motor vehicles and the like) of a target object image from a plurality of candidate image frames, screening the similar image frames with the maximum similarity to the target object image from the plurality of similar image frames, and determining the similar image frames with the maximum similarity to be positioning image frames, wherein the specific examples include: there are three candidate image frames, and the target object type of the three candidate image frames and the target object type of the target object image are both pedestrians, and the three candidate image frames include: a first candidate image frame, a second candidate image frame, and a third candidate image frame having respective similarities of 70%, 80%, and 90% to the target object image, respectively, the third candidate image frame should be determined as the scout image frame.

Please refer to fig. 2, which illustrates a relationship diagram between similarity and frame identifier provided in the embodiment of the present application; the ordinate in the figure represents similarity, the range of the similarity is 0% to 100%, 0% to 100% represents from completely dissimilar to completely similar, the abscissa in the figure represents a video stream time axis (which may also represent video frames in the time direction), and the number on the abscissa represents a video image frame identification or a frame number. It is understood that theoretically, if it is a real target, the similarity should be very close to 1, specifically: assuming that the time of the acquisition time of the image frame corresponding to the target object image on the time axis of the video stream is marked as T, and taking T as the center, intercepting the video stream in the time range of [ T-N, T + N ], wherein [ T-N, T + N ] is a preset time range, and N is a preset deviation duration; the value of N may be set according to specific conditions, and if N is set to be larger, the time for image search is also increased, so that the specific value of N needs to be adjusted according to the performance of the actual network, hardware, and the like. If the similarity in each frame in the time [ T-N, T + N ] is arranged according to the time line (frame identifier or frame number), the similarity should be a sequence that increases first and then decreases, the position of the image frame with the maximum similarity is the position of the maximum value, that is, the frame identifier of the image frame with the maximum similarity is the frame identifier of the position (point in the dashed circle) of the maximum value.

In the process of screening out the image frame with the maximum similarity to the target object image from the plurality of image frames, a concept of block sorting may be adopted, each time a batch (batch) is taken as a preset number of image frames for analysis, if a target image frame with the maximum similarity to the query image in the batch and a frame identifier corresponding to the target image frame need to be found, the similarity needs to be set as the similarity of the batch to which the target image frame belongs, and the similarity is arranged according to a time line. When the similarity of the batch begins to be gradually increased and decreased, the frame identifier corresponding to the batch can be directly judged to be the final target frame identifier; at this time, the search can be ended in advance, the subsequent frame sequence does not need to be analyzed, and the speed of positioning the target image frame can be greatly increased.

After step S120, step S130 is performed: and playing the positioned image frames from the video stream data according to the positioned image frames.

The embodiment of step S130 described above is, for example: playing the image frame positioned from the video stream data according to the obtained positioning image frame, specifically for example: assuming that the positional image frames are 200 ms frames at 8 pm for 50 min 10 s, the video stream data can be played starting at 8 pm for 50 min 10 s and 200 ms. Of course, the user may also set which frame in the video stream data to start playing according to specific situations, for example: assuming that the positioning image frame is an image frame with 8 pm, 50 min, 10 sec and 200 ms, the playing of the video stream data can be started 100 ms or 1 sec in advance; if the user sets to start playing the video stream data 100 ms in advance, the video stream data can be played from 8 pm, 50 min, 10 sec and 100 ms; if the user sets to start playing the video stream data 1 second in advance, the video stream data can be played from 8 pm, 50 minutes, 9 seconds and 200 milliseconds.

In the implementation process, a target object image is screened from a target image library, a positioning image frame is screened from video stream data according to index information corresponding to the target object image, the positioning image frame in the video stream data is played, and the video stream data is played from the positioning image frame positioned in the video stream data according to the positioning image frame; that is to say, the positioning image frames are screened out from the video stream data through the target image library established in advance, so that the situation that operation errors exist according to the process of manually dragging the video playing progress bar is avoided, and the accuracy of the image frames positioned from the video stream data to the target object is improved.

Alternatively, in order to improve the accuracy of positioning the image frame, the mapping relationship between the image and the index information of the image may be corrected each time the target image library is used, and of course, the mapping relationship may be corrected when the target image library is not used, and the correction of the mapping relationship when the target image library is not used may include the following two cases:

in the first case, please refer to a schematic flow chart of the modified mapping relationship provided in the embodiment of the present application shown in fig. 3; firstly, the mapping relationship is corrected, and then the mapping relationship is stored, which may include:

step S210: and acquiring a mapping relation between the target object image and the index information of the target object image.

The embodiment of step S210 described above is, for example: and predicting a candidate area where the target object is located in a video image frame in the video stream data and the prediction probability of the target object existing in the candidate area by using the target detection model. If the prediction probability is larger than a preset probability threshold, intercepting a target object image from the video image frame according to the candidate region, and associating the target object image with the index information of the target object image, so as to obtain the mapping relation between the target object image and the index information of the target object image; among them, target detection models that may be used include, but are not limited to: network models such as RCNN, fast RCNN, Feature Fusion Single Shot multi box Detector (FSSD), and YOLO.

Step S220: and correcting the mapping relation to obtain the corrected mapping relation.

The embodiment of step S220 described above is, for example: screening a correction positioning frame from the video stream data according to the index information corresponding to the target object image, where the correction positioning frame is a positioning image frame for correcting the mapping relationship, and the specific obtaining mode of the correction positioning frame is similar to that in the step S120 and is different only in function, so that the specific obtaining mode of the correction positioning frame is not described here; and replacing the mapping relation in the target image library by using the mapping relation between the corrected positioning frame and the index information of the corrected positioning frame.

Step S230: and storing the corrected mapping relation into a target image library.

The embodiment of the step S230 is, for example: storing the corrected mapping relation between the target object image and the index information of the target object image into a target image library; wherein, the target image library can be a relational database or a non-relational database; relational databases that can be used are for example: mysql, PostgreSQL, Oracle, SQLSever, etc.; non-relational databases that may be used include: grakn database, Neo4j database, Hadoop subsystem HBase, MongoDB and CouchDB, etc.

In the second case, the mapping relationship is stored first, and then the mapping relationship is corrected, wherein the condition for triggering and correcting the mapping relationship may be triggered immediately after the mapping relationship is stored, or triggered periodically without repetition after the mapping relationship is stored, or triggered periodically after the mapping relationship is stored; such situations may include:

step S240: and acquiring a mapping relation between the target object image and the index information of the target object image.

The implementation principle and implementation manner of step S240 are similar to those of step S210, and therefore, the implementation principle and implementation manner of step are not described herein, and reference may be made to the description of step S210 if it is unclear.

Step S250: and storing the mapping relation to a target image library.

The implementation principle and implementation manner of the step S250 are similar to the implementation principle and implementation manner of the step S230, and therefore, the implementation manner and implementation principle of the step are not described herein, and if it is not clear, reference may be made to the description of the step S230.

Step S260: and correcting the mapping relation stored in the target image library.

The implementation principle and implementation manner of step S260 are similar to that of step S220, and therefore, the implementation principle and implementation manner of step is not described here, and reference may be made to the description of step S220 if it is not clear.

Please refer to fig. 4, which is a schematic structural diagram of a video playing apparatus according to an embodiment of the present application; the embodiment of the present application provides a video playing apparatus 300, including:

the index information obtaining module 310 is configured to screen a target object image from a target image library, and obtain index information corresponding to the target object image, where the target image library stores a mapping relationship between the target object image captured by a video image frame in video stream data and the index information of the target object image.

And the positioning image screening module 320 is configured to screen a positioning image frame from the video stream data according to the index information corresponding to the target object image.

The video positioning playing module 330 is configured to play the image frames positioned from the video stream data according to the positioning image frames.

Optionally, in an embodiment of the present application, the positioning image filtering module includes:

and the candidate image screening module is used for screening a plurality of candidate image frames from the video stream data according to the index information of the target object image.

And the positioning frame screening submodule is used for screening the positioning image frames from the candidate image frames according to the similarity between each candidate image frame in the candidate image frames and the target object image.

Optionally, in this embodiment of the present application, the index information is an acquisition time; a candidate image screening module comprising:

and the acquisition moment acquisition module is used for acquiring the first acquisition moment of the image frame corresponding to the target object image.

And the first candidate determining module is used for determining the video image frame with the acquisition time in a preset time range near the first acquisition time as a candidate image frame.

Optionally, in this embodiment of the present application, the positioning frame screening sub-module is specifically configured to: determining candidate image frames which are screened out from the plurality of candidate image frames and meet preset conditions as positioning image frames, wherein the preset conditions comprise: the similarity between the candidate image frame and the target object image is the maximum, or the similarity between the candidate image frame and the target object image is larger than a preset similarity threshold and appears earliest.

Optionally, in this embodiment of the present application, the positioning frame screening sub-module is specifically configured to: the method comprises the steps of screening a plurality of similar image frames with the same type as a target object of a target object image from a plurality of candidate image frames, screening the similar image frames with the maximum similarity to the target object image from the similar image frames, and determining the similar image frames with the maximum similarity as positioning image frames.

Optionally, in this embodiment of the present application, the index information obtaining module includes:

and the query image acquisition module is used for acquiring the image to be queried.

The characteristic information extraction module is used for extracting the characteristic information of the image to be inquired, and the characteristic information comprises: image features or human face features.

And the target image screening module is used for screening out a target object image with the maximum similarity with the image to be inquired from the target image library according to the characteristic information of the image to be inquired.

Optionally, in this embodiment of the present application, the index information obtaining module may include:

and the target image obtaining module is used for responding to the image selection operation in the target image library and obtaining the target object image made by the image selection operation.

Optionally, in this embodiment of the present application, the video playing apparatus further includes:

the mapping relation correction module is used for acquiring the mapping relation between the target object image and the index information of the target object image, correcting the mapping relation to acquire a corrected mapping relation, and then storing the corrected mapping relation into a target image library; or acquiring a mapping relation between the target object image and the index information of the target object image, storing the mapping relation into a target image library, and then correcting the mapping relation.

Optionally, in this embodiment of the present application, the mapping relationship correcting module is specifically configured to:

and predicting a candidate area where the target object is located in a video image frame in the video stream data and the prediction probability of the target object existing in the candidate area by using the target detection model.

And if the prediction probability is greater than a preset probability threshold, intercepting a target object image from the video image frame according to the candidate region, and storing the mapping relation between the target object image and the index information of the target object image into a target image library.

Optionally, in this embodiment of the present application, the mapping relationship correcting module may be specifically configured to:

and screening out the corrected positioning frame from the video stream data according to the index information corresponding to the target object image.

And replacing the mapping relation in the target image library by using the mapping relation between the corrected positioning frame and the index information of the corrected positioning frame.

It should be understood that the apparatus corresponds to the above-mentioned video playing method embodiment, and can perform the steps related to the above-mentioned method embodiment, and the specific functions of the apparatus can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy. The device includes at least one software function that can be stored in memory in the form of software or firmware (firmware) or solidified in the Operating System (OS) of the device.

Please refer to fig. 5, which illustrates a schematic structural diagram of an electronic device according to an embodiment of the present application. An electronic device 400 provided in an embodiment of the present application includes: a processor 410 and a memory 420, the memory 420 storing machine-readable instructions executable by the processor 410, the machine-readable instructions when executed by the processor 410 performing the method as above.

The embodiment of the present application also provides a storage medium 430, where the storage medium 430 stores a computer program, and the computer program is executed by the processor 410 to perform the method as above.

The storage medium 430 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules of the embodiments in the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an alternative embodiment of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application.

Claims

1. A video playback method, comprising:

screening a target object image from a target image library, and acquiring index information corresponding to the target object image, wherein the target image library stores a mapping relation between the target object image intercepted by a video image frame in video stream data and the index information of the target object image;

screening a positioning image frame from the video stream data according to the index information corresponding to the target object image;

playing the positioned image frames from the video stream data according to the positioned image frames;

wherein, the screening out positioning image frames from the video stream data according to the index information corresponding to the target object image comprises: screening a plurality of candidate image frames from the video stream data according to the index information of the target object image; and screening out positioning image frames from the plurality of candidate image frames according to the similarity between each candidate image frame in the plurality of candidate image frames and the target object image.

2. The method of claim 1, wherein the index information is a collection time; the screening out a plurality of candidate image frames from the video stream data according to the index information of the target object image comprises the following steps:

acquiring a first acquisition moment of an image frame corresponding to the target object image;

and determining the video image frame with the acquisition time in a preset time range near the first acquisition time as the candidate image frame.

3. The method of claim 1, wherein the screening out scout image frames from the plurality of candidate image frames based on the similarity of each of the plurality of candidate image frames to the target object image comprises:

determining candidate image frames which are screened out from the plurality of candidate image frames and meet preset conditions to be the positioning image frames, wherein the preset conditions comprise: the similarity between the candidate image frame and the target object image is the maximum, or the similarity between the candidate image frame and the target object image is greater than a preset similarity threshold and occurs earliest.

4. The method of claim 1, wherein the screening out the target object image from the target image library comprises:

acquiring an image to be inquired;

extracting feature information of the image to be inquired, wherein the feature information comprises: image features or face features;

and screening out the target object image with the maximum similarity with the image to be inquired from the target image library according to the characteristic information of the image to be inquired.

5. The method of claim 1, wherein the screening out the target object image from the target image library comprises:

in response to an image selection operation for the target image library, a target object image by the image selection operation is obtained.

6. The method according to any one of claims 1-5, further comprising, prior to said screening out the target object image from the target image library:

acquiring a mapping relation between the target object image and the index information of the target object image, correcting the mapping relation to obtain a corrected mapping relation, and then storing the corrected mapping relation to the target image library;

or,

after storing the mapping relationship into the target image library, correcting the mapping relationship in the target image library, including: screening a correction positioning frame from the video stream data according to index information corresponding to a target object image, and replacing a mapping relation in the target image library with a mapping relation between the correction positioning frame and the index information of the correction positioning frame; wherein, the screening of the corrected positioning frame from the video stream data according to the index information corresponding to the target object image comprises: screening a plurality of candidate image frames from the video stream data according to the index information of the target object image; and screening out a correction positioning frame from the plurality of candidate image frames according to the similarity between each candidate image frame in the plurality of candidate image frames and the target object image.

7. A video playback apparatus, comprising:

the index information acquisition module is used for screening a target object image from a target image library and acquiring index information corresponding to the target object image, wherein the target image library stores a mapping relation between the target object image intercepted by a video image frame in video stream data and the index information of the target object image;

the positioning image screening module is used for screening positioning image frames from the video stream data according to the index information corresponding to the target object image;

the video positioning playing module is used for playing the positioned image frames from the video stream data according to the positioned image frames;

8. An electronic device, comprising: a processor and a memory, the memory storing machine-readable instructions executable by the processor, the machine-readable instructions, when executed by the processor, performing the method of any of claims 1 to 6.

9. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the method according to any one of claims 1 to 6.