CN109389088B

CN109389088B - Video recognition method, device, machine equipment and computer readable storage medium

Info

Publication number: CN109389088B
Application number: CN201811191485.1A
Authority: CN
Inventors: 陆康
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-10-12
Filing date: 2018-10-12
Publication date: 2022-05-24
Anticipated expiration: 2038-10-12
Also published as: CN109389088A

Abstract

The invention discloses a video identification method, a video identification device and machine equipment. The method comprises the following steps: in the process of playing a video by a video playing client, the video identification client receives a user instruction for identifying the played video; responding to the user instruction to obtain at least one frame of video image in the played video; extracting the characteristics of the video image to obtain characteristic information; and retrieving according to the characteristic information to obtain source video information of the played video, wherein the source video information is used for describing the origin of the played video. Therefore, for a user, the source video can be searched for the visible video in real time, efficient and quick source video identification is achieved for the anywhere visible video playing, and the method is applicable to the source video search of the video seen by the user in all scenes as only video clips need to be acquired for the played video, so that the universality is strong, the interaction performance of video services is enhanced, and the identification when the user sees is achieved.

Description

Video recognition method, device, machine equipment and computer readable storage medium

Technical Field

The present invention relates to the field of internet application technologies, and in particular, to a video recognition method, apparatus, machine device, and computer-readable storage medium.

Background

With the development of video application technology in the internet, more and more video services play videos in a plurality of scenes, and correspondingly, users can watch the played videos when staying in the plurality of scenes. For example, in a terminal device held by a user, a video played in a stopped application scene; and the video played by the display device in the real scene where the user stays, and the like.

The user can see videos presented by various video services in many scenes. In both the context of the running application and the real-world context, the user's stay is inevitably added to the viewing of a video clip. For the video clip, the source cannot be known through video identification, that is, it cannot be known from which complete video the video clip comes from, and other related source video information.

The existing realization of video identification is often limited to the identification of video content. For example, a frame of video content is identified to obtain a corresponding content label, but only this is done, it is still unknown from which complete video the currently played video is from, and the source video information is unknown. For the seen video, the user can only initiate a question in the internet through the forms of text description or screenshot and the like, and the reply of net friends is gathered widely to obtain source video information such as a video place. The process cannot predict whether the net friend replies or not, cannot know the reply accuracy, and is uncontrollable in the whole process.

Therefore, it is highly desirable to provide a video identification implementation for the video playing in any place, so as to solve the dilemma that the current watched video cannot efficiently and quickly identify the source video.

Disclosure of Invention

In order to solve the technical problem that the provenance of a played video cannot be identified in the related art, the invention provides a video identification method, a video identification device, machine equipment and a computer readable storage medium, which can realize efficient and rapid source video identification for video playing.

A method of video recognition, the method comprising:

in the process of playing a video by a video playing client, a video identification client receives a user instruction for identifying the video;

responding to the user instruction, and acquiring at least one frame of video image in the played video by the video identification client;

extracting the characteristics of the video image to obtain characteristic information;

and retrieving according to the characteristic information to obtain source video information of the played video, wherein the source video information is used for describing the origin of the played video.

A method of video recognition, the method comprising:

the method comprises the steps that a server obtains characteristic information of at least one frame of video image in a video according to a user instruction for identifying the video by a video identification client, wherein the user instruction is generated by triggering the video in video playing;

in the inverted index data with the characteristic information as an index item, retrieving the characteristic information to obtain source video information;

and the server feeds back the source video information to the video identification client, so that the video identification client obtains the video source.

A video recognition device, the device comprising:

the instruction receiving module is used for receiving a user instruction for identifying the video in the process of playing the video by the video playing client;

the image acquisition module is used for responding to the user instruction and acquiring at least one frame of video image in the played video;

the extraction module is used for extracting the characteristics of the video image to obtain characteristic information;

and the retrieval module is used for retrieving according to the characteristic information to obtain source video information of the played video, wherein the source video information is used for describing the origin of the played video.

A video identification device, the device comprising:

the system comprises a characteristic acquisition module, a video identification client and a video display module, wherein the characteristic acquisition module is used for acquiring characteristic information of at least one frame of video image in a video according to a user instruction for identifying the video by the video identification client, and the user instruction is generated by triggering the video in the video playing process of the video identification client;

the data retrieval module is used for retrieving the characteristic information in the inverted index data taking the characteristic information as an index item to obtain source video information;

and the feedback module is used for feeding back the source video information to the video identification client so that the video identification client can obtain the video output.

A machine device, comprising:

a processor; and

a memory having computer readable instructions stored thereon which, when executed by the processor, implement a method as described above.

A computer-readable storage medium having stored thereon a computer program executable by a processor for performing the video recognition method as set forth above.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

for the faced video playing, when the user selects to identify the complete video from which the video comes from the video played by the video playing client, the video identification client implemented by the exemplary embodiment of the present invention receives the user instruction for identifying the video, here, at least one frame of video image is obtained in the played video in response to the user instruction, feature extraction is performed on the obtained at least one frame of video image to obtain feature information, and finally, the source video information of the played video is obtained by retrieving according to the feature information, the source video information is used for describing the origin of the played video, that is, the complete video from which the video comes is indicated, so that for the user, the search of the source video can be instantly realized for the seen video, and efficient and rapid source video identification is achieved for the video playing where the video is visible, and because only at least one frame of video image needs to be acquired from the played video, the method and the device are suitable for searching the source video of the video seen by the user in all scenes, have strong universality, enhance the interactive performance of the video service and realize the identification when the user sees.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic illustration of an implementation environment in accordance with the present invention;

FIG. 2 is a block diagram illustrating an apparatus according to an example embodiment

FIG. 3 is a flow diagram illustrating a method of video recognition in accordance with an exemplary embodiment;

FIG. 4 is a flowchart illustrating a description of step 350 according to an exemplary embodiment;

FIG. 5 is a flow diagram illustrating a video recognition method in accordance with an exemplary embodiment;

FIG. 6 is a flowchart illustrating steps for pre-processing video identification by a server for a user terminal and constructing inverted index data with feature information as an index entry according to an exemplary embodiment;

FIG. 7 is a diagram illustrating a first level of video recognition implementation for a caption feature according to an example embodiment;

FIG. 8 is a schematic diagram of an application for implementing video recognition using first-level and second-level features according to the corresponding embodiment of FIG. 7;

FIG. 9 is a block diagram illustrating a video recognition device in accordance with an exemplary embodiment;

FIG. 10 is a block diagram illustrating a video recognition device configured with a server in accordance with an exemplary embodiment;

FIG. 11 is a block diagram illustrating a description of a pre-processing module in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

FIG. 1 is a schematic illustration of an implementation environment in accordance with the present invention. In an exemplary embodiment, the implementation environment includes a video recognition client 110, a video playback client 130, and a server 150 that provides video recognition services.

The video recognition client 110 may be operated in a terminal device, such as a smart phone, held by a user. The video recognition client 110 and the video playing client 130 may be integrated or may be separately disposed.

In an exemplary embodiment, the video recognition client 110 and the video playing client 130 integrated together may run on a terminal device, and the two cooperate with each other to implement an application on the terminal device that can perform video playing and recognize the played video.

In another exemplary embodiment, the video recognition client 110 and the video playing client 130 are separate from each other and are not configured in the same application. The video identification client 110 and the video playing client 130 which are separately arranged may be respectively arranged on the same terminal device, or may be arranged on different terminal devices.

The user initiates video identification on the video played by the video playing client 130 through the video identification client 110 running on the terminal device, and the video identification client 110 acquires at least one frame of video image from the video played by the video playing client 130, and acquires the source video information from the server 150 based on the video image.

Therefore, for the user, the place is obtained for the user to see in real time, the interaction performance between the user and the played video is enhanced, and a search tool is provided for the playing of the video which is seen everywhere.

FIG. 2 is a block diagram illustrating an apparatus according to an example embodiment. For example, the apparatus 200 may be a terminal device in the implementation environment shown in FIG. 1. For example, the terminal device is a terminal device held by a user, such as a smartphone or a tablet computer, or various cameras.

Referring to fig. 2, the apparatus 200 includes at least the following components: a processing component 202, a memory 204, a power component 206, a multimedia component 208, an audio component 210, a sensor component 214, and a communication component 216.

The processing component 202 generally controls overall operation of the device 200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations, among others. The processing components 202 include at least one or more processors 218 to execute instructions to perform all or a portion of the steps of the methods described below. Further, the processing component 202 includes at least one or more modules that facilitate interaction between the processing component 202 and other components. For example, the processing component 202 can include a multimedia module to facilitate interaction between the multimedia component 208 and the processing component 202.

The memory 204 is configured to store various types of data to support operations at the apparatus 200. Examples of such data include instructions for any application or method operating on the apparatus 200. The Memory 204 is implemented by at least any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. Also stored in memory 204 are one or more modules configured to be executed by the one or more processors 218 to perform all or a portion of the steps of any of the methods illustrated in fig. 3, 4, 5, 6, 7, 8, and 9, described below.

The power supply component 206 provides power to the various components of the device 200. The power components 206 include at least a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 200.

The multimedia component 208 includes a screen that provides an output interface between the device 200 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a touch panel. If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. The screen further includes an Organic Light Emitting Display (OLED for short).

The audio component 210 is configured to output and/or input audio signals. For example, the audio component 210 may include a Microphone (MIC) configured to receive external audio signals when the device 200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 204 or transmitted via the communication component 216. In some embodiments, audio component 210 also includes a speaker for outputting audio signals.

The sensor component 214 includes one or more sensors for providing various aspects of status assessment for the device 200. For example, the sensor assembly 214 detects the open/closed status of the device 200, the relative positioning of the components, the sensor assembly 214 also detects a change in position of the device 200 or a component of the device 200, and a change in temperature of the device 200. In some embodiments, the sensor assembly 214 also includes a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 216 is configured to facilitate wired or wireless communication between the apparatus 200 and other devices. The device 200 accesses a WIreless network based on a communication standard, such as WiFi (WIreless-Fidelity). In an exemplary embodiment, the communication component 216 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the Communication component 216 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, bluetooth technology, and other technologies.

In an exemplary embodiment, the apparatus 200 is implemented by one or more Application Specific Integrated Circuits (ASICs), digital signal processors, digital signal processing devices, programmable logic devices, field programmable gate arrays, controllers, microcontrollers, microprocessors or other electronic components for performing the methods described below.

Fig. 3 is a flow chart illustrating a video recognition method according to an example embodiment. In an exemplary embodiment, the video recognition method, as shown in fig. 3, includes at least the following steps.

In step 310, during the process of playing the video by the video playing client, the video identifying client receives a user instruction for identifying the video.

In the process of playing a video by a video playing client, a user can select to perform video identification for the video playing client so as to acquire information related to source videos such as the position of the currently watched video. It should be understood. For a user, a small part of a complete video exists in a watched video, or a short video intercepted from the complete video, even a frame of video image, and the like, it is difficult to know information related to the video through the content of the watched video.

It should be noted here that, for a video played by a video playing client, a source video is a corresponding complete video, and in one case, a played video is a complete video where the source video is located; in another case, the video may be a complete video from which the played video comes, in the case of the intercepted short video, the played video is only a short video intercepted from the complete video and is not limited herein, and the played video may be a part or all of the source video.

For video playing performed by the video playing client, the video identification client operated by the terminal device identifies video triggering along with user operation, and correspondingly, the video identification client senses user selection and receives a user instruction for identifying the played video.

Compared with the terminal device running the video identification client, the main body for playing the video will be other devices, for example, a large screen where the user stays, and of course, the main body for playing the video may also be the terminal device itself running the video identification client, so as to initiate identification on the video played by the terminal device itself, which is not limited herein.

The user instruction for identifying the played video is obtained along with the user operation. In one exemplary embodiment, step 310 includes: in the process of playing the video by the video playing client, the video identification client initiates video identification on the video played by the video playing client through triggering of user operation and control, and generates a user instruction for identifying the video. In an exemplary embodiment, the video playing client and the video recognition client are disposed in the same terminal device or different terminal devices.

Along with the identification of the video by the user, a video control video identification client playing the video playing client in the terminal equipment or other equipment generates a user instruction for identifying the video, and the video is captured under the control of the user instruction. Therefore, for a user, video identification can be realized through video capture carried out at the video identification client side at will, the method is simple and efficient, the method does not need to depend on a mode of asking questions of net friends any more, the dilemma that the time period for solving the problems in the mode of asking questions of the net friends is not guaranteed is avoided, the method is realized through the mode that the video identification client side initiates video capture at will to further realize video identification, the processing time consumption is low, and the feedback is quick.

The user instruction for identifying the played video is used for indicating the acquisition of at least one frame of video image in the video playing process. There are many ways for acquiring at least one frame of video image, and different video playing conditions are applicable to different video image acquisition ways, so that the video clip acquisition instructed by the user instruction is different.

In one specific implementation of the exemplary embodiment, the video playing client performs video playing, for example, the aforementioned large screen performs video playing through the video playing client.

In step 330, in response to a user instruction, the video recognition client acquires at least one frame of video image from the played video.

The user instruction for identifying the video played by the video playing client by the video identification client is used for indicating the acquisition of at least one frame of video image in the video playing process, namely acquiring at least one frame of video image in the played video. The number of frames of the video image is not limited, and the corresponding existing form of at least one frame of the corresponding obtained video image comprises a single frame of the video image and a video clip formed by a plurality of frames of the video image. The length of the video segment is not limited, as long as the segment of the video needs to be identified. Of course, the longer the duration of the acquired video segment is, the more beneficial the video can be accurately identified.

In response to a user instruction, at least one frame of video image acquisition for video playing may be performed, in an exemplary embodiment, in a manner of shooting a played video, for example, shooting a video played on a large screen, so as to obtain a single frame of video image, or a video clip, for identifying the large screen of the played video. In another exemplary embodiment, the obtaining of at least one video image of the video played by the video playing client may also be implemented by means of video recording. For example, for a video played by a terminal device running a video playing client, the video can be recorded to obtain a video clip for identifying the played video, and of course, the video clip can also be captured on a screen to obtain a single frame video image.

Therefore, different at least one frame of video image acquisition process is initiated under the control of a user instruction according to different conditions of video playing so as to adapt to video identification in different video playing scenes. Here, for a user instruction, the at least one frame of video image acquisition instructed by the user instruction is related to a user manipulation that triggers generation of the user instruction. For example, when the user raises his hand to identify a video played on the large screen, the user triggers the video to be shot and identified, and correspondingly, the generated user instruction also controls the video clip to be shot.

For another example, for video playing performed by the terminal device itself, if video identification is required, the video is triggered to be recorded and identified under the control of the user; in response, the generated user command will also control the recording of the video segment.

At least one frame of video image including the video clip is obtained by responding to the user instruction for video playing, so that the user can conveniently initiate the identification of the played video at any time, the possibility of identifying the played video is ensured, and an entrance is provided for obtaining the related information of the currently played video.

The obtained at least one frame of video image corresponds to a video to be played, that is, the at least one frame of video image is obtained for a currently played video, that is, the at least one frame of video image obtained for the video to be played may be a single video image or a video clip, that is, a video image sequence composed of a plurality of video images.

The acquired at least one frame of video image is used as the basis for video identification, and the existence form of the input data of the video identification is the basis on which the required data processing can be carried out by means of the video clip.

In one exemplary embodiment, step 330 includes: and controlling the video identification client to shoot the image of the played video according to the user instruction to obtain at least one frame of video image corresponding to the video.

In the exemplary embodiment, at least one frame of video image of a video watched by a user is obtained by calling a camera by a video identification client to shoot, so as to identify the video for the user.

In the implementation of the process, as the video recognition client is triggered to shoot, the played video is shot for a certain time under the control of the user, and the obtained single-frame video image or video clip corresponds to the played video, namely is obtained from the played video. The time length may be a specified time length, or may be any time length other than the specified shortest time length, and is not limited herein.

In step 350, the features of the video image are extracted to obtain feature information.

After at least one frame of video image is obtained from the played video through the steps, feature extraction can be performed on the obtained at least one frame of video image, so that the obtained at least one frame of video image is represented through the extracted feature information.

The extracted feature information includes various types of features such as subtitle features and/or image features. Whatever the type of feature, is matched to the at least one acquired video image. In at least one frame of the acquired video images, if the video image exists in the form of a video clip, the video clip consists of a plurality of frames of video images, so that the video clip corresponds to an image sequence which is formed by sequentially arranging the plurality of frames of video images of the video clip.

Feature information obtained by feature extraction of the video clip is used for describing an image sequence corresponding to the video clip so as to represent the features of the video, so that the video is distinguished from other videos, and the corresponding video can be accurately represented; similar to the feature information extracted from the single-frame video image, the content of the single-frame video image is also described, so that the video is characterized in the video identification currently performed by the video identification client. In the feature information, no matter what type of features, description of the acquired at least one frame of video image is performed on multiple dimensions, and then videos are distinguished.

In an exemplary embodiment, the feature extraction may be performed to extract the specified feature or features of several types according to the configuration of the video recognition performed, so as to obtain feature information including the feature or features, but any feature is matched with the acquired at least one frame of video image.

In another exemplary embodiment, the feature extraction may be performed by first extracting a feature that is less time-consuming or less complex, and then extracting other kinds of features after the feature cannot be successfully extracted, so as to ensure the processing efficiency and response speed of video recognition.

For example, for the caption feature and the image feature, if the caption can be recognized from at least one acquired frame of video image, the caption feature is firstly extracted from the video image to obtain the feature information; if the subtitles cannot be recognized from at least one frame of the obtained video image, image feature extraction is carried out on the video image, although the video image is complex, the video recognition can be guaranteed, and the processing speed and the successful recognition of the video are both considered and guaranteed through the method.

In step 370, a search is performed according to the feature information to obtain source video information of the played video, where the source video information is used to describe the origin of the played video.

The feature information obtained from at least one frame of video image in the video is to describe the video segment in a specified dimension, so that the video segment can be represented by the feature information, and the retrieval of the played video is performed for the represented video segment.

Note that, the search based on the feature information is performed in the inverted index data constructed by using the features as the index items, that is, the index search of the feature information is substantial. The inverted index data is constructed for massive videos by taking the characteristics as index items and video information as an index value.

For the retrieval according to the feature information, the video information indexed by all the features is the source video information of the currently played video.

In an exemplary embodiment, the video information, as well as the source video information identified for the played video, records information related to the complete video, such as video name, attribute, and the like, and is not limited herein. However, for the video watched by the user, through the realization of video identification, the complete video from which the current watched video comes can be obtained, the exit is not required to be obtained in a way of asking questions to net friends through text description or screenshot, and the whole process is controllable, convenient and fast.

In one exemplary embodiment, step 370 includes: and the video identification client transmits the extracted characteristic information to the server, and searches the characteristic information through the server to obtain the source video information of the played video.

For the video identification initiated by the user at any time, the video identification client running on the terminal equipment obtains the source video information fed back by the server through the server access, and then the video played by the video playing client can be conveniently and quickly obtained to the video outlet.

The server constructs and stores the inverted index data for the existing massive videos, the constructed inverted index data takes the characteristics as index items, the retrieval of the video identification client side is served according to the characteristic information, and the inverted index data is continuously updated along with the addition of the videos. For the construction and updating of the inverted index data by the server, each video existing in the internet exists as a complete video relative to the video identification, so that the server constructs the inverted index data for each video to facilitate the characteristic information retrieval in the video identification.

Therefore, the server continuously updates the inverted index data along with the increase of the videos in the internet.

The method comprises the steps that a video identification client on terminal equipment acquires at least one frame of video image through video clip or single frame of video image capture, feature extraction is carried out on the basis to obtain feature information, at the moment, the video identification client can initiate server access, and reverse index data stored in a server are retrieved to obtain source video information.

Of course, it should be understood that the video identification client is not limited to performing retrieval in the server, and may also perform retrieval of the feature information locally by locally storing the inverted index data, and of course, only the inverted index data corresponding to the common video or the popular video may also be locally stored, so as to support the video identification client performance and improve the processing efficiency, which is not limited herein.

In an exemplary embodiment, the feature information includes a caption feature, and step 350 includes: and extracting the caption features of at least one frame of video image to obtain the caption features in at least one frame of video image.

The caption feature refers to a caption of a video image. The caption features describe the acquired at least one frame of video image from the dimension of the caption, so that the acquired at least one frame of video image can be accurately represented by the caption features, and the rapid extraction of the features can be realized.

In an exemplary embodiment, many videos, such as those of the television series and movies, have video segments with subtitles, and therefore, subtitle features can be extracted from them. The extraction of the caption features can be realized by text recognition on each frame of video image in at least one frame of video image.

In another exemplary embodiment, the feature information includes image features, and step 350 includes: and performing image feature extraction on at least one frame of video image to obtain an image feature sequence in at least one frame of video image.

Wherein, no matter the caption characteristic or the image characteristic is used for describing the video content. Not all videos have subtitles, and a large number of non-television dramas and non-movie videos are not provided with subtitles, so that processing of subtitle features cannot be solved, and only processing of subtitle features cannot obtain required feature information, and image features need to be introduced.

By means of the image feature extraction, the obtained whole video image is abstracted into an image feature sequence, and therefore a complete video is searched for the video seen by the user. Further, the image feature sequence is obtained by extracting features according to each frame of content of the video image and sequentially arranging the extracted features.

In an exemplary embodiment, the image feature is used as a content feature, and the image feature extraction performed can describe the video content in multiple aspects, so the image feature may include content features of multiple aspects of color, texture, shape and motion obtained by extracting the video content, and even a spatial relationship between objects in the video content and semantic features of scenes, behaviors, emotions and the like, which are not limited herein.

In addition, for feature extraction of the video clip, feature extraction can be performed on a key frame, and the key frame reflects the main content of a section of video, so that feature extraction can be performed on the key frame under the condition that the complexity of the video is high or the video clip is too long.

In another exemplary embodiment, the foregoing subtitle feature extraction and image feature extraction are performed by the following steps further included in step 350.

Fig. 4 is a flowchart illustrating a description of step 350, according to an example embodiment. In one exemplary embodiment, as shown in FIG. 4, step 350 includes:

in step 351, controlling to jump to the step 353 according to whether the video identification client can identify the subtitle from the video image or not, and controlling to jump to the step 355 when the subtitle can be identified.

In step 353, extracting the caption feature of at least one frame of video image to obtain the caption feature of at least one frame of video image.

In step 355, image feature extraction is performed on at least one frame of video image to obtain an image feature sequence of at least one frame of video image.

If the caption can be identified from the obtained at least one frame of video image, executing step 353, wherein the executed feature extraction is caption feature extraction, and the obtained feature information is caption features; if the subtitles cannot be recognized, step 355 is performed, and the feature extraction performed is image feature extraction, so that the obtained feature information is the image feature.

Thus, under the control of step 351, the caption feature is first extracted, and the feature information is preferentially searched for using the caption feature. The caption features, whether the storage of the inverted index data or the feature information transmission from the terminal device to the server, have obvious advantages, namely, the front end and the back end have small data transmission quantity, the storage pressure is small, and the retrieval of the index consumes very little time.

However, not all videos have subtitles, and therefore, the subtitle feature alone cannot fully address the need for video recognition, and other features need to be introduced.

Thus, feature extraction is performed under the control of step 351, thereby flexibly adapting to the recognition of various videos and enhancing the adaptability.

Through the exemplary embodiments as described above, an application of on-hand visible video recognition is implemented for a user, so that, for the user, the origin of the visible video, i.e., from which complete video, can be recognized just by capturing a video clip in the application, so that the user can conveniently and quickly inquire which source video the currently viewed video comes from.

The following is the implementation of the method of the present invention in a server, i.e. a video identification method applied to the server in the implementation environment shown in fig. 1. Fig. 5 is a flow chart illustrating a video recognition method according to an example embodiment. In an exemplary embodiment, as shown in fig. 5, the video recognition method includes at least the following steps.

In step 410, the server obtains feature information of at least one frame of video image in the video according to a user instruction for identifying the video by the video identification client, where the user instruction is generated by triggering the video during video playing performed by the video playing client.

The server is used as the back end of the video identification and used for realizing retrieval for the video identification client at the front end and further feeding back the source video information of the requested identification video to the video identification client. The server obtains the characteristic information corresponding to the video identification of the video identification client along with the access of the video identification client.

The characteristic information corresponds to at least one frame of video image acquired by the video identification client side requesting to identify the video, and the characteristic information is description on the content of the at least one frame of video image, so that after the server acquires the characteristic information according to a user instruction for identifying the video by the video identification client side, the retrieval is the content retrieval of the at least one frame of video image, and the video retrieval is performed based on the content.

The server is oriented to the massive video identification clients, after video identification is initiated for video playing of any video identification client along with user operation, response of a generated user instruction is carried out to obtain characteristic information, and the characteristic information is transmitted to the server to request the server to carry out video retrieval.

In step 430, in the inverted index data with the feature information as the index item, the feature information is retrieved to obtain the source video information.

The server stores the inverted index data to provide video retrieval service for the video identification client. The inverted index data is constructed by the server for the available videos. Each piece of inverted index data corresponds to a video to be mapped to video information of the video by the constructed index. The video identification of the video identification client is realized, the features are used as index items, the video information is used as index values to construct inverted index data, and the features are matched with the feature information extracted by the video identification client.

The video information retrieved by the server for the video identification client is the source video information fed back to the video identification client.

In step 450, the server feeds back the source video information to the video identification client, so that the video identification client obtains the provenance of the played video.

By the aid of the video retrieval method and the video retrieval device, video retrieval at the server side is achieved, and support is provided for a video identification application running at the user side.

In another exemplary embodiment, the video recognition method further includes: the server carries out video-related preprocessing for video identification of the video identification client, and constructs inverted index data with the characteristic information as an index item for the video.

The server needs to realize video retrieval for the video identification client, and therefore preprocessing needs to be performed based on videos existing in the internet so as to construct and obtain the inverted index data. Specifically, for each video available from the internet, construction of corresponding inverted index data is performed for this purpose. For each video, the feature extraction is correspondingly performed in accordance with the video identification performed by the video identification client, so as to obtain the feature information of the video, and it should be understood that the obtained feature information is the feature of the video, and the video can be characterized in one dimension.

Therefore, the feature information of each video is obtained, the video information is obtained, and further the reverse index data corresponding to the video is constructed and obtained by taking the feature information as an index item and the video information as an index value.

For a video clip captured by a video identification client requesting video identification, the feature information of the video clip must be matched with the feature information corresponding to a video, and the index value corresponding to the matched feature information is the source video information of the identified video.

Fig. 6 is a flowchart illustrating video-related preprocessing for video recognition by a server for a video recognition client, and describing steps of constructing inverted index data with feature information as an index item for the video according to an exemplary embodiment. In an exemplary embodiment, as shown in fig. 6, the server performs video-related preprocessing for video identification of the video identification client, and the step of constructing the inverted index data with the feature information as an index item for the video at least includes:

in step 501, the server crawls the video information to obtain video information corresponding to videos in the internet.

In step 503, an inverted index of the video information is constructed for each video according to the corresponding feature information, and inverted index data which takes the feature information as an index item and is searched facing all videos is formed.

And the video information retrieved from the inverted index data is the source video information of the video identified by the video identification client.

The server faces to the internet, crawls video information in the internet, constructs inverted index data for videos existing in the internet, and accurately provides video information serving as source video information for identifying any video through the constructed inverted index data.

The identification of the video played by the user staying on the large screen is taken as an example, and the explanation is realized by combining the method.

When a user stays in front of a large screen on which video is being played, and views the video played on the large screen, for the user, if the user needs to know the origin of the video, for example, which complete video the video comes from, what the name of the video corresponds to, and other video information, the user can only ask a question through text description or video screenshot to seek a reply.

Even today, where the internet is rapidly developed, there is no guarantee that the user can quickly obtain the reply of the net friend by means of the internet, no matter the time or the result is uncontrollable, and in the face of many video services, the difficulty of obtaining the provenance of the played video in the video services becomes a bottleneck on the user side after the video is played, and the user lacks means and ways to search the provenance of the played video.

By the method, the application running on the terminal side, such as various mobile terminals such as a smart phone, is provided for the user, namely the video identification client implemented by the exemplary embodiment of the invention, and through the running of the application, the user can capture and identify the video seen by the user, so that the video can be conveniently and quickly and accurately obtained, namely the source video information corresponding to the video is obtained, and the accuracy and the timeliness are both effectively guaranteed.

With the playing of a large-screen video, a user can trigger at least one frame of video image, such as the shooting of a video clip, by using a video identification client running in a carried smart phone for interested video content, then extract features of the shot video clip, and query the extracted features to a background through application to obtain the return of source video information.

It should be understood that the identification of the source video is realized by at least one frame of video image, and the feature is extracted through the shooting process, and the feature characterizes the source video so as to be separated from other video areas.

Fig. 7 is a schematic diagram illustrating a video recognition implementation corresponding to a subtitle feature at a first level according to an example embodiment. As indicated in the foregoing description, the video recognition is performed by the front end and the back end, where the front end is an application implemented by the present invention, i.e. a user app (application) side; the back-end is a server implemented by the present invention, which performs server pre-processing 710 on one hand and performs background index query 730 on the user app side on the other hand.

For the user app side, it is only necessary to shoot a video clip through the running application and recognize a subtitle from the video clip, as described in steps 810 to 830, so that the source video information returned by the server can be obtained, and the origin of the viewed video can be known.

For the video identification, if the caption features can be obtained, only index query needs to be performed according to the caption features, and extraction of the second-level features is not needed, so that a higher processing speed is ensured.

The subtitle features of the first level exist as features for optimized use, and the subtitle features have many advantages no matter the subtitle features are for the server, the video identification client and the interaction between the server and the video identification client, for example, less front-end and back-end transmission data can be ensured, background storage performed by the server is also small in storage pressure and short in retrieval time due to the fact that only storage of subtitle and video information is needed.

However, in the case of video without subtitles, a second level of features, i.e., image features, has to be introduced. The image features for video identification are extracted from each frame of content of the video segment, so that an image feature sequence corresponding to the image sequence in the video segment is obtained along with the extraction of the image features.

Fig. 8 is a schematic diagram of an application of the first-level and second-level features to implement video recognition according to the corresponding embodiment in fig. 7. Based on the embodiment shown in fig. 7, for a video segment with no subtitle recognition, as shown in step 910, the image feature sequence of the video segment is extracted to be sent to the background.

And at the moment, the server in the background retrieves the image feature sequence index data. Similarly, a subtitle reverse index sequence is also stored to enable retrieval for the first level of subtitle features.

The image feature sequence index data or the subtitle reverse index data are reverse index data stored by the server, and are constructed and continuously updated by the server through a preprocessing process.

As shown in fig. 8, in the server side preprocessing process, in step 1010, the server side crawls video information, and for a video corresponding to the crawled video information, step 1030 is executed to determine whether the video has subtitles, and if the video has subtitles, step 1040 is executed to establish an inverted index of the subtitle features and the video information, so as to obtain subtitle inverted index data stored in the server.

However, once the video does not have subtitles, step 1050 is executed to extract the image feature sequence of the video corresponding to the video information, and further step 1070 is executed to generate an index, establish an inverted index of the image feature sequence and the video information, and obtain image feature sequence index data.

Therefore, for a user, the source video from which the video currently seen comes can be conveniently and quickly inquired, and a very quick feedback speed can be obtained.

The following is an embodiment of the apparatus of the present invention, which is used to implement the above-mentioned embodiment of the video recognition method of the present invention. For details that are not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the method for controlling a video rate of the present invention.

Fig. 9 is a block diagram illustrating a video recognition device according to an example embodiment. In an exemplary embodiment, as shown in fig. 9, the video recognition device is used to implement a video recognition client, and the video recognition device includes but is not limited to: an instruction receiving module 1110, a segment acquiring module 1130, an extracting module 1150, and a retrieving module 1170.

The instruction receiving module 1110 is configured to receive a user instruction for identifying a video during a process of playing the video by the video playing client.

The image acquiring module 1130 is configured to respond to a user instruction to acquire at least one frame of video image from the played video.

An extracting module 1150 is configured to extract features of the video image to obtain feature information.

And the retrieving module 1170 is configured to retrieve according to the feature information to obtain source video information of the played video, where the source video information is used to describe a source of the played video.

In an exemplary embodiment, the instruction receiving module 1110 is further configured to, in a video playing process performed by the video playing client, initiate video identification on the video played by the video playing client through triggering of user operation and control by the video identification client, and generate a user instruction for identifying the video, where the video playing client and the video identification client are disposed in the same terminal device or different terminal devices.

In an exemplary embodiment, the image obtaining module 1130 is further configured to control the video recognition client to perform image capturing of a played video according to the user instruction, and obtain at least one frame of video image corresponding to the video.

In another exemplary embodiment, the feature information includes a caption feature, and the extraction module 1150 is further configured to extract the caption feature from the at least one frame of video image, so as to obtain the caption feature in the at least one frame of video image.

Further, the feature information includes image features, and the extraction module 1150 is further configured to perform image feature extraction on the at least one frame of video image to obtain an image feature sequence in the at least one frame of video image.

Further, the feature information obtained by feature extraction includes a caption feature or an image feature, and the extraction module 1150 is further configured to control to skip into the extraction of the caption feature or the image feature according to whether the video identification client is capable of identifying a caption from a video image, and control to skip into the extraction of the caption feature when the caption is capable of being identified, otherwise, control to skip into the extraction of the image feature.

In another exemplary embodiment, the retrieving module 1170 is further configured to transmit the extracted feature information to a server, and perform the feature information retrieval through the server to obtain source video information of the played video.

Fig. 10 is a block diagram illustrating a video recognition device configured with a server according to an example embodiment. In an exemplary embodiment, as shown in fig. 10, the video recognition device includes but is not limited to: a feature acquisition module 1210, a data retrieval module 1230, and a feedback module 1250.

The feature obtaining module 1210 is configured to obtain feature information of at least one frame of video image in a video according to a user instruction for identifying the video by a video identification client, where the user instruction is generated by triggering the video during video playing performed by a video playing client;

the data retrieval module 1230 is configured to retrieve the feature information from the inverted index data using the feature information as an index item to obtain source video information;

a feedback module 1250 configured to feed back the source video information to the video identification client, so that the video identification client obtains a source of the played video.

In another exemplary embodiment, the video identification device further comprises a preprocessing module, which is used for performing video-related preprocessing for video identification of the video identification client, and constructing inverted index data with the feature information as an index item for the video.

FIG. 11 is a block diagram illustrating a description of a pre-processing module in accordance with an exemplary embodiment. In an exemplary embodiment, the pre-processing module 1310 includes an information crawling unit 1311 and an index generation unit 1313.

The information crawling unit 1311 is configured to crawl video information to obtain video information corresponding to videos in the internet.

An index generating unit 1313, configured to construct an inverted index of the video information with the corresponding feature information for each video, and form inverted index data that takes the feature information as an index item and performs retrieval for all videos.

And the retrieved video information in the inverted index data is the source video information identified by the video identification client.

Optionally, the present invention further provides an electronic device, which may be used in the implementation environment shown in fig. 1 to execute all or part of the steps of the method shown in any one of fig. 3, fig. 4 and fig. 5. The device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the method for realizing the foregoing.

The specific manner in which the processor of the apparatus in this embodiment performs operations has been described in detail in relation to the foregoing embodiments and will not be elaborated upon here.

In an exemplary embodiment, a storage medium is also provided that is a computer-readable storage medium, such as may be transitory and non-transitory computer-readable storage media, that includes instructions. The storage medium includes, for example, the memory 204 of instructions executable by the processor 218 of the device 200 to perform the methods described above.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for video recognition, the method comprising:

in the process of playing a video by a video playing client, the video identification client receives a user instruction for identifying the video, wherein the user instruction is used for instructing acquisition of at least one frame of video image in the playing process;

responding to the user instruction, the video identification client shoots or records the played video to acquire at least one frame of video image;

extracting characteristics of the obtained at least one frame of video image to obtain characteristic information, wherein the characteristic information is used for representing the video segment description of the at least one frame of video image on the appointed dimension;

the video identification client retrieves the inverted index data according to the characteristic information to obtain source video information of the video played by the video playing client, wherein the source video information is used for describing the origin of the played video, each video constructs the inverted index data, the characteristic which can be matched with the characteristic information extracted by the video identification client is used as an index item, the video information is used as an index value, and the index constructed in the inverted index data is mapped to the video information of the video.

2. The method according to claim 1, wherein during the playing of the video by the video playing client, the video identifying client receives a user instruction for identifying the video, and the method comprises:

in the process of playing the video by the video playing client, the video identification client initiates video identification on the video played by the video playing client through triggering of user operation and control, and generates a user instruction for identifying the video.

3. The method according to claim 1 or 2, wherein the video playing client and the video recognition client are provided in the same terminal device or different terminal devices.

4. The method of claim 1, wherein said capturing or recording the played video by the video recognition client in response to the user command to obtain at least one video image comprises:

and controlling the video identification client to shoot the image of the played video according to the user instruction to obtain at least one frame of video image corresponding to the video.

5. The method according to claim 1, wherein the feature information includes a caption feature, and the extracting the feature from the acquired at least one frame of video image to obtain the feature information includes:

and extracting the caption features of the at least one frame of video image to obtain the caption features in the at least one frame of video image.

6. The method according to claim 1, wherein the feature information includes image features, and the extracting features from the acquired at least one frame of video image to obtain the feature information includes:

and performing image feature extraction on the at least one frame of video image to obtain an image feature sequence in the at least one frame of video image.

7. The method according to claim 5 or 6, wherein said extracting features from at least one frame of video image to obtain feature information further comprises:

and controlling to jump to the extraction of the subtitle features or the image features according to whether the video identification client can identify the subtitles from the video images, and controlling to jump to the extraction of the subtitle features when the subtitles can be identified, otherwise controlling to jump to the extraction of the image features.

8. The method according to claim 1, wherein the video identification client retrieves the inverted index data according to the feature information to obtain source video information of the video played by the video playing client, including:

and the video identification client transmits the extracted characteristic information to a server, and retrieves the characteristic information from the inverted index data through the server to obtain the source video information of the played video.

9. A method for video recognition, the method comprising:

the method comprises the steps that a server obtains characteristic information of at least one frame of video image in a video according to a user instruction for identifying the video by a video identification client, wherein the user instruction is generated by triggering the video by the video identification client in the video playing of a video playing client and is used for indicating the acquisition of the at least one frame of video image in the playing process, and the characteristic information is used for representing the video segment description of the at least one frame of video image in a specified dimension;

retrieving the characteristic information in inverted index data with the characteristics as index items to obtain source video information, constructing inverted index data for each video, and mapping indexes constructed in the inverted index data to the video information of the video by taking the characteristics capable of being matched with the characteristic information extracted by a video identification client as the index items and the video information as index values;

and the server feeds back the source video information to the video identification client, so that the video identification client obtains the position of the video played by the video identification client.

10. The method of claim 9, further comprising:

and the server performs video-related preprocessing on the video identification of the video identification client, and constructs inverted index data with the characteristic information as an index item for the video.

11. The method according to claim 10, wherein the server performs video-related preprocessing for video identification of the video identification client, constructs inverted index data with feature information as an index for the video, and includes:

the server crawls video information to obtain video information respectively corresponding to videos in the Internet;

constructing an inverted index of video information for each video according to the corresponding characteristic information to form inverted index data which takes the characteristic information as an index item and is searched facing all videos;

and the retrieved video information in the inverted index data is the source video information of the video identified by the video identification client.

12. A video recognition apparatus, wherein the apparatus is configured to implement a video recognition client, and the apparatus comprises:

the instruction receiving module is used for receiving a user instruction for identifying the video in the process of playing the video by the video playing client, wherein the user instruction is used for instructing the acquisition of at least one frame of video image in the playing process;

the image acquisition module is used for responding to the user instruction, shooting or recording the played video and acquiring at least one frame of video image;

the extraction module is used for extracting characteristics of the obtained at least one frame of video image to obtain characteristic information, and the characteristic information is used for representing the video segment description of the video image in a specified dimension;

and the retrieval module is used for retrieving the inverted index data according to the characteristic information to obtain source video information of the video played by the video playing client, wherein the source video information is used for describing the origin of the played video, each video constructs the inverted index data, the characteristic which can be matched with the characteristic information extracted by the video identification client is used as an index item, the video information is used as an index value, and the index constructed in the inverted index data is mapped to the video information of the video.

13. A video recognition apparatus, wherein the apparatus is configured to implement a server for a video playing client, and the apparatus comprises:

the system comprises a characteristic acquisition module, a video identification client and a video playing module, wherein the characteristic acquisition module is used for acquiring characteristic information of at least one frame of video image in a video according to a user instruction for identifying the video by the video identification client, the user instruction is generated by triggering the video by the video identification client in video playing performed by the video identification client and is used for indicating acquisition of at least one frame of video image in a playing process, and the characteristic information is used for representing video segment description of the at least one frame of video image in a specified dimension;

the data retrieval module is used for retrieving the characteristic information in inverted index data with the characteristics as index items to obtain source video information, each video constructs inverted index data, the characteristics capable of being matched with the characteristic information extracted by the video identification client side serve as the index items, the video information serves as an index value, and indexes constructed in the inverted index data are mapped to the video information of the video;

and the feedback module is used for feeding back the source video information to the video identification client side so that the video identification client side can obtain the position of the video played by the video identification client side.

14. A machine device, comprising:

a processor; and

a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method of any of claims 1 to 11.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program executable by a processor to perform the video recognition method according to any one of claims 1 to 11.