CN113254712A

CN113254712A - Video matching method, video processing device, electronic equipment and medium

Info

Publication number: CN113254712A
Application number: CN202110520028.8A
Authority: CN
Inventors: 刘俊启
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-08-13
Anticipated expiration: 2041-05-12
Also published as: CN113254712B

Abstract

The present disclosure discloses a video matching method, a video processing method, an apparatus, a device, a medium and a product, and relates to the fields of image processing, natural language processing, intelligent search, etc. The video matching method comprises the following steps: receiving first feature data for a reference video; comparing the first characteristic data with respective second characteristic data of at least one candidate video to obtain a comparison result, wherein the second characteristic data is obtained by identifying text change information in a target display area, and the target display area comprises a partial display area of the candidate video; and determining a target video matched with the reference video from the at least one candidate video based on the comparison result, wherein the second characteristic data of the target video is matched with the first characteristic data.

Description

Video matching method, video processing device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the fields of image processing, natural language processing, and intelligent search, and more particularly, to a video matching method, a video processing method, an apparatus, an electronic device, a medium, and a program product.

Background

With the popularity of the internet, more and more users search for videos on the internet. In the process of searching videos, related videos are matched based on search terms input by a user, and the videos obtained through matching are recommended to the user. However, the method of matching videos by using search terms has the problem of low matching accuracy, and the videos obtained by matching are difficult to meet the requirements of users.

Disclosure of Invention

The present disclosure provides a video matching method, a video processing method, an apparatus, an electronic device, a storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided a video matching method, including: receiving first feature data for a reference video; comparing the first characteristic data with second characteristic data of at least one candidate video to obtain a comparison result, wherein the second characteristic data is obtained by identifying text change information in a target display area, and the target display area comprises a partial display area of the candidate video; and determining a target video matched with the reference video from the at least one candidate video based on the comparison result, wherein the second characteristic data of the target video is matched with the first characteristic data.

According to another aspect of the present disclosure, there is provided a video processing method including: aiming at a target display area in a reference video, identifying text change information in the target display area; extracting first feature data from the reference video in response to identifying text change information; and sending the first characteristic data.

According to another aspect of the present disclosure, there is provided a video matching apparatus including: the device comprises a receiving module, a comparing module and a first determining module. The receiving module is used for receiving first characteristic data aiming at a reference video. And the comparison module is used for comparing the first characteristic data with second characteristic data of at least one candidate video to obtain a comparison result, wherein the second characteristic data is obtained by identifying text change information in a target display area, and the target display area comprises a partial display area of the candidate video. A first determining module, configured to determine, based on the comparison result, a target video matching the reference video from the at least one candidate video, where second feature data of the target video matches the first feature data.

According to another aspect of the present disclosure, there is provided a video processing apparatus including: the device comprises a second identification module, a second extraction module and a sending module. The second identification module is used for identifying the text change information in the target display area in the reference video aiming at the target display area. And the second extraction module is used for responding to the recognition of the text change information and extracting the first characteristic data from the reference video. And the sending module is used for sending the first characteristic data.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video matching method as described above.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video processing method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the video matching method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the video processing method as described above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the video matching method as described above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the video processing method as described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically shows an application scenario of a video matching method according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow diagram of a video matching method according to an embodiment of the present disclosure;

fig. 3 schematically shows a schematic diagram of a video matching method according to an embodiment of the present disclosure;

FIG. 4 schematically shows a flow diagram of a video processing method according to an embodiment of the present disclosure;

fig. 5 schematically shows a schematic diagram of a video matching method and a video processing method according to an embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of a video matching apparatus according to an embodiment of the present disclosure;

fig. 7 schematically shows a block diagram of a video processing apparatus according to an embodiment of the present disclosure; and

fig. 8 is a block diagram of an electronic device for implementing a video matching method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Fig. 1 schematically shows an application scenario of a video matching method according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 of an embodiment of the present disclosure includes, for example, a candidate video and a reference video.

For example, a plurality of candidate videos are stored in a server and a reference video is stored in a client. When the user needs to search for the video matching with the reference video, the server may receive the reference video from the client, and match the reference video with each candidate video, so as to determine a target video matching with the reference video from the plurality of candidate videos, where the target video is the video needed by the user.

The embodiment of the present disclosure takes one candidate video 110 as an example to illustrate the matching situation between the candidate video and the reference video.

Illustratively, the reference videos 121, 122 are a subset of the candidate videos 110. When the candidate video 110 is matched based on the reference videos 121, 122, at least part of the content of the candidate video 110 matches the entire content of the reference videos 121, 122.

Illustratively, the partial content of the

reference videos

123, 124 is a subset of the candidate videos 110. When the candidate video 110 is matched based on the

reference videos

123, 124, the partial content of the candidate video 110 matches the partial content of the

reference videos

123, 124.

Illustratively, the candidate videos 110 are a subset of the

reference videos

125, 126. When the candidate video 110 is matched based on the

reference videos

125, 126, the entire content of the candidate video 110 matches the partial content of the

reference videos

125, 126.

Illustratively, the candidate video 110 is disjoint from the reference videos 127, 128. When the candidate video 110 is matched based on the reference videos 127, 128, the content of the candidate video 110 does not match the content of the reference videos 127, 128.

For example, matching of videos may be performed using a picture similarity recognition algorithm and an algorithm that extracts feature matches. Specifically, a plurality of images may be extracted from the reference video, a plurality of images may be extracted from each candidate video, and the plurality of images of the reference video and the plurality of images of the candidate video may be matched to determine the target video matching the reference video from the candidate videos. Alternatively, the features of each image may be extracted from a plurality of images of the reference video, the features of each image may be extracted from a plurality of images of the candidate video, and the features of the images may be matched to determine the target video matching the reference video from the candidate video. It can be seen that, since the video is a continuous content, when matching the reference video and the candidate video, the process of matching through multiple images is computationally expensive.

Illustratively, for each second of content in a video, 25 (frames) of images are extracted from each second of content, and if the duration of the video is 16 seconds, the number of images to be extracted is 16 × 25 — 400. The number of extracted images is large, so that the calculation amount is large when matching is performed by using the extracted images. And it is a matter of consideration which images are extracted from the video, because the inconsistent rules for extracting images will cause the problem that the same two videos cannot be matched.

For example, the duration of the reference video is 16 seconds, and the duration of the candidate video is 32 seconds. In one case, 25 images per second of the content of the reference video are extracted, for a total of 16 × 25 images, and 25 images per second of the content of the candidate video are extracted, for a total of 32 × 25 images. When matching the reference video and the candidate video, at most (16 × 25) × (32 × 25) × 32 (ten thousand) comparisons are required to compare one image of the reference video and one image of the candidate video at a time, and the calculation amount of matching of the visible videos is large.

In view of this, an embodiment of the present disclosure provides a video matching method, including: first feature data for a reference video is received. Then, the first characteristic data and second characteristic data of at least one candidate video are compared to obtain a comparison result, wherein the second characteristic data is obtained by identifying text change information in a target display area, and the target display area comprises a partial display area of the candidate video. Next, based on the comparison result, a target video matching the reference video is determined from the at least one candidate video, wherein the second feature data of the target video matches the first feature data.

An embodiment of the present disclosure further provides a video processing method, including: and aiming at the target display area in the reference video, identifying the text change information in the target display area. Then, in response to recognizing the text change information, first feature data is extracted from the reference video. Next, the first characteristic data is transmitted.

A video matching method and a video processing method according to an exemplary embodiment of the present disclosure are described below with reference to fig. 2 to 5 in conjunction with an application scenario of fig. 1.

Fig. 2 schematically shows a flow chart of a video matching method according to an embodiment of the present disclosure.

As shown in fig. 2, the video matching method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S230. The method of the disclosed embodiments may be performed by a server, for example.

In operation S210, first feature data for a reference video is received.

In operation S220, the first feature data and second feature data of each of the at least one candidate video are compared to obtain a comparison result, where the second feature data is obtained by identifying text change information in a target display area, and the target display area includes a partial display area of the candidate video.

In operation S230, a target video matching the reference video is determined from the at least one candidate video based on the comparison result.

According to an embodiment of the present disclosure, the second feature data is obtained by identifying text change information in the target display area of the candidate video. When the change of the text in the target display area is recognized, second feature data is extracted from the candidate video. Second feature data of the target video determined from the at least one candidate video is matched with the first feature data. The target display area includes a partial display area of the candidate video, for example, the target display area is a partial area of the candidate video for displaying text.

Illustratively, when it is recognized that the text of the target display region in the candidate video changes, an image is extracted from the candidate video, and the extracted image is taken as the second feature data. Alternatively, after the image is extracted, the image may be further processed to obtain second feature data.

Illustratively, the first feature data is obtained by, for example, recognizing text change information of the target display area in the reference video, and the first feature data is extracted in a similar manner to the second feature data.

For example, the first characteristic data is sent by the client to the server. The server stores a plurality of candidate videos, each having second feature data. After the server receives the first feature data, the first feature data and the second feature data of each candidate video are compared to obtain a comparison result, and then a candidate video matched with the reference video is determined from the candidate videos as a target video based on the comparison result.

In the embodiment of the disclosure, the feature data are extracted from the video and the video matching is performed through the comparison of the feature data, so that the calculation amount of the video matching is greatly reduced, and the efficiency of the video matching is improved. In addition, the embodiment of the disclosure extracts the feature data by identifying the text change information in the target display area in the video, so that the extracted first feature data is for text change, and the second feature data is for text change in the candidate video, thereby improving the matching probability between the first feature data and the second feature data, and thus improving the success rate of video matching.

Fig. 3 schematically shows a schematic diagram of a video matching method according to an embodiment of the present disclosure.

As shown in fig. 3, each candidate video is composed of a plurality of (frame) images. Taking the candidate video 300 as an example, a target display area for the candidate video 300 is determined, text change information in the target display area is identified, and when the text change information is identified, second feature data is extracted from the candidate video 300.

Illustratively, the text is, for example, subtitles of the video, and the target display area is, for example, a display area of the subtitles.

For example, video data 310 for a first text and video data 320 for a second text are included in the candidate video 300. The first text is for example "AAA" and the second text is for example "BB". The subtitle corresponding to each image in the video data 310 is, for example, "AAA", and the subtitle corresponding to each image in the video data 320 is, for example, "BB". The candidate video is displayed in the display area 300A, the first text and the second text are displayed in the target display area 300B, for example, and the target display area 300B is a part of the display area 300A, for example.

When it is recognized that the target display area 300B is switched from the first text to the second text, the first video clip 330 is determined from the candidate video 300.

Illustratively, the first video segment 330 includes video data for the first text and/or video data for the second text. For example, the first video segment 330 includes a portion of video data in the video data 310, and text corresponding to each image in the video data 310 is first text. Alternatively, the first video segment 330 includes a portion of the video data in the video data 320, and the text corresponding to each image in the video data 320 is the second text. Alternatively, the first video segment 330 includes a portion of video data in the video data 310 and a portion of video data in the video data 320. The embodiment of the present disclosure takes the example that the first video segment 330 includes a part of the video data in the video data 310 and a part of the video data in the video data 320.

After determining the first video segment 330, second feature data for text changes is extracted from the first video segment 330.

In an example, one first image 340 may be extracted from the first video segment 330, then text information in the first image 340 may be identified, and a partial image 350 in the first image 340 may be extracted based on the identified text information, for example, the partial image 350 may include the text information therein, and the partial image 350 may be taken as the second feature data. Alternatively, the partial image 350 may be subjected to preprocessing such as compression, cropping, changing the image size, and grayscaling, and the preprocessed partial image 350 may be used as the second feature data.

In another example, the first image 340 may be extracted from the first video segment 330, the local image 350 may then be extracted from the first image 340, the local image 350 may be preprocessed, and the preprocessed local image 350 may then be further processed to extract image features of the local image 350, and the image features may be used as the second feature data. The image feature may be a feature vector.

In another example, the first image 340 may be extracted from the first video segment 330, and then the partial image 350 may be extracted from the first image 340, and text recognition may be performed on the partial image 350 to extract text information in the partial image 350, and the text information may be a subtitle, as the second feature data.

According to the embodiment of the disclosure, when extracting the second feature data of the candidate video, a first video segment for text switching is determined by recognizing the text switching, and then the second feature data is extracted from the first video segment. Compared with the method of comparing each image (frame) in the video, the method has the advantage that the cost of video matching is greatly reduced by extracting the second characteristic data to match the video. In addition, the video content corresponding to the second characteristic data in the candidate video is changed according to the text, and video matching is performed based on the second characteristic data, so that the success rate of video matching is improved.

In an embodiment of the present disclosure, the first feature data includes, for example, a first data sequence including a plurality of sub data. The second characteristic data includes, for example, a second data sequence including a plurality of sub data. A subdata is for example an image feature of a partial image or a text message in a partial image.

For example, the first data sequence is [ a ]₁，a₂，a₃]Sub-data a₁Indicating the presence of text A in the target display area where the reference video was identified₁Switching to text A₂Feature data extracted at the time of text A₁The text a2 is the third text. Subdata a₂Indicating the presence of text A in the target display area where the reference video was identified₂Switching to text A₃The extracted feature data, text a2 being the third text, text a₃Is the fourth text. Subdata a₃Indicating the presence of text A in the target display area where the reference video was identified₃Switching to text A₄Feature data extracted at the time of text A₃As a third text, text A₄Is the fourth text. Text A₁、A₂、A₃、A₄Which in turn appear in the reference video. First data sequence [ a ]₁，a₂，a₃]And storing in association with the reference video.

For example, the second data sequence is [ b ]₁，b₂，b₃，b₄]Sub-data b₁Indicating the presence of text B in the target display area where the candidate video was identified₁Switch to text B₂Feature data extracted at this time, text B at this time₁As a first text, text B₂Is the second text. Sub-data b₂Indicating the presence of text B in the target display area where the candidate video was identified₂Switch to text B₃Feature data extracted at this time, text B at this time₂As a first text, text B₃Is the second text. Sub-data b₃Indicating the presence of text B in the target display area where the candidate video was identified₃Switch to text B₄Feature data extracted at this time, text B at this time₃As a first text, text B₄Is the second text. Sub-data b₄Indicating the presence of text B in the target display area where the candidate video was identified₄Switch to text B₅Feature data extracted at this time, text B at this time₄As a first text, text B₅Is the second text. Text B₁、B₂、B₃、B₄、B₅Appearing in the candidate video in sequence. Second data sequence [ b ]₁，b₂，b₃，b₄]And storing the candidate videos in association.

Then, from the first data sequence [ a ]₁，a₂，a₃]A plurality of sub data adjacent to each other are determined as a first sub sequence, and a second data sequence [ b ] is selected from among the plurality of sub data₁，b₂，b₃，b₄]The number of the sub-data in the second sub-sequence is the same as the number of the sub-data in the first sub-sequence, and the number may be set according to the actual application, for example, the number is 2. That is, the first subsequence is, for example, [ a ]₁，a₂]Or [ a₂，a₃]The second subsequence is [ b ]₁，b₂]、[b₂，b₃]Or [ b₃，b₄]. Will be provided withAnd comparing any one first subsequence with any one second subsequence, and if any one first subsequence is matched with any one second subsequence, determining the candidate video corresponding to the second subsequence as the target video. With the first subsequence as [ a ]₁，a₂]And the second subsequence is [ b ]₂，b₃]For example, when the child data a₁And sub data b₂Match and subdata a₂And sub data b₃When matching, the first subsequence is determined to be [ a1, a ]₂]And the second subsequence is [ b ]₂，b₃]And (6) matching. The matching of the two sub data comprises that the similarity of the two sub data is higher, and the sub data can be an image or a feature vector of the image.

In an embodiment of the disclosure, a comparison result obtained by comparing the first feature data with the second feature data includes a similarity between the first feature data and the second feature data, and a target video is determined from the plurality of candidate videos based on the comparison result, where the similarity between the second feature data of the target video and the first feature data satisfies a similarity condition, where the similarity condition includes that a preset number of sub-data are matched, that is, the similarity condition includes that a second sub-sequence having the preset number of sub-data is matched with a first sub-sequence having the preset number of sub-data.

In an embodiment, the similarity condition is associated with an image size of the partial image.

When the image size of the local image is larger, it indicates that the number of the characters in the local image is larger, and the matching between the sub-data corresponding to the local image indicates that the videos are matched at a higher probability, at this time, a second sub-sequence having a first preset number of sub-data may be set to match with a first sub-sequence having a first preset number of sub-data as a similarity condition, at this time, the first preset number is smaller, and the first preset number is, for example, 2.

When the image size of the local image is smaller, it indicates that the number of the characters in the local image is smaller, and the matching between the sub-data corresponding to the local image indicates that the videos are matched at a lower probability, at this time, a second sub-sequence having a second preset number of sub-data may be set to match with a first sub-sequence having a second preset number of sub-data as a similarity condition, where the second preset number is greater than the first preset number, and the second preset number is, for example, 4.

In one embodiment, the similarity condition is associated with the number of words contained in the text information in the local image.

When the text information contains a large number of characters, the matching between the sub-data corresponding to the text information indicates that the videos are matched at a high probability, at this time, a second sub-sequence with a third preset number of sub-data is matched with a first sub-sequence with a third preset number of sub-data to serve as a similarity condition, and at this time, the third preset number is small.

When the number of the characters included in the text information is small, matching between the sub-data corresponding to the text information indicates that matching between videos is performed with a small probability, at this time, matching between the second sub-sequence having the fourth preset number of sub-data and the first sub-sequence having the fourth preset number of sub-data may be set as a similarity condition, and at this time, the fourth preset number is larger than the third preset number.

In one embodiment, the first data sequence [ a ]₁，a₂，a₃]For example by the client to the server. The server receives a first data sequence [ a ]₁，a₂，a₃]Thereafter, from the first data sequence [ a ]₁，a₂，a₃]Determining a first sub-sequence from a second data sequence [ b ]₁，b₂，b₃，b₄]Determining the second subsequence and matching the first subsequence with the second subsequence.

In another example, the client may send one sub-data for the reference video to the server at a time, and after the service receives the sub-data for the reference video, the received sub-data is compared with the sub-data for the candidate video until it is determined that a plurality of adjacent sub-data for the reference video and a plurality of adjacent sub-data for the candidate video match. After determining that the plurality of neighboring sub-data for the reference video and the plurality of neighboring sub-data for the candidate video match, the matching process may be ended, and the matched candidate video may be recommended to the client as the target video.

For example, the client will sub-data a₁Sending to the server, the server will a₁And a second data sequence [ b ]₁，b₂，b₃，b₄]Is compared, when the sub-data a is determined₁And sub data b₂When matching, it will be for the second data sequence [ b ]₁，b₂，b₃，b₄]The candidate video of (2) is stored in a queue. The client side continues to use the subdata a₂Sending to the server, the server will a₂And a second data sequence [ b ]₁，b₂，b₃，b₄]Is compared, when the sub-data a is determined₂And sub data b₃When matching, the first data sequence and the second data sequence have 2 adjacent sub-data matching, and the second data sequence [ b ] is aimed at in the queue₁，b₂，b₃，b₄]As the target video. The server may recommend the target video to the client. If the sub-data a₂And sub data b₃Mismatch, one can look for the second data sequence [ b ]₁，b₂，b₃，b₄]Is removed from the queue.

Fig. 4 schematically shows a flow chart of a video processing method according to an embodiment of the present disclosure.

As shown in fig. 4, the video processing method 400 of the embodiment of the present disclosure may include, for example, operations S410 to S430. The method of the disclosed embodiments may be performed by a client, for example.

In operation S410, text change information in a target display area in a reference video is identified with respect to the target display area.

In operation S420, first feature data is extracted from the reference video in response to recognition of the text change information.

In operation S430, first feature data is transmitted.

In the embodiment of the disclosure, the process of extracting the first feature data by performing text change information recognition on the target display area in the reference video by the client is the same as or similar to the process of extracting the second feature data by performing text change information recognition on the target display area in the candidate video by the server, and is not repeated herein. After the client extracts the first feature data, the client may send the first feature data to the server, so that the server can match the first feature data with the second feature data conveniently.

In the embodiment of the disclosure, the first feature data is extracted from the reference video and the video matching is performed through the first feature data, so that the calculation amount of the video matching is greatly reduced, and the efficiency of the video matching is improved. In addition, the embodiment of the disclosure extracts the feature data by identifying the text change information of the target display area in the video, so that the extracted first feature data is specific to the text change, the matching probability between the first feature data and the second feature data is improved, and the success rate of video matching is improved.

In the embodiment of the disclosure, when the client identifies that the target display area in the reference video is switched from the third text to the fourth text, a second video segment is determined from the reference video, and then the first feature data is extracted from the second video segment, wherein the second video segment comprises video data for the third text and/or video data for the fourth text. The process of determining the second video segment is similar to the process of determining the first video segment, and is not repeated herein.

Illustratively, when the client identifies that the target display area in the reference video is switched from the third text to the fourth text, a second video segment is determined from the reference video, then a second image is extracted from the second video segment, a local image is extracted from the second image by identifying text information in the second image, and the local image is taken as the first feature data.

Alternatively, the partial image may be subjected to preprocessing such as compression, cropping, changing the image size, or graying the image, and the preprocessed partial image may be used as the first feature data.

Alternatively, a second image may be extracted from the second video segment, a partial image may be extracted from the second image, the partial image may be preprocessed, the preprocessed partial image may be further processed to extract image features of the partial image, and the image features may be used as the first feature data. The image feature may be a feature vector.

Or, for the preprocessed local image, recognizing text information in the local image, and taking the text information as the first feature data.

According to the embodiment of the present disclosure, when extracting the first feature data of the reference video, a second video segment for text switching is determined by recognizing the text switching, and then the first feature data is extracted from the second video segment. Compared with the mode of comparing each image in the video, the video matching is carried out by extracting the first characteristic data, so that the video matching cost is greatly reduced. In addition, the video content corresponding to the first characteristic data in the reference video is changed according to the text, and video matching is performed based on the first characteristic data, so that the success rate of video matching is improved.

Fig. 5 schematically shows a schematic diagram of a video matching method and a video processing method according to an embodiment of the present disclosure.

As shown in fig. 5, the video matching method performed by the server 510 includes operations S510A through S560A, and the video processing method performed by the client 520 includes operations S510B through S540B.

In operation S510A, for each candidate video stored by the server 510, text change information in a target display area is identified for the target display area of each candidate video.

In operation S520A, the server 510 extracts second feature data from the candidate video after recognizing the text change information.

In operation S510B, the client 520 identifies text change information in a target display area in the reference video.

In operation S520B, the client 520 extracts first feature data from the reference video after recognizing the text change information.

In operation S530B, the client 520 transmits the first feature data to the server 510.

In operation S530A, the server 510 receives first feature data.

In operation S540A, the server 510 compares the first feature data with the second feature data of each candidate video to obtain a comparison result.

In operation S550A, the server 510 determines a target video matching the reference video from among the plurality of candidate videos based on the comparison result.

In operation S560A, the server 510 recommends the target video to the client 520.

In operation S540B, the client 520 presents the target video to the user.

In the embodiment of the present disclosure, if multiple frames of images in a video are compared to perform video matching, the cost of matching calculation is too high. The feature data in the video is extracted based on the text switching of the video, and the client and the server (cloud) cooperate to realize the video search with lower cost.

For the client, a complete reference video does not need to be uploaded to the server, a second image is extracted by identifying text switching of a target display area in the reference video, the second image is preprocessed and processed to obtain first characteristic data, and the first characteristic data is uploaded to the server for video matching.

And for the server, comparing the first characteristic data uploaded by the client with the second characteristic data extracted in advance. The server does not limit the number of the first feature data uploaded by the client, the client does not need to upload a complete reference video, the client can extract features in real time, and the server performs video matching in real time. Therefore, according to the technical scheme of the embodiment of the disclosure, when the feature extraction is performed on videos with different video lengths or different video starting points, the extracted first feature data and the extracted second feature data are both for text switching, that is, the extracted first feature data and the extracted second feature data have higher similarity probability, and the matching of the videos and the searching of the videos can be realized through lower calculation amount.

According to the embodiment of the disclosure, the image extraction, the image preprocessing and the feature data extraction are completed on the client, the data transmission quantity of the client and the server is reduced, and the matching speed of the server is greatly improved.

Fig. 6 schematically shows a block diagram of a video matching apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the video matching apparatus 600 of the embodiment of the present disclosure includes, for example, a receiving module 610, a comparing module 620, and a first determining module 630.

The receiving module 610 may be configured to receive first feature data for a reference video. According to the embodiment of the present disclosure, the receiving module 610 may perform, for example, the operation S210 described above with reference to fig. 2, which is not described herein again.

The comparing module 620 may be configured to compare the first feature data with second feature data of each of the at least one candidate video to obtain a comparison result, where the second feature data is obtained by identifying text change information in a target display area, and the target display area includes a partial display area of the candidate video. According to the embodiment of the present disclosure, the comparing module 620 may perform, for example, the operation S220 described above with reference to fig. 2, which is not described herein again.

The first determining module 630 may be configured to determine a target video matching the reference video from the at least one candidate video based on the comparison result, wherein the second feature data of the target video matches the first feature data. According to the embodiment of the present disclosure, the first determining module 630 may, for example, perform operation S230 described above with reference to fig. 2, which is not described herein again.

According to an embodiment of the present disclosure, the apparatus 600 may further include: the device comprises a second determining module, a first identifying module and a first extracting module. A second determination module to determine, for each of the at least one candidate video, a target display region for the candidate video. And the first identification module is used for identifying the text change information in the target display area. And the first extraction module is used for responding to the identification of the text change information and extracting second characteristic data from the candidate video.

According to an embodiment of the present disclosure, the first extraction module includes: a first determination submodule and a first extraction submodule. And the first determining sub-module is used for determining a first video segment from the candidate videos in response to the fact that the target display area is switched from the first text to the second text is identified, wherein the first video segment comprises video data aiming at the first text and/or video data aiming at the second text. And the first extraction submodule is used for extracting second characteristic data from the first video segment.

According to an embodiment of the present disclosure, the first extraction module includes: a second extraction sub-module, a third extraction sub-module and a first processing sub-module. And the second extraction sub-module is used for extracting the first image from the candidate video. And the third extraction submodule is used for extracting a local image comprising the text information from the first image. And the first processing sub-module is used for processing the local images to obtain second characteristic data aiming at the candidate video.

According to an embodiment of the disclosure, the first processing submodule is configured to perform at least one of: extracting image features of the local image, and taking the image features as second feature data; and recognizing text information in the local image, and using the text information as second feature data.

According to an embodiment of the present disclosure, the comparison result includes a similarity between the first feature data and the second feature data; the similarity between the second characteristic data and the first characteristic data of the target video meets a similarity condition.

According to an embodiment of the present disclosure, the similarity condition is associated with an image size of the partial image, or the similarity condition is associated with the number of characters contained in the text information.

According to an embodiment of the present disclosure, the target display region includes a display region of subtitles.

Fig. 7 schematically shows a block diagram of a video processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the video processing apparatus 700 of the embodiment of the present disclosure includes, for example, a second identifying module 710, a second extracting module 720, and a sending module 730.

The second identifying module 710 may be configured to identify text change information in a target display area in a reference video. According to the embodiment of the present disclosure, the second identifying module 710 may, for example, perform operation S410 described above with reference to fig. 4, which is not described herein again.

The second extraction module 720 may be configured to extract first feature data from the reference video in response to identifying the text change information. According to the embodiment of the present disclosure, the second extracting module 720 may, for example, perform operation S420 described above with reference to fig. 4, which is not described herein again.

The sending module 730 may be configured to send the first characteristic data. According to the embodiment of the present disclosure, the sending module 730 may, for example, perform the operation S430 described above with reference to fig. 4, which is not described herein again.

According to an embodiment of the present disclosure, the second extraction module 720 includes: a second determination submodule and a fourth extraction submodule. And the second determining sub-module is used for determining a second video segment from the reference video in response to the fact that the target display area is switched from the third text to the fourth text is identified, wherein the second video segment comprises video data aiming at the third text and/or video data aiming at the fourth text. And the fourth extraction submodule is used for extracting the first characteristic data from the second video clip.

According to an embodiment of the present disclosure, the second extraction module 720 includes: a fifth extraction sub-module, a sixth extraction sub-module and a second processing sub-module. And the fifth extraction sub-module is used for extracting the second image from the reference video. And the sixth extraction submodule is used for extracting the local image comprising the text information from the second image. And the second processing sub-module is used for processing the local image to obtain first characteristic data aiming at the reference video.

According to an embodiment of the disclosure, the second processing submodule is configured to perform at least one of: extracting image features of the local image, and taking the image features as first feature data; and identifying text information in the local image, and taking the text information as first characteristic data.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. The electronic device 800 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the respective methods and processes described above, such as the video matching method. For example, in some embodiments, the video matching method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the video matching method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the video matching method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

The electronic device may be configured to perform a video processing method. The electronic device may comprise, for example, a computing unit, a ROM, a RAM, an I/O interface, an input unit, an output unit, a storage unit and a communication unit. The computing unit, the ROM, the RAM, the I/O interface, the input unit, the output unit, the storage unit, and the communication unit in the electronic device have the same or similar functions as the computing unit, the ROM, the RAM, the I/O interface, the input unit, the output unit, the storage unit, and the communication unit of the electronic device shown in fig. 8, for example, and are not described again here.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A video matching method, comprising:

receiving first feature data for a reference video;

comparing the first characteristic data with second characteristic data of at least one candidate video to obtain a comparison result, wherein the second characteristic data is obtained by identifying text change information in a target display area, and the target display area comprises a partial display area of the candidate video; and

and determining a target video matched with the reference video from the at least one candidate video based on the comparison result, wherein the second characteristic data of the target video is matched with the first characteristic data.

2. The method of claim 1, further comprising:

for each candidate video of the at least one candidate video, determining a target display area for the candidate video;

identifying text change information in the target display area; and

in response to identifying text change information, second feature data is extracted from the candidate video.

3. The method of claim 2, wherein said extracting second feature data from the candidate video in response to identifying text change information comprises:

in response to identifying that the target display area is switched from a first text to a second text, determining a first video segment from the candidate videos, wherein the first video segment comprises video data for the first text and/or video data for the second text; and

extracting the second feature data from the first video segment.

4. The method of claim 2 or 3, wherein said extracting second feature data from the candidate video comprises:

extracting a first image from the candidate video;

extracting a partial image including text information from the first image; and

and processing the local image to obtain second characteristic data aiming at the candidate video.

5. The method of claim 4, wherein the processing the local image to obtain second feature data for the candidate video comprises at least one of:

extracting image features of the local image, and taking the image features as the second feature data; and

and identifying text information in the local image, and using the text information as the second characteristic data.

6. The method according to claim 4 or 5, wherein the comparison result comprises a similarity between the first feature data and the second feature data; the similarity between the second characteristic data of the target video and the first characteristic data meets a similarity condition.

7. The method according to claim 6, wherein the similarity condition is associated with an image size of the partial image, or the similarity condition is associated with a number of characters included in the text information.

8. The method of any of claims 1-7, wherein the target display area comprises a display area of subtitles.

9. A video processing method, comprising:

aiming at a target display area in a reference video, identifying text change information in the target display area;

extracting first feature data from the reference video in response to identifying text change information; and

and sending the first characteristic data.

10. The method of claim 9, wherein said extracting first feature data from the reference video in response to identifying text change information comprises:

in response to identifying that the target display area is switched from a third text to a fourth text, determining a second video segment from the reference video, wherein the second video segment comprises video data for the third text and/or video data for the fourth text; and

extracting the first feature data from the second video segment.

11. The method of claim 9 or 10, wherein said extracting first feature data from the reference video comprises:

extracting a second image from the reference video;

extracting a partial image including text information from the second image; and

and processing the local image to obtain first characteristic data aiming at the reference video.

12. The method of claim 11, wherein the processing the local image to obtain first feature data for the reference video comprises at least one of:

extracting image features of the local image, and taking the image features as the first feature data; and

and identifying text information in the local image, and using the text information as the first characteristic data.

13. A video matching apparatus, comprising:

a receiving module for receiving first feature data for a reference video;

a comparison module, configured to compare the first feature data with second feature data of at least one candidate video to obtain a comparison result, where the second feature data is obtained by identifying text change information in a target display area, and the target display area includes a partial display area of the candidate video; and

a first determining module, configured to determine, based on the comparison result, a target video matching the reference video from the at least one candidate video, where second feature data of the target video matches the first feature data.

14. The apparatus of claim 13, further comprising:

a second determination module to determine, for each of the at least one candidate video, a target display region for the candidate video;

the first identification module is used for identifying the text change information in the target display area; and

and the first extraction module is used for responding to the recognition of the text change information and extracting second characteristic data from the candidate video.

15. The apparatus of claim 14, wherein the first extraction module comprises:

a first determining sub-module, configured to determine a first video segment from the candidate videos in response to identifying that the target display area is switched from a first text to a second text, wherein the first video segment includes video data for the first text and/or video data for the second text; and

a first extraction sub-module for extracting the second feature data from the first video segment.

16. The apparatus of claim 14 or 15, wherein the first extraction module comprises:

a second extraction sub-module, configured to extract a first image from the candidate video;

a third extraction sub-module, configured to extract a partial image including text information from the first image; and

and the first processing sub-module is used for processing the local images to obtain second characteristic data aiming at the candidate videos.

17. The apparatus of claim 16, wherein the first processing submodule is to perform at least one of:

18. The apparatus according to claim 16 or 17, wherein the comparison result comprises a similarity between the first feature data and the second feature data; the similarity between the second characteristic data of the target video and the first characteristic data meets a similarity condition.

19. The apparatus according to claim 18, wherein the similarity condition is associated with an image size of the partial image, or the similarity condition is associated with a number of characters included in the text information.

20. The apparatus of any of claims 13-19, wherein the target display area comprises a display area of subtitles.

21. A video processing apparatus comprising:

the second identification module is used for identifying text change information in a target display area in a reference video aiming at the target display area;

a second extraction module for extracting first feature data from the reference video in response to recognizing text change information; and

and the sending module is used for sending the first characteristic data.

22. The apparatus of claim 21, wherein the second extraction module comprises:

a second determining sub-module, configured to determine a second video segment from the reference video in response to identifying that the target display area is switched from a third text to a fourth text, wherein the second video segment includes video data for the third text and/or video data for the fourth text; and

a fourth extraction sub-module for extracting the first feature data from the second video segment.

23. The apparatus of claim 21 or 22, wherein the second extraction module comprises:

a fifth extraction sub-module, configured to extract a second image from the reference video;

a sixth extraction sub-module, configured to extract a partial image including text information from the second image; and

and the second processing submodule is used for processing the local image to obtain first characteristic data aiming at the reference video.

24. The apparatus of claim 23, wherein the second processing submodule is to perform at least one of:

25. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

26. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 9-12.

27. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

28. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 9-12.

29. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.

30. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 9-12.