CN109429084B

CN109429084B - Video processing method and device for video processing

Info

Publication number: CN109429084B
Application number: CN201710737845.2A
Authority: CN
Inventors: 张�杰; 卜海亮; 靳一笑; 邢真臻; 蒋品; 冯新强
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-08-24
Filing date: 2017-08-24
Publication date: 2022-03-29
Anticipated expiration: 2037-08-24
Also published as: CN109429084A

Abstract

The embodiment of the invention provides a video processing method and device and a device for video processing, wherein the method specifically comprises the following steps: acquiring text information in a video frame; acquiring a target article matched with the text information from a preset article library; and adding target information corresponding to the target object into the video frame. The embodiment of the invention can shorten the processing time of the video, improve the video processing efficiency and improve the video coverage rate of the target information.

Description

Video processing method and device for video processing

Technical Field

The present invention relates to the field of video technologies, and in particular, to a video processing method and apparatus, and an apparatus for video processing.

Background

With the development of internet technology, more and more users are used to watch videos through terminals such as computers and mobile phones, and specifically, users can watch interested videos through a locally installed player of a client or a player implanted on a webpage.

Currently, information can be added to video through video processing. In the existing scheme, information can be added to a video through manual operation, specifically, after an operator watches the video, a video frame suitable for adding the information is extracted from the video, then information corresponding to the video frame is acquired, and then the acquired information is inserted into the video frame by using an editing system.

However, the existing solutions add information to the video through manual operation, which requires much time cost and labor cost, which results in low video processing efficiency.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed to provide a video processing method, a video processing apparatus, and an apparatus for video processing that overcome or at least partially solve the above problems, and can shorten the processing time of a video and improve the video processing efficiency, and can improve the video coverage of target information.

In order to solve the above problem, the present invention discloses a video processing method, comprising:

acquiring text information in a video frame;

acquiring a target article matched with the text information from a preset article library;

and adding target information corresponding to the target object into the video frame.

In another aspect, the present invention discloses a video processing apparatus, comprising:

the text information acquisition module is used for acquiring text information in the video frame;

the target article acquisition module is used for acquiring a target article matched with the text information from a preset article library; and

and the target information adding module is used for adding the target information corresponding to the target object into the video frame.

Optionally, the text information obtaining module includes:

and the recognition submodule is used for performing text recognition and/or subtitle recognition on a video frame included in the video to obtain text information in the video frame.

Optionally, the target item acquisition module includes:

and the judging sub-module is used for judging whether the text information comprises information matched with the first article in the preset article library or the characteristic information corresponding to the similar article of the first article, and if so, taking the first article as a target article matched with the text information.

Optionally, the target information adding module includes:

the target position determining submodule is used for determining a target position for adding target information in the video frame;

and the adding sub-module is used for adding the target information at the target position in the video frame.

Optionally, the target position determination sub-module includes:

a first target position determining unit, configured to determine a coincidence degree between an existing item in the video frame and the target item; acquiring the position of an article with the conformity meeting preset conditions from the existing articles in the video frame as a target position; and/or the presence of a gas in the gas,

and the second target position determining unit is used for identifying a preset image target area suitable for adding the target information in the video frame, and taking the preset image target area as the target position.

Optionally, the target position is a subtitle-related position;

the adding submodule comprises:

the subtitle adding unit is used for modifying the subtitles included in the video frame according to target information so as to add the target information in the subtitles included in the video frame; and/or the presence of a gas in the gas,

and a subtitle adding unit for adding target information around the subtitle as additional information of the subtitle in the video frame to add the target information in the video frame.

Optionally, the target information adding module includes:

the video frame information modification submodule is used for modifying the information corresponding to the target position in the video frame according to the target information so as to obtain a modified video frame comprising the target information; or

And the additional sub-module is used for adding the target information into the video frame as additional information of a corresponding target position in the video frame.

Optionally, the video frame information modification sub-module includes:

the pixel value modifying unit is used for replacing a first pixel value corresponding to a target position in the video frame with a second pixel value corresponding to target information, and the second pixel value corresponding to the target information is determined according to the color value of the target information in the picture format and/or the target information in the text format; and/or the presence of a gas in the gas,

and the caption text modifying unit is used for modifying the text information corresponding to the caption position in the video frame so as to modify the text information corresponding to the caption position into target information in a text format.

Optionally, the apparatus further comprises:

the image tracking module is used for carrying out image tracking on image targets in continuous video frames included in the video;

and the target information multiplexing module is used for multiplexing target information corresponding to the same image target in the previous video frame aiming at the image target in the subsequent video frame according to the image tracking result.

In yet another aspect, an apparatus for video processing is disclosed that includes a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors to include instructions for:

acquiring text information in a video frame;

In yet another aspect, the present disclosure discloses a machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform one or more of the video processing methods described above.

The embodiment of the invention has the following advantages:

the embodiment of the invention automatically acquires the text information in the video frame through a machine, acquires the target object matched with the text information in a preset object library, and adds the target information corresponding to the target object in the corresponding video frame; according to the embodiment of the invention, the target object matched with the text information of the video frame can be quickly acquired without manual intervention, so that the processing time of the video can be shortened and the video processing efficiency can be improved.

Moreover, under the condition that the video processing time is shortened, the number of videos which can be processed in unit time can be increased in a geometric level, and the machine scale for processing the videos can be infinitely expanded in a cluster calculation mode, so that the video coverage rate of target information can be improved.

Further, the embodiment of the invention performs video processing by adopting a mode of acquiring text information and matching the preset article library, so that under the condition that the information in the preset article library changes, the latest target article and the corresponding target information thereof can be acquired based on the preset article library matching, thereby improving the timeliness of the target information added in the video frame, and even realizing real-time update of the target information to a certain extent.

Drawings

FIG. 1 is a flow chart of the steps of a first embodiment of a video processing method of the present invention;

FIG. 2 is a flowchart illustrating steps of a second embodiment of a video processing method according to the present invention;

FIG. 3 is a block diagram of a video processing apparatus according to an embodiment of the present invention;

fig. 4 is a block diagram illustrating a configuration of an apparatus 900 for video processing according to the present invention as a terminal;

fig. 5 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The embodiment of the invention provides a video processing scheme, which can acquire text information in a video frame, acquire a target object matched with the text information from a preset object library, and add target information corresponding to the target object in the video frame.

The method comprises the steps that text information in a video frame is automatically acquired through a machine, a target object matched with the text information in a preset object library is acquired, and target information corresponding to the target object is added into the video frame; according to the embodiment of the invention, the target object matched with the text information of the video frame can be quickly acquired without manual intervention, so that the processing time of the video can be shortened and the video processing efficiency can be improved.

Moreover, under the condition that the video processing time is shortened, the number of videos which can be processed in unit time can be increased in a geometric level, and the machine scale for processing the videos can be infinitely expanded in a cluster computing mode, so that the video processing efficiency can be further improved.

Further, the embodiment of the invention performs video processing by adopting a mode of acquiring text information and matching the preset article library, so that under the condition that the information in the preset article library changes, the latest target article and the corresponding target information thereof can be acquired based on the preset article library matching, thereby shortening the update period of the target information, for example, realizing real-time update of the target information to a certain extent.

The video processing scheme provided by the embodiment of the invention can be used for processing videos from any video platform, and can be used for processing offline videos or real-time playing videos. In practical applications, examples of the video platform may include: video websites and/or video APPs, etc.

Referring to fig. 1, there is shown an exemplary block diagram of a video processing system of an embodiment of the invention, which may include: a video server 101, a video client 102, and a video processing apparatus 103; the video server 101 and the video client 102 may be located in a wired or wireless network, and the video server 101 and the video client 102 perform data interaction through the wired or wireless network; the video server 101 and the video processing device 103 can also perform data interaction through a wired or wireless network.

In practical applications, the video server 101 may provide the first video to the video client 102, so that the video client 102 plays the first video provided by the video server 101; for example, the corresponding first video may be provided to the video client 102 according to a play request or a download request of the video client 102.

And the video server 101 may provide the second video requiring the addition of the information to the video processing apparatus 103, the video processing apparatus 103 may process the second video by using the video processing scheme of the embodiment of the present invention to obtain the second video added with the target information, and transmit the second video added with the target information to the video server 101.

In practical applications, the second video may be an offline video or a real-time playing video.

In the case where the second video is an offline video, the second video may be a current popular video, and the like, and the video server 101 may send the offline video to the video processing apparatus 103, obtain the offline video added with the target information from the video processing apparatus 103, and store the second video added with the target information, so that, when receiving a play request or a download request sent by the video client 102, the first video provided to the video client 102 may be: and playing the second video added with the target information corresponding to the request or downloading the second video.

In a case that the second video is a real-time playing video, the video server 101 may receive a playing request sent by the video client 102, for example, the playing request may carry information such as a URL (Uniform Resource Locator) of the real-time playing video, and then obtain the real-time playing video according to the URL, send the real-time playing video to the video processing device 103, and obtain the real-time playing video added with the target information from the video processing device 103, and then the first video provided to the video client 102 may be: and playing the video in real time with the added target information.

It is to be understood that the video processing system shown in fig. 1 is only an example of an application environment of the video processing method according to the embodiment of the present invention, and it is to be understood that the video processing method according to the embodiment of the present invention may be applied in any application environment, for example, the video processing method according to the embodiment of the present invention may also be applied in an application environment of a client, where the video client 102 may adopt the video processing method according to the embodiment of the present invention to process the first video provided by the video server 101, to add target information in the first video, and the like, and the embodiment of the present invention is not limited to a specific application environment.

Method embodiment

Referring to fig. 2, a flowchart illustrating steps of an embodiment of a video processing method according to the present invention is shown, which may specifically include the following steps:

step 201, acquiring text information in a video frame;

step 202, obtaining a target article matched with the text information from a preset article library;

and 203, adding the target information corresponding to the target object into the video frame.

The embodiment of the present invention does not limit the source of the video in step 201. For example, the video may originate from a video server or from a user. Wherein, in the case that the video originates from the video server, the video can be an offline video or a real-time playing video. In the case where the video originates from the user, for example, an uploading interface may be provided to the user in the form of a website or APP, and the video uploaded by the user through the uploading interface is taken as the video in step 201.

Video is typically composed of still pictures, which are referred to as video frames. In practical application, a plurality of video frames may be extracted from a video at preset time intervals, and the extracted video frames are input to step 201, that is, the extracted video frames may be used as the input data of step 201. It is to be understood that, those skilled in the art may determine the preset time interval according to practical application requirements, for example, the preset time interval may be a playing time length corresponding to N video frames, where N is a positive integer, and it is to be understood that the specific N and the preset time interval are not limited in the embodiments of the present invention.

In this embodiment of the present invention, the text information in the video frame may include: text information included in the image, and/or text information in the subtitle. In practical applications, the process of acquiring text information in a video frame may include: and performing text recognition and/or subtitle recognition on video frames included in the video to obtain text information in the video frames.

In practical applications, text recognition technology may be used to perform text recognition on video frames included in a video. The text recognition techniques described above may include: an OCR (Optical Character Recognition) technology, which may segment characters in an image after preprocessing such as noise reduction is performed on the image to obtain a single Character image, and recognize characters corresponding to the single Character image. It is to be understood that the embodiments of the present invention do not impose limitations on the specific text recognition techniques.

In practical application, a subtitle file corresponding to a subtitle of a video frame can be obtained, and text information in the subtitle is obtained from the subtitle file; or, a picture corresponding to the video frame may be subjected to screen capture, and text recognition may be performed on the captured image, so as to obtain text information in the subtitle. It can be understood that the embodiment of the present invention does not limit the specific manner of acquiring the text information in the subtitles.

After obtaining the text information in the video frame in step 201, step 202 may obtain the target item matching the text information from a preset item library.

The preset article library can be used for storing a first article, and the first article can also correspond to characteristic information and target information. In practical application, the system and the method can cooperate with an operator to obtain the first article and the corresponding characteristic information and target information thereof.

The feature information of the first article is used for representing the article features of the first article, and can be used as a matching basis for matching with the text information.

The target information is information for adding in the video frame; for example, the target information may be a logo, a picture, and the like of the first item, which attract the user, and for another example, the target information may be an access entry such as a link, so that the user enters the page corresponding to the first item through the access entry.

Examples of the first article may include: goods such as clothes, shoes, beverages, wearing articles and the like, the target information may include: the target information in picture formats such as logos, display diagrams, posters and the like and/or the target information in text formats and the like can be understood that an operator can determine a first article to be recommended and corresponding target information thereof according to actual application requirements, and the embodiment of the invention does not limit specific first articles and corresponding target information thereof.

In addition, it is understood that the manner of providing the first item and the corresponding feature information and the target information thereof by the operator is only an optional embodiment, and actually, a person skilled in the art may obtain the first item and the corresponding feature information and the target information thereof in other manners according to actual application requirements, for example, obtain the first item according to historical behavior data of the user, specifically, obtain an interesting feature of the user according to the historical behavior data of the user, and obtain the first item corresponding to the interesting feature, for example, the interesting feature may be a commodity feature purchased by the user, and the first item may be a same type of feature of the commodity feature, and it is understood that the specific obtaining manner of the first item and the corresponding target information thereof in the embodiment of the present invention is not limited.

In an optional embodiment of the present invention, the step 202 of obtaining the target item matching the text information from the preset item library may include: and judging whether the text information comprises information matched with the characteristic information corresponding to the first article in the preset article library or the similar article of the first article, and if so, taking the first article as a target article matched with the text information. The embodiment of the invention can match the text information with the first article or the characteristic information corresponding to the same kind of article of the first article, and can increase the matching range of the target article.

Optionally, the feature information may include: at least one of a name, a brand, a category, and a slogan. Matching the text information with the feature information may include: all or part of the text information is the same as the characters corresponding to the characteristic information, has the same semantics, similar semantics, related semantics and the like. Optionally, text vectors corresponding to the text information and the feature information may be respectively determined, and semantic similarity may be determined according to a similarity between the two text vectors, which may be understood that the matching between the text information and the feature information and the matching process corresponding to the matching are not limited in the embodiments of the present invention.

In an application example 1 of the present invention, it is assumed that a subtitle corresponding to a video frame includes text information "there are three favorite squirrels", the text information may be matched with feature information, such as a name, a brand, and a category, corresponding to a first article in a preset article library, and since the text information includes information matched with the feature information corresponding to the first article, a target article with the brand of "three squirrels" may be obtained, and a target article with the brand of "good product bunkers" may also be obtained, where the category of the "good product bunkers" is the same as the category of the "three squirrels".

In an application example 2 of the present invention, assuming that the subtitles corresponding to the video frames include text information "i want a wonderful life", the text information may be matched with the advertisement language information corresponding to the first item in the preset item library, and it is assumed that the matching result indicates: the text message matches the advertising language "young will be awake" of a certain beverage, and the beverage can be targeted.

In an application example 3 of the present invention, it is assumed that an image corresponding to a video frame includes text information "GAP", that is, a person in the image wears an article (such as a garment, a hat, a bag, etc.) with a logo of "GAP", the text information may be matched with feature information, such as a name, a brand, a category, etc., corresponding to a first article in a preset article library, and since the text information includes information matched with the feature information corresponding to the first article, a target article with the brand of "GAP" may be obtained, and a target article with the brand of "good clothing library" may also be obtained, where the category of "good clothing library" is the same as or similar to the category of "GAP".

After the target item matched with the text information is acquired from the preset item library in step 202, step 203 may add target information corresponding to the target item to the video frame, so that when a subsequent user watches the video, when the progress of the video reaches the video frame, the target information is displayed to the user.

In an optional embodiment of the present invention, the step 203 of adding the target information corresponding to the target item into the video frame may include: determining a target position in the video frame for adding the target information; adding the target information at a target location in the video frame.

In practical applications, the video frame may be analyzed to obtain a target position suitable for adding the target information from the image of the video frame.

In an alternative embodiment of the present invention, the target position may be a subtitle related position. The subtitle related position may include: a position of a subtitle, or a position around a subtitle. When the target position is a subtitle position, modifying subtitles included in the video frame according to target information so as to add the target information to the subtitles included in the video frame. Alternatively, when the target position is a position around a subtitle, the target information may be added around the subtitle as additional information for the subtitle in the video frame.

In an alternative embodiment of the invention, the target location may coincide with the target item, so that the naturalness of the video may be improved. Accordingly, the process of determining the target position in the video frame for adding the target information may include: determining the conformity between the existing object and the target object in the video frame; and acquiring the position of the article with the conformity meeting the preset condition from the existing articles in the video frame as a target position.

In practical application, the feature information (such as shape, color, name, category, brand, and target information) of the existing article in the video frame may be matched with the feature information (such as shape, color, name, category, brand, and target information) of the target article to obtain a degree of correspondence between the two. Further, if the conformity meets a preset condition, the position of the existing article in the video frame can be used as a target position. Optionally, the compliance with the preset condition may include: the conformity exceeds a preset threshold, etc. For example, if the target item "cola" is a beverage in the shape of a can, then the position of the object in the shape of a can or a bottle in the video frame can be obtained as the target position according to the image analysis. For another example, if the target information of the target item is a logo of a certain brand (e.g., "GAP"), the location of the item of clothing or shoes and hats in the video frame corresponding to the logo may be obtained as the target location, for example, the style of the clothing or shoes and hats corresponding to the logo of "GAP" may be a leisure style corresponding to "GAP", it may be understood that the target location may be the location of the item in the video frame corresponding to the logo is within the protection range of the target location of the embodiment of the present invention, where the location of the item corresponding to the logo may refer to being suitable for adding the logo at the location of the item.

In another optional embodiment of the present invention, the target position may be a position corresponding to a preset image target area, the preset image target may be an image target that does not affect a user's viewing, and the preset image target may include: the preset image target can be a space such as a wall, a ground, an elevator, a blue sky and the like, besides a person and an article worn by the person, and can also be an article such as furniture and the like. Accordingly, the process of determining the target position in the video frame for adding the target information may include: and identifying a preset image target area suitable for adding the target information in the video frame, and taking the preset image target area as the target position.

In an application example of the present invention, assuming that a large area of a preset image target area (such as a wall area, a floor area, an elevator area, or a wardrobe area) exists in a certain video frame, the preset image target area may be identified by an image recognition technique, and target information (such as poster information, a display map, or the like) is inserted into the preset image target area. Generally, for a user watching a video, the user basically does not perceive that the content of the preset image target area is the content except the video, so that the recommendation of the target information can be realized while reducing the influence of the target information on the video and the dislike degree of the user on the target information.

Image recognition refers to a technique for processing, analyzing, and understanding an image with a machine to recognize various different patterns of image objects. In particular, in the embodiments of the present invention, a machine may be used to process, analyze, and understand a video frame to identify various image objects in different modes, where an image object in a video frame may correspond to a certain image area in the video frame, and the image object in the video frame may include: for example, the person may be a person in a video frame, the article may be an article worn by the person in the video frame, and the space may be an environmental space in which the person is located in the video frame, such as an outdoor environment, an indoor environment, and the like.

In an alternative embodiment of the present invention, the process of performing image recognition on the video frames included in the video may include: and detecting image targets in the video frames, and analyzing the obtained image targets by using a depth learning method to obtain corresponding image target information. Therefore, the image recognition result of the embodiment of the present invention may include: and image target information corresponding to the video frame. The image object information may include: the image of the image object (i.e. the image of the image object in the video frame, the image object usually corresponds to a certain closed region in the video frame), and the image recognition result of the image object (such as the name, category, etc. of the recognized image object). For example, a face detection technique may be used to detect a face in a video frame, and a deep learning method may be used to analyze the face to obtain information such as gender, age, etc. of a person, and even to obtain a source of the person, such as from which movie or television show, etc., and even to obtain which celebrity the person is. Furthermore, articles worn by the person, such as clothes, shoes, worn watches, jewelry and the like, can be detected. Alternatively, spatial information or the like in which the person is located may be detected.

In practical applications, the adding manner adopted by the step 203 to add the target information corresponding to the target item into the video frame may include:

the adding mode 1 is that according to the target information, the information corresponding to the target position in the video frame is modified to obtain the modified video frame comprising the target information; or

And an adding mode 2, adding the target information into the video frame as additional information of a corresponding target position in the video frame.

The adding mode 1 can add the target information to the video frame by modifying the information of the corresponding target position in the video frame, so that the information in the video frame can be changed.

According to an embodiment, the modifying the information of the corresponding target position in the video frame may include: the method includes modifying a pixel value corresponding to a target position in a video frame, specifically, replacing a first pixel value corresponding to the target position in the video frame with a second pixel value corresponding to target information, where the second pixel value corresponding to the target information may be determined according to a color value (such as an RGB (Red, Green, Blue) value) of the target information in a picture format and/or a color value (such as an RGB (Red, Green, Blue) value) of the target information in a text format.

According to another embodiment, the modifying the information of the corresponding target position in the video frame may include: and modifying the text information corresponding to the subtitle position in the video frame to modify the text information corresponding to the subtitle position into target information in a text format.

The adding mode 2 may add the target information into the video frame as additional information of a corresponding target position in the video frame, where the additional information may include subtitle information or cover layer information.

The target information in the text format may be used as subtitle information corresponding to a target position in the video frame, for example, if a person in the video frame wears a garment, the target information (e.g., garment brand a) corresponding to a target article may be used as subtitle information corresponding to a position of the garment, so as to implement recommendation of the garment brand a. If the clothing worn by the person in the video frame has a brand, the brand worn by the clothing worn by the person in the video frame can be removed through an image processing technology, so that the brand is prevented from being repeated.

The masking layer refers to a layer with a certain transparency value, and the parameters of the masking layer may include a size, a display position, and a transparency value. The masking layer in the embodiment of the invention can be covered on the video frame, so that the simultaneous display of the masking layer and the video frame can be realized through the parameters of the masking layer. For example, the target information may be displayed by masking at a target location in the video frame while the video frame is displayed. And, in order to reduce the influence of the mask layer on the video frame, the mask layer may be located in the position area where the aforementioned preset image object is located.

An application example of adding the target information corresponding to the target item to the video frame in the embodiment of the present invention may include:

application example 1, assuming that the caption corresponding to the video frame includes text information "three favorite squirrels exist", assuming that a target item with a brand of "good product buns" is obtained by matching, the "three squirrels" in the text information "three favorite squirrels exist" included in the caption of the video frame may be replaced by the "good product buns", and the modified caption information "good product buns exist", and is displayed in the added video frame.

Application example 2, assuming that the caption corresponding to the video frame includes the text information "i want a wonderful life", and assuming that the text information matches the advertisement "young is striking" of a certain beverage, the beverage can be used as a target object, and a mask layer is provided in the area around the caption (e.g., the upper area), through which the target information corresponding to the target object (the beverage), such as the logo and the advertisement of the beverage, is loaded, and is displayed in the added video frame.

Application example 3, assuming that a person in an image corresponding to a video frame wears an article (such as clothes, hat, bag, etc.) with a logo of "GAP", assuming that a target article with a brand of "jacket library" is obtained by matching, a logo of the target article (such as the logo UNIQLO of the jacket library) may be added to a corresponding target position in the image of the video frame, or a logo of a second article in the video frame may be replaced with the logo of the target article (such as the logo "GAP" on the jacket in the video frame is replaced with the logo of "UNIQLO"). Wherein the addition or replacement of the logo of the target item may be achieved by modification of pixel values or masking. Also, the target location may coincide with a logo of the target item, in particular, the logo may cover the item location of any item type, etc., for example, the type of item covered by the youth library logo "UNIQLO" may include: clothing, hats, and the like.

In some embodiments of the present invention, text information in consecutive video frames included in a video may also be tracked, so that a target object corresponding to the same text information in a previous video frame may be multiplexed with text information in a subsequent video frame according to a tracking result, which may not only reduce the amount of computation required for obtaining the target object, but also deepen the memory of a user for the target object due to multiple occurrences of the target object. For example, if a video frame i (i is a video frame number, i is an integer greater than or equal to 0) has a certain text information "GAP", and a target object corresponding to the text information "GAP" is an object with a brand of "UNIQLO", the text information "GAP" may be tracked, and if the text information "GAP" still appears in a subsequent video frame i +1 and a video frame i +2 … video frame i + M (where M is a positive integer), an object with a brand of "UNIQLO" may be multiplexed with the text information "GAP" included in the subsequent video frame i +1 and a video frame i +2 … video frame i + M until the text information "GAP" disappears in the video frame i + M +1, so that, when the video frame into which the target information is embedded is a video frame, the user may see the target information of the object with a brand of "UNIQLO", until the text message "GAP" is no longer displayed.

In some embodiments of the present invention, a real-time playing video may be processed, and accordingly, a corresponding first target object may be obtained for a first video frame corresponding to a current playing time, and target information corresponding to the first target object may be added to a second video frame corresponding to a next playing time, where text information in the second video frame may be matched with the first target object.

It should be noted that, under the condition that the consecutive video frames include the same text information, the target object corresponding to the same text information may correspond to a plurality of target information, so that different target information corresponding to the target object may be added to different video frames of the consecutive video frames, and thus, diversity of target information corresponding to the target object may be achieved. For example, the different target information corresponding to the target item may include: the corresponding logo, display picture, poster, even text information, etc. of the same target object.

It should be noted that, after the target object matched with the text information is obtained, the mapping relationship between the text information and the target object may be recorded, so that, for the text information in the video frame, the target object matched with the text information may be obtained through the mapping relationship. Therefore, the calculation amount required by obtaining the target object can be reduced, and the memory of the user for the target object can be deepened by the multiple appearance of the target object. For example, if "three squirrels" appear in the lines (subtitles) corresponding to the video frames for many times, after the target object "good product buns" corresponding to the "three squirrels" is obtained for the first time, a mapping relationship between the "three squirrels" and the "good product buns" can be established; therefore, the target article, namely the good product buns, matched with the three subsequent squirrels can be obtained through the mapping relation.

In other embodiments of the present invention, image recognition may also be performed on a video stream corresponding to a video to obtain corresponding image target information; and/or performing voice recognition on the audio stream corresponding to the video to obtain corresponding text information. And further, a target object matched with the image target information and/or the text information can be obtained from a preset object library, and the target information corresponding to the target object is added into the video frame corresponding to the video stream.

The process of performing image recognition on the video stream corresponding to the video may include: detecting an image target in a video frame, and analyzing the obtained image target by using a depth learning method to obtain corresponding image target information, so that the identification result of the embodiment of the present invention may include: and image target information corresponding to the video frame. The image object information may include: the image of the image object (i.e. the image of the image object in the video frame, the image object usually corresponds to a certain closed region in the video frame), and the recognition result of the image object (such as the name, category, etc. of the recognized image object). For example, a face detection technique may be used to detect a face in a video frame, and a deep learning method may be used to analyze the face to obtain information such as gender, age, etc. of a person, and even to obtain a source of the person, such as from which movie or television show, etc., and even to obtain which celebrity the person is. Furthermore, articles worn by the person, such as clothes, shoes, worn watches, jewelry and the like, can be detected. Alternatively, spatial information or the like in which the person is located may be detected.

Video is typically composed of still pictures, which are referred to as video frames. The audio stream corresponding to the video can be used to represent a continuous audio signal, and the audio stream and the video frame corresponding to the audio stream can have synchronicity to realize the synchronous playing effect of the video picture and the audio.

In practical applications, the audio stream corresponding to the video may correspond to video content such as a speech, a score, and the score may include: theme music, inserting music, tail music, background music corresponding to the lines and the like. It is understood that the embodiment of the present invention does not impose a limitation on the specific video content corresponding to the audio stream.

In practical applications, the video stream and the Audio stream corresponding to the video may be located in the same file, in this case, the Audio may be extracted from the video file, and specifically, the video file may be converted into an Audio file, for example, the video file in MP4 (Moving Picture Experts Group Audio Layer 4) format may be converted into an Audio file in MP3 (Moving Picture Experts Group Audio Layer III) format, and the like. Or, the video stream and the audio stream corresponding to the video may be located in separate files, that is, the video file and the audio file may be separate, and in this case, the audio file may be directly obtained. The audio file may include an audio stream corresponding to a video, so that the audio stream corresponding to the video may be read from the audio file.

The embodiment of the invention can adopt the voice recognition technology to convert the audio stream corresponding to the video into the text information. If the audio stream corresponding to the video is marked as S, a series of processing is carried out on the S to obtain a corresponding voice characteristic sequence O, and the voice characteristic sequence O is marked as O ═ O₁，O₂，…，O_i，…，O_TIn which O is_iIs the ith speech feature, and T is the total number of speech features. The sentence corresponding to the audio stream S can be regarded as a word string composed of many words, and is denoted as W ═ W₁，w₂，…，w_n}. The process of speech recognition is to find the most likely word string W based on the known speech feature sequence O.

Specifically, the speech recognition is a model matching process, in which a speech model is first established according to the speech characteristics of a person, and a template required for the speech recognition is established by extracting required features through analysis of an input speech signal; the process of recognizing the voice input by the user is a process of comparing the characteristics of the voice input by the user with the template, and finally determining the best template matched with the voice input by the user so as to obtain a voice recognition result. The specific speech recognition algorithm may adopt a training and recognition algorithm based on a statistical hidden markov model, or may adopt other algorithms such as a training and recognition algorithm based on a neural network, a recognition algorithm based on dynamic time warping matching, and the like.

If the identification result includes image target information, whether a second item which is the same as, similar to or of the same category as a first item in the preset item library is included in the image target information or not can be judged, and if yes, the first item is used as a target item matched with the identification result; and/or under the condition that the identification result comprises text information, judging whether the text information comprises information matched with the first article in the preset article library or the characteristic information corresponding to the same kind of article of the first article, and if so, taking the first article as a target article matched with the text information.

According to the embodiment of the invention, the first object which is the same as or the same as the second object included in the image target information can be used as the target object, so that the video coverage rate of the target information can be improved. For example, "hat 1" included in the image target information is the same as "hat 2" included in the preset item library; as another example, "suit 1" included in the image target information is similar to "suit 2" included in the preset item library; for another example, the preset item library includes "cola", and the items in the image target information include "sprite", and the categories of the "cola" and the "sprite" are all beverage in the shape of a pop can, and the like.

Specifically, the above-mentioned process of determining whether the image target information includes a second item that is the same as, similar to, or the same as the first item in the preset item library may include: matching the characteristic information of the second article included in the image target information with the characteristic information of the first article in the preset article library to obtain a corresponding matching result; if the matching result is successful, determining that the image target information comprises target objects which are the same as, similar to or of the same type as the first object in the preset object library; wherein the feature information may include: at least one of a shape, a color, and a category.

In practical application, the shape of the second object can be determined according to the outline of the second object included in the image target information; and/or, the color of the second article may be determined according to the color value (e.g., RGB (Red, Green, Blue) value) of the second article; and/or analyzing the second object by using a deep learning method to obtain the category of the second object.

Optionally, the process of matching the feature information of the second item included in the image target information with the feature information of the first item in the preset item library may include: and determining the similarity between the characteristic information of the second article included in the image target information and the characteristic information of the first article in the preset article library, and judging whether the similarity meets a preset similarity condition, wherein if so, the corresponding matching result can be that the matching is successful.

For example, the shape and color of the second item included in the image target information may be matched with the shape and color of the first item in the preset item library, and if the matching is successful, the first item may be considered to be matched with the second item. For example, if the shape and color of the clothing included in the image target information corresponding to the video frame of a certain drama are "suit shape 1" and "wine red", respectively, and the shape and color of the first item included in a certain preset item library are "suit shape 2" and "date red", respectively, it can be considered that the clothing included in the image target information is successfully matched with the first item. It is to be understood that, in the embodiment of the present invention, the specific preset similarity condition is not limited, for example, the preset similarity condition may include: the similarity exceeds a similarity threshold, which may be 0.8 or a positive number not exceeding 1.

In an example of the embodiment of the present invention, a first pixel value corresponding to a target position in the video frame may be replaced with a second pixel value corresponding to target information. For example, a first pixel value included in a first image corresponding to a second item in a video frame may be replaced with a second pixel value included in a second image corresponding to a target item of the same category as the second item. Examples of the second article may include: the first beverage in the shape of a can or a bottle, and the target items in the same category as the second product may include: a second beverage in the shape of a can or a bottle, so that the picture of the first beverage in the video frame can be replaced by the picture of the second beverage.

In another example of the embodiment of the present invention, a logo of the target item may be added to a corresponding target position in the video frame, or a logo of the second item in the video frame may be replaced with a logo of the target item. Wherein the addition or replacement of the logo of the target item may be achieved by modification of pixel values or masking. Also, the target location may coincide with the logo of the target item, for example, if the logo of the target item is a logo of a certain brand, then the target location may be a location suitable for adding the logo, specifically, the logo may cover the item location of any item type, etc., for example, the type of item covered by the logo "GAP" may include: clothing, hats, etc., the type of article covered by the logo "NIKE" may include: clothing, shoes, hats, bags, etc.

In yet another example of the embodiment of the present invention, target information corresponding to the target item, such as target information in picture format of a logo, a display image, a poster, and/or target information in text format, may be displayed in the corresponding target position in the video frame through a mask, and the target information displayed through the mask may have a link, so that the user enters a page corresponding to the target item through the link.

In the embodiment of the present application, the number of video frames corresponding to an audio stream may be one or more. In practical applications, the target information corresponding to the target article may be added to all video frames corresponding to the audio stream, or only the target information corresponding to the target article may be added to a part of the video frames corresponding to the audio stream. Optionally, a target video frame suitable for adding target information may be first selected from video frames corresponding to the audio stream, and then target information corresponding to the target item may be added to the target video frame. Optionally, a video frame corresponding to the text information matched with the target object may be used as the target video frame, so that synchronization between the video frame and the target information may be achieved. For example, if the text information matched with the target object is information of a certain passage word in the video, the video frame corresponding to the passage word may be used as the target video frame suitable for adding the target information. Of course, the specific target video frame is not limited in the embodiment of the present invention, for example, it may also be a video frame located after a video frame corresponding to text information matching the target object, and assuming that the text information matching the target object is located at the end of a certain term in the video, a next video frame corresponding to the certain term may be used as the target video frame.

In an optional embodiment of the present invention, the adding target information corresponding to the target item into the video frame corresponding to the audio stream may include: selecting a target video frame suitable for adding target information from video frames corresponding to the audio stream; determining a target position for adding the target information in the target video frame; and adding the target information at a target position in the target video frame.

Wherein the target video frame may include: and video frames corresponding to the text information matched with the target object. Specifically, the selecting a target video frame suitable for adding the target information from video frames corresponding to the audio stream may include: acquiring information matched with the characteristic information of the target object in the identification result as a target identification result; extracting a part corresponding to the target recognition result in the audio stream as target audio; taking a video frame corresponding to the target audio as the target video frame; the recognition result is the text information obtained by voice recognition on the audio stream. In practical application, the audio stream may have a certain length, and the text information as the recognition result may also have a certain length, so that the target recognition result, such as the target text information in the text information, may be first obtained according to the feature information of the target object, and then the target audio in the audio stream is extracted, and then the target video frame corresponding to the target audio is located, wherein the target video frame corresponding to the target audio may be located according to the synchronization between the video stream and the audio stream.

It should be noted that, when a plurality of target video frames are provided, a target position for adding the target information may be determined for each target video frame; therefore, the problem that the target information is missed by a user due to the fact that the duration corresponding to one target video frame is short can be avoided to a certain extent.

In an optional embodiment of the present invention, the method of the embodiment of the present invention may further include: and modifying the audio stream according to the target information to obtain a modified audio stream matched with the target information. For example, if a certain speech of the video is "three favorite squirrels", and if the target item is "good product shop", the audio corresponding to the speech may be modified to "good product shop" that is favorite.

According to one embodiment, the target information may be speech synthesized to obtain target audio; and replacing the audio matched with the target object in the audio stream by using the target audio, wherein the replaced audio stream is used as a modified audio stream.

The Speech synthesis technology is also called Text-to-Speech (TTS) technology, that is, technology for converting Text into Speech. Examples of speech synthesis techniques may include: speech Synthesis based on Hidden Markov Models (HMMs) (HMM-based Speech Synthesis System), the basic idea of HTS is: and carrying out parametric decomposition on the speech signal, establishing an HMM model corresponding to each acoustic parameter, predicting the acoustic parameters of the text to be synthesized by using the HMM model obtained by training during synthesis, inputting the acoustic parameters into a parameter synthesizer, and finally obtaining the synthesized speech. The acoustic parameters may include: at least one of a spectral parameter and a fundamental frequency parameter.

According to another embodiment, the modifying the audio stream may include: acquiring voice characteristics corresponding to the audio stream; performing voice synthesis on the target information by using the voice characteristics to obtain a target audio; and replacing the audio matched with the target object in the audio stream by using the target audio, wherein the replaced audio stream is used as a modified audio stream. In this embodiment, the voice feature may be utilized to determine an acoustic parameter corresponding to voice synthesis, so that consistency between the audio that is not replaced in the audio stream and the audio that is replaced in the aspect of the voice feature may be achieved.

Optionally, the voice features may include a voiceprint feature, where the voiceprint feature is a sound wave spectrum carrying speech information displayed by an electro-acoustic instrument, and the voiceprint feature has characteristics of specificity and relative stability. The embodiment of the invention carries out voice synthesis of the target information by utilizing the voiceprint characteristics corresponding to the audio stream, can match the synthesized target audio with the original sound corresponding to the audio stream, and realizes the integrity of the video content.

In an optional embodiment of the present invention, time axis alignment may be performed on the modified audio stream and the audio stream before modification (referred to as the original audio stream for short), and the time axis alignment may implement consistency between the modified audio stream and the original audio stream in terms of a time axis, so that an influence on video and audio synchronization due to modification of the audio stream may be avoided. Assuming that the original audio stream corresponds to the text information "three favorite squirrels" as the first audio, and assuming that the modified audio stream corresponds to the modified text information "good product shop" as the favorite, the time information of the first audio in the original audio stream is consistent with the time information of the second audio in the modified audio stream; specifically, the time durations corresponding to the first audio and the second audio may be consistent, and the start time and the end time of the first audio in the original audio stream may be consistent with the start time and the end time of the second audio in the modified audio stream.

In an optional embodiment of the present invention, the method of the embodiment of the present invention may further include: acquiring a geographical area where equipment is located and a target language corresponding to the geographical area; translating the text information corresponding to the audio stream into target text information conforming to the target language; and adding the target text information into a video frame corresponding to the audio stream. The embodiment of the invention can perform machine translation on text information (such as lines, lyrics and the like) corresponding to the audio stream aiming at the geographic area where the user is located, so that the purpose that users with different languages can understand the video content can be realized. The granularity of the geographic region may be country, etc., so that the text information corresponding to the audio stream may be translated from one language (e.g., chinese) to english for users in the european and american region. Of course, the granularity of the geographic region may also be province, etc., so that text information corresponding to an audio stream may be translated from one language (e.g., chinese) to a dialect (e.g., northeast dialect, Sichuan dialect, Guangdong dialect, etc.) of a certain region.

To sum up, the video processing method of the embodiment of the present invention automatically obtains text information in a video frame through a machine, obtains a target object matched with the text information in a preset object library, and adds target information corresponding to the target object in a corresponding video frame; according to the embodiment of the invention, the target object matched with the text information of the video frame can be quickly acquired without manual intervention, so that the processing time of the video can be shortened and the video processing efficiency can be improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.

Device embodiment

Referring to fig. 3, a block diagram of a video processing apparatus according to an embodiment of the present invention is shown, which may specifically include: a text information acquisition module 301, a text information acquisition module 302, and a target information addition module 303.

The text information acquiring module 301 is configured to acquire text information in a video frame;

a target item obtaining module 302, configured to obtain a target item matched with the text information from a preset item library;

and an object information adding module 303, configured to add object information corresponding to the object item in the video frame.

Optionally, the text information obtaining module 301 may include:

and the identification submodule is used for performing text identification and/or subtitle identification on a video frame which can be included in the video so as to obtain text information in the video frame.

Optionally, the target item obtaining module 302 may include:

and the judging sub-module is used for judging whether the text information can comprise information matched with the first article in the preset article library or the characteristic information corresponding to the same kind of article of the first article, and if so, taking the first article as a target article matched with the text information.

Optionally, the target information adding module 303 may include:

Optionally, the target position determination sub-module may include:

Optionally, the target position is a subtitle-related position;

the adding sub-module may include:

a subtitle adding unit, configured to modify subtitles that may be included in the video frame according to target information, so as to add the target information to the subtitles that may be included in the video frame; and/or the presence of a gas in the gas,

Optionally, the target information adding module 303 may include:

a video frame information modification submodule, configured to modify information corresponding to a target position in the video frame according to the target information, so as to obtain a modified video frame that may include the target information; or

Optionally, the video frame information modification sub-module may include:

Optionally, the apparatus may further include:

the image tracking module is used for carrying out image tracking on image targets in continuous video frames which can be included in the video;

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Embodiments of the present invention provide an apparatus for video processing, which may include a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: acquiring text information in a video frame; acquiring a target article matched with the text information from a preset article library; and adding target information corresponding to the target object into the video frame.

Optionally, the acquiring text information in a video frame includes:

and performing text recognition and/or subtitle recognition on video frames included in the video to obtain text information in the video frames.

Optionally, the obtaining the target item matched with the text information from a preset item library includes:

and judging whether the text information comprises information matched with the characteristic information corresponding to the first article in the preset article library or the similar article of the first article, and if so, taking the first article as a target article matched with the text information.

Optionally, the adding the target information corresponding to the target item into the video frame includes:

determining a target position for adding target information in the video frame;

adding the target information at the target location in the video frame.

Optionally, the determining a target position in the video frame for adding the target information includes:

determining the conformity between the existing object and the target object in the video frame; acquiring the position of an article with the conformity meeting preset conditions from the existing articles in the video frame as a target position;

and/or the presence of a gas in the gas,

and identifying a preset image target area suitable for adding the target information in the video frame, and taking the preset image target area as the target position.

Optionally, the target position is a subtitle-related position;

the adding the target information at the target location in the video frame comprises:

modifying the subtitles included in the video frame according to target information so as to add the target information in the subtitles included in the video frame;

and/or the presence of a gas in the gas,

and adding target information around the subtitle as additional information of the subtitle in the video frame so as to add the target information in the video frame.

modifying information corresponding to a target position in the video frame according to the target information to obtain a modified video frame comprising the target information; or

And adding the target information into the video frame as additional information of a corresponding target position in the video frame.

Optionally, the modifying information of the corresponding target position in the video frame includes:

replacing a first pixel value corresponding to a target position in the video frame with a second pixel value corresponding to target information, wherein the second pixel value corresponding to the target information is determined according to the color value of the target information in the picture format and/or the target information in the text format;

and/or the presence of a gas in the gas,

and modifying the text information corresponding to the subtitle position in the video frame to modify the text information corresponding to the subtitle position into target information in a text format.

Optionally, the device is also configured to execute the one or more programs by the one or more processors including instructions for: carrying out image tracking on image targets in continuous video frames included in the video; and multiplexing target information corresponding to the same image target in the previous video frame aiming at the image target in the subsequent video frame according to the image tracking result.

Fig. 4 is a block diagram illustrating an apparatus 900 for video processing as a terminal according to an example embodiment. For example, the apparatus 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 4, apparatus 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.

The processing component 902 generally controls overall operation of the device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 906 provides power to the various components of the device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 900.

The multimedia component 908 comprises a screen providing an output interface between the device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide motion action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when apparatus 900 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.

I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing status assessment of various aspects of the apparatus 900. For example, the sensor assembly 914 may detect an open/closed state of the device 900, the relative positioning of the components, such as a display and keypad of the apparatus 900, the sensor assembly 914 may also detect a change in the position of the apparatus 900 or a component of the apparatus 900, the presence or absence of user contact with the apparatus 900, orientation or acceleration/deceleration of the apparatus 900, and a change in the temperature of the apparatus 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby items in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communications between the apparatus 900 and other devices in a wired or wireless manner. The apparatus 900 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the apparatus 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 5 is a schematic diagram of a server in some embodiments of the invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (terminal or server), enable the terminal to perform a video processing method, the method comprising: acquiring text information in a video frame; acquiring a target article matched with the text information from a preset article library; and adding target information corresponding to the target object into the video frame.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The present invention provides a video processing method, a video processing apparatus and a video processing apparatus, which have been described above in detail, and the principles and embodiments of the present invention are explained herein by using specific examples, and the descriptions of the above examples are only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A video processing method, comprising:

acquiring text information in a video frame;

adding target information corresponding to the target object into audio and video streams corresponding to the video frames, so as to provide the audio and video streams added with the target information to a video client after receiving a playing request or a downloading request sent by the video client;

adding the target information corresponding to the target object into the audio/video stream corresponding to the video frame comprises the following steps:

and modifying the subtitles and the audio stream included in the audio and video stream according to the target information.

2. The method of claim 1, wherein obtaining the text information in the video frame comprises:

3. The method according to claim 1, wherein the obtaining the target item matching the text information from the preset item library comprises:

4. The method according to claim 1, wherein the adding the target information corresponding to the target item to the audio/video stream corresponding to the video frame further comprises:

determining a target position for adding target information in the video frame;

adding the target information at the target location in the video frame.

5. The method of claim 4, wherein the determining the target location in the video frame for adding the target information comprises:

and/or the presence of a gas in the gas,

6. The method of claim 4, wherein the target position is a caption-related position;

7. The method according to claim 1, wherein the adding the target information corresponding to the target item to the audio/video stream corresponding to the video frame further comprises:

8. The method of claim 7, wherein modifying the information corresponding to the target location in the video frame comprises:

and replacing a first pixel value corresponding to the target position in the video frame with a second pixel value corresponding to the target information, wherein the second pixel value corresponding to the target information is determined according to the color value of the target information in the picture format and/or the target information in the text format.

9. The method according to any one of claims 1 to 8, further comprising:

carrying out image tracking on image targets in continuous video frames included in the video;

and multiplexing target information corresponding to the same image target in the previous video frame aiming at the image target in the subsequent video frame according to the image tracking result.

10. A video processing apparatus, comprising:

the target information adding module is used for adding target information corresponding to the target object into audio and video streams corresponding to the video frames so as to provide the audio and video streams added with the target information to a video client after receiving a playing request or a downloading request sent by the video client;

the target information adding module comprises:

and the subtitle audio modifying submodule is used for modifying the subtitle and the audio stream included in the audio and video stream according to the target information.

11. The apparatus of claim 10, wherein the text information obtaining module comprises:

12. The apparatus of claim 10, wherein the target item acquisition module comprises:

13. The apparatus of claim 10, wherein the target information adding module further comprises:

14. The apparatus of claim 13, wherein the target location determination sub-module comprises:

15. The apparatus of claim 13, wherein the target position is a caption-related position;

the adding submodule comprises:

16. The apparatus of claim 10, wherein the target information adding module further comprises:

17. The apparatus of claim 16, wherein the video frame information modification sub-module comprises:

and the pixel value modifying unit is used for replacing a first pixel value corresponding to the target position in the video frame with a second pixel value corresponding to the target information, and the second pixel value corresponding to the target information is determined according to the color value of the target information in the picture format and/or the target information in the text format.

18. The apparatus of any one of claims 10 to 17, further comprising:

19. An apparatus for video processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors the one or more programs including instructions for:

acquiring text information in a video frame;

20. The apparatus of claim 19, wherein the obtaining text information in a video frame comprises:

21. The apparatus of claim 19, wherein the obtaining of the target item matching the text information from the preset item library comprises:

22. The apparatus according to claim 19, wherein the adding target information corresponding to the target item to the audio/video stream corresponding to the video frame further comprises:

determining a target position for adding target information in the video frame;

adding the target information at the target location in the video frame.

23. The apparatus of claim 22, wherein the determining the target location in the video frame for adding the target information comprises:

determining the conformity between the existing object and the target object in the video frame; acquiring the position of an article with the conformity meeting preset conditions from the existing articles in the video frame as a target position; and/or the presence of a gas in the gas,

24. The apparatus of claim 22, wherein the target location is a caption-related location;

25. The apparatus according to claim 19, wherein the adding target information corresponding to the target item to the audio/video stream corresponding to the video frame further comprises:

26. The apparatus of claim 25, wherein the modifying the information corresponding to the target location in the video frame comprises:

27. The apparatus of any of claims 19-26, wherein the apparatus is also configured to execute the one or more programs by one or more processors includes instructions for:

carrying out image tracking on image targets in continuous video frames included in the video; and multiplexing target information corresponding to the same image target in the previous video frame aiming at the image target in the subsequent video frame according to the image tracking result.

28. A machine-readable medium having stored thereon instructions which, when executed by one or more processors, cause an apparatus to perform a video processing method as claimed in one or more of claims 1 to 9.