WO2019020061A1 - 视频台词处理方法、客户端、服务器及存储介质 - Google Patents

视频台词处理方法、客户端、服务器及存储介质 Download PDF

Info

Publication number
WO2019020061A1
WO2019020061A1 PCT/CN2018/097089 CN2018097089W WO2019020061A1 WO 2019020061 A1 WO2019020061 A1 WO 2019020061A1 CN 2018097089 W CN2018097089 W CN 2018097089W WO 2019020061 A1 WO2019020061 A1 WO 2019020061A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
line
text
speech
interface
Prior art date
Application number
PCT/CN2018/097089
Other languages
English (en)
French (fr)
Inventor
陈姿
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019020061A1 publication Critical patent/WO2019020061A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present application relates to the field of Internet technologies, and in particular, to a video word processing method, a video client, a video server, and a computer readable storage medium.
  • the application example provides a video word processing method.
  • the method includes:
  • the recognized text of the line is sent to the video client.
  • the identifying the line of text from the frame image may include:
  • the identifying the line text from the frame image may further include:
  • Preprocessing the frame image before detecting the character region in the frame image Preprocessing the frame image before detecting the character region in the frame image.
  • the pre-processing can include at least one of smoothing, layout analysis, and tilt correction.
  • the removing the background in the detected character region may include: performing binarization processing on the detected character region; wherein the character sequence is extracted from the character region after the background is removed,
  • the method includes: performing character segmentation on the character region subjected to the binarization processing according to a pixel value of each pixel in the character region subjected to the binarization processing to obtain the character sequence.
  • the identifying the line text from the frame image may further include:
  • the recognized text of the line is post-processed according to the language syntax constraint.
  • the application example provides a video word processing method.
  • the method includes:
  • the line text is processed correspondingly in response to the operation of the line operation interface.
  • the video speech processing request may be a video speech sharing request; the speech operation interface further includes information of one or more alternative sharing platforms and/or comment regions.
  • processing of the text of the line in response to the operation of the line operation interface may include:
  • the speech text is posted to the selected sharing platform in response to a publishing operation of the information publishing interface of the selected sharing platform.
  • processing of the line of text in response to the operation of the line operation interface may further include:
  • the selected sharing platform is logged in response to a login operation to the login interface of the selected sharing platform.
  • processing of the text of the line in response to the operation of the line operation interface may include:
  • the line text is posted to the selected comment area in response to a selection operation of a comment area in the line operation interface.
  • the line text is displayed in an editable text box of the line operation interface
  • the application example provides a video server.
  • the video server includes:
  • the information extraction module is configured to: when receiving the video speech processing request sent by the video client, extract the video identifier and time information carried by the processing request;
  • An image obtaining module configured to acquire, from the video data corresponding to the video identifier, a frame image corresponding to the time information
  • a line recognition module configured to identify a line text from the frame image
  • a speech sending module configured to send the recognized text of the speech to the video client.
  • the line recognition module can include:
  • An area detecting unit configured to detect a character area in the frame image
  • a background removing unit configured to remove a background in the detected character region
  • a character extracting unit configured to extract a sequence of characters from a character region after removing the background; wherein the sequence of characters includes one or more character images;
  • a character recognition unit configured to perform text recognition on the extracted one or more character pictures included in the character sequence to obtain the line text.
  • the line recognition module can further include:
  • a pre-processing unit configured to pre-process the frame image before the region detecting unit detects the character region in the frame image.
  • the pre-processing may include at least one of smoothing, layout analysis, and tilt correction.
  • the background removing unit may be specifically configured to perform binarization processing on the detected character region, where the character extracting unit may be specifically configured to: according to each pixel in the character region after the binarization processing The pixel value of the point is subjected to character segmentation of the character region subjected to the binarization process to obtain the character sequence.
  • the line recognition module can further include:
  • a post-processing unit for post-processing the line text according to the language syntax constraint for post-processing the line text according to the language syntax constraint.
  • the application example provides a video client.
  • the video client includes:
  • a request sending module configured to send a video channel processing request carrying a video identifier and time information to the video server in response to the operation of the video line control in the video playing interface, to cause the video server to identify from the video and the time
  • the text of the line is recognized in the frame image corresponding to the information
  • An interface display module configured to display a line operation interface including the line text when receiving the line text sent by the video server;
  • the speech processing module is configured to perform corresponding processing on the speech text in response to the operation of the speech operation interface.
  • the video speech processing request is a video speech sharing request; the speech operation interface further includes information of one or more alternative sharing platforms and/or comment regions.
  • the line processing module may be specifically configured to: in response to a selection operation of a sharing platform in the line operation interface, if the selected sharing platform is in a login state, displaying the quilt including the line text Selecting an information publishing interface of the sharing platform; in response to a publishing operation of the information publishing interface of the selected sharing platform, publishing the spoken text to the selected sharing platform.
  • the line processing module may be further configured to: in response to a selection operation of a sharing platform in the line operation interface, if the selected sharing platform is in an unlogged state, display the selected Sharing a login interface of the platform; logging in to the selected sharing platform in response to a login operation to the login interface of the selected sharing platform.
  • the line processing module may be further configured to: publish the line text to the selected comment area in response to a selection operation of a comment area in the line operation interface.
  • the line text may be displayed in an editable text box of the line operation interface, and the line processing module may be further configured to: respond to the operation of the editable text box, the line text Edit it.
  • the present application example provides a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the above method.
  • the user only needs to click the video line control in the video playing interface, and the video server will recognize the line text from the corresponding frame image, and feed the line text to the video client, so that the user can be in the video client.
  • the corresponding processing of the video lines can be realized, and the user does not need to manually input the video lines, which is very convenient.
  • FIG. 1 is a system architecture diagram related to an example of the present application
  • FIG. 2 is a schematic flow chart of a method for processing a video word in an example of the present application
  • FIG. 3 is a schematic diagram of a video playing interface in an example of the present application.
  • FIG. 4 is an enlarged schematic view of the video speech sharing control 301 of FIG. 3;
  • FIG. 5 is a schematic diagram of a video playing interface in an example of the present application.
  • FIG. 6 is a schematic diagram of a line operation interface in an example of the present application.
  • FIG. 7 is a schematic flow chart of a method for processing a video word in an example of the present application.
  • FIG. 8 is a schematic diagram of interaction between a user, a video client, and a video server in an example of the present application
  • FIG. 9 is a structural block diagram of a video client in an example of the present application.
  • FIG. 10 is a structural block diagram of a video server in an example of the present application.
  • FIG. 11 is a schematic structural diagram of a computing device in an example of the present application.
  • the present application proposes a video word processing method, and the system architecture applied by the method is as shown in FIG. 1 .
  • the system architecture includes a client device 101, a video server 102, and an Internet 103.
  • the client device 101 is connected to the video server 102 via the Internet 103. among them:
  • the client device 101 may be a user's smart phone or a computer, on which client software of various application software is installed, and the user can log in and use a client of various application software through the client device, and the client of the application software
  • the terminal can be a client of multimedia software, such as a video client.
  • the video server 102 may be a server or a server cluster, and may provide a video playback service for the client device.
  • the above Internet 103 may include a wired network and a wireless network.
  • the inventor of the present application has found that during the process of watching a movie on the client device 101, the user may see some lines that he likes or feels. In this case, the user may want to share the lines to the comment area of the video client. Or you want to share the lines on social platforms such as friends circle, Weibo, qq space or friend dynamics, or you want to copy and paste the lines into the text of your choice. In one possible implementation, the user can pass Manually enter lines and then share, etc. However, this method of operation is not very convenient.
  • Methods include:
  • S201 Send a video channel processing request carrying a video identifier and time information to the video server, in response to the operation of the video line control in the video playing interface, so that the video server corresponds to the video identifier and the time information.
  • the line text is recognized in the frame image.
  • the above video identifier is an identifier used to distinguish different video files or video streams, and may be allocated by a video server, and different video files or video streams correspond to different video identifiers.
  • the video of the movie "Shawshengk's Redemption” is identified as a1
  • the video of the movie "Death Poetry Society” is identified as b1
  • the video of the TV series "Parental Love” episode 12 is identified as c1_12
  • the TV series "Hidden The video of episode 20 is identified as d1_20.
  • the time information may be the playing time point of the current video (also referred to as playing progress, playing position). For example, a movie has 90 minutes, and the video data of the movie is composed of many frame images, and different playing time points. Corresponds to different frame images.
  • the video client in the client device can carry the time information in the video processing request, so that the video server can know which frame image of the video corresponding to the video identifier is the frame image that the user wants to perform the speech processing.
  • the current video may be a video currently being played by a video client in the client device.
  • the video line control refers to a UI (User Interface) control for triggering a speech processing request in the video playing interface.
  • the video line control can be expressed as a graphic button, a menu option, etc. in the playing interface.
  • the video client when the user clicks on the control, the video client performs a corresponding operation.
  • the video line control is a video line sharing control
  • the video client when the user clicks the control, the video client sends a video to the video player.
  • Line sharing request (corresponding to the above video line processing request).
  • the video client sends the video identifier and the time information to the video server, and after receiving the processing request, the video server acquires the corresponding image frame according to the video identifier and the time information.
  • the video client may also send a video channel processing request including a to-be-processed frame image to the video server, so that the video server directly performs line recognition according to the frame image in the processing request, and obtains the line text.
  • the video playing interface of the video client is playing a video.
  • the video client obtains the frame image currently played by the video playing interface in response to the operation of the pair of video lines. And the acquired currently played frame image is carried in the processing request and sent to the video server.
  • a video channel sharing control 301 is disposed in the video playing interface of the video client.
  • the user wants to share the lines in the current video playing interface to a certain social platform (for example, On the circle of friends, the user can click on the video word sharing control 301 at this time.
  • the video client sends a video speech processing request to the video server 102, so that the video server 102 obtains the video identification and time information from the video speech request, and determines according to the video identification.
  • FIG. 4 For an enlarged schematic diagram of the video speech sharing control in FIG. 3, reference may be made to FIG. 4, and of course, other shapes of icons may be used as the video speech control.
  • the video word processing request is not limited to a request for sharing processing of a line, but may also be a request for other processing of a video line, such as a request for editing (eg, copying, modifying, etc.) a line.
  • the line operation interface displayed by the video client after receiving the line text sent by the server may have various forms, and different video line word processing requests correspond to different line operation interfaces. For example, if the video client sends a video word sharing request in step S201, when the video client receives the line text sent by the video server, the displayed line operation interface may further include one or more options. Share the platform and / or comment area information for users to choose the sharing platform or comment area.
  • the video server recognizes the lines in the video playing interface shown in FIG. 3, the obtained line text is fed back to the video client, and the video client displays the line operation interface after receiving the line text.
  • Figure 5 shows.
  • the line text is displayed in the text box 501.
  • the line operation interface further includes several icons of the sharing platform and the ⁇ or comment area: the WeChat icon 502, the Tencent qq icon 503, and the Weibo icon 504.
  • the comment area icon 505 each icon may be a corresponding control of a sharing platform or a comment area; wherein the WeChat icon 502 corresponds to a circle of friends in the WeChat platform, and is used to enter the WeChat friend circle after being triggered.
  • the text of the lines is displayed in the information publishing interface of the WeChat circle of friends; the Tencent qq icon 503 corresponds to the qq space or friend dynamics in the qq platform, used to enter the qq space or the dynamic information of the friends after being triggered.
  • the text of the line is displayed in the qq space or the dynamic information publishing interface of the friend; the microblog icon 504 corresponds to the microblog publishing interface of the microblog platform, and is used to enter the information publishing interface of the microblog after being triggered.
  • the comment area icon 505 is also provided in the line operation interface shown in FIG. 5, and the comment area icon 505 corresponds to the area in the current video client that posts the comment. After being triggered, enter the comment area below the video playback interface, and display the line text in the comment area.
  • a cancel button 506 is further disposed in the line operation interface shown in FIG. 5 for canceling the current sharing behavior and returning to the video playing interface.
  • the video client can perform different processing operations on the operation interface of the line, and still take FIG. 5 as an example to illustrate the above step S203:
  • the user wants to post the text of the speech to the circle of friends he can click the WeChat icon 502, so that the video client will display the information publishing interface of the WeChat circle of friends, and display the text of the line in the information publishing interface, and then the user clicks Sending, the video client sends the line text to the WeChat circle in response to the operation sent by the click, after which the user or the user's friend can see the line text of the user's speech in the circle of friends; if the user wants to put the line When the text is published to the qq space or the friend dynamic, you can click the Tencent qq icon 503.
  • the video client will display the qq space or the friend's dynamic information publishing interface, and display the line text in the information publishing interface, and the user clicks to send. Afterwards, the video client sends the speech text to the qq space or the friend dynamic in response to the operation sent by the click, and then the user or the user's friend can see the speech text published by the user in the qq space or the friend dynamic; If the user wants to post the text of the line to the microblog, he can click the microblog icon 504. The sample video client will display the information publishing interface of the microblog, and display the text of the line in the information publishing interface. After the user clicks the post, the video client sends the text of the line to the microblog in response to the operation of the click.
  • the user or the user's friend can see the line text published by the user in Weibo; similarly, if the user wants to publish the line text in the comment area of the video client, the comment area icon 505 is clicked, and the video client is clicked. In response to the operation of the click comment area icon 505, the line text in the text box is posted to the comment area below the video play interface.
  • the video client displays the line operation interface
  • the user does not want to share or publish the line.
  • the cancel button 506 in the line operation interface can be clicked, and the video client can return to the operation in response to the click cancel button 506.
  • the video playback interface allows the user to continue watching the video.
  • the processing of the video lines is not limited to sharing, and only the line texts fed back by the video server can be edited without sharing, and It is to edit the lines of the words returned by the video server and then share them.
  • the text box 501 can be configured as an editable text box.
  • the video client responds to the user's operation on the editable text box and performs the text on the line. edit. For example, modify the line text (for example, delete the English in Figure 5, add an expression, etc.), and then the user copies the modified line text and pastes it into a word document or text document, or the user Share the modified texts on a social platform.
  • FIG. 5 is only one form of the operation interface of the line.
  • the line operation interface can also adopt other forms.
  • multiple settings are arranged below the text box.
  • Virtual editing keys such as copy keys, paste keys, expression add keys, background setting keys, etc., different keys can be used to perform different editing operations on the line text.
  • FIG. 6 in addition to the text box 601, the WeChat icon 602, the Tencent qq icon 603, the Weibo icon 604, the comment area icon 605, and the cancel key 606, there are a copy key 607 and a paste key 608.
  • the video client can copy the line text in the text box 601 to other files in response to the operation of clicking the copy key 607.
  • the video client In response to the operation of clicking the paste key 608, the content previously copied in other files may be pasted into the text box.
  • the video client may respond to the operation of clicking the expression add key 609, and may be in the text box. Add expressions, etc.
  • the video word processing method provided by the example of the present application, the user only needs to click the video line word control in the video playing interface, the video server will recognize the line text from the corresponding frame image, and feed the line text to the video client. Therefore, the user can operate on the line operation page of the video client, and the corresponding processing of the video lines can be realized, and the user does not need to manually input the video lines, which is very convenient.
  • the line operation interface may include one or more sharing platforms to be selected, so that the user can select the sharing platform, and there may be two cases:
  • the selected sharing platform is in the login state
  • the video client When the user selects a sharing platform on the line operation interface, the video client responds to the user's selection operation on the sharing platform in the line operation interface, and detects whether the user is in the login state of the sharing platform. When the sharing platform is in the login state, the speech text is directly displayed on the information publishing interface of the selected sharing platform. If the user continues to click the publishing operation, the video client posts the speech text to the selected sharing platform in response to the publishing operation of the information publishing interface of the selected sharing platform.
  • the video client When the user selects a sharing platform on the line operation interface, the video client responds to the user's selection operation on the sharing platform in the line operation interface, and detects whether the user is in the login state of the sharing platform, and detects the sharing. If the platform is not logged in, the login interface of the selected sharing platform is displayed, so that after the user inputs the correct login information on the login interface, the video client responds to the login of the login interface of the selected sharing platform. Operation, logging in to the selected sharing platform, and then publishing the information.
  • the above method is a video word processing method performed by the video client, and corresponding to the above method, the application example further provides a video word processing method, which may be performed by the video server 102, as shown in FIG. 7, the method includes:
  • the video server receives the video word processing request in various ways, one of which is real-time monitoring.
  • the data with the destination address is the video server is monitored, it is received, and then according to the data.
  • the relevant information is determined to be a video speech processing request.
  • the video server transmits a video stream to the video client, so that the user can view the video composed of one frame of the image on the video client.
  • the method for obtaining the frame image corresponding to the time information from the video data corresponding to the video identifier in step S702 may be: the video server extracts the frame image corresponding to the time information from the current video stream.
  • the method for extracting the frame image is not limited thereto, and the video server may also be obtained from the video file corresponding to the video identifier, and the video file at this time is a static video file.
  • the video server searches for the video file corresponding to the video identifier in the database or the network of the video server, and extracts the frame image corresponding to the time information from the video file. Either way, as long as the frame image corresponding to the video identification and time information can be obtained.
  • the above steps S701 and S702 are only described by taking the video identifier and the time information in the video word processing request as an example.
  • the video channel processing request sent by the video client may also be directly carried.
  • the above frame image, that is, the video server extracts the frame image from the processing request sent by the video client.
  • the above identification process refers to a process of converting characters in a bitmap image format into text in a frame image.
  • the image recognition technology is used to identify the text of the line in the frame image.
  • the method is not limited. As long as the line text in the frame image can be recognized.
  • the video server when the video server receives the video speech processing request, the video identifier and the time information are extracted from the request, and then the corresponding frame image is obtained according to the video identifier and the time information, and then The line text is recognized from the frame image, and finally the line text is sent to the video client, so that the video client can process the line text accordingly, so that the line can be shared without the user manually inputting the video line. Editing and other processing to improve the convenience of the processing of lines.
  • the specific process of identifying the line text from the frame image in step S703 may include the following steps:
  • the lines are generally located below the video screen, so the character area can be obtained by intercepting the rectangular area below the video screen.
  • This method is simple, but obtained.
  • the character area may not be very accurate. Therefore, the character region can also be obtained according to the difference between the character and the played background image.
  • a typical character region is a horizontal rectangular region with steep edges, and the distribution of pixel values in the character region and the background of the play. The pixel distribution of the image is very different, and the difference can be used to detect and intercept the character area.
  • the frame image may be preprocessed in one or more manners before performing step S7031.
  • preprocessing methods such as image smoothing, layout analysis, and tilt correction.
  • Image smoothing refers to the method of highlighting the wide area of the image, the low-frequency component, the trunk part, or suppressing image noise and interfering with high-frequency components, which can achieve the effect of smoothing the brightness of the image, reducing the abrupt gradient, and improving the image quality. It can be seen that the image smoothing preprocessing method can make the brightness of the frame image gradually change and the image quality is improved.
  • image smoothing There are various ways to specifically perform image smoothing, such as an interpolation method, a linear smoothing method, a convolution method, and the like.
  • the specific image smoothing method can be selected according to different image noises. For example, when the image noise is salt and pepper noise or salt and pepper noise, the smoothing method can be used to smooth the image.
  • Layout analysis refers to dividing a digital image into multiple regions, and determining the categories of each region, such as text, tables, symbols, etc., to achieve the positioning of each region.
  • Layout analysis mainly includes three types of methods: top-down method, bottom-up method, and comprehensive method.
  • the top-down method includes a projection analysis method and a run-length combining algorithm.
  • Projection analysis is to project a two-dimensional image in a certain direction, and divide it by histogram analysis combined with local or global threshold method.
  • the run merge algorithm is to combine the two runs into one run if the distance between two adjacent runs in the same row is short.
  • the bottom-up method includes the region growing method, which analyzes from the smallest unit of the image to obtain the connected body, and then combines the connected bodies with a certain strategy to obtain a higher-level structure, and obtains the layout structure information in the merge process.
  • the bottom-up analysis method has strong adaptability and can analyze more complex layouts, but the calculation is large.
  • the top-down and bottom-up methods have their own advantages and disadvantages.
  • the integrated method obtained by combining the two is flexible, but different solutions are needed for different situations in practical applications.
  • Tilt correction refers to the process of correcting the tilt of the image.
  • the algorithm for estimating the tilt angle of the document image mainly includes three types: projection-based method, Hough transform-based method and based on The method of least squares.
  • the projection-based method uses some features of the projection to make judgments, and performs different angle projection tests on the document image, and extracts the best projection effect in the obtained series of results, thereby estimating the tilt angle of the document image.
  • the disadvantage of this method is that the amount of calculation is large, and the obtained angle of inclination angle accuracy depends on the unit step size when performing different angle projection tests.
  • the Hough transform-based method mainly maps the original coordinate plane to all points on the straight line passing through the point in the Hough space.
  • the disadvantage is that the computational time-space complexity is high, and the symbol dispersion is the case.
  • the choice is more difficult.
  • the group feature points calculate the residual, so that the residual is the smallest, and the value of b is solved, and the tilt angle of the image can be obtained.
  • the process of removing the background of the character area can be understood as the process of image purification, removing the visible noise in the character area, thereby improving the image quality of the character area.
  • each pixel is 1 or 0, that is, each pixel in the character area represents either a character or a background.
  • the so-called characters include characters, letters, punctuation, and the like.
  • the above character sequence includes one or more character pictures.
  • the process of extracting the character sequence from the character area after the background area may adopt the following steps:
  • the character sequence is obtained by performing character segmentation on the character region subjected to the binarization processing according to the pixel value of each pixel in the character region subjected to the binarization processing.
  • black represents a character and white represents a background.
  • the pixel values of multiple columns of pixels between adjacent characters and characters in the same row are all 1, and there are many adjacent characters and characters in the same column.
  • the pixel values of the column pixels are all 1, even if some characters are left and right structures or upper and lower structures, the number of columns of pixel points where the pixel values are all 1 between the left and right structures is not too large, and the pixel values between the upper and lower structures are all
  • the number of lines of 1 pixel is not too large, so the character area can be divided according to this point to obtain a character sequence.
  • S7034 Perform text recognition on the extracted one or more character pictures included in the character sequence to obtain the line text.
  • the so-called text recognition refers to the process of converting character bitmap images into words, letters and punctuation marks for text processing.
  • the specific text recognition process can be identified using printed character recognition technology. Of course, other methods may be used for identification. For example, according to the distribution of pixel points representing each character in each character picture, the distribution of the pixel points of each character in the preset character library is compared, and the similarity is selected. The highest character is used as the character in the character picture. It is assumed that black represents a character, white represents a background, and the distribution of pixel points refers to the distribution position and number of pixel points whose pixel value is 0 in each row and each column of the character picture.
  • the obtained line text can be further post-processed, so that the obtained line text is more in line with the language representation manner, for example, the recognized characters are post-processed according to the language syntax constraints. .
  • the so-called linguistic syntax such as the relationship between the relationship, the narration, the relationship, the mediation, etc., use the constraints of these language syntax to make the recognized Chinese line text more in line with the Chinese language features.
  • linguistic syntax such as the relationship between the relationship, the narration, the relationship, the mediation, etc.
  • the example of the present application further provides a video word processing method performed by the video client and the video server, the method comprising:
  • the video client sends a video word processing request carrying the video identifier and the time information to the video server in response to the operation of the video line control in the video playing interface;
  • the video server when receiving the video channel processing request sent by the video client, extracting the video identifier and the time information carried by the processing request, and acquiring the time information corresponding to the video data corresponding to the video identifier. a frame image; identifying a line of text from the frame image; and transmitting the recognized line of text to the video client.
  • the video client when receiving the line text sent by the video server, the video client displays a line operation interface including the line text; and in response to the operation of the line operation interface, performing the line text Corresponding processing.
  • the video client sends a video speech processing request to the video server, where the request includes a video identifier and time information;
  • the video server obtains the video identifier and the time information from the video word processing request, and further determines the corresponding frame image according to the video identifier and the time information;
  • the video server detects a character area in the frame image.
  • the video server binarizes the character area to remove the background in the character area
  • the video server performs character segmentation on the character area after removing the background to obtain a character sequence
  • the video server identifies the character sequence and obtains the line text
  • the video server sends the recognized text of the line to the video client
  • the video server displays a line operation interface, and the interface includes a line text
  • the user selects a sharing platform on the line operation interface
  • the video client posts the text of the line on the sharing platform selected by the user, thereby completing information sharing or publishing.
  • the user needs to do: First, click the video line control on the video playing interface; Second, select the sharing platform on the line operation interface; it can be seen that in the above process, the user does not need to manually input the lines to be shared. Therefore, the convenience of the user operation can be greatly improved, and the quick sharing can be realized. If the user shares the lines of the solo drama, the traffic of the video can be increased.
  • the example of the present application further provides a video client.
  • the video client 900 includes:
  • the request sending module 901 is configured to, in response to the operation of the video line control in the video playing interface, send a video word processing request carrying the video identifier and the time information to the video server, so that the video server identifies the video from the video
  • the line text is recognized in the frame image corresponding to the time information
  • the interface display module 902 is configured to display a line operation interface including the line text when receiving the line text sent by the video server;
  • the line processing module 903 is configured to perform corresponding processing on the line text in response to the operation of the line operation interface.
  • the video speech processing request is a video speech sharing request; the speech operation interface further includes information of one or more alternative sharing platforms and/or comment regions.
  • the line processing module 903 may be specifically configured to: in response to a selection operation of a sharing platform in the line operation interface, if the selected sharing platform is in a login state, displaying the text including the line text An information publishing interface of the selected sharing platform; in response to a publishing operation of the information publishing interface of the selected sharing platform, publishing the spoken text to the selected sharing platform.
  • the line processing module 903 may be further configured to: in response to a selection operation of a sharing platform in the line operation interface, if the selected sharing platform is in an unlogged state, display the selected a login interface of the sharing platform; logged in to the selected sharing platform in response to a login operation to the login interface of the selected sharing platform.
  • the line processing module 903 is further configured to: publish the line text to the selected comment area in response to a selection operation of a comment area in the line operation interface.
  • the line text may be displayed in an editable text box of the line operation interface, and the line processing module 903 may be further configured to: respond to the operation of the editable text box, the line Text is edited.
  • the user Similar to the video line processing method performed by the video client, the user only needs to click the video line control in the video playing interface of the video client provided by the example of the present application, and the request sending module 901 sends a video word processing request to the video server, and the video The server will recognize the line text from the corresponding frame image, and feed the line text to the video client.
  • the interface display module 902 in the video client displays the line operation interface including the line text, so that the user can be in the video client. By operating on the line operation page, the corresponding processing of the video lines can be realized, and the user does not need to manually input the video lines, which is very convenient.
  • the video client provided by the example of the present application is a functional architecture module of the video word processing method executed by the video client, and the explanations, examples, optional implementation manners, beneficial effects, and the like of the technical terms can be referred to.
  • the corresponding content of the video word processing method executed by the above video client is not described here.
  • the example of the present application further provides a video server.
  • the video server 1000 includes:
  • the information extraction module 1001 is configured to: when receiving a video speech processing request sent by the video client, extract the video identifier and time information carried by the processing request;
  • the image obtaining module 1002 is configured to acquire, from the video data corresponding to the video identifier, a frame image corresponding to the time information;
  • a line recognition module 1003, configured to identify a line text from the frame image
  • the speech sending module 1004 is configured to send the recognized text of the speech to the video client.
  • the line recognition module 1003 may specifically include:
  • An area detecting unit configured to detect a character area in the frame image
  • a background removing unit configured to remove a background in the detected character region
  • a character extracting unit configured to extract a sequence of characters from a character region after removing the background; wherein the sequence of characters includes one or more character images;
  • a character recognition unit configured to perform text recognition on the extracted one or more character pictures included in the character sequence to obtain the line text.
  • the line recognition module 1003 can further include:
  • a pre-processing unit configured to pre-process the frame image before the region detecting unit detects the character region in the frame image.
  • the pre-processing may include at least one of smoothing processing, layout analysis, and tilt correction.
  • the background removing unit may be specifically configured to: perform binarization processing on the detected character region; correspondingly, the character extracting unit may be specifically configured to: according to each of the character regions that have undergone the binarization processing The pixel value of the pixel is subjected to character segmentation by the binarized character region to obtain the character sequence.
  • the line recognition module 1003 can further include:
  • a post-processing unit for post-processing the line text according to the language syntax constraint for post-processing the line text according to the language syntax constraint.
  • the information extraction module 1001 in the video server extracts the video identification and time information from the request when receiving the video speech processing request, and the image acquisition module 1002 further obtains a corresponding frame image according to the video identifier and the time information, and then the line recognition module 1003 identifies the line text from the frame image, and finally the line sending module 1004 sends the line text to the video client to facilitate the video client.
  • the line recognition module 1003 identifies the line text from the frame image
  • the line sending module 1004 sends the line text to the video client to facilitate the video client.
  • Corresponding processing of the lines of the words in order to achieve the need to manually input the video lines, users can share, edit and other lines, very convenient results.
  • the video server provided by the example of the present application is a functional architecture module of the video speech processing method executed by the video server, and the related technical terms, explanations, optional implementation manners, beneficial effects, and the like can refer to the above video.
  • the corresponding content of the video word processing method executed by the server is not described here.
  • the present application also discloses a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the video word processing method (for example, steps S201 to S203 and steps S701 to S704 described above).
  • USB flash drive a removable hard disk
  • ROM read-only memory
  • RAM random access memory
  • magnetic disk a magnetic disk
  • optical disk and the like can store program codes.
  • the present application also discloses a computer device, which may be a client device or a video server.
  • the computer device includes one or more processors (CPU) 1102, a communication module 1104, and a memory 1106. , a user interface 1110, and a communication bus 1108 for interconnecting these components, wherein:
  • the processor 1102 can receive and transmit data through the communication module 1104 to effect network communication and/or local communication.
  • User interface 1110 includes one or more output devices 1112 that include one or more speakers and/or one or more visual displays.
  • User interface 1110 also includes one or more input devices 1114 including, for example, a keyboard, a mouse, a voice command input unit or loudspeaker, a touch screen display, a touch sensitive tablet, a gesture capture camera or other input button or control, and the like.
  • Memory 1106 can be a high speed random access memory such as DRAM, SRAM, DDR RAM, or other random access solid state storage device; or nonvolatile memory such as one or more disk storage devices, optical disk storage devices, flash memory devices, Or other non-volatile solid-state storage devices.
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • DDR RAM dynamic random access memory
  • nonvolatile memory such as one or more disk storage devices, optical disk storage devices, flash memory devices, Or other non-volatile solid-state storage devices.
  • the memory 1106 stores a set of instructions executable by the processor 1102, including:
  • An operating system 1116 including a program for processing various basic system services and for performing hardware related tasks
  • the application 1118 includes various applications for video speech processing, such an application can implement the processing flow in each of the above examples, such as may include some or all of the instruction modules or units in the video client 900, and may also include video. Some or all of the modules or units in the server 1000.
  • the processor 1102 can implement the functions of at least one of the above-described units or modules by executing machine-executable instructions in at least one of the units in the memory 1106.
  • the hardware modules in the embodiments may be implemented in a hardware manner or a hardware platform plus software.
  • the above software includes machine readable instructions stored in a non-volatile storage medium.
  • embodiments can also be embodied as software products.
  • the hardware may be implemented by specialized hardware or hardware that executes machine readable instructions.
  • the hardware can be a specially designed permanent circuit or logic device (such as a dedicated processor such as an FPGA or ASIC) for performing a particular operation.
  • the hardware may also include programmable logic devices or circuits (such as including general purpose processors or other programmable processors) that are temporarily configured by software for performing particular operations.
  • each instance of the present application can be implemented by a data processing program executed by a data processing device such as a computer.
  • the data processing program constitutes the present application.
  • a data processing program usually stored in a storage medium is executed by directly reading a program out of a storage medium or by installing or copying the program to a storage device (such as a hard disk and or a memory) of the data processing device. Therefore, such a storage medium also constitutes the present application, and the present application also provides a non-volatile storage medium in which a data processing program is stored, which can be used to execute any of the above-mentioned method examples of the present application. An example.
  • the machine readable instructions corresponding to the modules of FIG. 11 may cause an operating system or the like operating on a computer to perform some or all of the operations described herein.
  • the non-transitory computer readable storage medium may be inserted into a memory provided in an expansion board within the computer or written to a memory provided in an expansion unit connected to the computer.
  • the CPU or the like installed on the expansion board or the expansion unit can perform part and all of the actual operations according to the instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本申请提供一种视频台词处理方法、视频客户端、视频服务器和计算机可读存储介质,其中方法包括:在接收到视频客户端发送来的视频台词处理请求时,提取所述处理请求携带的视频标识和时间信息;从所述视频标识对应的视频数据中获取所述时间信息对应的帧图像;从所述帧图像中识别出台词文本;将识别出的所述台词文本发送至所述视频客户端。基于本申请,用户只需要在视频播放界面中点击视频台词控件,视频服务器便会从对应的帧图像中识别出台词文本,并将台词文本反馈给视频客户端,这样用户便可以在视频客户端的台词操作页面上进行操作,便可以实现对视频台词的相应处理,不需要用户自己手动输入视频台词,非常便捷。

Description

视频台词处理方法、客户端、服务器及存储介质
本申请要求于2017年07月26日提交中国国家知识产权局、申请号为201710616032.8、发明名称为“视频台词处理方法、视频客户端及服务器”的中国专利申请的优先权,上述申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网技术领域,尤其是涉及一种视频台词处理方法、一种视频客户端、一种视频服务器和一种计算机可读存储介质。
背景技术
随着计算机通信技术、互联网技术以及多媒体技术的发展,通过在线观看视频得到了越来越广泛的应用,用户可以选择在任意时段,通过客户端建立与视频播放服务器的网络连接,查看视频播放服务器提供的各类视频,例如,电影、电视剧或者Flash视频,并选取自己喜好的视频播放文件,点击进行在线下载播放、观看,以享受数字多媒体运营商通过视频播放服务器提供的各种视频扩展服务。
发明内容
本申请实例提供了一种视频台词处理方法。该方法包括:
在接收到视频客户端发送来的视频台词处理请求时,提取所述处理请求携带的视频标识和时间信息;
从所述视频标识对应的视频数据中获取所述时间信息对应的帧图像;
从所述帧图像中识别出台词文本;
将识别出的所述台词文本发送至所述视频客户端。
在一些实例中,所述从所述帧图像中识别出台词文本,可以包括:
检测所述帧图像中的字符区域;
去除所检测出的字符区域中的背景;
从去除背景后的字符区域中提取字符序列;其中,所述字符序列包括一个或多个字符图片;
对提取出的所述字符序列中包括的所述一个或多个字符图片进行文本识别,得到所述台词文本。
在一些实例中,所述从所述帧图像中识别出台词文本,还可以包括:
在所述检测所述帧图像中的字符区域之前,对所述帧图像进行预处理。
在一些实例中,所述预处理可以包括平滑处理、版面分析和倾斜度校正中的至少一种。
在一些实例中,所述去除所检测出的字符区域中的背景,可以包括:对所检测出的字符区域进行二值化处理;其中,所述从去除背景后的字符区域中提取字符序列,包括:根据经过所述二值化处理的字符区域中各像素点的像素值,对经过所述二值化处理的字符区域进行字符分割得到所述字符序列。
在一些实例中,所述从所述帧图像中识别出台词文本,还可以包括:
根据语言句法约束条件,对识别出的所述台词文本进行后处理。
本申请实例提供了一种视频台词处理方法。该方法包括:
响应于对视频播放界面中视频台词控件的操作,向视频服务器发送携带视频标识和时间信息的视频台词处理请求,所述处理请求用于请求所述视频服务器从所述视频标识和所述时间信息所对应的帧图像中识别出台词文本;
在接收到所述视频服务器发送来的所述台词文本时,展示包含所述台词文本的台词操作界面;
响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理。
在一些实例中,所述视频台词处理请求可以为视频台词分享请求;所述台词操作界面中还包括一个或多个可供选择的分享平台和\或评论区的信息。
在一些实例中,所述响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理,可以包括:
响应于对所述台词操作界面中一个分享平台的选择操作,若被选择的分享平台处于登录状态,则展示包含所述台词文本的所述被选择的分享平台的信息发布界面;
响应于对所述被选择的分享平台的信息发布界面的发布操作,将所述台词文本发布到所述被选择的分享平台。
在一些实例中,所述响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理,还可以包括:
响应于对所述台词操作界面中一个分享平台的选择操作,若所述被选择的分享平台处于未登录状态,则展示所述被选择的分享平台的登录界面;
响应于对所述被选择的分享平台的登录界面的登录操作,登录所述被选择的分享平台。
在一些实例中,所述响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理,可以包括:
响应于对所述台词操作界面中一个评论区的选择操作,将所述台词文本发布到所述被选择的评论区。
在一些实例中,所述台词文本展示在所述台词操作界面的可编辑文本框内;
所述响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理,包括:响应于对所述可编辑文本框的操作,对所述台词文本进行编辑操作。
本申请实例提供了一种视频服务器。该视频服务器包括:
信息提取模块,用于在接收到视频客户端发送来的视频台词处理请求时,提取所述处理请求携带的视频标识和时间信息;
图像获取模块,用于从所述视频标识对应的视频数据中获取所述时间信息对应的帧图像;
台词识别模块,用于从所述帧图像中识别出台词文本;
台词发送模块,用于将识别出的所述台词文本发送至所述视频客户端。
在一些实例中,台词识别模块可以包括:
区域检测单元,用于检测所述帧图像中的字符区域;
背景去除单元,用于去除所检测出的字符区域中的背景;
字符提取单元,用于从去除背景后的字符区域中提取字符序列;其中,所述字符序列包括一个或多个字符图片;
字符识别单元,用于对提取出的所述字符序列中包括的所述一个或多个字符图片进行文本识别,得到所述台词文本。
在一些实例中,台词识别模块还可以包括:
预处理单元,用于在区域检测单元检测所述帧图像中的字符区域之前,对所述帧图像进行预处理。
在一些实例中,所述预处理可以包括平滑处理、版面分析和倾斜度校正中 的至少一种。
在一些实例中,背景去除单元可以具体用于:对所检测出的字符区域进行二值化处理;其中,字符提取单元可以具体用于:根据经过所述二值化处理的字符区域中各像素点的像素值,对经过所述二值化处理的字符区域进行字符分割得到所述字符序列。
在一些实例中,台词识别模块还可以包括:
后处理单元,用于根据语言句法约束条件,对台词文本进行后处理。
本申请实例提供了一种视频客户端。该视频客户端包括:
请求发送模块,用于响应于对视频播放界面中视频台词控件的操作,向视频服务器发送携带视频标识和时间信息的视频台词处理请求,以使所述视频服务器从所述视频标识和所述时间信息所对应的帧图像中识别出台词文本;
界面展示模块,用于在接收到所述视频服务器发送来的所述台词文本时,展示包含所述台词文本的台词操作界面;
台词处理模块,用于响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理。
在一些实例中,所述视频台词处理请求为视频台词分享请求;所述台词操作界面中还包括一个或多个可供选择的分享平台和\或评论区的信息。
在一些实例中,台词处理模块可以具体用于:响应于对所述台词操作界面中一个分享平台的选择操作,若被选择的分享平台处于登录状态,则展示包含所述台词文本的所述被选择的分享平台的信息发布界面;响应于对所述被选择的分享平台的信息发布界面的发布操作,将所述台词文本发布到所述被选择的分享平台。
在一些实例中,台词处理模块还可以具体用于:响应于对所述台词操作界面中一个分享平台的选择操作,若所述被选择的分享平台处于未登录状态,则展示所述被选择的分享平台的登录界面;响应于对所述被选择的分享平台的登录界面的登录操作,登录所述被选择的分享平台。
在一些实例中,台词处理模块还可以具体用于:响应于对所述台词操作界面中一个评论区的选择操作,将所述台词文本发布到所述被选择的评论区。
在一些实例中,所述台词文本可以展示在所述台词操作界面的可编辑文本框内,台词处理模块还可以具体用于:响应于对所述可编辑文本框的操作,对 所述台词文本进行编辑操作。
本申请实例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述方法的步骤。
基于上述技术方案,用户只需要在视频播放界面中点击视频台词控件,视频服务器便会从对应的帧图像中识别出台词文本,并将台词文本反馈给视频客户端,这样用户便可以在视频客户端的台词操作页面上进行操作,便可以实现对视频台词的相应处理,不需要用户自己手动输入视频台词,非常便捷。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实例涉及的系统构架图;
图2是本申请一实例中视频台词处理方法的流程示意图;
图3是本申请一实例中视频播放界面的示意图;
图4是图3中视频台词分享控件301的放大示意图;
图5是本申请一实例中视频播放界面的示意图;
图6是本申请一实例中台词操作界面的示意图;
图7是本申请一实例中视频台词处理方法的流程示意图;
图8是本申请一实例中用户、视频客户端和视频服务器之间的交互示意图;
图9是本申请一实例中视频客户端的结构框图;
图10是本申请一实例中视频服务器的结构框图;
图11是本申请一实例中计算设备的结构示意图。
具体实施方式
本申请提出了一种视频台词处理方法,该方法所应用的系统架构,如图1所示。该系统架构包括:客户端设备101、视频服务器102以及互联网103,客户端设备101与视频服务器102通过互联网103连接。其中:
上述客户端设备101可以是用户的智能手机或电脑,其上安装有各种应用软件的客户端软件,用户可以通过上述客户端设备登录并使用各种应用软件的客户端,该应用软件的客户端可以为多媒体软件的客户端,例如视频客户端。
上述视频服务器102可以是一台服务器,也可以是服务器集群,可以为客户端设备提供视频播放服务。
上述互联网103可以包括有线网络和无线网络。
本申请的发明人发现,用户在客户端设备101上观看影片的过程中,可能会看到一些自己喜欢的或有感触的台词,这时用户可能想要将台词分享到视频客户端的评论区,或者想要将台词分享到朋友圈、微博、qq空间或好友动态等社交平台上,也或者想要将台词复制粘贴到自己选择的文本中,在一种可能的实现方式中,用户可以通过手动输入台词然后进行分享等操作,然而,这种操作方式不是很便捷。
基于以上用户通过手动输入台词然后进行分享的操作方式存在不便捷的问题,本申请提出一种视频台词处理方法,该方法可由客户端设备101中的视频客户端执行,如图2所示,该方法包括:
S201、响应于对视频播放界面中视频台词控件的操作,向视频服务器发送携带视频标识和时间信息的视频台词处理请求,以使所述视频服务器从所述视频标识和所述时间信息所对应的帧图像中识别出台词文本。
上述视频标识,是用于区分不同视频文件或视频流的标识,可以由视频服务器分配,不同的视频文件或视频流对应不同的视频标识。例如,电影《肖生克的救赎》的视频标识为a1,而电影《死亡诗社》的视频标识为b1;再例如,电视剧《父母爱情》第12集的视频标识为c1_12,而电视剧《潜伏》第20集的视频标识为d1_20。
上述时间信息,可以是当前视频的播放时间点(也可称为播放进度、播放位置),例如,一部电影有90分钟,该部电影的视频数据由很多帧图像组成,不同的播放时间点对应不同的帧图像。客户端设备中的视频客户端可以将时间信息携带在视频处理请求中,以便于视频服务器可以知道用户想要进行台词处理的帧图像是视频标识对应的视频中的哪一帧图像。其中,上述当前视频可以是客户端设备中的视频客户端当前正在播放的视频。
上述视频台词控件,是指展示在视频播放界面中的用于触发台词处理请求的UI(User Interface,用户界面)控件,比如,上述视频台词控件可以表现为 播放界面中的图形按键、菜单选项等等多种形式,当用户点击该控件时,视频客户端会执行相应的操作,例如,上述视频台词控件为视频台词分享控件,则用户点击该控件时,视频客户端会向视频播放器发送视频台词分享请求(对应上述视频台词处理请求)。
在上述方案中,视频客户端向视频服务器发送的视频台词处理请求中携带视频标识和时间信息,视频服务器接收到处理请求后,根据视频标识和时间信息获取对应的图像帧。在另一种可能的实现方式中,视频客户端也可以向视频服务器发送包含待处理的帧图像的视频台词处理请求,以便视频服务器直接根据该处理请求中的帧图像进行台词识别,获得台词文本。比如,视频客户端的视频播放界面正在播放一视频,当用户对视频播放界面中视频台词控件的操作时,响应于该对视频台词控件的操作,视频客户端获取视频播放界面当前播放的帧图像,并将获取到的该当前播放的帧图像携带在处理请求中发送给视频服务器。
如图3所示,视频客户端的视频播放界面中设置有一个视频台词分享控件301,当用户在观看视频的过程中,如果想要将当前视频播放界面中的台词分享到某一社交平台(例如,朋友圈)上,此时用户可以点击该视频台词分享控件301。此时由于视频台词分享控件301受到触发,视频客户端便会向视频服务器102发送视频台词处理请求,这样视频服务器102便会从视频台词请求中获取视频标识和时间信息,根据视频标识确定要进行台词处理的是哪一段视频,然后根据时间信息进一步确定要进行台词处理的是该短视频中的哪一帧图像,进而提取出这帧图像,然后从这帧图像中识别出台词文本,最后将台词文本发送给客户端设备101中的视频客户端。
图3中的视频台词分享控件的放大示意图可参考图4,当然也可以采用其他形状的图标作为视频台词控件。
实际上,上述视频台词处理请求并不限于是对台词进行分享处理的请求,还可以是对视频台词进行其他处理的请求,例如对台词进行编辑处理(例如,复制、修改等)的请求。
S202、在接收到所述视频服务器发送来的所述台词文本时,展示包含所述台词文本的台词操作界面;
视频客户端接收到服务器发送过来的台词文本后所展示的台词操作界面可以有多种形式,不同的视频台词处理请求对应不同的台词操作界面。例如, 如果在步骤S201中视频客户端发送的是视频台词分享请求,则视频客户端在接收到视频服务器发送来的台词文本时,展示的台词操作界面中还可以包括一个或多个可供选择的分享平台和\或评论区的信息,以供用户选择分享平台或评论区。
举例来说,当视频服务器将图3所示出的视频播放界面中的台词识别出来后,把得到的台词文本反馈给视频客户端,视频客户端在接收到台词文本后展示的台词操作界面如图5所示。在图5中,台词文本显示在文本框501中,在台词操作界面中还包括几个可供选择的分享平台和\或评论区的图标:微信图标502、腾讯qq图标503、微博图标504和评论区图标505,每一个图标可以是对应的一个分享平台或者评论区的控件;其中,微信图标502对应的是微信平台中的朋友圈,用于在被触发后进入微信朋友圈的信息发布界面中,并将台词文本显示在微信朋友圈的信息发布界面中;腾讯qq图标503对应的是qq平台中的qq空间或好友动态,用于在被触发后进入qq空间或好友动态的信息发布界面中,并将台词文本显示在qq空间或好友动态的信息发布界面中;微博图标504对应的是微博平台的微博发布界面,用于在被触发后进入微博的信息发布界面中,并将台词文本显示在微博的信息发布界面中;由于一般的视频播放界面的下方为评论区,在评论区中用户可以发表自己对视频的观看感受等信息,针对这种情况,在图5示出的台词操作界面中还设置有评论区图标505,评论区图标505对应的是当前视频客户端中发表评论的区域,用于在被触发后进入视频播放界面下方的评论区,并将台词文本显示在评论区中。另外,在图5示出的台词操作界面中还设置了一个取消键506,用于取消当前的分享行为,而返回到视频播放界面。
S203、响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理。
在该步骤中,用户对台词操作界面不同的操作,视频客户端可以执行不同的处理过程,仍然以图5为例,对上述步骤S203进行示例说明:
如果用户想要将台词文本发布到朋友圈中,便可以点击微信图标502,这样视频客户端便会展示微信朋友圈的信息发布界面,并在该信息发布界面中显示有台词文本,然后用户点击发送,视频客户端响应于该点击发送的操作,将台词文本发布到微信朋友圈中,之后用户或者用户的好友便可以在朋友圈中看到该用户发表的台词文本;如果用户想要把台词文本发布到qq空间或好友动 态中,便可以点击腾讯qq图标503,此时视频客户端便会展示qq空间或好友动态的信息发布界面,并在该信息发布界面中显示台词文本,用户点击发送后,视频客户端响应于该点击发送的操作,将台词文本发布到qq空间或好友动态中,之后用户或者用户的好友便可以在qq空间或好友动态中看到该用户发表的台词文本;如果用户想要将台词文本发布到微博中,便可以点击微博图标504,这样视频客户端会展示微博的信息发布界面,并在该信息发布界面中显示台词文本,然后用户点击发表后,视频客户端响应于该点击发送的操作,将台词文本发布到微博中,之后用户或者用户的好友便可以在微博中看到该用户发布的台词文本;类似的,如果用户想要将台词文本发布在视频客户端的评论区,便会点击评论区图标505,视频客户端响应于该点击评论区图标505的操作,便会将文本框中的台词文本发布到视频播放界面下方的评论区内。
当视频客户端在展示台词操作界面后,用户又不想分享或发布该台词了,此时可以点击台词操作界面中的取消键506,视频客户端响应于该点击取消键506的操作,可以返回该视频播放界面,以便用户继续观看视频。
以上图5以及相关说明均是以对台词进行分享处理为例,当然,对视频台词的处理不限于分享,还可以仅仅是对视频服务器反馈回来的台词文本进行编辑而不需要进行分享,还可以是对视频服务器反馈回来的台词文本进行编辑后再进行分享。针对这两种情况,可以将文本框501配置为可编辑文本框,当用户在可编辑文本框内执行编辑操作时,视频客户端会响应于用户对可编辑文本框的操作,对台词文本进行编辑。例如,对台词文本进行修改(例如,将图5中的英文删除、添加表情等),然后用户对修改后的台词文本进行复制再将其粘贴到某一个word文档或文本文档中,或者,用户将修改后的台词文本分享到某一社交平台上。
在上文中提到过,图5仅仅是台词操作界面的一种形式,在实际应用中,台词操作界面还可以采用其他的形式,例如,在台词操作界面中,文本框的下方设置有多个虚拟的编辑按键,比如复制键、粘贴键、表情添加键、背景设置键等,不同的按键可以用于执行对台词文本的不同编辑操作。如图6所示,台词操作界面中除了有文本框601、微信图标602、腾讯qq图标603、微博图标604、评论区图标605以及取消键606之外,还有复制键607、粘贴键608以及表情添加键609,用户点击复制键607后,视频客户端响应于点击复制键607的操作,可以将文本框601中的台词文本复制到其他文件中,用户点击粘贴键 608后,视频客户端响应于点击粘贴键608的操作,可以将之前在其他文件中复制的内容粘贴到文本框中,用户点击表情添加键609后,视频客户端响应于点击表情添加键609的操作,可以在文本框中添加表情等。
基于上述分析,本申请实例提供的视频台词处理方法,用户只需要在视频播放界面中点击视频台词控件,视频服务器便会从对应的帧图像中识别出台词文本,并将台词文本反馈给视频客户端,这样用户便可以在视频客户端的台词操作页面上进行操作,便可以实现对视频台词的相应处理,不需要用户自己手动输入视频台词,非常便捷。
在一些实例中,台词操作界面中可包括一个或多个可供选择的分享平台,这样便于用户对分享平台进行选择,这里可能存在两种情况:
(1)被选择的分享平台处于登录状态;
用户在台词操作界面上选择一分享平台时,视频客户端响应于用户在台词操作作界面中对这一分享平台的选择操作,检测用户是否在该分享平台处于登录的状态,若经检测发现该分享平台处于登录状态,便会直接将所述台词文本展示在在所述被选择的分享平台的信息发布界面上。若用户继续点击发布操作,则视频客户端响应于对所述被选择的分享平台的信息发布界面的发布操作,将所述台词文本发布到所述被选择的分享平台。
(2)被选择的分享平台处于未登录的状态;
用户在台词操作界面上选择一分享平台时,视频客户端响应于用户在台词操作作界面中对这一分享平台的选择操作,检测用户是否在该分享平台处于登录的状态,经检测发现该分享平台处于未登录状态,便会展示被选择的分享平台的登录界面,这样用户在登录界面上输入正确的登录信息后,视频客户端会响应于对所述被选择的分享平台的登录界面的登录操作,登录所述被选择的分享平台,进而进行信息发布。
以上方法为视频客户端执行的视频台词处理方法,与以上方法对应的,本申请实例还提供一种视频台词处理方法,该方法可由视频服务器102执行,如图7所示,该方法包括:
S701、在接收到视频客户端发送来的视频台词处理请求时,提取所述处理请求携带的视频标识和时间信息。
对于视频标识、时间信息、视频台词处理请求的解释说明在上文中已经介 绍,此处不再赘述。
在该步骤中,视频服务器接收视频台词处理请求的方式有多种,其中一种方式为实时监听,当监听到有目的地址为视频服务器的数据时,便将对其进行接收,然后根据数据中的相关信息确定其为视频台词处理请求。
S702、从所述视频标识对应的视频数据中获取所述时间信息对应的帧图像。
在实际场景中,视频服务器向视频客户端传输视频流,这样用户可以在视频客户端观看到由一帧帧图像组成的视频。基于这种情况,步骤S702中从视频标识对应的视频数据中获取时间信息对应的帧图像的方式可以为:视频服务器从当前的视频流中提取时间信息对应的帧图像。当然,提取帧图像的方式不限于此,视频服务器还可以从视频标识对应的视频文件中获取,此时的视频文件为静态的视频文件。比如,视频服务器获取到处理请求携带的视频标识和时间信息后,在视频服务器的数据库或者网络中搜索该视频标识对应的视频文件,并从视频文件中提取该时间信息对应的帧图像。不论采取哪种方式,只要能够获取到视频标识和时间信息对应的帧图像即可。
上述步骤S701和步骤S702,仅以视频台词处理请求中携带视频标识和时间信息为例进行说明,在另一种可能的实现方式中,视频客户端发送来的视频台词处理请求中也可以直接携带上述帧图像,也就是说,视频服务器从视频客户端发送的处理请求中提取上述帧图像。
S703、从所述帧图像中识别出台词文本。
在本申请实例中,上述识别过程,是指将帧图像中点阵图像格式的字符转化为文本的过程。
在该步骤中,从帧图像中识别出台词文本的方式有多种,例如,采用图像识别技术识别出帧图像中的台词文本,在实际应用时具体采用何种方式本申请实例不做限定,只要能够识别出帧图像中的台词文本即可。
S704、将识别出的所述台词文本发送至所述视频客户端。
基于本申请实例提供的视频台词处理方法,当视频服务器接收到视频台词处理请求时,从该请求中提取出视频标识和时间信息,进而根据视频标识和时间信息,获取到对应的帧图像,然后从该帧图像中识别出台词文本,最后将台词文本发送至视频客户端,以便于视频客户端对台词文本进行相应的处理,以达到不需要用户自己手动输入视频台词便可以对台词进行分享、编辑等处理, 提高台词处理的便捷性的效果。
在一些实例中,步骤S703中从所述帧图像中识别出台词文本的具体过程可以包括以下步骤:
S7031、检测所述帧图像中的字符区域。
具体的检测方法有多种,例如,在视频播放界面中,台词一般位于视频画面的下方,因此可以通过截取视频画面下方的矩形区域的方式获取字符区域,这种方法虽简单,但是获取到的字符区域可能不是很精确。因此还可以根据字符与播放的背景图像之间的差异性获取字符区域,比如,一个典型的字符区域为一个水平的矩形区域,有陡峭的边缘,而且字符区域内像素值的分布与播放的背景图像的像素分布有很大的差异,利用这些差异便可以检测并截取到字符区域。
当然,为了实现更好的识别效果,还可以在执行步骤S7031之前,对帧图像进行一种或多种方式的预处理,预处理的方式有很多,例如:图像平滑、版面分析、倾斜度校正等,其中:
图像平滑,是指用于突出图像的宽大区域、低频成分、主干部分或抑制图像噪声和干扰高频成分的方式,可以达到使图像亮度平缓渐变,减小突变梯度,改善图像质量的效果。可见,通过图像平滑的预处理方式可以使得帧图像的亮度平缓渐变,画质得到的改善。具体进行图像平滑的方式有多种,例如,插值方法、线性平滑方法、卷积法等等。具体采用何种图像平滑方式可以根据图像噪声的不同而选择,比如,当图像噪声为椒盐噪声或者以椒盐噪声为主时,可以采用线性平滑方法对图像进行平滑处理。
版面分析,是指将数字图像分割成多个区域,并且确定每个区域的类别,比如文本、表格、符号等,实现各个区域的定位。版面分析主要包括三类方法:自顶向下方法、自底向上方法、综合方法。自顶向下方法包括投影分析法、游程合并算法。投影分析法是在某个方向上对二维图像进行投影,通过对直方图分析,结合局部或全局阈值法对其进行区域分割。游程合并算法是指如果同一行中两个相邻的游程距离较短,就将这两个游程合并为一个游程。自底向上方法包括区域生长法,是从图像最小单元进行分析,得到连通体,然后对连通体采用一定的策略进行合并得到更高级的结构,同时在合并过程中获取版面结构信息。自底向上的分析方法适应能力强,能够分析比较复杂的版面,但计算量大。自顶向下和自底向上方法各有优缺点,将两者结合得到的综合方法灵活性 强,但在实际应用中针对不同的情况需要采用不同的方案。
倾斜度校正,是指对图像的倾斜度进行修正的过程,首先要估算出图像的倾斜角度,估算文档图像倾斜角的算法主要包括三类:基于投影的方法、基于霍夫变换的方法和基于最小二乘的方法。基于投影的方法利用投影的某些特征进行判断,对文档图像进行不同角度的投影测试,在得到的系列结果中提取最佳的投影效果,从而估算文档图像的倾斜角。该方法缺点是计算量大,得到的倾斜角角度精度取决于进行不同角度投影测试时的单位步长。基于霍夫变换的方法主要是将原始的坐标平面映射到霍夫空间中经过该点的直线上的所有点,其不足之处在于计算的时空复杂度较高,对符号分散的情况,映射角度选择比较困难。基于最小二乘的方法首先选择文档图像的一组特征点,形成包含N个特征向量的特征集,其中每个特征点都是一个独立的样本,假定存在一条直线y=a+bx,对一组特征点计算残差,令残差最小,解出b的值,即可求出图像的倾斜角。
S7032、去除所检测出的字符区域中的背景。
在该步骤中,去除字符区域的背景的过程,可以理解为是图像净化的过程,去除掉字符区域中的显见噪声,进而改善字符区域的图像质量。
在具体实施时,去除字符区域中的背景的具体方法为多种,其中一种方法为:对所检测出的字符区域进行二值化处理,所谓的二值化处理即为令字符区域中的每个像素为1或0,也就是说,字符区域中的每个像素要么代表字符,要么代表背景。例如,假设二值化处理后得到的字符区域中的各个像素中,用0代表字符,用1代表背景,也就是说,黑色代表字符,白色代表背景,从而实现去除背景的目的。其中,所谓的字符,包括文字、字母、标点符号等。
S7033、从去除背景后的字符区域中提取字符序列。
其中,上述字符序列包括一个或多个字符图片。
基于上述采用二值化进行去除背景的方法,从背景区域后的字符区域中提取字符序列的过程可以采用以下步骤:
根据经过所述二值化处理的字符区域中各像素点的像素值,对经过所述二值化处理的字符区域进行字符分割得到所述字符序列。
假设黑色代表字符,白色代表背景,可以理解的是,同一行中相邻的字符与字符之间有多列像素点的像素值全部是1,同一列中相邻的字符与字符之间有多列像素点的像素值全部是1,即便有的字符是左右结构或上下结构,但是 左右结构之间像素值全部为1的像素点的列数不会太大,上下结构之间像素值全部为1的像素点的行数也不会太大,因此可以根据这一点对字符区域进行字符分割,得到字符序列。
以上仅仅是其中一种从字符区域中提取字符序列的方式,当然还可以采用其他的方式进行字符序列的提取,对此本申请实例不做限定。
S7034、对提取出的所述字符序列中包括的所述一个或多个字符图片进行文本识别,得到所述台词文本。
所谓的文本识别,是指将字符点阵图像转换为文字、字母和标点符号的过程,以便于进行文本处理。具体的文本识别过程可以采用印刷体字符识别技术进行识别。当然,还可以采用其他的方式进行识别,例如根据每个字符图片中每一行代表字符的像素点的分布情况,与预先设置的字符库中各个字符的像素点的分布情况进行对比,选取相似度最高的字符作为该字符图片中的字符。假设黑色代表字符,白色代表背景,像素点的分布情况是指字符图片中每一行和每一列中像素值为0的像素点的分布位置和个数等。
当然,在执行完S7034之后,还可以对得到的台词文本进行一定的后处理,以使得到的台词文本更加符合语言的表述方式,例如,根据语言句法约束条件,对识别出的字符进行后处理。
所谓的语言句法,例如状中关系、述宾关系、述补关系、介宾关系等,利用这些语言句法的约束,使识别出的中文台词文本更加符合汉语的语言特征。对于其他语言,也存在一些特定的语言句法,也可以采用相应语言的句法进行约束,使其更符合响应语言的语言特征。
基于以上在视频客户端执行的视频台词处理方法和在视频服务器执行的视频台词处理方法,本申请实例还提供一种由视频客户端和视频服务器共同执行的视频台词处理方法,该方法包括:
1)、视频客户端响应于对视频播放界面中视频台词控件的操作,向视频服务器发送携带视频标识和时间信息的视频台词处理请求;
2)、视频服务器在接收到视频客户端发送来的视频台词处理请求时,提取所述处理请求携带的视频标识和时间信息;从所述视频标识对应的视频数据中获取所述时间信息对应的帧图像;从所述帧图像中识别出台词文本;将识别出的所述台词文本发送至所述视频客户端。
3)、视频客户端在接收到所述视频服务器发送来的所述台词文本时,展示包含所述台词文本的台词操作界面;响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理。
对以上方法中的各个步骤中有关技术名词的解释说明、一些举例说明、一些实施方式等内容请参考上述视频客户端执行的视频台词处理方法和上述视频服务器执行的视频台词处理方法中的相应内容,在此不再赘述。
下面结合图8对上述过程进行举例说明:
1、用户点击视频客户端的视频播放界面中的视频台词控件;
2、视频客户端向视频服务器发送视频台词处理请求,在该请求中包含视频标识和时间信息;
3、视频服务器从视频台词处理请求中获取视频标识和时间信息,进而根据视频标识和时间信息确定对应的帧图像;
4、视频服务器检测出上述帧图像中的字符区域;
5、视频服务器对字符区域进行二值化,进而去除字符区域中的背景;
6、视频服务器对去除背景后的字符区域进行字符分割,得到字符序列;
7、视频服务器对字符序列进行识别,得到台词文本;
8、视频服务器将识别出的台词文本发送至视频客户端;
9、视频服务器展示台词操作界面,该界面中包括台词文本;
10、用户在台词操作界面上选择分享平台;
11、视频客户端将台词文本发布在用户选择的分享平台上,从而完成信息分享或发布。
在上述过程中,用户需要做的事情是:一、在视频播放界面上点击视频台词控件;二、在台词操作界面上选择分享平台;可见在上述过程中用户不需要手动输入想要分享的台词,因此可以大大的提高用户操作的便捷性,实现快速分享,如果用户分享的是独播剧的台词,还可以带动视频的流量增长。
与上述视频客户端执行的视频台词处理方法相对应的,本申请实例还提供一种视频客户端,如图9所示,该视频客户端900包括:
请求发送模块901,用于响应于对视频播放界面中视频台词控件的操作,向视频服务器发送携带视频标识和时间信息的视频台词处理请求,以使所述视频服务器从所述视频标识和所述时间信息所对应的帧图像中识别出台词文本;
界面展示模块902,用于在接收到所述视频服务器发送来的所述台词文本时,展示包含所述台词文本的台词操作界面;
台词处理模块903,用于响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理。
在一些实例中,所述视频台词处理请求为视频台词分享请求;所述台词操作界面中还包括一个或多个可供选择的分享平台和\或评论区的信息。
在一些实例中,台词处理模块903可以具体用于:响应于对所述台词操作界面中一个分享平台的选择操作,若被选择的分享平台处于登录状态,则展示包含所述台词文本的所述被选择的分享平台的信息发布界面;响应于对所述被选择的分享平台的信息发布界面的发布操作,将所述台词文本发布到所述被选择的分享平台。
在一些实例中,台词处理模块903还可以具体用于:响应于对所述台词操作界面中一个分享平台的选择操作,若所述被选择的分享平台处于未登录状态,则展示所述被选择的分享平台的登录界面;响应于对所述被选择的分享平台的登录界面的登录操作,登录所述被选择的分享平台。
在一些实例中,台词处理模块903还可以具体用于:响应于对所述台词操作界面中一个评论区的选择操作,将所述台词文本发布到所述被选择的评论区。
在一些实例中,所述台词文本可以展示在所述台词操作界面的可编辑文本框内,台词处理模块903还可以具体用于:响应于对所述可编辑文本框的操作,对所述台词文本进行编辑操作。
与视频客户端执行的视频台词处理方法类似的,用户只需要在本申请实例提供的视频客户端的视频播放界面中点击视频台词控件,请求发送模块901便会向视频服务器发送视频台词处理请求,视频服务器便会从对应的帧图像中识别出台词文本,并将台词文本反馈给视频客户端,视频客户端中的界面展示模块902展示包括台词文本的台词操作界面,这样用户便可以在视频客户端的台词操作页面上进行操作,便可以实现对视频台词的相应处理,不需要用户自己手动输入视频台词,非常便捷。
可理解的是,本申请实例提供的视频客户端为上述视频客户端执行的视频台词处理方法的功能架构模块,其有关技术名词的解释、举例说明、可选实施方式、有益效果等内容可以参考上述视频客户端执行的视频台词处理方法的相 应内容,此处不再赘述。
与上述视频服务器共同执行的视频台词处理方法相对应的,本申请实例还提供一种视频服务器,如图10所示,该视频服务器1000包括:
信息提取模块1001,用于在接收到视频客户端发送来的视频台词处理请求时,提取所述处理请求携带的视频标识和时间信息;
图像获取模块1002,用于从所述视频标识对应的视频数据中获取所述时间信息对应的帧图像;
台词识别模块1003,用于从所述帧图像中识别出台词文本;
台词发送模块1004,用于将识别出的所述台词文本发送至所述视频客户端。
在一些实例中,台词识别模块1003可以具体包括:
区域检测单元,用于检测所述帧图像中的字符区域;
背景去除单元,用于去除所检测出的字符区域中的背景;
字符提取单元,用于从去除背景后的字符区域中提取字符序列;其中,所述字符序列包括一个或多个字符图片;
字符识别单元,用于对提取出的所述字符序列中包括的所述一个或多个字符图片进行文本识别,得到所述台词文本。
在一些实例中,台词识别模块1003还可以包括:
预处理单元,用于在区域检测单元检测所述帧图像中的字符区域之前,对所述帧图像进行预处理。其中,所述预处理可以包括平滑处理、版面分析和倾斜度校正中的至少一种。
在一些实例中,背景去除单元可以具体用于:对所检测出的字符区域进行二值化处理;对应的,字符提取单元可以具体用于:根据经过所述二值化处理的字符区域中各像素点的像素值,对经过所述二值化处理的字符区域进行字符分割得到所述字符序列。
在一些实例中,台词识别模块1003还可以包括:
后处理单元,用于根据语言句法约束条件,对台词文本进行后处理。
与上述视频服务器执行的视频台词处理方法类似的,本申请实例提供的视频服务器中的信息提取模块1001在接收到视频台词处理请求时,从该请求中提取出视频标识和时间信息,图像获取模块1002进而根据视频标识和时间信 息,获取到对应的帧图像,然后台词识别模块1003从该帧图像中识别出台词文本,最后台词发送模块1004将台词文本发送至视频客户端,以便于视频客户端对台词文本进行相应的处理,以达到不需要用户自己手动输入视频台词便可以对台词进行分享、编辑等处理,非常便捷的效果。
可理解的是,本申请实例提供的视频服务器为上述视频服务器执行的视频台词处理方法的功能架构模块,其有关技术名词的解释、举例说明、可选实施方式、有益效果等内容可以参考上述视频服务器执行的视频台词处理方法的相应内容,此处不再赘述。
本申请还公开一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述视频台词处理方法(例如:上述步骤S201~S203、上述步骤S701~S704)的步骤。
上述存储介质有多种,例如,U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请还公开一种计算机设备,该设备可以为客户端设备,也可以为视频服务器,如图11所示,该计算机设备包括一个或者多个处理器(CPU)1102、通信模块1104、存储器1106、用户接口1110,以及用于互联这些组件的通信总线1108,其中:
处理器1102可通过通信模块1104接收和发送数据以实现网络通信和/或本地通信。
用户接口1110包括一个或多个输出设备1112,其包括一个或多个扬声器和/或一个或多个可视化显示器。用户接口1110也包括一个或多个输入设备1114,其包括诸如,键盘,鼠标,声音命令输入单元或扩音器,触屏显示器,触敏输入板,姿势捕获摄像机或其他输入按钮或控件等。
存储器1106可以是高速随机存取存储器,诸如DRAM、SRAM、DDR RAM、或其他随机存取固态存储设备;或者非易失性存储器,诸如一个或多个磁盘存储设备、光盘存储设备、闪存设备,或其他非易失性固态存储设备。
存储器1106存储处理器1102可执行的指令集,包括:
操作系统1116,包括用于处理各种基本系统服务和用于执行硬件相关任务 的程序;
应用1118,包括用于视频台词处理的各种应用程序,这种应用程序能够实现上述各实例中的处理流程,比如可以包括视频客户端900中的部分或者全部指令模块或单元,也可以包括视频服务器1000中的部分或全部指令模块或单元。处理器1102通过执行存储器1106中各单元中至少一个单元中的机器可执行指令,进而能够实现上述各单元或模块中的至少一个模块的功能。
需要说明的是,上述各流程和各结构图中不是所有的步骤和模块都是必须的,可以根据实际的需要忽略某些步骤或模块。各步骤的执行顺序不是固定的,可以根据需要进行调整。各模块的划分仅仅是为了便于描述采用的功能上的划分,实际实现时,一个模块可以分由多个模块实现,多个模块的功能也可以由同一个模块实现,这些模块可以位于同一个设备中,也可以位于不同的设备中。
各实施例中的硬件模块可以以硬件方式或硬件平台加软件的方式实现。上述软件包括机器可读指令,存储在非易失性存储介质中。因此,各实施例也可以体现为软件产品。
各例中,硬件可以由专门的硬件或执行机器可读指令的硬件实现。例如,硬件可以为专门设计的永久性电路或逻辑器件(如专用处理器,如FPGA或ASIC)用于完成特定的操作。硬件也可以包括由软件临时配置的可编程逻辑器件或电路(如包括通用处理器或其它可编程处理器)用于执行特定操作。
另外,本申请的每个实例可以通过由数据处理设备如计算机执行的数据处理程序来实现。显然,数据处理程序构成了本申请。此外,通常存储在一个存储介质中的数据处理程序通过直接将程序读取出存储介质或者通过将程序安装或复制到数据处理设备的存储设备(如硬盘和或内存)中执行。因此,这样的存储介质也构成了本申请,本申请还提供了一种非易失性存储介质,其中存储有数据处理程序,这种数据处理程序可用于执行本申请上述方法实例中的任何一种实例。
图11模块对应的机器可读指令可以使计算机上操作的操作系统等来完成这里描述的部分或者全部操作。非易失性计算机可读存储介质可以是插入计算机内的扩展板中所设置的存储器中或者写到与计算机相连接的扩展单元中设置的存储器。安装在扩展板或者扩展单元上的CPU等可以根据指令执行部分和全部实际操作。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发 明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。

Claims (16)

  1. 一种视频台词处理方法,其特征在于,所述方法由视频服务器执行,包括:
    在接收到视频客户端发送来的视频台词处理请求时,提取所述处理请求携带的视频标识和时间信息;
    从所述视频标识对应的视频数据中获取所述时间信息对应的帧图像;
    从所述帧图像中识别出台词文本;
    将识别出的所述台词文本发送至所述视频客户端。
  2. 根据权利要求1所述的方法,其特征在于,所述从所述帧图像中识别出台词文本,包括:
    检测所述帧图像中的字符区域;
    去除所检测出的字符区域中的背景;
    从去除背景后的字符区域中提取字符序列;其中,所述字符序列包括一个或多个字符图片;
    对提取出的所述字符序列中包括的所述一个或多个字符图片进行文本识别,得到所述台词文本。
  3. 根据权利要求2所述的方法,其特征在于,所述从所述帧图像中识别出台词文本,还包括:
    在所述检测所述帧图像中的字符区域之前,对所述帧图像进行预处理。
  4. 根据权利要求3所述的方法,其特征在于,所述预处理包括如下至少一种:平滑处理、版面分析和倾斜度校正。
  5. 根据权利要求2所述的方法,其特征在于,所述去除所检测出的字符区域中的背景,包括:对所检测出的字符区域进行二值化处理;
    其中,所述从去除背景后的字符区域中提取字符序列,包括:
    根据经过所述二值化处理的字符区域中各像素点的像素值,对经过所述二值化处理的字符区域进行字符分割得到所述字符序列。
  6. 根据权利要求2所述的方法,其特征在于,所述从所述帧图像中识别出台词文本,还包括:
    根据语言句法约束条件,对识别出的所述台词文本进行后处理。
  7. 一种视频台词处理方法,其特征在于,所述方法由视频客户端执行,包括:
    响应于对视频播放界面中视频台词控件的操作,向视频服务器发送携带视频标识和时间信息的视频台词处理请求,以使所述视频服务器从所述视频标识和所述时间信息所对应的帧图像中识别出台词文本;
    在接收到所述视频服务器发送来的所述台词文本时,展示包含所述台词文本的台词操作界面;
    响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理。
  8. 根据权利要求7所述的方法,其特征在于,所述视频台词处理请求为视频台词分享请求;所述台词操作界面中还包括一个或多个可供选择的分享平台和\或评论区的信息。
  9. 根据权利要求8所述的方法,其特征在于,所述响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理,包括:
    响应于对所述台词操作界面中一个分享平台的选择操作,若被选择的分享平台处于登录状态,则展示包含所述台词文本的所述被选择的分享平台的信息发布界面;
    响应于对所述被选择的分享平台的信息发布界面的发布操作,将所述台词文本发布到所述被选择的分享平台。
  10. 根据权利要求9所述的方法,其特征在于,所述响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理,还包括:
    响应于对所述台词操作界面中一个分享平台的选择操作,若所述被选择的分享平台处于未登录状态,则展示所述被选择的分享平台的登录界面;
    响应于对所述被选择的分享平台的登录界面的登录操作,登录所述被选择 的分享平台。
  11. 根据权利要求8所述的方法,其特征在于,所述响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理,包括:
    响应于对所述台词操作界面中一个评论区的选择操作,将所述台词文本发布到所述被选择的评论区。
  12. 根据权利要求7所述的方法,其特征在于,所述台词文本展示在所述台词操作界面的可编辑文本框内;
    所述响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理,包括:
    响应于对所述可编辑文本框的操作,对所述台词文本进行编辑操作。
  13. 一种视频服务器,其特征在于,包括:
    信息提取模块,用于在接收到视频客户端发送来的视频台词处理请求时,提取所述处理请求携带的视频标识和时间信息;
    图像获取模块,用于从所述视频标识对应的视频数据中获取所述时间信息对应的帧图像;
    台词识别模块,用于从所述帧图像中识别出台词文本;
    台词发送模块,用于将识别出的所述台词文本发送至所述视频客户端。
  14. 一种视频客户端,其特征在于,包括:
    请求发送模块,用于响应于对视频播放界面中视频台词控件的操作,向视频服务器发送携带视频标识和时间信息的视频台词处理请求,以使所述视频服务器从所述视频标识和所述时间信息所对应的帧图像中识别出台词文本;
    界面展示模块,用于在接收到所述视频服务器发送来的所述台词文本时,展示包含所述台词文本的台词操作界面;
    台词处理模块,用于响应于对所述台词操作界面的操作,对所述台词文本进行相应的处理。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所 述计算机程序被处理器执行时实现如权利要求1~12任一所述方法的步骤。
  16. 一种计算机设备,其特征在于,所述计算机设备包括处理器,所述处理器中存储有指令集,所述指令集被处理器执行时实现如权利要求1~12任一所述方法的步骤。
PCT/CN2018/097089 2017-07-26 2018-07-25 视频台词处理方法、客户端、服务器及存储介质 WO2019020061A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710616032.8 2017-07-26
CN201710616032.8A CN109309844B (zh) 2017-07-26 2017-07-26 视频台词处理方法、视频客户端及服务器

Publications (1)

Publication Number Publication Date
WO2019020061A1 true WO2019020061A1 (zh) 2019-01-31

Family

ID=65039997

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/097089 WO2019020061A1 (zh) 2017-07-26 2018-07-25 视频台词处理方法、客户端、服务器及存储介质

Country Status (2)

Country Link
CN (1) CN109309844B (zh)
WO (1) WO2019020061A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111654715A (zh) * 2020-06-08 2020-09-11 腾讯科技(深圳)有限公司 直播的视频处理方法、装置、电子设备及存储介质
CN111836061A (zh) * 2020-06-18 2020-10-27 北京嘀嘀无限科技发展有限公司 直播辅助方法、装置、服务器和可读存储介质
CN112752121A (zh) * 2020-05-26 2021-05-04 腾讯科技(深圳)有限公司 一种视频封面生成方法及装置
CN112968826A (zh) * 2020-02-05 2021-06-15 北京字节跳动网络技术有限公司 语音交互方法、装置和电子设备
CN113552984A (zh) * 2021-08-09 2021-10-26 北京字跳网络技术有限公司 文本提取方法、装置、设备及介质
CN115150649A (zh) * 2022-06-14 2022-10-04 阿里云计算有限公司 一种媒体流的播放方法、设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147467A (zh) * 2019-04-11 2019-08-20 北京达佳互联信息技术有限公司 一种文本描述的生成方法、装置、移动终端及存储介质
CN114449133A (zh) * 2021-12-23 2022-05-06 北京达佳互联信息技术有限公司 文件显示方法、装置、设备、存储介质及程序产品
CN115373550B (zh) * 2022-10-24 2022-12-20 中诚华隆计算机技术有限公司 一种获取交互信息的方法、系统及芯片

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1600889A1 (en) * 2004-05-21 2005-11-30 Samsung Electronics Co., Ltd. Apparatus and method for extracting character(s) from image
CN101021903A (zh) * 2006-10-10 2007-08-22 鲍东山 视频字幕内容分析系统
US20120137237A1 (en) * 2010-08-13 2012-05-31 Sony Corporation System and method for digital image and video manipulation and transfer
CN105338419A (zh) * 2015-10-29 2016-02-17 网易传媒科技(北京)有限公司 一种字幕集锦的生成方法和设备
CN105872810A (zh) * 2016-05-26 2016-08-17 网易传媒科技(北京)有限公司 一种媒体内容分享方法和装置
CN106254933A (zh) * 2016-08-08 2016-12-21 腾讯科技(深圳)有限公司 字幕提取方法及装置
CN107862315A (zh) * 2017-11-02 2018-03-30 腾讯科技(深圳)有限公司 字幕提取方法、视频搜索方法、字幕分享方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186780B (zh) * 2011-12-30 2018-01-26 乐金电子(中国)研究开发中心有限公司 视频字幕识别方法及装置
US9021536B2 (en) * 2012-09-06 2015-04-28 Stream Translations, Ltd. Process for subtitling streaming video content
CN102916951A (zh) * 2012-10-11 2013-02-06 北京百度网讯科技有限公司 多媒体信息转换的方法、系统和装置
CN103593142B (zh) * 2013-11-29 2017-10-24 杭州网易云音乐科技有限公司 一种分享歌词的方法及装置
CN104361336A (zh) * 2014-11-26 2015-02-18 河海大学 一种水下视频图像的文字识别方法
CN106162323A (zh) * 2015-03-26 2016-11-23 无锡天脉聚源传媒科技有限公司 一种视频数据处理方法及装置
CN106295628A (zh) * 2015-05-20 2017-01-04 地利控股(西咸新区)网络农业有限公司 一种使视频中出现的文字易于交互的方法
CN105472409B (zh) * 2015-12-01 2018-12-21 康佳集团股份有限公司 一种基于社交朋友圈分享电视节目直播方法及系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1600889A1 (en) * 2004-05-21 2005-11-30 Samsung Electronics Co., Ltd. Apparatus and method for extracting character(s) from image
CN101021903A (zh) * 2006-10-10 2007-08-22 鲍东山 视频字幕内容分析系统
US20120137237A1 (en) * 2010-08-13 2012-05-31 Sony Corporation System and method for digital image and video manipulation and transfer
CN105338419A (zh) * 2015-10-29 2016-02-17 网易传媒科技(北京)有限公司 一种字幕集锦的生成方法和设备
CN105872810A (zh) * 2016-05-26 2016-08-17 网易传媒科技(北京)有限公司 一种媒体内容分享方法和装置
CN106254933A (zh) * 2016-08-08 2016-12-21 腾讯科技(深圳)有限公司 字幕提取方法及装置
CN107862315A (zh) * 2017-11-02 2018-03-30 腾讯科技(深圳)有限公司 字幕提取方法、视频搜索方法、字幕分享方法及装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112968826A (zh) * 2020-02-05 2021-06-15 北京字节跳动网络技术有限公司 语音交互方法、装置和电子设备
CN112968826B (zh) * 2020-02-05 2023-08-08 北京字节跳动网络技术有限公司 语音交互方法、装置和电子设备
CN112752121A (zh) * 2020-05-26 2021-05-04 腾讯科技(深圳)有限公司 一种视频封面生成方法及装置
CN112752121B (zh) * 2020-05-26 2023-06-09 腾讯科技(深圳)有限公司 一种视频封面生成方法及装置
CN111654715A (zh) * 2020-06-08 2020-09-11 腾讯科技(深圳)有限公司 直播的视频处理方法、装置、电子设备及存储介质
CN111654715B (zh) * 2020-06-08 2024-01-09 腾讯科技(深圳)有限公司 直播的视频处理方法、装置、电子设备及存储介质
CN111836061A (zh) * 2020-06-18 2020-10-27 北京嘀嘀无限科技发展有限公司 直播辅助方法、装置、服务器和可读存储介质
CN113552984A (zh) * 2021-08-09 2021-10-26 北京字跳网络技术有限公司 文本提取方法、装置、设备及介质
CN115150649A (zh) * 2022-06-14 2022-10-04 阿里云计算有限公司 一种媒体流的播放方法、设备及存储介质

Also Published As

Publication number Publication date
CN109309844B (zh) 2022-02-22
CN109309844A (zh) 2019-02-05

Similar Documents

Publication Publication Date Title
CN109309844B (zh) 视频台词处理方法、视频客户端及服务器
US11526325B2 (en) Projection, control, and management of user device applications using a connected resource
CN105981368B (zh) 在成像装置中的照片构图和位置引导
WO2018028583A1 (zh) 字幕提取方法及装置、存储介质
US10649608B2 (en) Dynamically enabling an interactive element within a non-interactive view of a screen sharing session
US10809895B2 (en) Capturing documents from screens for archival, search, annotation, and sharing
Arai et al. Automatic e-comic content adaptation
CN108182211B (zh) 视频舆情获取方法、装置、计算机设备及存储介质
WO2022089170A1 (zh) 字幕区域识别方法、装置、设备及存储介质
US10586308B2 (en) Digital media environment for removal of obstructions in a digital image scene
CN111107422B (zh) 图像处理方法及装置、电子设备和计算机可读存储介质
US20140267011A1 (en) Mobile device event control with digital images
US20150058708A1 (en) Systems and methods of character dialog generation
JP7213291B2 (ja) 画像を生成するための方法及装置
US11190653B2 (en) Techniques for capturing an image within the context of a document
CN112995749A (zh) 视频字幕的处理方法、装置、设备和存储介质
CN105184838A (zh) 一种图片处理方法及终端
JP2023545052A (ja) 画像処理モデルの訓練方法及び装置、画像処理方法及び装置、電子機器並びにコンピュータプログラム
WO2024169397A1 (zh) 印章识别方法、装置、电子设备及存储介质
US10631050B2 (en) Determining and correlating visual context on a user device with user behavior using digital content on the user device
US20160026613A1 (en) Processing image to identify object for insertion into document
WO2024039964A1 (en) Systems and methods for blur identification and correction
WO2023239468A1 (en) Cross-application componentized document generation
RU2571379C2 (ru) Интеллектуальная обработка электронного документа
US20180300301A1 (en) Enhanced inking capabilities for content creation applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18837476

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18837476

Country of ref document: EP

Kind code of ref document: A1