CN111836062A - Video playing method and device and computer readable storage medium - Google Patents

Video playing method and device and computer readable storage medium Download PDF

Info

Publication number
CN111836062A
CN111836062A CN202010622064.0A CN202010622064A CN111836062A CN 111836062 A CN111836062 A CN 111836062A CN 202010622064 A CN202010622064 A CN 202010622064A CN 111836062 A CN111836062 A CN 111836062A
Authority
CN
China
Prior art keywords
target
audio track
text information
track data
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010622064.0A
Other languages
Chinese (zh)
Inventor
张浩波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Pinecone Electronic Co Ltd
Original Assignee
Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Pinecone Electronic Co Ltd filed Critical Beijing Xiaomi Pinecone Electronic Co Ltd
Priority to CN202010622064.0A priority Critical patent/CN111836062A/en
Publication of CN111836062A publication Critical patent/CN111836062A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/75Media network packet handling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/434Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams, extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The present disclosure relates to a video playing method, apparatus and computer readable storage medium, the method comprising: caching the streaming media data received in real time; acquiring target audio track data to be added with subtitles from the cached streaming media data; analyzing the target audio track data to obtain target text information corresponding to the target audio track data; carrying out time axis alignment on the target text information and the target audio track data to obtain time axis information corresponding to the target text information; and playing videos according to the streaming media data, and adding subtitles to the played videos based on the target text information and the time axis information. Therefore, subtitles can be added to the played video in real time, and therefore when a user watches the video, the content in the video can be more clearly determined. Moreover, subtitles can be added to the video in real time while the video playing progress is not affected, the real-time performance of video playing is guaranteed, and the user experience is improved.

Description

Video playing method and device and computer readable storage medium
Technical Field
The present disclosure relates to the field of video technologies, and in particular, to a video playing method and apparatus, and a computer-readable storage medium.
Background
Today, streaming media is increasingly popular, and in the above scenes, subtitles are not usually displayed. In the related art, subtitles are usually added to the video after the live broadcast is finished, so that the video is inconvenient for users to use.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a video playing method, apparatus, and computer-readable storage medium.
According to a first aspect of the embodiments of the present disclosure, there is provided a video playing method, including:
caching the streaming media data received in real time;
acquiring target audio track data to be added with subtitles from the cached streaming media data;
analyzing the target audio track data to obtain target text information corresponding to the target audio track data;
carrying out time axis alignment on the target text information and the target audio track data to obtain time axis information corresponding to the target text information;
and playing videos according to the streaming media data, and adding subtitles to the played videos based on the target text information and the time axis information.
Optionally, the method further comprises:
receiving a language selection instruction set by a user, wherein the language selection instruction is used for indicating a target language set by the user;
the analyzing the target audio track data to obtain target text information corresponding to the target audio track data includes:
performing voice recognition on the target audio track data to obtain first text information corresponding to the target audio track data;
and under the condition that the first text information is different from the target language, performing language conversion on the first text information according to the target language to obtain the target text information.
Optionally, the playing a video according to the streaming media data, and adding subtitles to the played video based on the target text information and the time axis information includes:
and playing the target audio track data and the image data corresponding to the target audio track data in the streaming media data, and displaying the target text information according to the time axis information.
Optionally, the method further comprises:
carrying out sentence segmentation on the target text information, and determining each sentence contained in the target text information;
for each statement, determining image data corresponding to the statement from the streaming media data according to time axis information corresponding to the statement;
determining target person information corresponding to the sentence according to the image data;
the playing the video according to the streaming media data and adding subtitles to the played video based on the target text information and the time axis information includes:
and playing the target audio track data and the image data corresponding to the target audio track data in the streaming media data, and displaying text information of a sentence corresponding to the target character information at a position corresponding to the target character information according to the time axis information.
Optionally, the determining, according to the image data, target person information corresponding to the sentence includes:
performing face recognition according to the image data, and determining character information corresponding to the image data;
extracting voiceprint features according to the audio track data corresponding to the statement to obtain voiceprint information corresponding to the statement;
and determining the person information matched with the voiceprint information in the person information corresponding to the image data as the target person information according to the voiceprint information.
Optionally, the method further comprises:
storing progress information corresponding to the target audio track data to indicate that subtitles have been added to the target audio track data.
According to a second aspect of the embodiments of the present disclosure, there is provided a video playback apparatus including:
the caching module is configured to cache the streaming media data received in real time;
the acquisition module is configured to acquire target audio track data to be added with subtitles from the cached streaming media data;
the analysis module is configured to analyze the target audio track data to obtain target text information corresponding to the target audio track data;
the processing module is configured to align the target text information with the target audio track data in a time axis manner to obtain time axis information corresponding to the target text information;
and the playing module is configured to play videos according to the streaming media data and add subtitles to the played videos based on the target text information and the time axis information.
Optionally, the apparatus further comprises:
the receiving module is configured to receive a language selection instruction set by a user, and the language selection instruction is used for indicating a target language type set by the user;
the parsing module includes:
the recognition sub-module is configured to perform voice recognition on the target audio track data to obtain first text information corresponding to the target audio track data;
and the conversion sub-module is configured to perform language conversion on the first text information according to the target language to obtain the target text information under the condition that the first text information is different from the target language.
Optionally, the playing module includes:
and the first playing sub-module is configured to play the target audio track data and the image data corresponding to the target audio track data in the streaming media data, and display the target text information according to the time axis information.
Optionally, the apparatus further comprises:
the segmentation submodule is configured to perform sentence segmentation on the target text information and determine each sentence contained in the target text information;
a first determining sub-module configured to determine, for each of the sentences, image data corresponding to the sentence from the streaming media data according to timeline information corresponding to the sentence;
a second determination sub-module configured to determine target person information corresponding to the sentence from the image data;
the playing module comprises:
and a second playing sub-module configured to play the target audio track data and the image data corresponding to the target audio track data in the streaming media data, and display text information of a sentence corresponding to the target character information at a position corresponding to the target character information according to the time axis information.
Optionally, the second determining sub-module includes:
the third determining submodule is configured to perform face recognition according to the image data and determine character information corresponding to the image data;
the extraction submodule is configured to extract voiceprint features according to the audio track data corresponding to the statement to obtain voiceprint information corresponding to the statement;
a fourth determining sub-module configured to determine, as the target person information, person information that matches the voiceprint information among the person information corresponding to the image data, according to the voiceprint information.
Optionally, the apparatus further comprises:
a storage module configured to store progress information corresponding to the target audio track data to indicate that subtitles have been added to the target audio track data.
According to a third aspect of the embodiments of the present disclosure, there is provided a video playback apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
caching the streaming media data received in real time;
acquiring target audio track data to be added with subtitles from the cached streaming media data;
analyzing the target audio track data to obtain target text information corresponding to the target audio track data;
carrying out time axis alignment on the target text information and the target audio track data to obtain time axis information corresponding to the target text information;
and playing videos according to the streaming media data, and adding subtitles to the played videos based on the target text information and the time axis information.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the video playback method provided by the first aspect of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
through the technical scheme, the streaming media data received in real time are cached; acquiring target audio track data to be added with subtitles from the cached streaming media data; analyzing the target audio track data to obtain target text information corresponding to the target audio track data; carrying out time axis alignment on the target text information and the target audio track data to obtain time axis information corresponding to the target text information; and playing videos according to the streaming media data, and adding subtitles to the played videos based on the target text information and the time axis information. Through the technical scheme, the audio track data in the cached streaming media data can be analyzed, so that the text information corresponding to the audio track data is generated, the video can be played and the text information can be displayed during video playing, subtitles can be added to the played video in real time, and the content in the video can be more clearly determined when a user watches the video. In addition, the scheme is operated based on the audio track data of the cached streaming media data, so that the video playing progress is not influenced, meanwhile, subtitles are added to the video in real time, and the real-time performance of video playing is guaranteed. In addition, the scheme can also provide video caption descriptions for the users with hearing impairment, so that the use experience of the users is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flow chart illustrating a video playback method according to an exemplary embodiment.
Fig. 2 is a flowchart illustrating an exemplary implementation of determining target person information corresponding to the sentence from the image data according to an exemplary embodiment.
FIG. 3 is a schematic diagram of a video playback interface shown in accordance with an exemplary embodiment.
Fig. 4 is a block diagram illustrating a video playback device in accordance with an exemplary embodiment.
Fig. 5 is a block diagram illustrating a video playback device according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Fig. 1 is a flow chart illustrating a video playback method according to an exemplary embodiment, which includes the following steps, as shown in fig. 1.
In step 11, streaming media data received in real time is buffered.
As an example, in a live scenario, the streaming media data may be video data loaded by the terminal from the server, and the terminal may cache the received streaming media data at the terminal, so that the cached streaming media data may be processed at the terminal.
In step 12, target audio track data to be added with subtitles is obtained from the buffered streaming media data.
The streaming media data may include audio track data and image data, where the audio track data is used for playing audio and the image data is used for playing images, so as to implement playing of video.
Optionally, in an initial situation of loading the streaming media data, the streaming media data does not have audio track data to which subtitles have been added, a data amount corresponding to the audio track data in the buffered streaming media data may be determined in real time or periodically, and when the data amount is greater than or equal to a target threshold, the audio track data in the buffered streaming media data is determined as the target audio track data. The target threshold may be set according to an actual usage scenario, which is not limited by this disclosure. Illustratively, the target threshold may be set to 10M.
In a possible embodiment, the data amount corresponding to the track data in the cached streaming media data may be determined in real time, and when it is determined that the data amount reaches the target threshold, the track data corresponding to the data amount is determined as the target track data. For example, when streaming media data received in real time is buffered, the data amount corresponding to the audio track data in the streaming media data is determined at the same time. For example, when the target threshold is 10M, each time the data amount is determined to reach 10M, the track data corresponding to the data amount in the buffer is determined as the target track data. It should be noted that the data amount of the track data determined in real time may exceed 10M, and if the data amount is 11M, the track data portion corresponding to 11M is determined as the target track data. It should be noted that, when determining the target audio track data to be added with subtitles, it needs to be ensured that the audio track data to which subtitles have been added before is not completely played, that is, the target audio track data to be added with subtitles is determined from the cached and not-yet-played streaming media data.
In another possible embodiment, the method may further include: storing progress information corresponding to the target audio track data to indicate that subtitles have been added to the target audio track data. For example, the determined data amount of the target track data may be stored as progress information corresponding to the target track data. For example, it may be determined whether the data amount of the track data following the progress information in the cached streaming media data reaches a target threshold value every target time interval, and if the data amount reaches the target threshold value, the track data following the progress information in the track data of the cached streaming media data may be determined as the target track data. If the data amount corresponding to the determined target track data is 10M initially, the progress information corresponding to the target track data may be determined to be 10M. As an example, it may be determined whether the data amount of the track data after the progress information reaches 10M every 5 seconds, for example, by a difference between the total data amount corresponding to the track data and the progress information. For example, if it is determined that the total data amount corresponding to the track data is 23M, it indicates that the data amount (13M) of the track data following the progress information reaches the target threshold, and at this time, the track data following the progress information may be determined as the target track data.
Therefore, the target audio track data can be quickly determined by the method, and the subsequent processing of the target audio track data is facilitated. Meanwhile, the integrity and the accuracy of processing audio track data in the streaming media data can be ensured, so that the integrity and the accuracy of subsequently added subtitles are ensured.
In step 13, the target audio track data is analyzed to obtain target text information corresponding to the target audio track data.
The target audio track data can be identified through a voice identification method, so that target text information corresponding to the target audio track data is obtained.
In step 14, time axis alignment is performed on the target text information and the target audio track data, and time axis information corresponding to the target text information is obtained.
As an example, time axis alignment of the track data with the Sequence corresponding to the target text information may be performed by a Sequence to Sequence (Sequence to Sequence) model, thereby obtaining time axis information corresponding to the target text information. The Seq2Seq model is prior art and is not described herein again.
As another example, when displaying subtitles, one sentence is directly displayed by all of them at once. Based on this, for each sentence in the target text information, the start and end time (including the start time and the end time) of the track data corresponding to the sentence can be obtained, and the start and end time can be determined as the time axis information corresponding to the target text information. For example, for the sentence "welcome people come to XX live broadcasting", the start time and the end time corresponding to the track data are (1, 3), that is, the start time and the end time corresponding to the sentence are respectively 1 st s and 3 rd s, so that the time axis information corresponding to the target text information can be obtained.
In step 15, video playing is performed according to the streaming media data, and subtitles are added to the played video based on the target text information and the time axis information.
Through the technical scheme, the streaming media data received in real time are cached; acquiring target audio track data to be added with subtitles from the cached streaming media data; analyzing the target audio track data to obtain target text information corresponding to the target audio track data; carrying out time axis alignment on the target text information and the target audio track data to obtain time axis information corresponding to the target text information; and playing videos according to the streaming media data, and adding subtitles to the played videos based on the target text information and the time axis information. Through the technical scheme, the audio track data in the cached streaming media data can be analyzed, so that the text information corresponding to the audio track data is generated, the video can be played and the text information can be displayed during video playing, subtitles can be added to the played video in real time, and the content in the video can be more clearly determined when a user watches the video. In addition, the scheme is operated based on the audio track data of the cached streaming media data, so that the video playing progress is not influenced, meanwhile, subtitles are added to the video in real time, and the real-time performance of video playing is guaranteed. In addition, the scheme can also provide video caption descriptions for the users with hearing impairment, so that the use experience of the users is improved.
With the development of computer network technology, users may watch the foreign language live broadcast in an actual use scene, and therefore specific contents in the live broadcast cannot be clearly known due to the problem of language obstruction. Accordingly, the present disclosure also provides the following embodiments.
Optionally, the method may further include:
receiving a language selection instruction set by a user, wherein the language selection instruction is used for indicating a target language set by the user.
Illustratively, the user may make language selections through a language setting interface. The language setting interface can bear the options of a plurality of languages, and a user can select one language as the target language corresponding to the caption displayed by the streaming media data.
Accordingly, in step 13, an exemplary implementation manner of parsing the target audio track data to obtain target text information corresponding to the target audio track data is as follows, and the step may include:
and performing voice recognition on the target audio track data to obtain first text information corresponding to the target audio track data. For example, the target audio track data may be subjected to Speech Recognition by an Automatic Speech Recognition (ASR) technique.
And under the condition that the first text information is different from the target language, performing language conversion on the first text information according to the target language to obtain the target text information.
For example, when the user is currently watching a direct korean program and the user determines, through speech recognition, that target text information corresponding to the target track data is korean text information in an application scenario where the language selected by the user is a chinese language, and the target language selected by the user is chinese language, in this case, the first text information may be translated based on the translator, so that the korean text information may be converted into chinese text information, that is, the target text information is obtained.
Therefore, by the technical scheme, the subtitle corresponding to the voice in the video can be displayed in real time, and the subtitle can be displayed as the target language selected by the user, so that the specific content in the video can be known when the user watches the video with the unavailable language. In the above scheme, the target text information is determined based on the streaming media data in the cache, so that different target text information can be displayed when video playing is performed in terminals corresponding to different users for the same video data, thereby further widening the application range of the video playing method.
Optionally, in step 15, an exemplary implementation manner of playing a video according to the streaming media data and adding a subtitle to the played video based on the target text information and the time axis information is as follows, and the step may include:
and playing the target audio track data and the image data corresponding to the target audio track data in the streaming media data, and displaying the target text information according to the time axis information.
For example, with the above embodiment, the target text information corresponding to the target audio track data in the cached streaming media data may be determined, so that when video playing is performed based on the streaming media data in the cache, the target audio track data and the image data corresponding to the target audio track data may be played, thereby realizing audio-video synchronization of the streaming media data playing. The synchronous playing of the audio track data and the image data in the streaming media data is a playing method commonly used in the art, and is not described herein again. In the embodiment of the present disclosure, while playing a video, the target text information may be displayed according to the time axis information, so that a subtitle may be added to the played video in real time, and the subtitle corresponds to audio track data in the video.
For example, following the example shown above, it is determined that the sentence "welcome people come to XX live broadcast", the start-stop time corresponding to the audio track data is (1, 3), when the target text information is displayed according to the time axis information, that is, the target text information "welcome people come to XX live broadcast" is displayed in the 1 st s to the 3 rd s of video playing, and meanwhile, the audio track data in the video being played is the sentence "welcome people come to XX live broadcast", and the picture is a picture corresponding to the sentence, thereby ensuring synchronous display of audio, picture, and subtitle.
Therefore, by the technical scheme, the subtitle can be added to the played video in real time for display, so that the content watched at present can be more clear for the user, and the user can use the content conveniently. Meanwhile, by the method, the synchronous display of the audio frequency, the picture and the subtitle of the played video can be realized, the method can be suitable for the situation that the user with hearing impairment watches the video, the application range of the method is further widened, the convenience is provided for the user to watch the video, and the use experience of the user is further improved.
In another application scenario, when there are many people in the picture, it is difficult for a hearing impaired user to tell which person in the picture each sentence is spoken by. Based on this, the present disclosure also provides the following embodiments.
Optionally, the method may further include:
and carrying out sentence segmentation on the target text information, and determining each sentence contained in the target text information. Illustratively, sentence segmentation of the target text information may be achieved by training a Natural Language Processing (NLP) classifier, or sentence segmentation may also be achieved based on NLTK (Natural Language processing Toolkit).
And for each statement, determining image data corresponding to the statement from the streaming media data according to the time axis information corresponding to the statement.
The start time and the end time of the audio track data corresponding to the sentence can be determined as the time axis information corresponding to the sentence, and the determination method is described in detail above. Then, the image data corresponding to the time axis information in the streaming media data may be determined as the image data corresponding to the sentence. For example, the image data corresponding to the 1 st s-3 rd s may be determined as the image data corresponding to the sentence "welcome people to XX live broadcast".
Then, target person information corresponding to the sentence may be determined from the image data. The target person information is information on a person who uttered the term included in the image data.
In this case, the present disclosure provides the following embodiments to determine the target person information corresponding to the sentence, wherein the person in the current frame and the person corresponding to the voiceprint information of the voice correspond to the same person.
Illustratively, the exemplary implementation of determining the target person information corresponding to the sentence according to the image data is as follows, and as shown in fig. 2, the step may include:
in step 21, face recognition is performed based on the image data, and the personal information corresponding to the image data is determined.
As an example, face recognition may be performed for each video frame included in the image data, so that the personal information included in each video frame will be determined. In an actual application scenario, persons included in a plurality of consecutive image frames are generally the same, and as another example, video frames included in image data may be sampled at preset time intervals, and face recognition may be performed on the sampled video frames, so as to determine person information included in the video.
For example, each face image in the video frame may be extracted based on an existing face detection algorithm, such as a face detection algorithm, e.g., setaface, mtcnn, and the like. Then, the key points can be extracted by the existing key point detection algorithm, for example, the face key point detection methods such as ert (ensemble of regressioning tress) algorithm, mdm (mnemonic descementmethod) and the like, so that the key points corresponding to the face image can be obtained. After the key points corresponding to the face image are determined, the state information of the person can be determined according to the position information of the key points.
The person speaking in the displayed picture usually has a large screen occupation in the current picture. Therefore, in one possible embodiment, when it is determined that a plurality of persons exist in the screen, the screen occupation ratio corresponding to each person may be determined according to the position information of the key point corresponding to each person, so that the information of the person whose screen occupation ratio is N before is determined as the person information corresponding to the image data, and the person information may be gender information, position information, and the like corresponding to the person.
In another possible embodiment, the position information of the key point may be used to determine whether the person speaks, so as to determine the person information corresponding to the image data. In which a detection model may be trained in advance to detect whether a person is in a speaking state based on the location information of the key points. For example, the position information of the key points corresponding to the face image of the speech may be used as training data for training. Therefore, when the key point corresponding to each person is determined, the position information of the key point can be input into the detection model to determine whether the person is in a speaking state. And then when the person is determined to be in the speaking state, determining the information of the person as the person information corresponding to the image.
In another possible embodiment, the person information corresponding to the image data may be determined by combining the two manners described above, for example, information corresponding to a person whose screen ratio in the speaking state is ranked from M to M, where N and M may be the same or different, may be determined as the person information corresponding to the image data.
Therefore, the character information of possible speech in the corresponding picture can be comprehensively and accurately extracted, and comprehensive data support is provided for subsequently determining the target character information corresponding to the sentence.
In step 22, voiceprint feature extraction is performed based on the audio track data corresponding to the sentence, and voiceprint information corresponding to the sentence is obtained. Illustratively, the voiceprint information can be obtained by extracting an MFCC (Mel-scale Frequency cepstral coefficients, Mel-Frequency cepstral coefficients) feature as the voiceprint feature.
In step 23, the person information matching the voiceprint information among the person information corresponding to the image data is determined as the target person information based on the voiceprint information.
As an example, the matching may be performed based on gender information determined from the voiceprint information. And if the extracted voiceprint information is characterized as female voice, determining the person information corresponding to the girl with the gender information as the target person information in the person information corresponding to the determined image data. The manner of determining gender according to voiceprint information is the prior art, and is not described herein again.
As another example, the matching may be based on age information determined from voiceprint information. And if the extracted voiceprint information indicates that the age is the middle age, determining the person information corresponding to the age information of the middle age in the person information corresponding to the determined image data as the target person information. The manner of determining the age according to the voiceprint information is the prior art, and is not described herein again.
For example, if there are a plurality of pieces of matched personal information determined in the above manner, the personal information matched to the person with the largest screen ratio in the speaking state may be determined as the target personal information.
Therefore, by the method, the target character information corresponding to each sentence in the video picture can be quickly and accurately determined, so that the sentences can correspond to the character information, and the addition of subtitles to characters is facilitated.
Accordingly, in step 15, an exemplary implementation manner of playing a video according to the streaming media data and adding a subtitle to the played video based on the target text information and the time axis information is as follows, and the step may include:
and playing the target audio track data and the image data corresponding to the target audio track data in the streaming media data, and displaying text information of a sentence corresponding to the target character information at a position corresponding to the target character information according to the time axis information.
The manner of playing the target audio track data and the image data corresponding to the target audio track data in the streaming media data is described in detail above. Illustratively, the time axis information corresponding to the sentence "welcome people come to XX live broadcast" is (1, 3), it is determined that the image data corresponding to the sentence includes two characters, such as a character a and a character B shown in fig. 3, and it is determined through the above description that the target character information corresponding to the sentence indicates a character a, and when the text information of the sentence "welcome people come to XX live broadcast" is displayed, the text information can be displayed at a position corresponding to the character a, as shown in fig. 3. The position of the text information display of the sentence corresponding to the target character information can be determined according to the position information of the key point corresponding to the target character information.
Therefore, through the above, the target character information corresponding to each sentence in the target audio track data can be determined, so that when subtitles are added to the played video, text information of the sentence can be displayed at the position corresponding to the character information, and therefore a user can clearly determine which user in the video is speaking and the speaking content of the user, the user with hearing impairment can understand the video conveniently, and the user experience is further improved.
The present disclosure also provides a video playing apparatus, as shown in fig. 4, the apparatus 10 includes:
a caching module 100 configured to cache streaming media data received in real time;
an obtaining module 200 configured to obtain target audio track data to be added with subtitles from the cached streaming media data;
the analysis module 300 is configured to analyze the target audio track data to obtain target text information corresponding to the target audio track data;
a processing module 400 configured to perform time axis alignment on the target text information and the target audio track data, and obtain time axis information corresponding to the target text information;
and the playing module 500 is configured to play a video according to the streaming media data, and add subtitles to the played video based on the target text information and the time axis information.
Optionally, the apparatus further comprises:
the receiving module is configured to receive a language selection instruction set by a user, and the language selection instruction is used for indicating a target language type set by the user;
the parsing module includes:
the recognition sub-module is configured to perform voice recognition on the target audio track data to obtain first text information corresponding to the target audio track data;
and the conversion sub-module is configured to perform language conversion on the first text information according to the target language to obtain the target text information under the condition that the first text information is different from the target language.
Optionally, the playing module includes:
and the first playing sub-module is configured to play the target audio track data and the image data corresponding to the target audio track data in the streaming media data, and display the target text information according to the time axis information.
Optionally, the apparatus further comprises:
the segmentation submodule is configured to perform sentence segmentation on the target text information and determine each sentence contained in the target text information;
a first determining sub-module configured to determine, for each of the sentences, image data corresponding to the sentence from the streaming media data according to timeline information corresponding to the sentence;
a second determination sub-module configured to determine target person information corresponding to the sentence from the image data;
the playing module comprises:
and a second playing sub-module configured to play the target audio track data and the image data corresponding to the target audio track data in the streaming media data, and display text information of a sentence corresponding to the target character information at a position corresponding to the target character information according to the time axis information.
Optionally, the second determining sub-module includes:
the third determining submodule is configured to perform face recognition according to the image data and determine character information corresponding to the image data;
the extraction submodule is configured to extract voiceprint features according to the audio track data corresponding to the statement to obtain voiceprint information corresponding to the statement;
a fourth determining sub-module configured to determine, as the target person information, person information that matches the voiceprint information among the person information corresponding to the image data, according to the voiceprint information.
Optionally, the apparatus further comprises:
a storage module configured to store progress information corresponding to the target audio track data to indicate that subtitles have been added to the target audio track data.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the video playback method provided by the present disclosure.
Fig. 5 is a block diagram illustrating a video playback device 800 according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 5, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the video playback method described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power component 806 provides power to the various components of device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described video playing method.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the video playback method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the video playback method described above when executed by the programmable apparatus.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A video playback method, comprising:
caching the streaming media data received in real time;
acquiring target audio track data to be added with subtitles from the cached streaming media data;
analyzing the target audio track data to obtain target text information corresponding to the target audio track data;
carrying out time axis alignment on the target text information and the target audio track data to obtain time axis information corresponding to the target text information;
and playing videos according to the streaming media data, and adding subtitles to the played videos based on the target text information and the time axis information.
2. The method of claim 1, further comprising:
receiving a language selection instruction set by a user, wherein the language selection instruction is used for indicating a target language set by the user;
the analyzing the target audio track data to obtain target text information corresponding to the target audio track data includes:
performing voice recognition on the target audio track data to obtain first text information corresponding to the target audio track data;
and under the condition that the first text information is different from the target language, performing language conversion on the first text information according to the target language to obtain the target text information.
3. The method according to claim 1, wherein the playing video according to the streaming media data and adding subtitles to the played video based on the target text information and the time axis information comprises:
and playing the target audio track data and the image data corresponding to the target audio track data in the streaming media data, and displaying the target text information according to the time axis information.
4. The method of claim 1, further comprising:
carrying out sentence segmentation on the target text information, and determining each sentence contained in the target text information;
for each statement, determining image data corresponding to the statement from the streaming media data according to time axis information corresponding to the statement;
determining target person information corresponding to the sentence according to the image data;
the playing the video according to the streaming media data and adding subtitles to the played video based on the target text information and the time axis information includes:
and playing the target audio track data and the image data corresponding to the target audio track data in the streaming media data, and displaying text information of a sentence corresponding to the target character information at a position corresponding to the target character information according to the time axis information.
5. The method of claim 4, wherein the determining target person information corresponding to the sentence from the image data comprises:
performing face recognition according to the image data, and determining character information corresponding to the image data;
extracting voiceprint features according to the audio track data corresponding to the statement to obtain voiceprint information corresponding to the statement;
and determining the person information matched with the voiceprint information in the person information corresponding to the image data as the target person information according to the voiceprint information.
6. The method of claim 1, further comprising:
storing progress information corresponding to the target audio track data to indicate that subtitles have been added to the target audio track data.
7. A video playback apparatus, comprising:
the caching module is configured to cache the streaming media data received in real time;
the acquisition module is configured to acquire target audio track data to be added with subtitles from the cached streaming media data;
the analysis module is configured to analyze the target audio track data to obtain target text information corresponding to the target audio track data;
the processing module is configured to align the target text information with the target audio track data in a time axis manner to obtain time axis information corresponding to the target text information;
and the playing module is configured to play videos according to the streaming media data and add subtitles to the played videos based on the target text information and the time axis information.
8. The apparatus of claim 7, further comprising:
the receiving module is configured to receive a language selection instruction set by a user, and the language selection instruction is used for indicating a target language type set by the user;
the parsing module includes:
the recognition sub-module is configured to perform voice recognition on the target audio track data to obtain first text information corresponding to the target audio track data;
and the conversion sub-module is configured to perform language conversion on the first text information according to the target language to obtain the target text information under the condition that the first text information is different from the target language.
9. A video playback apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
caching the streaming media data received in real time;
acquiring target audio track data to be added with subtitles from the cached streaming media data;
analyzing the target audio track data to obtain target text information corresponding to the target audio track data;
carrying out time axis alignment on the target text information and the target audio track data to obtain time axis information corresponding to the target text information;
and playing videos according to the streaming media data, and adding subtitles to the played videos based on the target text information and the time axis information.
10. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 6.
CN202010622064.0A 2020-06-30 2020-06-30 Video playing method and device and computer readable storage medium Pending CN111836062A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010622064.0A CN111836062A (en) 2020-06-30 2020-06-30 Video playing method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010622064.0A CN111836062A (en) 2020-06-30 2020-06-30 Video playing method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111836062A true CN111836062A (en) 2020-10-27

Family

ID=72900989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010622064.0A Pending CN111836062A (en) 2020-06-30 2020-06-30 Video playing method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111836062A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112383809A (en) * 2020-11-03 2021-02-19 Tcl海外电子(惠州)有限公司 Subtitle display method, device and storage medium
CN112702659A (en) * 2020-12-24 2021-04-23 成都新希望金融信息有限公司 Video subtitle processing method and device, electronic equipment and readable storage medium
CN114022668A (en) * 2021-10-29 2022-02-08 北京有竹居网络技术有限公司 Method, device, equipment and medium for aligning text with voice
WO2022217944A1 (en) * 2021-04-14 2022-10-20 北京达佳互联信息技术有限公司 Method for binding subtitle with audio source, and apparatus
CN115278351A (en) * 2022-05-17 2022-11-01 深圳传音控股股份有限公司 Data processing method, intelligent terminal and storage medium
CN115484477A (en) * 2021-05-31 2022-12-16 上海哔哩哔哩科技有限公司 Subtitle generating method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101518055A (en) * 2006-09-21 2009-08-26 松下电器产业株式会社 Subtitle generation device, subtitle generation method, and subtitle generation program
EP2725816A1 (en) * 2009-10-27 2014-04-30 VerbaVoice GmbH A method and system for generating subtitles
CN106504754A (en) * 2016-09-29 2017-03-15 浙江大学 A kind of real-time method for generating captions according to audio output
CN108259971A (en) * 2018-01-31 2018-07-06 百度在线网络技术(北京)有限公司 Subtitle adding method, device, server and storage medium
CN108401192A (en) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 Video stream processing method, device, computer equipment and storage medium
CN110767226A (en) * 2019-10-30 2020-02-07 山西见声科技有限公司 Sound source positioning method and device with high accuracy, voice recognition method and system, storage equipment and terminal
US10573312B1 (en) * 2018-12-04 2020-02-25 Sorenson Ip Holdings, Llc Transcription generation from multiple speech recognition systems

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101518055A (en) * 2006-09-21 2009-08-26 松下电器产业株式会社 Subtitle generation device, subtitle generation method, and subtitle generation program
EP2725816A1 (en) * 2009-10-27 2014-04-30 VerbaVoice GmbH A method and system for generating subtitles
CN106504754A (en) * 2016-09-29 2017-03-15 浙江大学 A kind of real-time method for generating captions according to audio output
CN108259971A (en) * 2018-01-31 2018-07-06 百度在线网络技术(北京)有限公司 Subtitle adding method, device, server and storage medium
CN108401192A (en) * 2018-04-25 2018-08-14 腾讯科技(深圳)有限公司 Video stream processing method, device, computer equipment and storage medium
US10573312B1 (en) * 2018-12-04 2020-02-25 Sorenson Ip Holdings, Llc Transcription generation from multiple speech recognition systems
CN110767226A (en) * 2019-10-30 2020-02-07 山西见声科技有限公司 Sound source positioning method and device with high accuracy, voice recognition method and system, storage equipment and terminal

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112383809A (en) * 2020-11-03 2021-02-19 Tcl海外电子(惠州)有限公司 Subtitle display method, device and storage medium
CN112702659A (en) * 2020-12-24 2021-04-23 成都新希望金融信息有限公司 Video subtitle processing method and device, electronic equipment and readable storage medium
WO2022217944A1 (en) * 2021-04-14 2022-10-20 北京达佳互联信息技术有限公司 Method for binding subtitle with audio source, and apparatus
CN115484477A (en) * 2021-05-31 2022-12-16 上海哔哩哔哩科技有限公司 Subtitle generating method and device
CN114022668A (en) * 2021-10-29 2022-02-08 北京有竹居网络技术有限公司 Method, device, equipment and medium for aligning text with voice
CN114022668B (en) * 2021-10-29 2023-09-22 北京有竹居网络技术有限公司 Method, device, equipment and medium for aligning text with voice
CN115278351A (en) * 2022-05-17 2022-11-01 深圳传音控股股份有限公司 Data processing method, intelligent terminal and storage medium

Similar Documents

Publication Publication Date Title
CN109446876B (en) Sign language information processing method and device, electronic equipment and readable storage medium
CN111836062A (en) Video playing method and device and computer readable storage medium
US9786326B2 (en) Method and device of playing multimedia and medium
CN111107421B (en) Video processing method and device, terminal equipment and storage medium
CN110210310B (en) Video processing method and device for video processing
CN112069952B (en) Video clip extraction method, video clip extraction device and storage medium
CN105828101B (en) Generate the method and device of subtitle file
CN104469437A (en) Advertisement pushing method and device
CN104394265A (en) Automatic session method and device based on mobile intelligent terminal
CN113343675B (en) Subtitle generation method and device and subtitle generation device
CN107945806B (en) User identification method and device based on sound characteristics
CN110730360A (en) Video uploading and playing methods and devices, client equipment and storage medium
CN108538284A (en) Simultaneous interpretation result shows method and device, simultaneous interpreting method and device
CN111835739A (en) Video playing method and device and computer readable storage medium
CN111369978A (en) Data processing method and device and data processing device
CN112312039A (en) Audio and video information acquisition method, device, equipment and storage medium
CN107247794B (en) Topic guiding method in live broadcast, live broadcast device and terminal equipment
CN110019936A (en) A kind of annotation method and apparatus during playback of media files
CN113259754B (en) Video generation method, device, electronic equipment and storage medium
CN112988956B (en) Method and device for automatically generating dialogue, and method and device for detecting information recommendation effect
CN115022654B (en) Video editing method and device in live broadcast scene
CN114464186A (en) Keyword determination method and device
CN113569085B (en) Audio and video data display method, device, equipment, storage medium and program product
CN114022814A (en) Video processing method and apparatus, electronic device, and computer-readable storage medium
CN113409766A (en) Recognition method, device for recognition and voice synthesis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201027

RJ01 Rejection of invention patent application after publication