CN114242058A - Voice subtitle generating method, system, device, storage medium and electronic device - Google Patents

Voice subtitle generating method, system, device, storage medium and electronic device Download PDF

Info

Publication number
CN114242058A
CN114242058A CN202111585267.8A CN202111585267A CN114242058A CN 114242058 A CN114242058 A CN 114242058A CN 202111585267 A CN202111585267 A CN 202111585267A CN 114242058 A CN114242058 A CN 114242058A
Authority
CN
China
Prior art keywords
voice recognition
video data
identification information
sound source
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111585267.8A
Other languages
Chinese (zh)
Inventor
陈文琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Fanxing Huyu IT Co Ltd
Original Assignee
Guangzhou Fanxing Huyu IT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Fanxing Huyu IT Co Ltd filed Critical Guangzhou Fanxing Huyu IT Co Ltd
Priority to CN202111585267.8A priority Critical patent/CN114242058A/en
Publication of CN114242058A publication Critical patent/CN114242058A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Studio Circuits (AREA)

Abstract

The application discloses a method, a system, a device, a storage medium and an electronic device for generating voice subtitles. The method comprises the following steps: acquiring an audio and video data frame generated in a live broadcast process; determining position information of at least one sounding body carried in the video data and sound source identification information associated with each sounding body; acquiring a plurality of voice recognition results corresponding to the multi-channel audio data and sound source identification information associated with each voice recognition result; and determining video data corresponding to the plurality of voice recognition results, and displaying each voice recognition result in a corresponding preset area in a video picture. By the method and the device, the problem that the content expressed by different sound producing bodies is difficult to distinguish through the voice subtitles under the condition that a plurality of sound producing bodies exist in a live broadcast scene in the related technology is solved.

Description

Voice subtitle generating method, system, device, storage medium and electronic device
Technical Field
The present application relates to the field of voice subtitle technology, and in particular, to a method, a system, an apparatus, a storage medium, and an electronic apparatus for generating a voice subtitle.
Background
In a live broadcast scene, in order to enable audiences to accurately know the speaking content of speakers, text information corresponding to the speaking content needs to be acquired through a voice subtitle technology, and the voice subtitle technology in related technologies generally performs voice semantic recognition on the voices of all the speakers in the live broadcast scene, and uniformly displays the voice content recognition result in a lower area of a live broadcast picture.
The voice caption technology in the related art has the following disadvantages: on one hand, when the live broadcast scene is a multi-person speaking scene, because the identified subtitles are not displayed according to the person, the audience is difficult to distinguish the speaking content corresponding to each person; on the other hand, for audiences with hearing impairment, when voices emitted by different people cannot be distinguished according to sounds, under a multi-person speaking scene, the speaking content corresponding to each person cannot be distinguished through caption information; in addition, the display mode of the subtitles is single, and the personalized subtitle display requirements of the user are difficult to meet.
Aiming at the problem that contents expressed by different sound producing bodies are difficult to distinguish through voice subtitles under the condition that a plurality of sound producing bodies exist in a live broadcast scene in the related technology, an effective solution is not provided at present.
Disclosure of Invention
The application provides a method, a system, a device, a storage medium and an electronic device for generating voice subtitles, which are used for solving the problem that contents expressed by different sound producing bodies are difficult to distinguish through the voice subtitles under the condition that a plurality of sound producing bodies exist in a live scene in the related technology.
According to an aspect of the present application, there is provided a method of generating a voice subtitle. The method comprises the following steps: acquiring audio and video data frames generated in a live broadcast process, wherein each audio and video data frame comprises video data and multiple paths of audio data which are synchronously generated, and each path of audio data is sent out by a sound generator; determining position information of at least one sounding body carried in video data and sound source identification information associated with each sounding body, wherein the position information is the position information of the sounding body in a video picture; acquiring a plurality of voice recognition results corresponding to the multi-channel audio data and sound source identification information associated with each voice recognition result; and determining video data corresponding to a plurality of voice recognition results, and displaying each voice recognition result in a corresponding preset area in a video picture, wherein the corresponding preset area is determined by the position information of the sounding body corresponding to the sound source identification information associated with the voice recognition results.
Optionally, the video data further carries a timestamp of the video data, and the obtaining of multiple voice recognition results corresponding to the multiple channels of audio data and the sound source identification information associated with each voice recognition result includes: acquiring a plurality of voice recognition results corresponding to the multi-channel audio data, sound source identification information associated with each voice recognition result and time stamps of the multi-channel audio data from a voice recognition server; determining video data corresponding to a plurality of speech recognition results comprises: and acquiring video data with the same time stamp as the multi-channel audio data, and determining the acquired video data as video data corresponding to a plurality of voice recognition results.
Optionally, the obtaining of a plurality of speech recognition results corresponding to the multi-channel audio data, and the sound source identification information associated with each speech recognition result includes: determining voice recognition results carried in the multi-channel audio data respectively and sound source identification information associated with each voice recognition result; determining video data corresponding to a plurality of speech recognition results comprises: and determining audio and video data frames corresponding to the multi-channel audio data, and determining the video data in the audio and video data frames as video data corresponding to a plurality of voice recognition results.
Optionally, displaying each speech recognition result in a corresponding preset area in the video picture includes: determining sound source identification information associated with each voice recognition result; determining a sounding body with the same sound source identification information as the voice recognition result, and acquiring position information associated with the sounding body; determining a preset range of the position of the sounding body in the video picture based on the acquired position information; and determining a corresponding preset area from the preset range.
Optionally, displaying each speech recognition result in a corresponding preset area in the video picture includes: determining display attribute information of the text box, and generating the text box in the video picture based on the display attribute information; and moving the text box to a preset area, and adding the voice recognition result to the text box.
According to another aspect of the present application, a system for generating a voice subtitle is provided. The system comprises: the system comprises a first client, a voice recognition server and a second client, wherein the first client is used for collecting multi-channel audio data generated by different sound generators in a live broadcast process, sending the multi-channel audio data to the voice recognition server, collecting video data generated in the live broadcast process, determining position information of the sound generators in the video data and sound source identification information associated with each sound generator, and generating audio and video data frames according to the video data and the multi-channel audio data generated synchronously; the voice recognition server is used for performing semantic recognition on the multi-channel audio data respectively to obtain a plurality of voice recognition results, wherein each voice recognition result is associated with a time stamp and voice source identification information; and the second client is used for acquiring the audio and video data frames from the first client, acquiring the voice recognition result from the voice recognition server, acquiring video data with the same time stamp as the voice recognition result from the audio and video data frames, and displaying each voice recognition result in a corresponding preset area in the video picture, wherein the corresponding preset area is determined by the position information of the sounding body corresponding to the sound source identification information associated with the voice recognition result.
Optionally, the first client stores image feature data of a plurality of target sound generators and sound source identification information associated with each sound generator, and is further configured to identify image feature data of a current sound generator from the acquired video data, acquire the position of the current sound generator, and compare the image feature data of the current sound generator with the image feature data of the plurality of target sound generators to obtain the sound source identification information associated with the current sound generator.
Optionally, the first client is further configured to receive audio and video data recorded by the multiple target sound generators before live broadcasting, and determine image feature data of each target sound generator and sound source identification information associated with the target sound generator based on the recorded audio and video data.
Optionally, the second client is further configured to: receiving a setting instruction of display attribute information of the text box, and generating the text box in the video picture based on the display attribute information; receiving a moving instruction aiming at the text box, and controlling the text box to move to a preset range of the position of the target current sounding body; and receiving a text adding instruction, and adding a target voice recognition result into the text box, wherein the sound source recognition identifier of the target voice recognition result is the same as the sound source recognition identifier of the target current sounding body.
According to another aspect of the present application, a system for generating a voice subtitle is provided. The system comprises: the system comprises a first client, a second client and a third client, wherein the first client is used for collecting multi-channel audio data generated by different sound producing bodies in a live broadcasting process, performing semantic recognition on the multi-channel audio data respectively to obtain a plurality of voice recognition results and sound source identification information associated with the voice recognition results, collecting video data generated in the live broadcasting process, determining position information of the sound producing bodies in the video data and sound source identification information associated with each sound producing body, and generating audio and video data frames according to the video data and the multi-channel audio data which are synchronously generated; and the second client is used for acquiring the audio and video data frames and the voice recognition results corresponding to the audio and video data frames from the first client, and displaying each voice recognition result in a corresponding preset area in the video picture, wherein the corresponding preset area is determined by the position information of the sounding body corresponding to the sound source identification information associated with the voice recognition results.
Optionally, the first client stores image feature data of a plurality of target sound generators and sound source identification information associated with each sound generator, and is further configured to identify image feature data of a current sound generator from the acquired video data, acquire the position of the current sound generator, and compare the image feature data of the current sound generator with the image feature data of the plurality of target sound generators to obtain the sound source identification information associated with the current sound generator.
Optionally, the first client is further configured to receive audio and video data recorded by the multiple target sound generators before live broadcasting, and determine image feature data of each target sound generator and sound source identification information associated with the target sound generator based on the recorded audio and video data.
According to another aspect of the present application, there is provided a voice subtitle generating apparatus. The device includes: the first acquisition unit is used for acquiring audio and video data frames generated in a live broadcast process, wherein each audio and video data frame comprises video data and multiple paths of audio data which are synchronously generated, and each path of audio data is sent out by a sound-producing body; the first determining unit is used for determining the position information of at least one sounding body carried in the video data and the sound source identification information associated with each sounding body, wherein the position information is the position information of the sounding body in a video picture; the second acquisition unit is used for acquiring a plurality of voice recognition results corresponding to the multi-channel audio data and sound source identification information associated with each voice recognition result; and the second determining unit is used for determining video data corresponding to a plurality of voice recognition results and displaying each voice recognition result in a corresponding preset area in the video picture, wherein the corresponding preset area is determined by the position information of the sound generating body corresponding to the sound source identification information associated with the voice recognition results.
According to another aspect of the embodiments of the present invention, there is also provided a non-volatile storage medium including a stored program, wherein the program controls a device in which the non-volatile storage medium is located to execute a method for generating a voice subtitle when the program is run.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a processor and a memory; the memory stores computer readable instructions, and the processor is used for executing the computer readable instructions, wherein the computer readable instructions execute a method for generating voice subtitles.
Through the application, the following steps are adopted: acquiring audio and video data frames generated in a live broadcast process, wherein each audio and video data frame comprises video data and multiple paths of audio data which are synchronously generated, and each path of audio data is sent out by a sound generator; determining position information of at least one sounding body carried in video data and sound source identification information associated with each sounding body, wherein the position information is the position information of the sounding body in a video picture; acquiring a plurality of voice recognition results corresponding to the multi-channel audio data and sound source identification information associated with each voice recognition result; and determining video data corresponding to a plurality of voice recognition results, and displaying each voice recognition result in a corresponding preset area in a video picture, wherein the corresponding preset area is determined by the position information of the sounding body corresponding to the sound source identification information associated with the voice recognition results. The problem of in the live scene in the correlation technique under the condition that there are a plurality of sound generating bodies, be difficult to distinguish the content that different sound generating bodies expressed through the pronunciation subtitle is solved. And furthermore, under the condition that a plurality of sound producing bodies exist in a live scene, each voice recognition result is displayed in the area near the sound producing body in the video picture, and the effect of obtaining the accuracy of the contents expressed by different sound producing bodies is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
fig. 1 is a schematic diagram of a system for generating a voice caption according to an embodiment of the present application;
fig. 2 is a first schematic illustration showing a voice subtitle provided according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating a second presentation of a voice subtitle according to an embodiment of the present application;
fig. 4 is a schematic illustration showing a voice subtitle provided according to an embodiment of the present application;
fig. 5 is a fourth illustration of a voice subtitle provided according to an embodiment of the present application;
fig. 6 is a schematic illustration showing a voice subtitle provided according to an embodiment of the present application;
fig. 7 is a flowchart of a method for generating a voice subtitle according to an embodiment of the present application;
fig. 8 is a schematic diagram of another system for generating a voice caption according to an embodiment of the present application;
fig. 9 is a schematic diagram of another method for generating a voice subtitle according to an embodiment of the present application;
fig. 10 is a schematic diagram of a voice subtitle generating apparatus according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
According to an embodiment of the present application, there is provided a system for generating a voice subtitle.
Fig. 1 is a schematic diagram of a system for generating a voice subtitle according to an embodiment of the present application. As shown in fig. 1, the system includes:
the first client 101 is configured to collect multiple paths of audio data generated by different sound generators in a live broadcast process, send the multiple paths of audio data to the voice recognition server 103, collect video data generated in the live broadcast process, determine position information of the sound generators in the video data and sound source identification information associated with each sound generator, and generate an audio/video data frame according to the video data and the multiple paths of audio data generated synchronously.
Specifically, the first client 101 is a stream pushing client, the stream pushing client collects audio data and video data generated in a live broadcast room during a live broadcast process, and generates audio and video data frames according to the video data and multiple paths of audio data generated synchronously to obtain multiple frames of video data frames, the multiple frames of video data frames form an audio and video code stream, and the stream pushing client further pushes the audio and video code stream to a viewer.
The speech recognition server 103 is configured to perform semantic recognition on the multiple paths of audio data respectively to obtain multiple speech recognition results, where each speech recognition result is associated with a timestamp and audio source identification information.
It should be noted that, the sounding body in the live broadcasting process may be a live broadcaster, and in a multi-user live broadcasting scene corresponding to the same live broadcasting room, audio data generated in the live broadcasting process is multi-channel audio data, specifically, each live broadcaster inputs audio data by using an independent device, and a plurality of live broadcasters input audio data to obtain multi-channel audio data, and each channel of audio data is associated with sound source identification information, and the sound source identification information may be a sound source sequence ID.
After the multi-channel audio is collected, in order to enable a live viewer to intuitively know the meaning of the audio data at the second client 102, the first client 101 further sends the collected multi-channel audio data and the sound source identification information associated with each channel of audio data to the voice recognition server 103, wherein each channel of audio data further carries time, and the time stamp indicates the generation time of the audio data.
Correspondingly, after receiving the multiple audio data, the voice recognition server 103 performs voice recognition (including semantic recognition) on each audio data, and sends the voice recognition result, the sound source identification information associated with the voice recognition result, and the time stamp to the second client 102.
It should be further noted that, in order to display the voice recognition result near the position of the generator corresponding to the audio data, after the first client 101 acquires the video data, the position of the generator of the audio data in the video picture needs to be recognized, specifically, the first client 101 determines the position information and the sound source identification information of each live broadcast person in the video picture in real time, and stores the position information and the sound source identification information in the corresponding video data.
The second client 102 is configured to obtain an audio/video data frame from the first client 101, obtain a voice recognition result from the voice recognition server 103, obtain video data having the same timestamp as the voice recognition result from the audio/video data frame, and display each voice recognition result in a corresponding preset area in a video picture, where the corresponding preset area is determined by position information of a sound-generating body corresponding to sound source identification information associated with the voice recognition result.
Specifically, the second client 102 is a stream pulling client, which may also be referred to as a viewer side, where the stream pulling client is configured to obtain an audio/video code stream from the stream pushing terminal, where the audio/video code stream includes multiple frames of audio/video data frames, and the stream pulling client decodes each frame of audio/video data frame to obtain audio data and video data, and plays the audio data and the video data.
It should be noted that, in order to enable a live viewer to intuitively know the meaning of the audio data sent by each live viewer in the video frame, the second client 102 obtains the voice recognition result from the voice recognition server 103 while playing the audio data and the video data, determines the voice recognition result corresponding to the audio data through comparison of the timestamps, and displays the voice recognition result in the video frame corresponding to the audio data.
Specifically, the voice recognition content corresponding to the sound source recognition information is determined through the sound source recognition information in the video data, so that each voice recognition result is respectively displayed in the area near the corresponding live player in the video picture.
The system for generating voice subtitles, provided by the embodiment of the application, collects multi-channel audio data generated by different sound producing bodies in a live broadcast process through a first client 101, sends the multi-channel audio data to a voice recognition server 103, collects video data generated in the live broadcast process, determines position information of the sound producing bodies in the video data and sound source identification information associated with each sound producing body, and generates audio and video data frames according to the video data and the multi-channel audio data which are synchronously generated; the voice recognition server 103 performs semantic recognition on the multi-channel audio data respectively to obtain a plurality of voice recognition results, wherein each voice recognition result is associated with a time stamp and audio source identification information; the second client 102 acquires the audio/video data frame from the first client 101, acquires the voice recognition result from the voice recognition server 103, acquires video data having the same time stamp as the voice recognition result from the audio/video data frame, and displays each voice recognition result in a corresponding preset area in the video picture, wherein the corresponding preset area is determined by the position information of the sounding body corresponding to the sound source identification information associated with the voice recognition result. The problem of in the live scene in the correlation technique under the condition that there are a plurality of sound generating bodies, be difficult to distinguish the content that different sound generating bodies expressed through the pronunciation subtitle is solved. And furthermore, under the condition that a plurality of sound producing bodies exist in a live scene, each voice recognition result is displayed in the area near the sound producing body in the video picture, and the effect of obtaining the accuracy of the contents expressed by different sound producing bodies is improved.
In order to implement the position determination and the sound source identification information determination for each live player in a video frame, optionally, in the system for generating a voice subtitle provided in this embodiment of the present application, the first client 101 stores image feature data of a plurality of target sound generators and sound source identification information associated with each sound generator, and the first client 101 is further configured to identify image feature data of a current sound generator from the collected video data, acquire a position of the current sound generator, and compare the image feature data of the current sound generator with the image feature data of the plurality of target sound generators to obtain sound source identification information associated with the current sound generator.
Specifically, the target sounding body may be a live broadcaster which is predetermined before live broadcast and is to be live broadcast in a current live broadcast room, the image feature data may be facial feature point data of the live broadcaster, facial feature point data of a plurality of live broadcasters and sound source identification information associated with each live broadcaster are cached in the first client 101 in advance, in a live broadcast process of the first client 101, facial feature point data and facial position information of the current live broadcaster are identified from collected video data, the facial feature point data is compared with the facial feature point data of the plurality of live broadcasters cached in advance, and sound source identification information corresponding to the live broadcaster and matched with the feature data is determined as sound source identification information of the current live broadcaster, so that the facial position information and the sound source identification information of the current live broadcaster are obtained.
Optionally, in the system for generating a voice subtitle provided in the embodiment of the present application, the first client 101 is further configured to receive, before live broadcast, audio and video data recorded by a plurality of target sound generators, and determine, based on the recorded audio and video data, image feature data of each target sound generator and sound source identification information associated with the target sound generator, respectively.
Specifically, before live broadcasting, each live broadcaster respectively records a section of audio and video at a first client 101, the first client 101 marks a currently recorded sound source serial number, the sound source serial number is used as sound source identification information, an image of the live broadcaster in a video corresponding to a current sound source is cached, the image of the live broadcaster is identified, image feature data is obtained, for example, a head portrait of the live broadcaster can be cached, face identification is performed on the head portrait of the live broadcaster, feature point data of a human face is obtained, and the sound source serial number and the main live image face feature point data are cached in the first client 101.
In order to flexibly display the voice recognition result, optionally, in the system for generating a voice subtitle provided in the embodiment of the present application, the second client 102 is further configured to: receiving a setting instruction of display attribute information of the text box, and generating the text box in the video picture based on the display attribute information; receiving a moving instruction aiming at the text box, and controlling the text box to move to a preset range of the position of the target current sounding body; and receiving a text adding instruction, and adding a target voice recognition result into the text box, wherein the sound source recognition identifier of the target voice recognition result is the same as the sound source recognition identifier of the target current sounding body.
Specifically, the presentation manner of the voice subtitle of each live broadcast may be adjusted in a self-defined manner at the second client 102, where the adjustment of the presentation manner of the voice subtitle includes adjustment of display attribute information of the subtitle, such as the shape and width and height of a subtitle frame, in an optional embodiment, as shown in fig. 2, a dashed frame is a preset range of a position where a current sound generating body is located, the shape of the subtitle frame may be selected from the selectable shapes of the second client 102, as shown in fig. 3, a circle may be selected, as shown in fig. 4, a square may also be selected, and at the same time, the subtitle frame further supports size adjustment such as elongation and widening, and as shown in fig. 5, the size of the subtitle frame may be changed in a dragging manner.
The adjustment of the display mode of the voice subtitles also includes adjustment of the display position of the subtitles, the relative position of the subtitle display frame can be determined by the second client 102 in a self-defined manner around the face area of the live broadcaster, as shown in fig. 6, the respective subtitle display frames of different anchor broadcasters in the picture are displayed at positions near the relative face, the display position information of the customized subtitle frame is cached locally at the second client 102 and is associated with the sound source identification information of each live broadcaster, and when the audio and video data frame sent by the first client is received, the voice recognition result of the audio data is displayed in the corresponding subtitle frame of the live broadcaster.
Through this embodiment, under many people live scenes, the real-time pronunciation subtitle of each anchor closely follows the people's face, and everyone's speech content is surveyability more, and hearing impairment spectator also can know the words of each anchor under many people live scenes, and the spectator end can be in the user-defined picture subtitle display frame position and the size of each anchor moreover, and the subtitle show mode is more nimble.
Example 2
According to an embodiment of the present application, there is provided a method of generating a voice subtitle.
Fig. 7 is a flowchart of a method for generating a voice subtitle according to an embodiment of the present application. As shown in fig. 7, the method is executed in the first client in the voice subtitle generating system in embodiment 1, and includes the following steps:
step S702, audio and video data frames generated in the live broadcast process are acquired, wherein each audio and video data frame comprises video data and multiple paths of audio data which are synchronously generated, and each path of audio data is sent out by a sound-producing body.
Specifically, the stream pulling end acquires audio and video data frames in an audio and video code stream generated in a live broadcast process from the stream pushing end, wherein the stream pushing client acquires audio data and video data generated in a live broadcast room in the live broadcast process, generates audio and video data frames from the video data and multiple paths of audio data generated synchronously, and forms the audio and video code stream by multiple frames of video data frames.
It should be noted that, the sounding body in the live broadcasting process can be a live broadcaster, and under a multi-user live broadcasting scene corresponding to the same live broadcasting room, audio data generated in the live broadcasting process is multi-channel audio data, specifically, each live broadcaster adopts independent equipment to input audio data, and a plurality of live broadcasters input audio data to obtain multi-channel audio data.
Step S704, determining position information of at least one sounding body carried in the video data and sound source identification information associated with each sounding body, where the position information is position information of the sounding body in the video picture.
It should be noted that the position information of the sounding bodies and the sound source identification information associated with each sounding body are determined in the stream pushing end, and specifically, after the stream pushing end collects video data, the position information and the sound source identification information of each live broadcaster in a video picture are determined in real time, and the position information and the sound source identification information are stored in corresponding video data.
Therefore, the stream pushing end acquires the position information of the sound producing bodies and the sound source identification information associated with each sound producing body from the frame data structure of the video data, so that a data base is provided for displaying the voice recognition result near the position of the generator corresponding to the audio data.
Step S706, a plurality of speech recognition results corresponding to the multi-channel audio data and the sound source identification information associated with each speech recognition result are obtained.
Specifically, the stream pushing terminal may obtain a plurality of voice recognition results corresponding to the multiple paths of audio data and the sound source identification information associated with each voice recognition result from the voice recognition server, where the multiple paths of audio data and the sound source identification information thereof are sent to the voice recognition server by the stream pushing terminal in advance, and are processed into a plurality of corresponding voice recognition results by the voice recognition server.
In an optional embodiment, the voice recognition may be performed on the audio data locally at the stream push end, the recognition result is encapsulated in the audio/video frame data, and after the stream pull end acquires the audio frame data, the voice recognition result corresponding to the audio data is decoded and acquired.
Step S708, determining video data corresponding to a plurality of speech recognition results, and displaying each speech recognition result in a corresponding preset area in the video image, where the corresponding preset area is determined by the position information of the sounding body corresponding to the sound source identification information associated with the speech recognition result.
Specifically, the corresponding preset region may be a region near the sounding body corresponding to the voice recognition result, and each voice recognition result is displayed in the corresponding preset region in the video picture, so as to realize the differentiated display of the subtitles in the multi-user speaking scene in the live broadcast scene.
The method for generating the voice caption comprises the steps of acquiring audio and video data frames generated in a live broadcast process, wherein each audio and video data frame comprises video data and multiple paths of audio data which are synchronously generated, and each path of audio data is sent by a sound-producing body; determining position information of at least one sounding body carried in video data and sound source identification information associated with each sounding body, wherein the position information is the position information of the sounding body in a video picture; acquiring a plurality of voice recognition results corresponding to the multi-channel audio data and sound source identification information associated with each voice recognition result; the video data corresponding to the voice recognition results are determined, and each voice recognition result is displayed in the corresponding preset area in the video picture, wherein the corresponding preset area is determined by the position information of the sound generator corresponding to the sound source identification information associated with the voice recognition results, and the problem that the content expressed by different sound generators is difficult to distinguish through voice subtitles under the condition that a plurality of sound generators exist in a live broadcast scene in the related art is solved. And furthermore, under the condition that a plurality of sound producing bodies exist in a live scene, each voice recognition result is displayed in the area near the sound producing body in the video picture, and the effect of obtaining the accuracy of the contents expressed by different sound producing bodies is improved.
Optionally, in the method for generating a voice caption provided in an embodiment of the present application, displaying each voice recognition result in a corresponding preset area in a video picture includes: determining sound source identification information associated with each voice recognition result; determining a sounding body with the same sound source identification information as the voice recognition result, and acquiring position information associated with the sounding body; determining a preset range of the position of the sounding body in the video picture based on the acquired position information; and determining a corresponding preset area from the preset range.
Specifically, the voice recognition result and the position information associated with the sounding body are associated with the sound source identification information, the position information associated with the voice recognition result and the sounding body is established through the sound source identification information, the voice recognition result is displayed in the preset range of the position of the sounding body, the purpose that the voice recognition result is displayed around the sounding body is achieved, and under the condition that a plurality of live players are contained in a video picture, the caption information corresponding to the audio data generated by each live player can be accurately obtained.
Optionally, in the method for generating a voice subtitle provided in this embodiment of the present application, the video data further carries a timestamp of the video data, and the obtaining a plurality of voice recognition results corresponding to the multiple audio data, and the sound source identification information associated with each voice recognition result includes: acquiring a plurality of voice recognition results corresponding to the multi-channel audio data, sound source identification information associated with each voice recognition result and time stamps of the multi-channel audio data from a voice recognition server; determining video data corresponding to a plurality of speech recognition results comprises: and acquiring video data with the same time stamp as the multi-channel audio data, and determining the acquired video data as video data corresponding to a plurality of voice recognition results.
Specifically, the voice recognition result obtained by processing in the voice recognition server carries a timestamp, the video data from the plug-in end also carries a timestamp, and the video data corresponding to the plurality of voice recognition results can be determined through the timestamp.
It should be noted that, because the audio data occupies less transmission resources, the fast transmission of the audio data and the speech recognition result between the stream pulling end and the speech recognition server can be realized, and meanwhile, the multi-channel audio data is processed into the corresponding speech recognition result in the speech recognition server, thereby avoiding the occupation of the resources for processing the video data by the stream pulling end in the speech processing process.
Optionally, in the method for generating a voice subtitle provided in this embodiment of the present application, the obtaining of multiple voice recognition results corresponding to multiple audio data and the sound source identification information associated with each voice recognition result includes: determining voice recognition results carried in the multi-channel audio data respectively and sound source identification information associated with each voice recognition result; determining video data corresponding to a plurality of speech recognition results comprises: and determining audio and video data frames corresponding to the multi-channel audio data, and determining the video data in the audio and video data frames as video data corresponding to a plurality of voice recognition results.
Specifically, the multi-channel audio data can be processed into corresponding voice recognition results in the stream pushing end, the voice recognition results are packaged in audio and video frame data, the stream pulling end can obtain the voice recognition results and the video data corresponding to the voice recognition results after obtaining the audio and video frame data, and compared with the method for determining the voice recognition results through the voice recognition server, when caption display is carried out, the incidence relation between the voice recognition results and the video data can be obtained without comparing time stamps.
Optionally, in the method for generating a voice caption provided in an embodiment of the present application, displaying each voice recognition result in a corresponding preset area in a video picture includes: determining display attribute information of the text box, and generating the text box in the video picture based on the display attribute information; and moving the text box to a preset area, and adding the voice recognition result to the text box.
Specifically, the display mode of the voice caption of each live broadcaster can be adjusted in a self-defined manner at the stream pulling end, the adjustment of the display mode of the voice caption comprises the adjustment of display attribute information of the caption, such as the shape and the width and the height of a display frame, the adjustment of the display mode of the voice caption also comprises the adjustment of the display position of the caption, and the relative position of the caption display frame can be determined in a self-defined manner around the face area of the live broadcaster so that the relative face positions of the caption display frames of different main broadcasts in the picture are displayed.
Example 3
According to an embodiment of the present application, another system for generating a voice subtitle is provided.
Fig. 8 is a schematic diagram of a system for generating a voice subtitle according to an embodiment of the present application. As shown in fig. 8, the system includes:
the first client 101 is configured to collect multiple paths of audio data generated by different sound generators in a live broadcast process, perform semantic recognition on the multiple paths of audio data respectively to obtain multiple voice recognition results and sound source identification information associated with the voice recognition results, collect video data generated in the live broadcast process, determine position information of the sound generators in the video data and sound source identification information associated with each sound generator, and generate an audio/video data frame according to the video data and the multiple paths of audio data generated synchronously.
Specifically, the first client 101 is a stream pushing client, the stream pushing client collects audio data and video data generated in a live broadcast room during a live broadcast process, generates audio and video data frames from the video data and multiple paths of audio data generated synchronously, obtains multiple frames of video data frames, the multiple frames of video data frames form an audio and video code stream, and the first client 101 further pushes the audio and video code stream to a viewer.
It should be noted that, the sounding body in the live broadcasting process may be a live broadcaster, and in a multi-user live broadcasting scene corresponding to the same live broadcasting room, audio data generated in the live broadcasting process is multi-channel audio data, specifically, each live broadcaster inputs audio data by using an independent device, and a plurality of live broadcasters input audio data to obtain multi-channel audio data, and each channel of audio data is associated with sound source identification information, and the sound source identification information may be a sound source sequence ID.
After the multiple channels of audio are collected, in order to enable the live viewer to intuitively know the meaning of the audio data at the second client 102, the first client 101 further processes the collected multiple channels of audio data into a voice recognition result. In order to display the voice recognition result near the position of the producer of the corresponding audio data, after the first client 101 acquires the video data, the position of the producer of the audio data in the video picture needs to be recognized, specifically, the first client 101 determines the position and the sound source identification information of each live broadcaster in the video picture in real time, and stores the position information and the sound source identification information in the corresponding video data.
The second client 102 is configured to obtain the audio/video data frames and the voice recognition results corresponding to the audio/video data frames from the first client 101, and display each voice recognition result in a corresponding preset area in the video picture, where the corresponding preset area is determined by position information of a sound generating body corresponding to the sound source identification information associated with the voice recognition result.
Specifically, the second client 102 is a stream pulling client, the stream pulling client is configured to obtain an audio/video code stream from the stream pushing end, the audio/video code stream includes multiple frames of audio/video data frames, and the second client 102 decodes each frame of audio/video data frame to obtain audio data and video data and plays the audio data and the video data.
Specifically, decoding the video data frame to obtain the audio data and the video data, the voice recognition result and the sound source recognition information thereof, the position information of the live broadcaster and the sound source recognition information thereof, determining the position information of the live broadcaster corresponding to the voice recognition content through the sound source recognition information, and respectively displaying each voice recognition result in a nearby area corresponding to the live broadcaster in the video picture.
The system for generating voice subtitles, provided by the embodiment of the application, is used for acquiring multi-channel audio data generated by different sound producing bodies in a live broadcasting process through a first client 101, performing semantic recognition on the multi-channel audio data respectively to obtain a plurality of voice recognition results and sound source identification information associated with the voice recognition results, acquiring video data generated in the live broadcasting process, determining position information of the sound producing bodies in the video data and sound source identification information associated with each sound producing body, and generating an audio/video data frame according to the video data and the multi-channel audio data which are synchronously generated; the second client 102 is a stream pulling client, and the stream pulling client is configured to obtain an audio/video code stream from the stream pushing end, where the audio/video code stream includes multiple frames of audio/video data frames, decode each frame of audio/video data frame to obtain audio data and video data, and play the audio data and the video data. The problem of in the live scene in the correlation technique under the condition that there are a plurality of sound generating bodies, be difficult to distinguish the content that different sound generating bodies expressed through the pronunciation subtitle is solved. And furthermore, under the condition that a plurality of sound producing bodies exist in a live scene, each voice recognition result is displayed in the area near the sound producing body in the video picture, and the effect of obtaining the accuracy of the contents expressed by different sound producing bodies is improved.
In order to implement the position determination and the sound source identification information determination for each live player in a video frame, optionally, in the system for generating a voice subtitle provided in this embodiment of the present application, the first client 101 stores image feature data of a plurality of target sound generators and sound source identification information associated with each sound generator, and the first client 101 is further configured to identify image feature data of a current sound generator from the collected video data, acquire a position of the current sound generator, and compare the image feature data of the current sound generator with the image feature data of the plurality of target sound generators to obtain sound source identification information associated with the current sound generator.
Specifically, the target sounding body may be a live broadcaster which is predetermined before live broadcast and is to be live broadcast in a current live broadcast room, the image feature data may be facial feature point data of the live broadcaster, facial feature point data of a plurality of live broadcasters and sound source identification information associated with each live broadcaster are cached in the first client 101 in advance, in a live broadcast process of the first client 101, facial feature point data and facial position data of the current live broadcaster are identified from collected video data, the facial feature point data is compared with the facial feature point data of the plurality of live broadcasters cached in advance, and sound source identification information corresponding to the live broadcaster and matched with the feature data is determined as sound source identification information of the current live broadcaster, so that the facial position information and the sound source identification information of the current live broadcaster are obtained.
Optionally, in the system for generating a voice subtitle provided in the embodiment of the present application, the first client 101 is further configured to receive, before live broadcast, audio and video data recorded by a plurality of target sound generators, and determine, based on the recorded audio and video data, image feature data of each target sound generator and sound source identification information associated with the target sound generator, respectively.
Specifically, before live broadcasting, each live broadcaster respectively records a section of audio and video at a first client 101, the first client 101 marks a currently recorded sound source serial number, the sound source serial number is used as sound source identification information, an image of the live broadcaster in a video corresponding to a current sound source is cached, the image of the live broadcaster is identified, image feature data is obtained, for example, a head portrait of the live broadcaster can be cached, face identification is performed on the head portrait of the live broadcaster, feature point data of a human face is obtained, and the sound source serial number and the main live image face feature point data are cached in the first client 101.
In order to flexibly display the voice recognition result, optionally, in the system for generating a voice subtitle provided in the embodiment of the present application, the second client 102 is further configured to: receiving a setting instruction of display attribute information of the text box, and generating the text box in the video picture based on the display attribute information; receiving a moving instruction aiming at the text box, and controlling the text box to move to a preset range of the position of the target current sounding body; and receiving a text adding instruction, and adding a target voice recognition result into the text box, wherein the sound source recognition identifier of the target voice recognition result is the same as the sound source recognition identifier of the target current sounding body.
Specifically, the presentation manner of the voice subtitle of each live broadcast may be adjusted in a self-defined manner at the second client 102, where the adjustment of the presentation manner of the voice subtitle includes adjustment of display attribute information of the subtitle, such as a shape and a width and a height of the presentation frame, and in an optional implementation, the shape of the subtitle frame may be selected from selectable shapes of the second client 102, as shown in fig. 3, a circular shape may be selected, as shown in fig. 4, an oval shape may also be selected, and at the same time, the subtitle frame also supports size adjustment such as stretching and widening, and as shown in fig. 5, the size of the subtitle frame may be changed in a dragging manner.
The adjustment of the display mode of the voice subtitles also comprises the adjustment of the display position of the subtitles, the relative position of the subtitle display frame can be determined by the second client 102 in a self-defining way around the face area of the live player, so that the relative face positions of the subtitle display frames of different anchor broadcasters in the picture are different, as shown in fig. 5, the customized position is cached in the second client 102 locally and is associated with the sound source identification information of each live player, and when the audio and video data frame sent by the first client is received, the voice recognition result of the audio data is displayed in the subtitle frame of the corresponding live player.
Through this embodiment, under many people live scenes, the real-time pronunciation subtitle of each anchor closely follows the people's face, and everyone's speech content is surveyability more, and hearing impairment spectator also can know the words of each anchor under many people live scenes, and the spectator end can be in the user-defined picture subtitle display frame position and the size of each anchor moreover, and the subtitle show mode is more nimble.
Example 4
According to an embodiment of the present application, another method for generating a voice subtitle is provided.
Fig. 9 is a flowchart of a method for generating a voice subtitle according to an embodiment of the present application. As shown in fig. 9, the method is executed in the first client in the voice subtitle generating system in embodiment 3, and includes the following steps:
step S902, acquiring audio and video data frames generated in the live broadcast process, wherein each audio and video data frame comprises video data and multiple paths of audio data which are synchronously generated, and each path of audio data is sent out by a sound-producing body.
Specifically, the stream pulling end acquires audio and video data frames in an audio and video code stream generated in a live broadcast process from the stream pushing end, wherein the stream pushing client acquires audio data and video data generated in a live broadcast room in the live broadcast process, generates audio and video data frames from the video data and multiple paths of audio data generated synchronously, and forms the audio and video code stream by multiple frames of video data frames.
It should be noted that, the sounding body in the live broadcasting process can be a live broadcaster, and in a multi-user live broadcasting scene in the same live broadcasting room, audio data generated in the live broadcasting process is multi-channel audio data, specifically, each live broadcaster adopts independent equipment to input audio data, and a plurality of live broadcasters input multi-channel audio data.
Step S904, determining the position information of at least one sounding body carried in the video data and the sound source identification information associated with each sounding body, and determining a plurality of voice recognition results corresponding to the multi-channel audio data and the sound source identification information associated with each voice recognition result, where the position information is the position information of the sounding body in the video picture.
It should be noted that the position information of the sounding bodies and the sound source identification information associated with each sounding body are determined in the stream pushing end, specifically, after the stream pushing end acquires video data, the position and the sound source identification information of each live broadcaster in a video picture are determined in real time, and the position information and the sound source identification information are stored in corresponding video data, and after the stream pushing end acquires audio data, the audio data are processed to obtain a voice recognition result and associated sound source identification information.
Therefore, the stream pushing end acquires the position information of the sound producing bodies, the sound source identification information associated with each sound producing body, the voice recognition result corresponding to the audio data and the sound source identification information associated with each voice recognition result from the frame data structure of the video data, so that a data base is provided for displaying the voice recognition result near the position of the generator corresponding to the audio data.
Step S906, determining the position information of the sounding body corresponding to each voice recognition result through the sound source identification information, and displaying each voice recognition result in a preset range of the position of the sounding body in the video picture.
Specifically, the voice recognition result and the position information associated with the sounding body are associated with the sound source identification information, the position information associated with the voice recognition result and the sounding body is established through the sound source identification information, the voice recognition result is displayed in the preset range of the position of the sounding body, the purpose that the voice recognition result is displayed around the sounding body is achieved, and under the condition that a plurality of live players are contained in a video picture, the caption information corresponding to the audio data generated by each live player can be accurately obtained.
Furthermore, the display mode of the voice caption of each live broadcaster can be adjusted in a self-defined mode at the stream pulling end, the adjustment of the display mode of the voice caption comprises the adjustment of the display attribute information of the caption, such as the shape and the width and the height of a display frame, the adjustment of the display mode of the voice caption also comprises the adjustment of the display position of the caption, and the relative position of the caption display frame can be determined in a self-defined mode around the face area of the live broadcaster so that the relative face positions of the caption display frames of different main broadcasts in the picture are different.
The method for generating the voice caption comprises the steps of acquiring audio and video data frames generated in a live broadcast process, wherein each audio and video data frame comprises video data and multiple paths of audio data which are synchronously generated, and each path of audio data is sent by a sound-producing body; determining position information of at least one sounding body carried in video data and sound source identification information associated with each sounding body, and determining a plurality of voice recognition results corresponding to the multi-channel audio data and sound source identification information associated with each voice recognition result, wherein the position information is the position information of the sounding body in a video picture; the position information of the sounding body corresponding to each voice recognition result is determined through the sound source identification information, and each voice recognition result is displayed in the preset range of the position of the sounding body in the video picture, so that the problem that the content expressed by different sounding bodies is difficult to distinguish through voice subtitles under the condition that a plurality of sounding bodies exist in a live broadcast scene in the related technology is solved. And furthermore, under the condition that a plurality of sound producing bodies exist in a live scene, each voice recognition result is displayed in the area near the sound producing body in the video picture, and the effect of obtaining the accuracy of the contents expressed by different sound producing bodies is improved.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
The embodiment of the present application further provides a device for generating a voice caption, and it should be noted that the device for generating a voice caption according to the embodiment of the present application may be used to execute the method for generating a voice caption according to the embodiment of the present application. The following describes a speech subtitle generating apparatus according to an embodiment of the present application.
Fig. 10 is a schematic diagram of a voice subtitle generating apparatus according to an embodiment of the present application. As shown in fig. 10, the apparatus includes: a first acquisition unit 1001, a first determination unit 1002, a second acquisition unit 1003, and a second determination unit 1004.
Specifically, the first obtaining unit 1001 is configured to obtain audio and video data frames generated in a live broadcast process, where each of the audio and video data frames includes video data and multiple paths of audio data that are generated synchronously, and each of the multiple paths of audio data is sent by a sound generator;
a first determining unit 1002, configured to determine position information of at least one sounding body carried in the video data, and sound source identification information associated with each sounding body, where the position information is position information of the sounding body in a video picture;
a second obtaining unit 1003, configured to obtain multiple voice recognition results corresponding to multiple paths of audio data and sound source identification information associated with each voice recognition result;
the second determining unit 1004 is configured to determine video data corresponding to a plurality of voice recognition results, and display each voice recognition result in a corresponding preset area in the video image, where the corresponding preset area is determined by position information of a sound generating body corresponding to the sound source identification information associated with the voice recognition result.
The device for generating voice subtitles provided by the embodiment of the application acquires audio and video data frames generated in a live broadcast process through the first acquisition unit 1001, wherein each audio and video data frame comprises video data and multiple paths of audio data which are synchronously generated, and each path of audio data is sent out by a sound generator; the first determining unit 1002 determines position information of at least one sounding body carried in the video data and sound source identification information associated with each sounding body, wherein the position information is position information of the sounding body in a video picture; the second obtaining unit 1003 obtains a plurality of voice recognition results corresponding to the multi-channel audio data and sound source identification information associated with each voice recognition result; the second determining unit 1004 determines video data corresponding to a plurality of voice recognition results, and displays each voice recognition result in a corresponding preset area in the video picture, wherein the corresponding preset area is determined by position information of a sound generator corresponding to sound source identification information associated with the voice recognition results. The problem of in the correlation technique under the condition that there are a plurality of sound producing bodies in the live broadcast scene, be difficult to distinguish the content that different sound producing bodies expressed through the pronunciation subtitle is solved, and then reached under the condition that there are a plurality of sound producing bodies in the live broadcast scene, show every speech recognition result in the video picture sound producing body near region, improved the effect of the degree of accuracy of knowing the content that different sound producing bodies expressed.
Optionally, in the apparatus for generating a voice subtitle provided in this embodiment of the present application, the video data further carries a timestamp of the video data, and the second obtaining unit 1003 includes: the first acquisition module is used for acquiring a plurality of voice recognition results corresponding to the multi-channel audio data, sound source identification information associated with each voice recognition result and timestamps of the multi-channel audio data from the voice recognition server; the second determination unit 1004 includes: and the first determining module is used for acquiring video data with the same time stamp as the multi-channel audio data and determining the acquired video data as video data corresponding to a plurality of voice recognition results.
Optionally, in the apparatus for generating a voice subtitle provided in the embodiment of the present application, the second obtaining unit 1003 includes: the second determining module is used for determining the voice recognition results carried in the multi-channel audio data and the sound source identification information associated with each voice recognition result; the second determination unit 1004 includes: and the third determining module is used for determining audio and video data frames corresponding to the multi-channel audio data and determining the video data in the audio and video data frames as the video data corresponding to the plurality of voice recognition results.
Optionally, the second determining unit 1004 includes: the fourth determining module is used for determining sound source identification information associated with each voice recognition result; the fifth determining module is used for determining the sounding body with the same sound source identification information as the voice recognition result and acquiring the position information associated with the sounding body; the sixth determining module is used for determining the preset range of the position of the sounding body in the video picture based on the acquired position information; and the corresponding module is used for determining the corresponding preset area from the preset range.
Optionally, the second determining unit 1004 includes: a seventh determining module, configured to determine display attribute information of the text box, and generate the text box in the video picture based on the display attribute information; and the adding module is used for moving the text box to a preset area and adding the voice recognition result to the text box.
The apparatus for generating voice subtitles includes a processor and a memory, wherein the first acquiring unit 1001, the first determining unit 1002, the second acquiring unit 1003, the second determining unit 1004, and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the problem that the content expressed by different sound producing bodies is difficult to distinguish through voice subtitles under the condition that a plurality of sound producing bodies exist in a live broadcast scene in the related technology is solved by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The embodiment of the application also provides a nonvolatile storage medium, wherein the nonvolatile storage medium comprises a stored program, and the program controls the equipment where the nonvolatile storage medium is located to execute a voice subtitle generating method when running.
The embodiment of the application also provides an electronic device, which comprises a processor and a memory; the memory stores computer readable instructions, and the processor is used for executing the computer readable instructions, wherein the computer readable instructions execute a method for generating voice subtitles. The electronic device herein may be a server, a PC, a PAD, a mobile phone, etc.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (15)

1. A method for generating a voice subtitle, comprising:
acquiring audio and video data frames generated in a live broadcast process, wherein each audio and video data frame comprises video data and multiple paths of audio data which are synchronously generated, and each path of audio data is sent out by a sound-producing body;
determining position information of at least one sounding body carried in the video data and sound source identification information associated with each sounding body, wherein the position information is position information of the sounding body in a video picture;
acquiring a plurality of voice recognition results corresponding to the multi-channel audio data and sound source identification information associated with each voice recognition result;
and determining video data corresponding to the plurality of voice recognition results, and displaying each voice recognition result in a corresponding preset area in a video picture, wherein the corresponding preset area is determined by the position information of the sounding body corresponding to the sound source identification information associated with the voice recognition result.
2. The method according to claim 1, wherein the video data further carries a timestamp of the video data, and the obtaining a plurality of voice recognition results corresponding to the plurality of channels of audio data, and the sound source identification information associated with each voice recognition result comprises: acquiring a plurality of voice recognition results corresponding to the multi-channel audio data, sound source identification information associated with each voice recognition result and time stamps of the multi-channel audio data from a voice recognition server;
determining video data corresponding to the plurality of speech recognition results comprises: and acquiring video data with the same time stamp as the multi-channel audio data, and determining the acquired video data as the video data corresponding to the voice recognition results.
3. The method of claim 1, wherein obtaining a plurality of speech recognition results corresponding to the plurality of channels of audio data, and the sound source identification information associated with each of the speech recognition results comprises: determining voice recognition results carried in the multi-channel audio data respectively and sound source identification information associated with each voice recognition result;
determining video data corresponding to the plurality of speech recognition results comprises: and determining audio and video data frames corresponding to the multi-channel audio data, and determining video data in the audio and video data frames as video data corresponding to the plurality of voice recognition results.
4. The method of claim 1, wherein displaying each of the speech recognition results in a corresponding predetermined area of a video frame comprises:
determining sound source identification information associated with each voice recognition result;
determining a sounding body with the same sound source identification information as the voice recognition result, and acquiring position information associated with the sounding body;
determining a preset range of the position of the sounding body in the video picture based on the acquired position information;
and determining the corresponding preset area from the preset range.
5. The method of claim 1, wherein displaying each of the speech recognition results in a corresponding predetermined area of a video frame comprises:
determining display attribute information of a text box, and generating the text box in the video picture based on the display attribute information;
and moving the text box to the preset area, and adding the voice recognition result to the text box.
6. A system for generating a voice subtitle, comprising:
the system comprises a first client, a voice recognition server and a second client, wherein the first client is used for collecting multi-channel audio data generated by different sound generators in a live broadcast process, sending the multi-channel audio data to the voice recognition server, collecting video data generated in the live broadcast process, determining position information of the sound generators in the video data and sound source identification information associated with each sound generator, and generating audio and video data frames according to the video data and the multi-channel audio data which are synchronously generated;
the voice recognition server is used for performing semantic recognition on the multi-channel audio data respectively to obtain a plurality of voice recognition results, wherein each voice recognition result is associated with a time stamp and voice source identification information;
and the second client is used for acquiring the audio and video data frames from the first client, acquiring voice recognition results from the voice recognition server, acquiring video data with the same time stamp as the voice recognition results from the audio and video data frames, and displaying each voice recognition result in a corresponding preset area in a video picture, wherein the corresponding preset area is determined by the position information of a sounding body corresponding to the sound source identification information associated with the voice recognition results.
7. The system according to claim 6, wherein the first client stores image feature data of a plurality of target sound generators and sound source identification information associated with each sound generator, and the first client is further configured to identify image feature data of a current sound generator from the collected video data, obtain a position of the current sound generator, and compare the image feature data of the current sound generator with the image feature data of the plurality of target sound generators to obtain the sound source identification information associated with the current sound generator.
8. The system according to claim 7, wherein the first client is further configured to receive audio and video data entered by the plurality of target sound generators before live broadcast, and determine image feature data of each target sound generator and sound source identification information associated with the target sound generator based on the entered audio and video data.
9. The system of claim 6, wherein the second client is further configured to:
receiving a setting instruction of display attribute information of a text box, and generating the text box in the video picture based on the display attribute information;
receiving a moving instruction aiming at the text box, and controlling the text box to move to a preset range of the position of the target current sounding body;
and receiving a text adding instruction, and adding a target voice recognition result into the text box, wherein the sound source recognition identifier of the target voice recognition result is the same as the sound source recognition identifier of the target current sounding body.
10. A system for generating a voice subtitle, comprising:
the system comprises a first client, a second client and a third client, wherein the first client is used for collecting multi-channel audio data generated by different sound producing bodies in a live broadcasting process, performing semantic recognition on the multi-channel audio data respectively to obtain a plurality of voice recognition results and sound source identification information associated with the voice recognition results, collecting video data generated in the live broadcasting process, determining position information of the sound producing bodies in the video data and sound source identification information associated with each sound producing body, and generating audio and video data frames according to the video data and the multi-channel audio data which are synchronously generated;
and the second client is used for acquiring the audio and video data frames and the voice recognition results corresponding to the audio and video data frames from the first client, and displaying each voice recognition result in a corresponding preset area in a video picture, wherein the corresponding preset area is determined by the position information of the sounding body corresponding to the sound source identification information associated with the voice recognition results.
11. The system according to claim 10, wherein the first client stores image feature data of a plurality of target utterances and audio source identification information associated with each of the utterances, and the first client is further configured to identify image feature data of a current utterer from the collected video data, obtain a position of the current utterer, and compare the image feature data of the current utterer with the image feature data of the plurality of target utterers to obtain audio source identification information associated with the current utterer.
12. The system according to claim 11, wherein the first client is further configured to receive audio and video data entered by the plurality of target sound generators before live broadcast, and determine image feature data of each of the target sound generators and sound source identification information associated with the target sound generator based on the entered audio and video data.
13. An apparatus for generating a voice subtitle, comprising:
the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring audio and video data frames generated in a live broadcast process, each audio and video data frame comprises video data and multiple paths of audio data which are synchronously generated, and each path of audio data is sent out by a sound generator;
the first determining unit is used for determining position information of at least one sounding body carried in the video data and sound source identification information associated with each sounding body, wherein the position information is the position information of the sounding body in a video picture;
the second acquisition unit is used for acquiring a plurality of voice recognition results corresponding to the multi-channel audio data and sound source identification information associated with each voice recognition result;
and the second determining unit is used for determining video data corresponding to the plurality of voice recognition results and displaying each voice recognition result in a corresponding preset area in a video picture, wherein the corresponding preset area is determined by the position information of the sound generating body corresponding to the sound source identification information associated with the voice recognition results.
14. A non-volatile storage medium, comprising a stored program, wherein the program when executed controls a device in which the non-volatile storage medium is located to perform the method of generating voice subtitles of any one of claims 1 to 5.
15. An electronic device comprising a processor and a memory, the memory storing computer readable instructions, the processor being configured to execute the computer readable instructions, wherein the computer readable instructions are configured to execute the method for generating a voice caption according to any one of claims 1 to 5.
CN202111585267.8A 2021-12-22 2021-12-22 Voice subtitle generating method, system, device, storage medium and electronic device Pending CN114242058A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111585267.8A CN114242058A (en) 2021-12-22 2021-12-22 Voice subtitle generating method, system, device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111585267.8A CN114242058A (en) 2021-12-22 2021-12-22 Voice subtitle generating method, system, device, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN114242058A true CN114242058A (en) 2022-03-25

Family

ID=80761690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111585267.8A Pending CN114242058A (en) 2021-12-22 2021-12-22 Voice subtitle generating method, system, device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN114242058A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002502A (en) * 2022-07-29 2022-09-02 广州市千钧网络科技有限公司 Data processing method and server
CN115150660A (en) * 2022-06-09 2022-10-04 深圳市大头兄弟科技有限公司 Video editing method based on subtitles and related equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115150660A (en) * 2022-06-09 2022-10-04 深圳市大头兄弟科技有限公司 Video editing method based on subtitles and related equipment
CN115150660B (en) * 2022-06-09 2024-05-10 深圳市闪剪智能科技有限公司 Video editing method based on subtitles and related equipment
CN115002502A (en) * 2022-07-29 2022-09-02 广州市千钧网络科技有限公司 Data processing method and server
CN115002502B (en) * 2022-07-29 2023-01-03 广州市千钧网络科技有限公司 Data processing method and server

Similar Documents

Publication Publication Date Title
JP6984596B2 (en) Audiovisual processing equipment and methods, as well as programs
CN103460128B (en) Dubbed by the multilingual cinesync of smart phone and audio frequency watermark
CN114242058A (en) Voice subtitle generating method, system, device, storage medium and electronic device
CN105070304B (en) Realize method and device, the electronic equipment of multi-object audio recording
CN111460219B (en) Video processing method and device and short video platform
EP3659344B1 (en) Calibration system for audience response capture and analysis of media content
KR102551081B1 (en) Method and apparatus for efficient delivery and usage of audio messages for high quality of experience
US10820131B1 (en) Method and system for creating binaural immersive audio for an audiovisual content
Shirley et al. Personalized object-based audio for hearing impaired TV viewers
US20170169827A1 (en) Multimodal speech recognition for real-time video audio-based display indicia application
KR20060027826A (en) Video processing apparatus, ic circuit for video processing apparatus, video processing method, and video processing program
KR20160081043A (en) Method, server and system for controlling play speed of video
JP2004056286A (en) Image display method
KR20180133894A (en) Synchronizing auxiliary data for content containing audio
CN114449252A (en) Method, device, equipment, system and medium for dynamically adjusting live video based on explication audio
CN113542626B (en) Video dubbing method and device, computer equipment and storage medium
KR20200013658A (en) Temporary Placement of Rappering Events
JP2001143451A (en) Automatic index generating device and automatic index applying device
CN110992984B (en) Audio processing method and device and storage medium
US10885893B2 (en) Textual display of aural information broadcast via frequency modulated signals
CN113630620A (en) Multimedia file playing system, related method, device and equipment
EP3662470B1 (en) Audio object classification based on location metadata
CN116233411A (en) Method, device, equipment and computer storage medium for audio and video synchronous test
JP2016144192A (en) Upsurge notification system
EP3321795B1 (en) A method and associated apparatuses

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination