WO2019000721A1 - 视频文件录制方法、音频文件录制方法及移动终端 - Google Patents

视频文件录制方法、音频文件录制方法及移动终端 Download PDF

Info

Publication number
WO2019000721A1
WO2019000721A1 PCT/CN2017/107014 CN2017107014W WO2019000721A1 WO 2019000721 A1 WO2019000721 A1 WO 2019000721A1 CN 2017107014 W CN2017107014 W CN 2017107014W WO 2019000721 A1 WO2019000721 A1 WO 2019000721A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
information
environment
audio information
mobile terminal
Prior art date
Application number
PCT/CN2017/107014
Other languages
English (en)
French (fr)
Inventor
张雨田
Original Assignee
联想(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 联想(北京)有限公司 filed Critical 联想(北京)有限公司
Publication of WO2019000721A1 publication Critical patent/WO2019000721A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43074Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of additional data with content streams on the same device, e.g. of EPG data or interactive icon with a TV program
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles

Definitions

  • the present application belongs to the field of multimedia technologies, and in particular, to a video file recording method, an audio file recording method, and a mobile terminal.
  • audio and video are usually equipped with corresponding subtitles, so that users with hearing impairments or users in noisy environments can clearly understand the content played by audio and video through subtitles.
  • audio or video is usually produced first, and corresponding subtitles are produced later.
  • the current method of producing subtitles for audio or video is relatively simple.
  • the purpose of the present application is to provide a video file recording method applied to a mobile terminal, so as to more quickly create a video file configured with subtitles.
  • the present application also provides an audio file recording method applied to a mobile terminal, so as to make an audio file configured with subtitles more quickly.
  • the application provides a video file recording method for a mobile terminal, including:
  • image information is obtained by a camera of the mobile terminal, and audio information is obtained by a microphone of the mobile terminal;
  • an image stream composed of the image information, an audio stream composed of the audio information, and a subtitle stream composed of the subtitle information are synthesized into a first video file to enable playback In the first video file, the image stream, the audio stream, and the subtitle stream are synchronously output.
  • the performing real-time processing on the audio information based on the voice recognition engine includes: determining a current recording environment based on parameter information of the audio information; and using a current recording environment as a result of the first environment Converting the current audio information into subtitle information synchronously; suspending the operation of synchronously converting the audio information into subtitle information based on the result of the current recording environment being the second environment until a result indicating that the current recording environment is the first environment is obtained.
  • the first environment includes an environment in which at least one user is performing language output
  • the second environment includes an environment in which only background sound exists.
  • determining the current recording environment based on the parameter information of the audio information including: determining a signal to noise ratio of the current audio information; if the current signal to noise ratio of the audio information is greater than a threshold, determining that the current recording environment is The first environment is determined; if the signal to noise ratio of the current audio information is less than the threshold, determining that the current recording environment is the second environment.
  • the mobile terminal includes a microphone array, where the microphone array includes a plurality of microphones with different installation positions, wherein at least one microphone is disposed on a side of the camera, and at least one other side of the mobile terminal Set up with a microphone;
  • the obtaining audio information by using the microphone of the mobile terminal includes: obtaining audio information of a target user by using the microphone array, wherein the target user is capable of performing image acquisition by using a camera of the mobile terminal. A user displayed in the display screen of the mobile terminal.
  • the application provides a mobile terminal, including an input interface, a camera, a microphone, and a processor;
  • the input interface is configured to collect an input instruction
  • the processor is configured to: enter a video recording mode in response to a first instruction indicating that recording of a video is started; in the video recording mode, obtain image information by using a camera of the mobile terminal, and obtain audio information by using a microphone of the mobile terminal Calling a speech recognition engine, performing real-time processing on the audio information based on the speech recognition engine to synchronize generation of subtitle information based on the audio information; and exiting the video recording mode in response to a second instruction indicating ending recording of the video; In the video recording mode, an image stream composed of the image information, an audio stream composed of the audio information, and a subtitle stream composed of the subtitle information are synthesized into a first video file to enable playback In the first video file, the image stream, the audio stream, and the subtitle stream are synchronously output.
  • the processor is configured to: perform real-time processing on the audio information based on the voice recognition engine:
  • the processor configures the first environment to include an environment in which at least one user is performing language output, and configures the second environment to include an environment in which only background sound exists.
  • the processor determines an aspect of the current recording environment based on parameter information of the audio information, and is used to:
  • Determining a signal to noise ratio of the current audio information if the signal to noise ratio of the current audio information is greater than a threshold, determining that the current recording environment is the first environment; and determining a current recording environment if a signal to noise ratio of the current audio information is less than the threshold For the second environment.
  • the mobile terminal includes a microphone array, where the microphone array includes a plurality of microphones with different installation positions, wherein at least one microphone is disposed on a side where the camera is located, and at least one other side of the mobile terminal is disposed a microphone; the mobile terminal further includes a display screen;
  • the processor is configured to obtain audio information of a target user by using the microphone array, wherein the target user is capable of performing an image through a camera of the mobile terminal. A user that is captured and displayed within the display of the mobile terminal.
  • the present application provides a method for recording an audio file of a mobile terminal, including:
  • audio information is obtained by a microphone of the mobile terminal
  • an audio stream composed of the audio information and a subtitle stream composed of the subtitle information are synthesized into a first audio file, so that when the first audio file is played, the output is synchronized.
  • the audio stream and the subtitle stream are described.
  • the application provides a mobile terminal, including an input interface, a microphone, and a processor;
  • the input interface is configured to collect an input instruction
  • the processor is configured to: enter an audio recording mode in response to a first instruction instructing to start recording audio; obtain audio information through a microphone of the mobile terminal in the audio recording mode; invoke a voice recognition engine, based on the voice recognition The engine processes the audio information in real time such that the subtitle information is generated synchronously based on the audio information; in response to the second instruction indicating the end of recording the audio, exiting the audio recording mode; in the audio recording mode,
  • the audio stream composed of the audio information and the subtitle stream composed of the subtitle information are synthesized into a first audio file such that when the first audio file is played, the audio stream and the subtitle stream are synchronously output.
  • the video file recording method of the mobile terminal disclosed in the present application when the mobile terminal is in the video recording mode, obtains image information through the camera, obtains audio information through the microphone, and the mobile terminal invokes the voice recognition engine, and performs the obtained audio information based on the voice recognition engine.
  • Real-time processing in order to synchronously generate subtitle information based on audio information, after the mobile terminal exits the video recording mode, the image stream formed by the image information obtained during the video recording process, and the audio stream formed by the audio information obtained during the video recording process And the subtitle stream formed by the subtitle information obtained in the video recording process is synthesized to obtain the first video file.
  • the mobile terminal performs real-time processing on the audio information through the voice recognition engine during the process of recording the video, thereby generating the subtitle information synchronously based on the audio information, and the mobile terminal exits the video recording mode after exiting the video recording mode.
  • a video file can be generated based on the audio stream, the image stream, and the subtitle stream, thereby quickly making a video file configured with subtitles.
  • FIG. 1 is a flowchart of a video file recording method of a mobile terminal according to the present disclosure
  • FIG. 2 is a flowchart of real-time processing of audio information based on a speech recognition engine disclosed in the present application
  • FIG. 3 is a schematic diagram of a video recording scene disclosed in the present application.
  • FIG. 4 is a structural diagram of a mobile terminal disclosed in the present application.
  • FIG. 5 is a structural diagram of another mobile terminal disclosed in the present application.
  • FIG. 6 is a flowchart of a method for recording an audio file of a mobile terminal according to the present disclosure
  • FIG. 7 is a structural diagram of another mobile terminal disclosed in the present application.
  • the video file recording method, the audio file recording method and the corresponding mobile terminal of the present application in the process of recording audio or video, synchronously generate corresponding subtitle information by recognizing the audio information, thereby more quickly making an audio file configured with subtitles. Or a video file.
  • the mobile terminal in the present application may be a mobile phone, a tablet computer, or other terminal having an audio recording function and a video recording function.
  • the techniques of this disclosure may be implemented in the form of hardware and/or software (including firmware, microcode, etc.). Additionally, the techniques of this disclosure may take the form of a computer program product on a computer readable medium storing instructions for use by or in connection with an instruction execution system.
  • a computer readable medium can be any medium that can contain, store, communicate, propagate or transport the instructions.
  • a computer readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium.
  • the computer readable medium include: a magnetic storage device such as a magnetic tape or a hard disk (HDD); an optical storage device such as a compact disk (CD-ROM); a memory such as a random access memory (RAM) or a flash memory; and/or a wired /Wireless communication link.
  • a magnetic storage device such as a magnetic tape or a hard disk (HDD)
  • an optical storage device such as a compact disk (CD-ROM)
  • a memory such as a random access memory (RAM) or a flash memory
  • RAM random access memory
  • FIG. 1 is a flowchart of a video file recording method of a mobile terminal according to the present disclosure. The method includes:
  • Step S11 Obtain a first instruction indicating to start recording a video.
  • Step S12 Enter the video recording mode in response to the first instruction.
  • the first instruction may be generated by pressing a physical button of the mobile terminal, may be generated by pressing a virtual button displayed by the mobile terminal, or may use a voice collection module to collect a voice input of the user, and generate a trigger instruction by recognizing the voice input of the user. .
  • the mobile terminal enters the video recording mode in response to the obtained first instruction.
  • Step S13 In the video recording mode, image information is obtained through a camera of the mobile terminal, and audio information is obtained through a microphone of the mobile terminal.
  • the audio information obtained by the microphone of the mobile terminal may be the audio information of the current recording environment collected by the microphone, or may be the audio information obtained by processing the audio information collected by the microphone, such as collected by the microphone.
  • the audio information obtained by the noise reduction processing such as the audio information generated by an object extracted from the audio information collected by the microphone.
  • Step S14 Calling the speech recognition engine, real-time processing the audio information based on the speech recognition engine, so that the subtitle information is generated synchronously based on the audio information.
  • the mobile terminal invokes the voice recognition engine, and processes the audio information in real time during the process of collecting the audio information by the microphone to obtain corresponding subtitle information, that is, synchronizing the subtitle information based on the audio information.
  • synchronizing the generation of the subtitle information based on the audio information in the embodiment of the present disclosure may include synchronously generating the subtitle information by processing the audio information while receiving the audio information, that is, the action of generating the subtitle information is synchronized with the action of receiving the audio information. ongoing.
  • the embodiment of the present disclosure does not limit the time at which the subtitle information is generated to be completely synchronized with the audio information. For example, since the audio information needs to be processed in real time, the time at which the subtitle information is generated may be slightly later than the time at which the corresponding audio information is received.
  • Step S15 A second instruction indicating that the recording of the video is ended is obtained.
  • Step S16 Exit the video recording mode in response to the second instruction.
  • the second instruction may be generated by pressing a physical button of the mobile terminal, may be generated by pressing a virtual button displayed by the mobile terminal, or may use a voice collection module to collect a voice input of the user, and generate a trigger instruction by recognizing the voice input of the user. .
  • the mobile terminal exits the video recording mode in response to the obtained second instruction, that is, ends the recording of the video.
  • Step S17 In the video recording mode, the image stream composed of the image information, the audio stream composed of the audio information, and the subtitle stream composed of the subtitle information are synthesized into the first video file, so that when the first video file is played , synchronous output image stream, audio stream and subtitle stream.
  • the subtitle stream composed of the information is synthesized into a video file (recorded as the first video file).
  • the audio stream, the image stream, and the subtitle stream included in the first video file are output synchronously.
  • the video file recording method of the mobile terminal disclosed in the present application when the mobile terminal is in the video recording mode, obtains image information through the camera, obtains audio information through the microphone, and the mobile terminal invokes the voice recognition engine, and performs the obtained audio information based on the voice recognition engine.
  • Real-time processing in order to synchronously generate subtitle information based on audio information, after the mobile terminal exits the video recording mode, the image stream formed by the image information obtained during the video recording process, and the audio stream formed by the audio information obtained during the video recording process And the subtitle stream formed by the subtitle information obtained in the video recording process is synthesized to obtain the first video file.
  • the mobile terminal performs real-time processing on the audio information through the voice recognition engine during the process of recording the video, thereby generating the subtitle information synchronously based on the audio information, and the mobile terminal exits the video recording mode after exiting the video recording mode.
  • a video file can be generated based on the audio stream, the image stream, and the subtitle stream, thereby quickly making a video file configured with subtitles.
  • the audio information is processed in real time based on the speech recognition engine in the manner shown in FIG. 2. Specifically include:
  • Step S21 Determine the current recording environment based on the parameter information of the audio information.
  • Users may record video in different environments, and in some circumstances, there is no need to generate subtitle information. For example, if no one is talking in the current recording environment, there is no need to generate subtitle information. For example, there is a noisy vocal in the current recording environment, but the current subject does not speak, so there is no need to generate subtitle information. In addition, under certain circumstances, it is difficult for a search engine to accurately generate caption information based on audio information.
  • the speech recognition engine determines whether the current recording environment is the first environment or the second environment according to the parameter information of the audio information, to determine whether the audio information is synchronously converted into subtitle information by the speech recognition engine.
  • the first environment can be viewed as an environment in which a valid voice signal is present, and the second environment is considered to be an environment in which no valid voice signal is present.
  • the valid voice signal refers to a voice signal that satisfies a predetermined requirement, for example, a voice signal generated by a specific user as an effective voice signal, or a voice signal generated by a user whose volume reaches a volume threshold as an effective voice signal.
  • Step S22 synchronously convert the current audio information into subtitle information based on the result of the current recording environment being the first environment.
  • Step S23 Suspending the operation of synchronously converting the audio information into the subtitle information based on the result of the current recording environment being the second environment until a result indicating that the current recording environment is the first environment is obtained.
  • the current audio information is processed in real time by the voice recognition engine, and the current audio information is synchronously converted into the caption information. If the current recording environment is the second environment, the current audio information is temporarily processed by the voice recognition engine until the result indicating that the current recording environment is the first environment is obtained, and the voice recognition engine is started again to perform real-time processing on the audio information.
  • a blank corresponding to a time period in which the real-time processing of the audio information by the speech recognition engine is suspended may be inserted in the subtitle stream.
  • the speech recognition engine is suspended during the time period from the 10th minute to the 12th minute.
  • the audio information is processed in real time, and accordingly, a blank is inserted in the subtitle stream from the 10th minute to the 12th minute.
  • the user can edit and modify the subtitle information in the time period in the video file.
  • the mobile terminal obtains image information through a camera, obtains audio information through a microphone, and determines a current recording environment based on parameter information of the audio information in a video recording mode, if the current recording environment is the first environment. Then, the current audio information is synchronously converted into subtitle information by the speech recognition engine. If the current recording environment is the second environment, the audio information is synchronously converted into subtitle information by the speech recognition engine until the recording environment is changed to the first environment. After the mobile terminal exits the video recording mode, the image stream, the audio stream, and the subtitle stream generated during the video recording process are combined into a first video file. It can be seen that, according to the method shown in FIG.
  • the audio information is temporarily converted into the subtitle information by the speech recognition engine, and the data processing amount of the speech recognition engine can be reduced on the one hand.
  • the first environment is configured to have at least one environment in which the user performs speech output
  • the second environment is configured as an environment in which only the background sound exists.
  • the user's speech output means that the user is talking.
  • the current recording environment is determined based on the parameter information of the audio information in step S21, including:
  • the audio information obtained by the microphone is analyzed to determine whether the audio information includes the voice information. If the audio information does not include the voice information, it is determined that the current recording environment has no user who is performing speech output, and the current recording environment is the second environment.
  • the audio information includes voice information
  • the current recording environment is determined to be the second environment, if the current recording environment has a voice signal, but the voice signal is generated by a singing (or drama) process.
  • the voice signal then determine the current recording environment for the second environment.
  • determining the current recording environment based on the parameter information of the audio information in step S21 includes:
  • the audio information obtained by the microphone is analyzed to determine whether the audio information includes the voice information. If the audio information does not include the voice information, it is determined that the current recording environment has no user who is performing speech output, and the current recording environment is the second environment.
  • the volume of the voice information is further counted. If the volume of the voice information is lower than a preset volume threshold, it is determined that the current recording environment has no user who is performing speech output, and the current recording environment is Two environments.
  • the audio information includes voice information and the volume of the voice information reaches a preset volume threshold, it is determined whether the voice information is voice information generated by speaking or voice information generated by singing (or drama), if it is singing (or drama) The generated voice information, then determine that the current recording environment is not the user who is performing the speech output, the current recording environment is the second environment, if it is the voice information generated by the speech, then it is determined that the current recording environment has the user who is performing the speech output, the current recording The environment is the first environment.
  • the current recording environment is determined to be the second environment, if the current recording environment has a voice signal, but the volume of the voice signal is lower than the preset volume threshold. , determining that the current recording environment is the second environment. Further, if the volume of the voice signal reaches a preset volume threshold but the voice signal is a voice signal generated by a singing (or drama) process, determining the current recording environment is Two environments.
  • rhythm, melody or rhythm of the speech signal can be analyzed to determine whether the speech signal is generated by speech or by singing (or drama).
  • determining the current recording environment based on the parameter information of the audio information in step S21 includes:
  • the signal to noise ratio of the current audio information is less than the threshold, it is determined that the current recording environment is the second environment.
  • the signal-to-noise ratio of the audio information obtained through the microphone is greater than the threshold, it indicates that the current recording environment is relatively quiet, and the user in the recording environment can clearly collect the voice signal of the user when speaking.
  • the current recording environment is determined as the first environment, and the current audio information is processed in real time by the voice recognition engine, and the current audio information is synchronously converted into the caption information. If the signal-to-noise ratio of the audio information obtained through the microphone is less than the threshold, indicating that the current recording environment is relatively noisy, it is difficult for the user in the recording environment to clearly collect the sound signal of the user when speaking, so the current recording environment is determined to be the first In the second environment, the current audio information is temporarily processed by the speech recognition engine.
  • the mobile terminal includes a microphone array including a plurality of microphones with different installation positions, wherein at least one microphone is disposed on a side where the camera is located, and at least one microphone is disposed on at least one other side of the mobile terminal. It should be noted that the positions of the plurality of microphones are different, and correspondingly, the sound pickup areas of the plurality of microphones are also different.
  • the audio information is obtained through the microphone of the mobile terminal, and the following manner may be adopted:
  • the audio information collected by the microphone located on the second side is subjected to noise reduction processing on the audio information collected by the microphone located on the first side, and the audio information after the noise reduction processing is obtained.
  • the pickup area of the microphone on the first side can cover the shooting area of the camera currently performing image acquisition, and the pickup area of the microphone located on the second side and the camera currently performing image acquisition. There are no overlaps in the area, or only a small overlap.
  • the source of the video photographer's attention is usually the current subject.
  • the microphone on the first side collects the sound mainly from the subject, while the microphone on the second side collects the ambient noise. Therefore, the use of the microphone is located.
  • the audio information collected by the microphones on the two sides performs noise reduction processing on the audio information collected by the microphone located on the first side, and can obtain voice information with clearer subject.
  • the audio information is obtained through the microphone of the mobile terminal, and the following manner may also be adopted:
  • the audio information of the target user is obtained through the microphone array.
  • the target user is a user who can perform image acquisition through the camera of the mobile terminal and display the image in the display screen of the mobile terminal.
  • the target user is positioned by the microphone array, and the gain of each microphone is adjusted according to the position of the target user and the installation position of the microphone in the microphone array, so as to track the target user and collect the audio information of the target user.
  • the microphone array of the mobile terminal includes a microphone 102, a microphone 103, a microphone 104, and a microphone 105, wherein the microphone 102 and the microphone 103 are on the same side as the camera 101, and the microphone 104 and the microphone 105 are on the other side.
  • the person A1 makes a speech
  • the mobile terminal performs video recording toward the person A1
  • the mobile terminal The camera currently in the image capturing state is 101, and the shooting area of the camera 101 is the area indicated by S1 in the figure.
  • the camera 101 performs image acquisition on the person A1, and the image of the person A1 is displayed in the display screen of the mobile terminal, and the person A1 is the target user.
  • the mobile terminal locates the person A1 through the microphone array to determine the position of the person A1.
  • the mobile terminal adjusts the gain of each microphone according to the position of the person A1 and the installation position of each microphone, realizes the sound source tracking of the person A1, collects the audio information of the person A1, and filters out the audio information generated by other personnel.
  • the subtitle stream may further carry display configuration information of the subtitle information.
  • the display configuration information of the subtitle information includes a display position of the subtitle information and/or a dynamic display mode of the subtitle information.
  • the subtitle stream may include, in addition to the subtitle information generated by the speech recognition engine, auxiliary information determined according to the emotional state of the provider of the speech information.
  • the auxiliary information includes but is not limited to pictures and emoticons.
  • the image obtained by the camera is analyzed, the emotional state of the provider is determined according to the expression and/or the body motion of the provider of the voice information, and the emotional state of the provider may be determined according to the voice information, and the emotion is obtained.
  • Auxiliary information corresponding to the status is corresponding to the status.
  • the present application also discloses a mobile terminal having a structure as shown in FIG. 4, including an input interface 10, a camera 20, a microphone 301, and a processor 40.
  • Input interface 10 is used to acquire input commands.
  • the processor 40 is configured to: enter a video recording mode in response to a first instruction to start recording a video; obtain image information through the camera 20 in the video recording mode, obtain audio information through the microphone 301; invoke a speech recognition engine, based on a speech recognition engine Performing real-time processing on the audio information, so that the subtitle information is synchronously generated based on the audio information; exiting the video recording mode in response to the second instruction indicating ending the recording of the video; in the video recording mode, the image stream composed of the image information, and the audio information
  • the constructed audio stream and the subtitle stream composed of the subtitle information are synthesized into a first video file such that when the first video file is played, the image stream, the audio stream, and the subtitle stream are synchronously output.
  • the mobile terminal disclosed in the present application processes the audio information in real time through the voice recognition engine in the process of recording the video, thereby generating the subtitle information synchronously based on the audio information, and after exiting the video recording mode, the audio stream, the image stream, and the subtitle can be Streaming video files to quickly create video files that have subtitles configured.
  • the processor 40 is used in the aspect of real-time processing of the audio information based on the speech recognition engine for:
  • the processor 40 configures the first environment to be an environment in which at least one user is performing language output, and configures the second environment as an environment in which only background sound exists.
  • the processor 40 determines an aspect of the current recording environment based on the parameter information of the audio information, and is configured to: analyze audio information obtained by using the microphone, and determine whether the audio information includes voice information, if the audio information does not include Voice information, then determine that the current recording environment has no user who is performing speech output, and the current recording environment is the second environment. Further, if the audio information includes voice information, it is determined whether the voice information is voice information generated by speaking or voice information generated by singing (or drama), and if it is voice information generated by singing (or drama), it is determined that the current recording environment is not The user who is performing speech output, the current recording environment is the second environment. If it is the voice information generated by the speech, it is determined that the current recording environment has a user who is performing speech output, and the current recording environment is the first environment.
  • the processor 40 determines an aspect of the current recording environment based on the parameter information of the audio information, and is configured to: analyze audio information obtained by using the microphone, and determine whether the audio information includes voice information, if the audio information does not include Voice information, then determine that the current recording environment has no user who is performing speech output, and the current recording environment is the second environment. Further, if the audio information includes voice information, the volume of the voice information is further counted. If the volume of the voice information is lower than a preset volume threshold, it is determined that the current recording environment has no user who is performing speech output, and the current recording environment is Two environments.
  • the audio information includes voice information and the volume of the voice information reaches a preset volume threshold, it is determined whether the voice information is voice information generated by speaking or voice information generated by singing (or drama), if it is singing (or drama) The generated voice information, then determine that the current recording environment is not the user who is performing the speech output, the current recording environment is the second environment, if it is the voice information generated by the speech, then it is determined that the current recording environment has the user who is performing the speech output, the current recording The environment is the first environment.
  • the processor 40 determines an aspect of the current recording environment based on the parameter information of the audio information, and is used to: determine a signal to noise ratio of the current audio information; if the signal to noise ratio of the current audio information is greater than a threshold, determine the current The recording environment is the first environment; if the signal to noise ratio of the current audio information is less than the threshold, it is determined that the current recording environment is the second environment.
  • the mobile terminal includes a microphone array 30 including a plurality of microphones having different installation positions, wherein at least one microphone is disposed on a side where the camera 20 is located, and at least one other side of the mobile terminal is disposed There is a microphone, and the mobile terminal also includes a display screen 50, as shown in FIG.
  • the processor 40 obtains the audio information collected by the microphone on the first side to obtain the second side in terms of obtaining the audio information through the microphone of the mobile terminal.
  • the audio information collected by the microphone is subjected to noise reduction processing on the audio information collected by the microphone located on the first side by using the audio information collected by the microphone located on the second side to obtain the audio information after the noise reduction processing.
  • the first side is a side where the camera currently performing image acquisition is located
  • the second side is a side provided with a microphone in addition to the first side.
  • the processor 40 obtains audio information of the target user through the microphone array 30 in terms of obtaining audio information through the microphone of the mobile terminal, wherein the target The user is a user who can perform image acquisition through the camera 20 of the mobile terminal and display the image in the display screen 50 of the mobile terminal.
  • Processor 40 may, for example, comprise a general purpose microprocessor, an instruction set processor, and/or a related chipset and/or a special purpose microprocessor (eg, an application specific integrated circuit (ASIC)), and the like, in accordance with an embodiment of the present disclosure.
  • Processor 40 may also include onboard memory for caching purposes.
  • the processor 40 may be a single processing unit or a plurality of processing units for performing different actions of the method flow according to the embodiments of the present disclosure described with reference to FIGS. 1 through 2.
  • the present invention also discloses an audio file recording method applied to a mobile terminal.
  • FIG. 6 is a flowchart of a method for recording an audio file of a mobile terminal according to the present disclosure. The method includes:
  • Step S61 A first instruction instructing to start recording audio is obtained.
  • Step S62 Enter the audio recording mode in response to the first instruction.
  • the first instruction may be generated by pressing a physical button of the mobile terminal, may be generated by pressing a virtual button displayed by the mobile terminal, or may use a voice collection module to collect a voice input of the user, and generate a trigger instruction by recognizing the voice input of the user. .
  • the mobile terminal enters the audio recording mode in response to the obtained first instruction.
  • Step S63 In the audio recording mode, the audio information is obtained through the microphone of the mobile terminal.
  • the audio information obtained by the microphone of the mobile terminal may be the audio information of the current recording environment collected by the microphone, or may be the audio information obtained by processing the audio information collected by the microphone, such as collected by the microphone.
  • the audio information obtained by the noise reduction processing such as the audio information generated by an object extracted from the audio information collected by the microphone.
  • Step S64 The voice recognition engine is invoked, and the audio information is processed in real time based on the voice recognition engine, so that the caption information is synchronously generated based on the audio information.
  • the mobile terminal invokes the voice recognition engine, and processes the audio information in real time during the process of collecting the audio information by the microphone to obtain corresponding subtitle information, that is, synchronizing the subtitle information based on the audio information.
  • Synchronizing the generation of the subtitle information based on the audio information in the example may include simultaneously processing the audio information to generate the subtitle information while receiving the audio information, that is, the action of generating the subtitle information is performed in synchronization with the action of receiving the audio information.
  • the embodiment of the present disclosure does not limit the time at which the subtitle information is generated to be completely synchronized with the audio information. For example, since the audio information needs to be processed in real time, the time at which the subtitle information is generated may be slightly later than the time at which the corresponding audio information is received.
  • Step S65 A second instruction indicating that the recording of the audio is ended is obtained.
  • Step S66 Exit the audio recording mode in response to the second instruction.
  • the second instruction may be generated by pressing a physical button of the mobile terminal, may be generated by pressing a virtual button displayed by the mobile terminal, or may use a voice collection module to collect a voice input of the user, and generate a trigger instruction by recognizing the voice input of the user. .
  • the mobile terminal exits the audio recording mode in response to the obtained second instruction, that is, ends the recording of the audio.
  • Step S67 In the audio recording mode, the audio stream composed of the audio information and the subtitle stream composed of the subtitle information are synthesized into the first audio file, so that when the first audio file is played, the audio stream and the subtitle stream are synchronously output.
  • the audio stream composed of the audio information obtained by the microphone and the subtitle stream composed of the subtitle information obtained by the speech recognition engine are synthesized into an audio file.
  • the audio stream and the subtitle stream included in the first audio file are output synchronously.
  • the mobile terminal in the process of recording audio, performs real-time processing on the audio information through the voice recognition engine, so that the subtitle information is synchronously generated based on the audio information, and the mobile terminal can be based on the audio recording mode after exiting the audio recording mode.
  • the audio stream and the subtitle stream generate an audio file, thereby quickly making an audio file configured with subtitles.
  • the real-time processing of the audio information by using the voice recognition engine is as follows.
  • the method includes: determining a current recording environment based on the parameter information of the audio information; and using the current recording environment as a result of the first environment, the current audio information. Synchronously converting to subtitle information; suspending the operation of synchronously converting audio information into subtitle information based on the result of the current recording environment being the second environment until a result indicating that the current recording environment is the first environment is obtained.
  • the method includes: determining a current recording environment based on the parameter information of the audio information; and using the current recording environment as a result of the first environment, the current audio information. Synchronously converting to subtitle information; suspending the operation of synchronously converting audio information into subtitle information based on the result of the current recording environment being the second environment until a result indicating that the current recording environment is the first environment is obtained.
  • the first environment is configured to include an environment in which at least one user is performing speech output
  • the second environment is configured to include an environment in which only background sounds are present.
  • the user's speech output means that the user is talking.
  • the current recording environment is determined based on the parameter information of the audio information, including:
  • the audio information obtained by the microphone is analyzed to determine whether the audio information includes voice information. If the audio information does not include voice information, it is determined that the current recording environment has no speech output, the current recording environment For the second environment.
  • the audio information includes voice information
  • the current recording environment is determined to be the second environment, if the current recording environment has a voice signal, but the voice signal is generated by a singing (or drama) process.
  • the voice signal then determine the current recording environment for the second environment.
  • the current recording environment is determined based on the parameter information of the audio information, including:
  • the audio information obtained by the microphone is analyzed to determine whether the audio information includes the voice information. If the audio information does not include the voice information, it is determined that the current recording environment has no user who is performing speech output, and the current recording environment is the second environment.
  • the volume of the voice information is further counted. If the volume of the voice information is lower than a preset volume threshold, it is determined that the current recording environment has no user who is performing speech output, and the current recording environment is Two environments.
  • the audio information includes voice information and the volume of the voice information reaches a preset volume threshold, it is determined whether the voice information is voice information generated by speaking or voice information generated by singing (or drama), if it is singing (or drama) The generated voice information, then determine that the current recording environment is not the user who is performing the speech output, the current recording environment is the second environment, if it is the voice information generated by the speech, then it is determined that the current recording environment has the user who is performing the speech output, the current recording The environment is the first environment.
  • the current recording environment is determined to be the second environment, if the current recording environment has a voice signal, but the volume of the voice signal is lower than the preset volume threshold. , determining that the current recording environment is the second environment. Further, if the volume of the voice signal reaches a preset volume threshold but the voice signal is a voice signal generated by a singing (or drama) process, determining the current recording environment is Two environments.
  • rhythm, melody or rhythm of the speech signal can be analyzed to determine whether the speech signal is generated by speech or by singing (or drama).
  • the current recording environment is determined based on the parameter information of the audio information, including:
  • the signal to noise ratio of the current audio information is less than the threshold, it is determined that the current recording environment is the second environment.
  • the signal-to-noise ratio of the audio information obtained through the microphone is greater than the threshold, it indicates that the current recording environment is relatively quiet, and the user in the recording environment can clearly collect the voice signal of the user when speaking.
  • the current recording environment is determined as the first environment, and the current audio information is processed in real time by the voice recognition engine, and the current audio information is synchronously converted into the caption information. If the signal-to-noise ratio of the audio information obtained through the microphone is less than the threshold, indicating that the current recording environment is relatively noisy, it is difficult for the user in the recording environment to clearly collect the sound signal of the user when speaking, so the current recording environment is determined to be the first In the second environment, the current audio information is temporarily processed by the speech recognition engine.
  • the mobile terminal includes an array of microphones including a plurality of microphones, the plurality of microphones being disposed on at least two sides of the mobile terminal.
  • the audio information is obtained through the microphone of the mobile terminal, and the following manner may be adopted:
  • the audio information of the target user is obtained through the microphone array.
  • the target user is the specified user.
  • the target user is positioned through the microphone array, and the gain of each microphone is adjusted according to the position of the target user and the installation position of the microphone in the microphone array, so as to track the target user, so as to collect the audio information of the target user.
  • the subtitle stream may further carry display configuration information of the subtitle information.
  • the display configuration information of the subtitle information includes a display position of the subtitle information and/or a dynamic display mode of the subtitle information.
  • the subtitle stream may include, in addition to the subtitle information generated by the speech recognition engine, auxiliary information determined according to the status of the provider of the speech information.
  • auxiliary information includes but is not limited to pictures and emoticons. In practice, the emotional state of its provider can be determined based on the voice information.
  • the present application also discloses a mobile terminal, which is structured as shown in FIG. 7 and includes an input interface 50, a microphone 601, and a processor 70.
  • Input interface 50 is used to acquire input commands.
  • the processor 70 is configured to: enter an audio recording mode in response to a first instruction to start recording audio; obtain audio information through a microphone 601 in an audio recording mode; invoke a voice recognition engine to perform real-time processing on the audio information based on the voice recognition engine, So that the subtitle information is generated based on the audio information synchronization; in response to the second instruction indicating the end of the recording of the audio, Exiting the audio recording mode; in the audio recording mode, the audio stream composed of the audio information and the subtitle stream composed of the subtitle information are synthesized into the first audio file, so that the audio stream and the subtitle are synchronously output when the first audio file is played flow.
  • the mobile terminal disclosed in the present application processes the audio information in real time through the voice recognition engine in the process of recording audio, so that the subtitle information is synchronously generated based on the audio information, and the mobile terminal can be based on the audio stream and the subtitle stream after exiting the audio recording mode. Generate audio files to quickly create audio files that have subtitles configured.
  • the processor 70 is configured to: determine, according to parameter information of the audio information, a current recording environment, based on the real-time processing of the audio information by the voice recognition engine; and based on the current recording environment as a result of the first environment, The audio information is synchronously converted into subtitle information; based on the result of the current recording environment being the second environment, the operation of synchronously converting the audio information into subtitle information is suspended until a result indicating that the current recording environment is the first environment is obtained.
  • the processor 70 configures the first environment to be an environment in which at least one user is in language output, and configures the second environment as an environment in which only background sound exists.
  • the processor 70 determines an aspect of the current recording environment based on the parameter information of the audio information, and is configured to: analyze audio information obtained by using the microphone, and determine whether the audio information includes voice information, if the audio information does not include Voice information, then determine that the current recording environment has no user who is performing speech output, and the current recording environment is the second environment. Further, if the audio information includes voice information, it is determined whether the voice information is voice information generated by speaking or voice information generated by singing (or drama), and if it is voice information generated by singing (or drama), it is determined that the current recording environment is not The user who is performing speech output, the current recording environment is the second environment. If it is the voice information generated by the speech, it is determined that the current recording environment has a user who is performing speech output, and the current recording environment is the first environment.
  • the processor 70 determines an aspect of the current recording environment based on the parameter information of the audio information, and is configured to: analyze audio information obtained by using the microphone, and determine whether the audio information includes voice information, if the audio information does not include Voice information, then determine that the current recording environment has no user who is performing speech output, and the current recording environment is the second environment. Further, if the audio information includes voice information, the volume of the voice information is further counted. If the volume of the voice information is lower than a preset volume threshold, it is determined that the current recording environment has no user who is performing speech output, and the current recording environment is Two environments.
  • the audio information includes voice information and the volume of the voice information reaches a preset volume threshold, it is determined whether the voice information is voice information generated by speaking or voice information generated by singing (or drama), if it is singing (or drama) The generated voice information, then determine that the current recording environment is not the user who is performing the speech output, the current recording environment is the second environment, if it is the voice information generated by the speech, then it is determined that the current recording environment has the user who is performing the speech output, the current recording The environment is the first environment.
  • the processor 70 determines an aspect of the current recording environment based on the parameter information of the audio information, and is used to: determine a signal to noise ratio of the current audio information; if the signal to noise ratio of the current audio information is greater than a threshold, determine the current The recording environment is the first environment; if the signal to noise ratio of the current audio information is less than the threshold, it is determined that the current recording environment is the second environment.
  • the mobile terminal includes an array of microphones including a plurality of microphones, the plurality of microphones being disposed on at least two sides of the mobile terminal.
  • the processor 70 obtains audio information of the target user through the microphone array in terms of obtaining audio information through the microphone of the mobile terminal.
  • the target user is the specified user.
  • Processor 70 may, for example, comprise a general purpose microprocessor, an instruction set processor, and/or a related chipset and/or a special purpose microprocessor (e.g., an application specific integrated circuit (ASIC)), and the like.
  • Processor 510 may also include an onboard memory for caching purposes.
  • the processor 510 may be a single processing unit or a plurality of processing units for performing different actions of the method flow according to the embodiment of the present disclosure described with reference to FIG.
  • Embodiments of the present invention initiate speech recognition at the time of video recording, identify and convert speech into speech in the current environment.
  • the subtitle synchronization is saved with the image captured by the camera and the voice collected by the microphone to form a final multimedia file.
  • the embodiment of the present invention can realize voice collection only for objects in the camera acquisition area and synchronous recognition and conversion by the voice recognition engine through the acquisition of multiple microphones and the sound noise reduction technology. Further, the multi-microphone positioning technology can be used to locate a user who is performing voice output in the camera acquisition area and perform real-time acquisition and identify and convert the user into the subtitle by the speech recognition engine.

Abstract

本申请公开一种移动终端的视频文件录制方法,移动终端处于视频录制模式时,通过摄像头获得图像信息、通过麦克风获得音频信息,并且移动终端调用语音识别引擎,基于语音识别引擎对获得的音频信息进行实时处理,以便基于音频信息同步生成字幕信息,移动终端退出视频录制模式后,对本次视频录制过程中获得的图像信息构成的图像流、本次视频录制过程中获得的音频信息构成的音频流、以及本次视频录制过程中获得的字幕信息构成的字幕流进行合成处理,获得第一视频文件。基于本申请公开的方法,能够快捷地制作完成配置有字幕的视频文件。本申请还公开一种移动终端的音频文件录制方法。

Description

视频文件录制方法、音频文件录制方法及移动终端 技术领域
本申请属于多媒体技术领域,尤其涉及视频文件录制方法、音频文件录制方法及移动终端。
背景技术
随着互联网技术的发展和互联网资源的日益丰富,用户通过互联网能够获取到多种用于工作、学习、娱乐的资源,音频和视频就是其中重要的资源。
为了给用户带来更加丰富的体验,音频和视频通常配有对应的字幕,便于有听觉障碍的用户或者处于嘈杂环境的用户通过字幕清楚地理解音频和视频所播放的内容。目前通常是先制作音频或者视频,后期再制作对应的字幕。但是,目前针对音频或者视频制作字幕的方式较为单一。
发明内容
有鉴于此,本申请的目的在于提供一种应用于移动终端的视频文件录制方法,以便更加快捷地制作完成配置有字幕的视频文件。本申请还提供一种应用于移动终端的音频文件录制方法,以便更加快捷地制作完成配置有字幕的音频文件。
为实现上述目的,本申请提供如下技术方案:
一方面,本申请提供一种移动终端的视频文件录制方法,包括:
获得指示开始录制视频的第一指令;
响应所述第一指令,进入视频录制模式;
在所述视频录制模式下,通过所述移动终端的摄像头获得图像信息,通过所述移动终端的麦克风获得音频信息;
调用语音识别引擎,基于所述语音识别引擎对所述音频信息进行实时处理,以使得基于所述音频信息同步生成字幕信息;
获得指示结束录制视频的第二指令;
响应所述第二指令,退出所述视频录制模式;
将在所述视频录制模式下,由所述图像信息构成的图像流、由所述音频信息构成的音频流、以及由所述字幕信息构成的字幕流合成为第一视频文件,以使得在播放所述第一视频文件时,同步输出所述图像流、所述音频流和所述字幕流。
可选的,上述方法中,所述基于所述语音识别引擎对所述音频信息进行实时处理,包括:基于所述音频信息的参数信息确定当前录制环境;基于当前录制环境为第一环境的结果,将当前的音频信息同步转换为字幕信息;基于当前录制环境为第二环境的结果,暂停将音频信息同步转换为字幕信息的操作,直至获得表明当前录制环境为所述第一环境的结果。
可选的,上述方法中,所述第一环境包括至少有一个用户在进行语言输出的环境,所述第二环境包括仅存在背景音的环境。
可选的,上述方法中,基于所述音频信息的参数信息确定当前录制环境,包括:确定当前音频信息的信噪比;如果当前音频信息的信噪比大于阈值,则确定当前录制环境为所述第一环境;如果当前音频信息的信噪比小于所述阈值,则确定当前录制环境为所述第二环境。
可选的,所述移动终端包括麦克风阵列,所述麦克风阵列包括多个安装位置不同的麦克风,其中,所述摄像头所在的侧面上设置有至少一个麦克风,所述移动终端的至少一个其他侧面上设置有麦克风;
上述方法中,所述通过所述移动终端的麦克风获得音频信息,包括:通过所述麦克风阵列获得目标用户的音频信息,其中,所述目标用户为能够通过所述移动终端的摄像头进行图像采集且显示在所述移动终端的显示屏内的用户。
另一方面,本申请提供一种移动终端,包括输入接口、摄像头、麦克风和处理器;
所述输入接口用于采集输入指令;
所述处理器用于:响应指示开始录制视频的第一指令,进入视频录制模式;在所述视频录制模式下,通过所述移动终端的摄像头获得图像信息,通过所述移动终端的麦克风获得音频信息;调用语音识别引擎,基于所述语音识别引擎对所述音频信息进行实时处理,以使得基于所述音频信息同步生成字幕信息;响应指示结束录制视频的第二指令,退出所述视频录制模式;将在所述视频录制模式下,由所述图像信息构成的图像流、由所述音频信息构成的音频流、以及由所述字幕信息构成的字幕流合成为第一视频文件,以使得在播放所述第一视频文件时,同步输出所述图像流、所述音频流和所述字幕流。
可选的,上述移动终端中,所述处理器在基于所述语音识别引擎对所述音频信息进行实时处理的方面,用于:
基于所述音频信息的参数信息确定当前录制环境;基于当前录制环境为第一环境的结果,将当前的音频信息同步转换为字幕信息;基于当前录制环境为第二环境的结果,暂停将音频信息同步转换为字幕信息的操作,直至获得表明当前录制环境为所述第一环境的结果。
可选的,上述移动终端中,所述处理器将所述第一环境配置为包括至少有一个用户在进行语言输出的环境,将所述第二环境配置为包括仅存在背景音的环境。
可选的,上述移动终端中,所述处理器在基于所述音频信息的参数信息确定当前录制环境的方面,用于:
确定当前音频信息的信噪比;如果当前音频信息的信噪比大于阈值,则确定当前录制环境为所述第一环境;如果当前音频信息的信噪比小于所述阈值,则确定当前录制环境为所述第二环境。
可选的,上述移动终端包括麦克风阵列,所述麦克风阵列包括多个安装位置不同的麦克风,其中,所述摄像头所在的侧面上设置有至少一个麦克风,所述移动终端的至少一个其他侧面上设置有麦克风;所述移动终端还包括显示屏;
所述处理器在通过所述移动终端的麦克风获得音频信息的方面,用于:通过所述麦克风阵列获得目标用户的音频信息,其中,所述目标用户为能够通过所述移动终端的摄像头进行图像采集且显示在所述移动终端的显示屏内的用户。
另一方面,本申请提供一种移动终端的音频文件录制方法,包括:
获得指示开始录制音频的第一指令;
响应所述第一指令,进入音频录制模式;
在所述音频录制模式下,通过所述移动终端的麦克风获得音频信息;
调用语音识别引擎,基于所述语音识别引擎对所述音频信息进行实时处理,以使得基于所述音频信息同步生成字幕信息;
获得指示结束录制音频的第二指令;
响应所述第二指令,退出所述音频录制模式;
将在所述音频录制模式下,由所述音频信息构成的音频流以及由所述字幕信息构成的字幕流合成为第一音频文件,以使得在播放所述第一音频文件时,同步输出所述音频流和所述字幕流。
另一方面,本申请提供一种移动终端,包括输入接口、麦克风和处理器;
所述输入接口用于采集输入指令;
所述处理器用于:响应指示开始录制音频的第一指令,进入音频录制模式;在所述音频录制模式下,通过所述移动终端的麦克风获得音频信息;调用语音识别引擎,基于所述语音识别引擎对所述音频信息进行实时处理,以使得基于所述音频信息同步生成字幕信息;响应指示结束录制音频的第二指令,退出所述音频录制模式;将在所述音频录制模式下,由所述音频信息构成的音频流以及由所述字幕信息构成的字幕流合成为第一音频文件,以使得在播放所述第一音频文件时,同步输出所述音频流和所述字幕流。
由此可见,本申请的有益效果为:
本申请公开的移动终端的视频文件录制方法,移动终端处于视频录制模式时,通过摄像头获得图像信息、通过麦克风获得音频信息,并且移动终端调用语音识别引擎,基于语音识别引擎对获得的音频信息进行实时处理,以便基于音频信息同步生成字幕信息,移动终端退出视频录制模式后,对本次视频录制过程中获得的图像信息构成的图像流、本次视频录制过程中获得的音频信息构成的音频流、以及本次视频录制过程中获得的字幕信息构成的字幕流进行合成处理,获得第一视频文件。可以看到,本申请公开的视频文件录制方法,移动终端在录制视频的过程中,通过语音识别引擎对音频信息进行实时处理,从而基于音频信息同步生成字幕信息,移动终端在退出视频录制模式后,即可基于音频流、图像流和字幕流生成视频文件,从而快捷地制作完成配置有字幕的视频文件。
附图说明
为了更清楚地说明本申请实施例,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请公开的一种移动终端的视频文件录制方法的流程图;
图2为本申请公开的基于语音识别引擎对音频信息进行实时处理的流程图;
图3为本申请公开的一种视频录制场景的示意图;
图4为本申请公开的一种移动终端的结构图;
图5为本申请公开的另一种移动终端的结构图;
图6为本申请公开的一种移动终端的音频文件录制方法的流程图;
图7为本申请公开的另一种移动终端的结构图。
具体实施方式
本申请公开视频文件录制方法、音频文件录制方法及相应的移动终端,在录制音频或者视频的过程中,通过识别音频信息同步生成对应的字幕信息,从而更加快捷地制作完成配置有字幕的音频文件或者视频文件。本申请中的移动终端可以为手机、平板电脑,或者其他具有音频录制功能和视频录制功能的终端。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。此外,在以下说明中,省略了对公知结构和技术的描述,以避免 不必要地混淆本公开的概念。
在此使用的术语仅仅是为了描述具体实施例,而并非意在限制本公开。在此使用的术语“包括”、“包含”等表明了所述特征、步骤、操作和/或部件的存在,但是并不排除存在或添加一个或多个其他特征、步骤、操作或部件。
在此使用的所有术语(包括技术和科学术语)具有本领域技术人员通常所理解的含义,除非另外定义。应注意,这里使用的术语应解释为具有与本说明书的上下文相一致的含义,而不应以理想化或过于刻板的方式来解释。
在使用类似于“A、B和C等中至少一个”这样的表述的情况下,一般来说应该按照本领域技术人员通常理解该表述的含义来予以解释(例如,“具有A、B和C中至少一个的系统”应包括但不限于单独具有A、单独具有B、单独具有C、具有A和B、具有A和C、具有B和C、和/或具有A、B、C的系统等)。在使用类似于“A、B或C等中至少一个”这样的表述的情况下,一般来说应该按照本领域技术人员通常理解该表述的含义来予以解释(例如,“具有A、B或C中至少一个的系统”应包括但不限于单独具有A、单独具有B、单独具有C、具有A和B、具有A和C、具有B和C、和/或具有A、B、C的系统等)。本领域技术人员还应理解,实质上任意表示两个或更多可选项目的转折连词和/或短语,无论是在说明书、权利要求书还是附图中,都应被理解为给出了包括这些项目之一、这些项目任一方、或两个项目的可能性。例如,短语“A或B”应当被理解为包括“A”或“B”、或“A和B”的可能性。
附图中示出了一些方框图和/或流程图。应理解,方框图和/或流程图中的一些方框或其组合可以由计算机程序指令来实现。这些计算机程序指令可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器,从而这些指令在由该处理器执行时可以创建用于实现这些方框图和/或流程图中所说明的功能/操作的装置。
因此,本公开的技术可以硬件和/或软件(包括固件、微代码等)的形式来实现。另外,本公开的技术可以采取存储有指令的计算机可读介质上的计算机程序产品的形式,该计算机程序产品可供指令执行系统使用或者结合指令执行系统使用。在本公开的上下文中,计算机可读介质可以是能够包含、存储、传送、传播或传输指令的任意介质。例如,计算机可读介质可以包括但不限于电、磁、光、电磁、红外或半导体系统、装置、器件或传播介质。计算机可读介质的具体示例包括:磁存储装置,如磁带或硬盘(HDD);光存储装置,如光盘(CD-ROM);存储器,如随机存取存储器(RAM)或闪存;和/或有线/无线通信链路。
参见图1,图1为本申请公开的一种移动终端的视频文件录制方法的流程图。该方法包括:
步骤S11:获得指示开始录制视频的第一指令。
步骤S12:响应第一指令,进入视频录制模式。
其中,该第一指令可以通过按下移动终端的物理按键产生,可以通过按下移动终端显示的虚拟按键产生,也可以利用语音采集模块采集用户的语音输入,通过识别用户的语音输入产生触发指令。移动终端响应获得的第一指令进入视频录制模式。
步骤S13:在视频录制模式下,通过移动终端的摄像头获得图像信息,通过移动终端的麦克风获得音频信息。
需要说明的是,通过移动终端的麦克风获得的音频信息可以是麦克风采集到的当前录制环境的音频信息,也可以是对麦克风采集的音频信息进行处理后得到的音频信息,如对麦克风采集到的音频信息进行降噪处理所得到的音频信息,如从麦克风采集到的音频信息中提取出的某对象产生的音频信息。
步骤S14:调用语音识别引擎,基于语音识别引擎对音频信息进行实时处理,以使得基于音频信息同步生成字幕信息。
移动终端调用语音识别引擎,在麦克风采集音频信息的过程中,实时对音频信息进行处理,得到对应的字幕信息,也就是基于音频信息同步生成字幕信息。可以理解,本公开实施例中的基于音频信息同步生成字幕信息可以包括在接收音频信息的同时,对音频信息进行处理同步生成字幕信息,即,生成字幕信息的动作与接收音频信息的动作是同步进行的。但是,本公开实施例不限制字幕信息生成的时间与音频信息完全地同步,例如,由于需要对音频信息进行实时处理,则生成字幕信息的时间可以稍晚于接收到相应音频信息的时间。步骤S15:获得指示结束录制视频的第二指令。
步骤S16:响应第二指令,退出视频录制模式。
其中,该第二指令可以通过按下移动终端的物理按键产生,可以通过按下移动终端显示的虚拟按键产生,也可以利用语音采集模块采集用户的语音输入,通过识别用户的语音输入产生触发指令。移动终端响应获得的第二指令退出视频录制模式,也就是结束录制视频。
步骤S17:将在视频录制模式下,由图像信息构成的图像流、由音频信息构成的音频流、以及由字幕信息构成的字幕流合成为第一视频文件,以使得在播放第一视频文件时,同步输出图像流、音频流和字幕流。
也就是,将从获得第一指令开始到获得第二指令结束的过程中,通过摄像头获得的图像信息构成的图像流、通过麦克风获得的音频信息构成的音频流、以及通过语音识别引擎获得的字幕信息构成的字幕流合成为视频文件(记为第一视频文件)。在播放第一视频文件时,该第一视频文件包含的音频流、图像流和字幕流被同步输出。
本申请公开的移动终端的视频文件录制方法,移动终端处于视频录制模式时,通过摄像头获得图像信息、通过麦克风获得音频信息,并且移动终端调用语音识别引擎,基于语音识别引擎对获得的音频信息进行实时处理,以便基于音频信息同步生成字幕信息,移动终端退出视频录制模式后,对本次视频录制过程中获得的图像信息构成的图像流、本次视频录制过程中获得的音频信息构成的音频流、以及本次视频录制过程中获得的字幕信息构成的字幕流进行合成处理,获得第一视频文件。可以看到,本申请公开的视频文件录制方法,移动终端在录制视频的过程中,通过语音识别引擎对音频信息进行实时处理,从而基于音频信息同步生成字幕信息,移动终端在退出视频录制模式后,即可基于音频流、图像流和字幕流生成视频文件,从而快捷地制作完成配置有字幕的视频文件。
作为一种实施方式,基于语音识别引擎对音频信息进行实时处理采用如图2所示的方式。具体包括:
步骤S21:基于音频信息的参数信息确定当前录制环境。
用户可能在不同的环境中录制视频,在某些环境下是无需生成字幕信息的。例如:当前录制环境下没有人说话,那么是无需生成字幕信息的。例如:当前录制环境下存在嘈杂的人声,但当前的拍摄对象并未说话,那么是无需生成字幕信息的。另外,在某些环境下,通过搜索引擎难以准确地基于音频信息同步生成字幕信息。
因此,基于语音识别引擎对音频信息进行实时处理的过程中,根据音频信息的参数信息确定当前录制环境是第一环境还是第二环境,以确定是否通过语音识别引擎将音频信息同步转换为字幕信息。实施中,可以将第一环境视为存在有效语音信号的环境,将第二环境视为不存在有效语音信号的环境。
其中,有效语音信号是指满足预定要求的语音信号,例如:特定用户产生的语音信号作为有效语音信号,或者用户产生的音量达到了音量阈值的语音信号作为有效语音信号。
步骤S22:基于当前录制环境为第一环境的结果,将当前的音频信息同步转换为字幕信息。
步骤S23:基于当前录制环境为第二环境的结果,暂停将音频信息同步转换为字幕信息的操作,直至获得表明当前录制环境为第一环境的结果。
如果当前录制环境为第一环境,那么通过语音识别引擎对当前的音频信息进行实时处理,将当前的音频信息同步转换为字幕信息。如果当前录制环境为第二环境,那么暂停通过语音识别引擎对当前的音频信息进行实时处理,直至获得表明当前录制环境为第一环境的结果,再次启动语音识别引擎对音频信息进行实时处理。
实施中,可以在字幕流中插入与暂停通过语音识别引擎对音频信息进行实时处理的时间段对应的空白。
例如:在录制视频的过程中,从第10分钟进入第二环境、到第12分钟从第二环境进入第一环境,那么在从第10分钟至第12分钟的时间段内,语音识别引擎暂停对音频信息进行实时处理,相应的,在字幕流中从第10分钟至第12分钟的时间段内插入空白。在该时间段内,如果有需要补充的字幕信息,那么用户后期可以在视频文件中对该时间段内的字幕信息进行编辑修改。
基于本申请图2所示的方法,移动终端在视频录制模式下,通过摄像头获得图像信息、通过麦克风获得音频信息,并且基于音频信息的参数信息确定当前录制环境,如果当前录制环境为第一环境,则通过语音识别引擎将当前的音频信息同步转换为字幕信息,如果当前录制环境为第二环境,则暂停通过语音识别引擎将音频信息同步转换为字幕信息,直至录制环境变换为第一环境,移动终端退出视频录制模式后,将本次视频录制过程中产生的图像流、音频流和字幕流合成为第一视频文件。可以看到,基于本申请图2所示的方法,如果当前录制环境为第二环境,则暂停通过语音识别引擎将音频信息同步转换为字幕信息,一方面能够降低语音识别引擎的数据处理量,另一方面也能够避免将录制环境中的杂音误处理为字幕信息或者提供错误的字幕信息。
可选的,将第一环境配置为至少有一个用户在进行言语输出的环境,将第二环境配置为仅存在背景音的环境。其中,用户在进行言语输出是指该用户在说话。
作为一种方式,步骤S21中基于音频信息的参数信息确定当前录制环境,包括:
对通过麦克风获得的音频信息进行分析,确定音频信息中是否包含语音信息,如果音频信息不包含语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境。
进一步的,如果音频信息包含语音信息,那么判断该语音信息是说话产生的语音信息还是唱歌(或戏剧)产生的语音信息,如果是唱歌(或戏剧)产生的语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境,如果是说话产生的语音信息,那么确定当前录制环境有正在进行言语输出的用户,当前录制环境为第一环境。
也就是说,如果当前录制环境没有语音信号(没有人发出的声音),那么确定当前录制环境为第二环境,如果当前录制环境有语音信号,但是该语音信号是唱歌(或戏剧)过程所产生的语音信号,那么确定当前录制环境为第二环境。
作为另一种方式,步骤S21中基于音频信息的参数信息确定当前录制环境,包括:
对通过麦克风获得的音频信息进行分析,确定音频信息中是否包含语音信息,如果音频信息不包含语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境。
进一步的,如果音频信息包含语音信息,进一步统计该语音信息的音量,如果该语音信息的音量低于预设的音量阈值,则确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境。
进一步的,如果音频信息包含语音信息并且该语音信息的音量达到预设的音量阈值,那么判断该语音信息是说话产生的语音信息还是唱歌(或戏剧)产生的语音信息,如果是唱歌(或戏剧)产生的语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境,如果是说话产生的语音信息,那么确定当前录制环境有正在进行言语输出的用户,当前录制环境为第一环境。
也就是说,如果当前录制环境没有语音信号(没有人发出的声音),那么确定当前录制环境为第二环境,如果当前录制环境有语音信号,但是该语音信号的音量低于预设的音量阈值,则确定当前录制环境为第二环境,进一步的,如果该语音信号的音量达到预设的音量阈值但该语音信号是唱歌(或戏剧)过程所产生的语音信号,那么确定当前录制环境为第二环境。
需要说明的是,可以通过分析语音信号的节奏、旋律或者韵律,以确定语音信号是说话产生的还是唱歌(或戏剧)产生的。
作为另一种方式,步骤S21中基于音频信息的参数信息确定当前录制环境,包括:
确定当前音频信息的信噪比;
如果当前音频信息的信噪比大于阈值,则确定当前录制环境为第一环境;
如果当前音频信息的信噪比小于阈值,则确定当前录制环境为第二环境。
移动终端在视频录制模式下,如果通过麦克风获得的音频信息的信噪比大于阈值,表明当前录制环境较为安静,处于该录制环境中的用户说话时能够清楚地采集到该用户的声音信号,因此将当前录制环境确定为第一环境,通过语音识别引擎对当前的音频信息进行实时处理,将当前的音频信息同步转换为字幕信息。如果通过麦克风获得的音频信息的信噪比小于阈值,表明当前录制环境较为嘈杂,处于该录制环境中的用户说话时很难清楚地采集到该用户的声音信号,因此将当前录制环境确定为第二环境,暂停通过语音识别引擎对当前的音频信息进行实时处理。
作为一种优选方案,移动终端包括麦克风阵列,该麦克风阵列包括多个安装位置不同的麦克风,其中,摄像头所在的侧面上设置至少一个麦克风,移动终端的至少一个其他侧面上设置至少一个麦克风。需要说明的是,多个麦克风的位置是不同的,相应的,多个麦克风的拾音区也是不同的。
本申请上述公开的视频文件录制方法中,通过移动终端的麦克风获得音频信息,可以采用如下方式:
1)、获得第一侧面上麦克风采集的音频信息,获得第二侧面上麦克风采集的音频信息,其中,第一侧面是当前进行图像采集的摄像头所在的侧面,第二侧面是除第一侧面之外设置有麦克风的侧面;
2)、利用位于第二侧面的麦克风采集的音频信息对位于第一侧面的麦克风采集的音频信息进行降噪处理,获得经过降噪处理后的音频信息。
移动终端处于视频录制模式时,位于第一侧面的麦克风的拾音区能够覆盖当前进行图像采集的摄像头的拍摄区域,而位于第二侧面的麦克风的拾音区与当前进行图像采集的摄像头的拍摄区域没有重叠,或者仅有很小的重叠区域。而视频拍摄者关注的声音源通常是当前的拍摄对象,位于第一侧面的麦克风采集的主要是拍摄对象发出的声音,而位于第二侧面的麦克风采集的主要是环境噪音,因此,利用位于第二侧面的麦克风采集的音频信息对位于第一侧面的麦克风采集的音频信息进行降噪处理,能够得到拍摄对象更加清楚的语音信息。
另外,本申请上述公开的视频文件录制方法中,通过移动终端的麦克风获得音频信息,也可以采用如下方式:
通过麦克风阵列获得目标用户的音频信息。其中,目标用户为能够通过移动终端的摄像头进行图像采集且图像显示在移动终端的显示屏内的用户。
实施中,通过麦克风阵列对目标用户进行定位,根据目标用户的位置以及麦克风阵列中麦克风的安装位置调整各个麦克风的增益,实现对目标用户的追踪,采集该目标用户的音频信息。
以图3所示的办公室录制场景为例:
在办公室中共有10个人员,并且10个人员呈环形围坐。移动终端的麦克风阵列包括麦克风102、麦克风103、麦克风104和麦克风105,其中,麦克风102以及麦克风103与摄像头101处于同一侧面,麦克风104和麦克风105位于其他侧面上。
在当前时刻,人员A1进行发言,移动终端朝向人员A1进行视频录制,并且移动终端中 当前处于图像采集状态的摄像头为101,摄像头101的拍摄区域为图中以S1标示的区域。此时,摄像头101对人员A1进行图像采集,并且人员A1的图像显示在移动终端的显示屏内,人员A1即为目标用户。
移动终端通过麦克风阵列对人员A1进行定位,确定人员A1的位置。移动终端根据人员A1的位置以及各麦克风的安装位置,调整各个麦克风的增益,实现对人员A1的音源跟踪,采集人员A1的音频信息,将其他人员产生的音频信息滤除。
另外,在本申请上述公开的视频文件录制方法中,字幕流还可以携带字幕信息的显示配置信息。其中,字幕信息的显示配置信息包括字幕信息的显示位置和/或字幕信息的动态显示模式。
另外,字幕流中除了通过语音识别引擎产生的字幕信息之外,还可以包括:根据语音信息的提供者的情绪状态确定的辅助信息。其中,辅助信息包括但不限于图片、表情符号。实施中,对通过摄像头获得的图像进行分析,根据语音信息的提供者的表情和/或肢体动作确定该提供者的情绪状态,也可以根据语音信息确定其提供者的情绪状态,获得与该情绪状态对应的辅助信息。
本申请还公开一种移动终端,其结构如图4所示,包括输入接口10、摄像头20、麦克风301和处理器40。
输入接口10用于采集输入指令。
处理器40用于:响应指示开始录制视频的第一指令,进入视频录制模式;在视频录制模式下,通过摄像头20获得图像信息,通过麦克风301获得音频信息;调用语音识别引擎,基于语音识别引擎对音频信息进行实时处理,以使得基于音频信息同步生成字幕信息;响应指示结束录制视频的第二指令,退出视频录制模式;将在视频录制模式下,由图像信息构成的图像流、由音频信息构成的音频流、以及由字幕信息构成的字幕流合成为第一视频文件,以使得在播放第一视频文件时,同步输出图像流、音频流和字幕流。
本申请公开的移动终端在录制视频的过程中,通过语音识别引擎对音频信息进行实时处理,从而基于音频信息同步生成字幕信息,在退出视频录制模式后,即可基于音频流、图像流和字幕流生成视频文件,从而快捷地制作完成配置有字幕的视频文件。
作为一种实施方式,处理器40在基于语音识别引擎对音频信息进行实时处理的方面,用于:
基于音频信息的参数信息确定当前录制环境;基于当前录制环境为第一环境的结果,将当前的音频信息同步转换为字幕信息;基于当前录制环境为第二环境的结果,暂停将音频信息同步转换为字幕信息的操作,直至获得表明当前录制环境为第一环境的结果。
可选的,处理器40将第一环境配置为至少有一个用户在进行语言输出的环境,将第二环境配置为仅存在背景音的环境。
作为一种实施方式,处理器40在基于音频信息的参数信息确定当前录制环境的方面,用于:对通过麦克风获得的音频信息进行分析,确定音频信息中是否包含语音信息,如果音频信息不包含语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境。进一步的,如果音频信息包含语音信息,那么判断该语音信息是说话产生的语音信息还是唱歌(或戏剧)产生的语音信息,如果是唱歌(或戏剧)产生的语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境,如果是说话产生的语音信息,那么确定当前录制环境有正在进行言语输出的用户,当前录制环境为第一环境。
作为一种实施方式,处理器40在基于音频信息的参数信息确定当前录制环境的方面,用于:对通过麦克风获得的音频信息进行分析,确定音频信息中是否包含语音信息,如果音频信息不包含语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境。进一步的,如果音频信息包含语音信息,进一步统计该语音信息的音量,如果该语音信息的音量低于预设的音量阈值,则确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境。进一步的,如果音频信息包含语音信息并且该语音信息的音量达到预设的音量阈值,那么判断该语音信息是说话产生的语音信息还是唱歌(或戏剧)产生的语音信息,如果是唱歌(或戏剧)产生的语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境,如果是说话产生的语音信息,那么确定当前录制环境有正在进行言语输出的用户,当前录制环境为第一环境。
作为另一种实施方式,处理器40在基于音频信息的参数信息确定当前录制环境的方面,用于:确定当前音频信息的信噪比;如果当前音频信息的信噪比大于阈值,则确定当前录制环境为第一环境;如果当前音频信息的信噪比小于阈值,则确定当前录制环境为第二环境。
作为一种优选实施方式,移动终端包括麦克风阵列30,该麦克风阵列30包括多个安装位置不同的麦克风,其中,摄像头20所在的侧面上设置有至少一个麦克风,移动终端的至少一个其他侧面上设置有麦克风,移动终端还包括显示屏50,如图5所示。
在移动终端包括麦克风阵列30的情况下,作为一种实施方式,处理器40在通过移动终端的麦克风获得音频信息的方面,用于:获得第一侧面上麦克风采集的音频信息,获得第二侧面上麦克风采集的音频信息,利用位于第二侧面的麦克风采集的音频信息对位于第一侧面的麦克风采集的音频信息进行降噪处理,获得经过降噪处理后的音频信息。其中,第一侧面是当前进行图像采集的摄像头所在的侧面,第二侧面是除第一侧面之外设置有麦克风的侧面。
在移动终端包括麦克风阵列30的情况下,作为另一种实施方式,处理器40在通过移动终端的麦克风获得音频信息的方面,用于:通过麦克风阵列30获得目标用户的音频信息,其中,目标用户为能够通过移动终端的摄像头20进行图像采集且图像显示在移动终端的显示屏50内的用户。
根据本公开实施例,处理器40例如可以包括通用微处理器、指令集处理器和/或相关芯片组和/或专用微处理器(例如,专用集成电路(ASIC)),等等。处理器40还可以包括用于缓存用途的板载存储器。处理器40可以是用于执行参考图1~图2描述的根据本公开实施例的方法流程的不同动作的单一处理单元或者是多个处理单元。
本发明还公开应用于移动终端的音频文件录制方法。
参见图6,图6为本申请公开的一种移动终端的音频文件录制方法的流程图。该方法包括:
步骤S61:获得指示开始录制音频的第一指令。
步骤S62:响应第一指令,进入音频录制模式。
其中,该第一指令可以通过按下移动终端的物理按键产生,可以通过按下移动终端显示的虚拟按键产生,也可以利用语音采集模块采集用户的语音输入,通过识别用户的语音输入产生触发指令。移动终端响应获得的第一指令进入音频录制模式。
步骤S63:在音频录制模式下,通过移动终端的麦克风获得音频信息。
需要说明的是,通过移动终端的麦克风获得的音频信息可以是麦克风采集到的当前录制环境的音频信息,也可以是对麦克风采集的音频信息进行处理后得到的音频信息,如对麦克风采集到的音频信息进行降噪处理所得到的音频信息,如从麦克风采集到的音频信息中提取出的某对象产生的音频信息。
步骤S64:调用语音识别引擎,基于语音识别引擎对音频信息进行实时处理,以使得基于音频信息同步生成字幕信息。
移动终端调用语音识别引擎,在麦克风采集音频信息的过程中,实时对音频信息进行处理,得到对应的字幕信息,也就是基于音频信息同步生成字幕信息。可以理解,本公开实施 例中的基于音频信息同步生成字幕信息可以包括在接收音频信息的同时,对音频信息进行处理同步生成字幕信息,即,生成字幕信息的动作与接收音频信息的动作是同步进行的。但是,本公开实施例不限制字幕信息生成的时间与音频信息完全地同步,例如,由于需要对音频信息进行实时处理,则生成字幕信息的时间可以稍晚于接收到相应音频信息的时间。步骤S65:获得指示结束录制音频的第二指令。
步骤S66:响应第二指令,退出音频录制模式。
其中,该第二指令可以通过按下移动终端的物理按键产生,可以通过按下移动终端显示的虚拟按键产生,也可以利用语音采集模块采集用户的语音输入,通过识别用户的语音输入产生触发指令。移动终端响应获得的第二指令退出音频录制模式,也就是结束录制音频。
步骤S67:将在音频录制模式下,由音频信息构成的音频流以及由字幕信息构成的字幕流合成为第一音频文件,以使得在播放第一音频文件时,同步输出音频流和字幕流。
也就是,将从获得第一指令开始到获得第二指令结束的过程中,通过麦克风获得的音频信息构成的音频流、以及通过语音识别引擎获得的字幕信息构成的字幕流合成为音频文件(记为第一音频文件)。在播放第一音频文件时,该第一音频文件包含的音频流和字幕流被同步输出。
本申请公开的音频文件录制方法,移动终端在录制音频的过程中,通过语音识别引擎对音频信息进行实时处理,从而基于音频信息同步生成字幕信息,移动终端在退出音频录制模式后,即可基于音频流和字幕流生成音频文件,从而快捷地制作完成配置有字幕的音频文件。
作为一种实施方式,基于语音识别引擎对音频信息进行实时处理采用如下方式,具体包括:基于音频信息的参数信息确定当前录制环境;基于当前录制环境为第一环境的结果,将当前的音频信息同步转换为字幕信息;基于当前录制环境为第二环境的结果,暂停将音频信息同步转换为字幕信息的操作,直至获得表明当前录制环境为第一环境的结果。具体的实施方式可以参见前文中关于图2的说明。
可选的,将第一环境配置为包括至少有一个用户在进行言语输出的环境,将第二环境配置为包括仅存在背景音的环境。其中,用户在进行言语输出是指该用户在说话。
作为一种方式,基于音频信息的参数信息确定当前录制环境,包括:
对通过麦克风获得的音频信息进行分析,确定音频信息中是否包含语音信息,如果音频信息不包含语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境 为第二环境。
进一步的,如果音频信息包含语音信息,那么判断该语音信息是说话产生的语音信息还是唱歌(或戏剧)产生的语音信息,如果是唱歌(或戏剧)产生的语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境,如果是说话产生的语音信息,那么确定当前录制环境有正在进行言语输出的用户,当前录制环境为第一环境。
也就是说,如果当前录制环境没有语音信号(没有人发出的声音),那么确定当前录制环境为第二环境,如果当前录制环境有语音信号,但是该语音信号是唱歌(或戏剧)过程所产生的语音信号,那么确定当前录制环境为第二环境。
作为另一种方式,基于音频信息的参数信息确定当前录制环境,包括:
对通过麦克风获得的音频信息进行分析,确定音频信息中是否包含语音信息,如果音频信息不包含语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境。
进一步的,如果音频信息包含语音信息,进一步统计该语音信息的音量,如果该语音信息的音量低于预设的音量阈值,则确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境。
进一步的,如果音频信息包含语音信息并且该语音信息的音量达到预设的音量阈值,那么判断该语音信息是说话产生的语音信息还是唱歌(或戏剧)产生的语音信息,如果是唱歌(或戏剧)产生的语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境,如果是说话产生的语音信息,那么确定当前录制环境有正在进行言语输出的用户,当前录制环境为第一环境。
也就是说,如果当前录制环境没有语音信号(没有人发出的声音),那么确定当前录制环境为第二环境,如果当前录制环境有语音信号,但是该语音信号的音量低于预设的音量阈值,则确定当前录制环境为第二环境,进一步的,如果该语音信号的音量达到预设的音量阈值但该语音信号是唱歌(或戏剧)过程所产生的语音信号,那么确定当前录制环境为第二环境。
需要说明的是,可以通过分析语音信号的节奏、旋律或者韵律,以确定语音信号是说话产生的还是唱歌(或戏剧)产生的。
作为另一种方式,基于音频信息的参数信息确定当前录制环境,包括:
确定当前音频信息的信噪比;
如果当前音频信息的信噪比大于阈值,则确定当前录制环境为第一环境;
如果当前音频信息的信噪比小于阈值,则确定当前录制环境为第二环境。
移动终端在音频录制模式下,如果通过麦克风获得的音频信息的信噪比大于阈值,表明当前录制环境较为安静,处于该录制环境中的用户说话时能够清楚地采集到该用户的声音信号,因此将当前录制环境确定为第一环境,通过语音识别引擎对当前的音频信息进行实时处理,将当前的音频信息同步转换为字幕信息。如果通过麦克风获得的音频信息的信噪比小于阈值,表明当前录制环境较为嘈杂,处于该录制环境中的用户说话时很难清楚地采集到该用户的声音信号,因此将当前录制环境确定为第二环境,暂停通过语音识别引擎对当前的音频信息进行实时处理。
作为一种优选方案,移动终端包括麦克风阵列,该麦克风阵列包括多个麦克风,多个麦克风布置于移动终端的至少两个侧面上。
在本申请上述公开的音频文件录制方法中,通过移动终端的麦克风获得音频信息,可以采用如下方式:
通过麦克风阵列获得目标用户的音频信息。其中,目标用户为指定的用户。
实施中,通过麦克风阵列对目标用户进行定位,根据目标用户的位置以及麦克风阵列中麦克风的安装位置调整各个麦克风的增益,实现对目标用户的追踪,以便采集该目标用户的音频信息。
另外,在本申请上述公开的音频文件录制方法中,字幕流还可以携带字幕信息的显示配置信息。其中,字幕信息的显示配置信息包括字幕信息的显示位置和/或字幕信息的动态显示模式。
另外,字幕流中除了通过语音识别引擎产生的字幕信息之外,还可以包括:根据语音信息的提供者的状态确定的辅助信息。其中,辅助信息包括但不限于图片、表情符号。实施中,可以根据语音信息确定其提供者的情绪状态。
本申请还公开一种移动终端,其结构如图7所示,包括输入接口50、麦克风601和处理器70。
输入接口50用于采集输入指令。
处理器70用于:响应指示开始录制音频的第一指令,进入音频录制模式;在音频录制模式下,通过麦克风601获得音频信息;调用语音识别引擎,基于语音识别引擎对音频信息进行实时处理,以使得基于音频信息同步生成字幕信息;响应指示结束录制音频的第二指令, 退出音频录制模式;将在音频录制模式下,由音频信息构成的音频流以及由字幕信息构成的字幕流合成为第一音频文件,以使得在播放第一音频文件时,同步输出音频流和字幕流。
本申请公开的移动终端在录制音频的过程中,通过语音识别引擎对音频信息进行实时处理,从而基于音频信息同步生成字幕信息,移动终端在退出音频录制模式后,即可基于音频流和字幕流生成音频文件,从而快捷地制作完成配置有字幕的音频文件。
作为一种实施方式,处理器70在基于语音识别引擎对音频信息进行实时处理的方面,用于:基于音频信息的参数信息确定当前录制环境;基于当前录制环境为第一环境的结果,将当前的音频信息同步转换为字幕信息;基于当前录制环境为第二环境的结果,暂停将音频信息同步转换为字幕信息的操作,直至获得表明当前录制环境为第一环境的结果。
可选的,处理器70将第一环境配置为至少有一个用户在进行语言输出的环境,将第二环境配置为仅存在背景音的环境。
作为一种实施方式,处理器70在基于音频信息的参数信息确定当前录制环境的方面,用于:对通过麦克风获得的音频信息进行分析,确定音频信息中是否包含语音信息,如果音频信息不包含语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境。进一步的,如果音频信息包含语音信息,那么判断该语音信息是说话产生的语音信息还是唱歌(或戏剧)产生的语音信息,如果是唱歌(或戏剧)产生的语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境,如果是说话产生的语音信息,那么确定当前录制环境有正在进行言语输出的用户,当前录制环境为第一环境。
作为一种实施方式,处理器70在基于音频信息的参数信息确定当前录制环境的方面,用于:对通过麦克风获得的音频信息进行分析,确定音频信息中是否包含语音信息,如果音频信息不包含语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境。进一步的,如果音频信息包含语音信息,进一步统计该语音信息的音量,如果该语音信息的音量低于预设的音量阈值,则确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境。进一步的,如果音频信息包含语音信息并且该语音信息的音量达到预设的音量阈值,那么判断该语音信息是说话产生的语音信息还是唱歌(或戏剧)产生的语音信息,如果是唱歌(或戏剧)产生的语音信息,那么确定当前录制环境没有正在进行言语输出的用户,当前录制环境为第二环境,如果是说话产生的语音信息,那么确定当前录制环境有正在进行言语输出的用户,当前录制环境为第一环境。
作为另一种实施方式,处理器70在基于音频信息的参数信息确定当前录制环境的方面,用于:确定当前音频信息的信噪比;如果当前音频信息的信噪比大于阈值,则确定当前录制环境为第一环境;如果当前音频信息的信噪比小于阈值,则确定当前录制环境为第二环境。
作为一种优选实施方式,移动终端包括麦克风阵列,该麦克风阵列包括多个麦克风,多个麦克风布置于移动终端的至少两个侧面上。
在移动终端包括麦克风阵列的情况下,作为一种实施方式,处理器70在通过移动终端的麦克风获得音频信息的方面,用于:通过麦克风阵列获得目标用户的音频信息。其中,目标用户为指定的用户。
处理器70例如可以包括通用微处理器、指令集处理器和/或相关芯片组和/或专用微处理器(例如,专用集成电路(ASIC)),等等。处理器510还可以包括用于缓存用途的板载存储器。处理器510可以是用于执行参考图6描述的根据本公开实施例的方法流程的不同动作的单一处理单元或者是多个处理单元。本发明的实施例在视频录制的时候启动语音识别,针对当前环境中语音进行识别并转换成字幕。该字幕同步与摄像头采集的图像、麦克风采集的语音保存形成最终的多媒体文件。本发明的实施例通过多个麦克风的采集以及声音降噪技术能够实现仅针对摄像头采集区域中的对象进行语音采集并通过语音识别引擎进行同步识别和转换。更进一步的,可以通过多麦克风定位的技术定位到摄像头采集区域中的某一个正在进行语音输出的用户并进行实时采集以及通过语言识别引擎进行针对该正在语音输出的用户进行识别和转换成字幕。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的移动终端而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施 例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (12)

  1. 一种移动终端的视频文件录制方法,其特征在于,包括:
    获得指示开始录制视频的第一指令;
    响应所述第一指令,进入视频录制模式;
    在所述视频录制模式下,通过所述移动终端的摄像头获得图像信息,通过所述移动终端的麦克风获得音频信息;
    调用语音识别引擎,基于所述语音识别引擎对所述音频信息进行实时处理,以使得基于所述音频信息同步生成字幕信息;
    获得指示结束录制视频的第二指令;
    响应所述第二指令,退出所述视频录制模式;
    将在所述视频录制模式下,由所述图像信息构成的图像流、由所述音频信息构成的音频流、以及由所述字幕信息构成的字幕流合成为第一视频文件,以使得在播放所述第一视频文件时,同步输出所述图像流、所述音频流和所述字幕流。
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述语音识别引擎对所述音频信息进行实时处理,包括:
    基于所述音频信息的参数信息确定当前录制环境;
    基于当前录制环境为第一环境的结果,将当前的音频信息同步转换为字幕信息;
    基于当前录制环境为第二环境的结果,暂停将音频信息同步转换为字幕信息的操作,直至获得表明当前录制环境为所述第一环境的结果。
  3. 根据权利要求2所述的方法,其特征在于,所述第一环境包括至少有一个用户在进行语言输出的环境,所述第二环境包括仅存在背景音的环境。
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述音频信息的参数信息确定当前录制环境,包括:
    确定当前音频信息的信噪比;
    如果当前音频信息的信噪比大于阈值,则确定当前录制环境为所述第一环境;
    如果当前音频信息的信噪比小于所述阈值,则确定当前录制环境为所述第二环境。
  5. 根据权利要求1所述的方法,其特征在于,所述移动终端包括麦克风阵列,所述麦克风阵列包括多个安装位置不同的麦克风,其中,所述摄像头所在的侧面上设置有至少一个麦克风,所述移动终端的至少一个其他侧面上设置有麦克风;
    所述通过所述移动终端的麦克风获得音频信息,包括:通过所述麦克风阵列获得目标用 户的音频信息,其中,所述目标用户为能够通过所述移动终端的摄像头进行图像采集且显示在所述移动终端的显示屏内的用户。
  6. 一种移动终端,其特征在于,包括输入接口、摄像头、麦克风和处理器;
    所述输入接口用于采集输入指令;
    所述处理器用于:响应指示开始录制视频的第一指令,进入视频录制模式;在所述视频录制模式下,通过所述移动终端的摄像头获得图像信息,通过所述移动终端的麦克风获得音频信息;调用语音识别引擎,基于所述语音识别引擎对所述音频信息进行实时处理,以使得基于所述音频信息同步生成字幕信息;响应指示结束录制视频的第二指令,退出所述视频录制模式;将在所述视频录制模式下,由所述图像信息构成的图像流、由所述音频信息构成的音频流、以及由所述字幕信息构成的字幕流合成为第一视频文件,以使得在播放所述第一视频文件时,同步输出所述图像流、所述音频流和所述字幕流。
  7. 根据权利要求6所述的移动终端,其特征在于,所述处理器在基于所述语音识别引擎对所述音频信息进行实时处理的方面,用于:
    基于所述音频信息的参数信息确定当前录制环境;基于当前录制环境为第一环境的结果,将当前的音频信息同步转换为字幕信息;基于当前录制环境为第二环境的结果,暂停将音频信息同步转换为字幕信息的操作,直至获得表明当前录制环境为所述第一环境的结果。
  8. 根据权利要求7所述的移动终端,其特征在于,所述处理器将所述第一环境配置为包括至少有一个用户在进行语言输出的环境,将所述第二环境配置为包括仅存在背景音的环境。
  9. 根据权利要求8所述的移动终端,其特征在于,所述处理器在基于所述音频信息的参数信息确定当前录制环境的方面,用于:
    确定当前音频信息的信噪比;如果当前音频信息的信噪比大于阈值,则确定当前录制环境为所述第一环境;如果当前音频信息的信噪比小于所述阈值,则确定当前录制环境为所述第二环境。
  10. 根据权利要求6所述的移动终端,其特征在于,所述移动终端包括麦克风阵列,所述麦克风阵列包括多个安装位置不同的麦克风,其中,所述摄像头所在的侧面上设置有至少一个麦克风,所述移动终端的至少一个其他侧面上设置有麦克风;所述移动终端还包括显示屏;
    所述处理器在通过所述移动终端的麦克风获得音频信息的方面,用于:通过所述麦克风阵列获得目标用户的音频信息,其中,所述目标用户为能够通过所述移动终端的摄像头进行图像采集且显示在所述移动终端的显示屏内的用户。
  11. 一种移动终端的音频文件录制方法,其特征在于,包括:
    获得指示开始录制音频的第一指令;
    响应所述第一指令,进入音频录制模式;
    在所述音频录制模式下,通过所述移动终端的麦克风获得音频信息;
    调用语音识别引擎,基于所述语音识别引擎对所述音频信息进行实时处理,以使得基于所述音频信息同步生成字幕信息;
    获得指示结束录制音频的第二指令;
    响应所述第二指令,退出所述音频录制模式;
    将在所述音频录制模式下,由所述音频信息构成的音频流以及由所述字幕信息构成的字幕流合成为第一音频文件,以使得在播放所述第一音频文件时,同步输出所述音频流和所述字幕流。
  12. 一种移动终端,其特征在于,包括输入接口、麦克风和处理器;
    所述输入接口用于采集输入指令;
    所述处理器用于:响应指示开始录制音频的第一指令,进入音频录制模式;在所述音频录制模式下,通过所述移动终端的麦克风获得音频信息;调用语音识别引擎,基于所述语音识别引擎对所述音频信息进行实时处理,以使得基于所述音频信息同步生成字幕信息;响应指示结束录制音频的第二指令,退出所述音频录制模式;将在所述音频录制模式下,由所述音频信息构成的音频流以及由所述字幕信息构成的字幕流合成为第一音频文件,以使得在播放所述第一音频文件时,同步输出所述音频流和所述字幕流。
PCT/CN2017/107014 2017-06-30 2017-10-20 视频文件录制方法、音频文件录制方法及移动终端 WO2019000721A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710525908.8A CN107316642A (zh) 2017-06-30 2017-06-30 视频文件录制方法、音频文件录制方法及移动终端
CN201710525908.8 2017-06-30

Publications (1)

Publication Number Publication Date
WO2019000721A1 true WO2019000721A1 (zh) 2019-01-03

Family

ID=60180331

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/107014 WO2019000721A1 (zh) 2017-06-30 2017-10-20 视频文件录制方法、音频文件录制方法及移动终端

Country Status (2)

Country Link
CN (1) CN107316642A (zh)
WO (1) WO2019000721A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814732A (zh) * 2020-07-23 2020-10-23 上海优扬新媒信息技术有限公司 一种身份验证方法及装置
CN112672099A (zh) * 2020-12-31 2021-04-16 深圳市潮流网络技术有限公司 字幕数据生成和呈现方法、装置、计算设备、存储介质
CN112770160A (zh) * 2020-12-24 2021-05-07 沈阳麟龙科技股份有限公司 一种股票分析视频创作系统及方法
CN113014984A (zh) * 2019-12-18 2021-06-22 深圳市万普拉斯科技有限公司 实时添加字幕方法、装置、计算机设备和计算机存储介质
CN113781988A (zh) * 2021-07-30 2021-12-10 北京达佳互联信息技术有限公司 字幕显示方法、装置、电子设备及计算机可读存储介质
EP4236328A4 (en) * 2020-11-27 2024-04-24 Beijing Zitiao Network Technology Co Ltd METHOD AND APPARATUS FOR SHARING VIDEO, ELECTRONIC DEVICE AND STORAGE MEDIUM

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107895575A (zh) * 2017-11-10 2018-04-10 广东欧珀移动通信有限公司 屏幕录制方法、屏幕录制装置及电子终端
CN108063722A (zh) * 2017-12-20 2018-05-22 北京时代脉搏信息技术有限公司 视频数据生成方法、计算机可读存储介质和电子设备
CN110300274B (zh) * 2018-03-21 2022-05-10 腾讯科技(深圳)有限公司 视频文件的录制方法、装置及存储介质
CN110853662B (zh) * 2018-08-02 2022-06-24 深圳市优必选科技有限公司 语音交互方法、装置及机器人
CN109660744A (zh) * 2018-10-19 2019-04-19 深圳壹账通智能科技有限公司 基于大数据的智能双录方法、设备、存储介质及装置
CN112752047A (zh) * 2019-10-30 2021-05-04 北京小米移动软件有限公司 视频录制方法、装置、设备及可读存储介质
CN111816183B (zh) * 2020-07-15 2024-05-07 前海人寿保险股份有限公司 基于音视频录制的语音识别方法、装置、设备及存储介质
CN112261489A (zh) * 2020-10-20 2021-01-22 北京字节跳动网络技术有限公司 生成视频的方法、装置、终端和存储介质
TWI792207B (zh) * 2021-03-03 2023-02-11 圓展科技股份有限公司 過濾鏡頭操作雜音的方法及錄影系統
CN113905267B (zh) * 2021-08-27 2023-06-20 北京达佳互联信息技术有限公司 一种字幕编辑方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050018929A (ko) * 2005-02-01 2005-02-28 우종식 하나의 파일로 일반 음악, 반주 음악, 가사 미리불러주기, 코러스 기능, 뮤직비디오 제작이 가능한 음원생성 및 재생 방법과 그 장치
CN101382937A (zh) * 2008-07-01 2009-03-11 深圳先进技术研究院 基于语音识别的多媒体资源处理方法及其在线教学系统
CN103297710A (zh) * 2013-06-19 2013-09-11 江苏华音信息科技有限公司 汉语自动实时标注中外文字幕音像录播设备
CN106409296A (zh) * 2016-09-14 2017-02-15 安徽声讯信息技术有限公司 基于分核处理技术的语音快速转写校正系统
CN106792145A (zh) * 2017-02-22 2017-05-31 杭州当虹科技有限公司 一种音视频自动叠加字幕的方法和装置
CN106851401A (zh) * 2017-03-20 2017-06-13 惠州Tcl移动通信有限公司 一种自动添加字幕的方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050018929A (ko) * 2005-02-01 2005-02-28 우종식 하나의 파일로 일반 음악, 반주 음악, 가사 미리불러주기, 코러스 기능, 뮤직비디오 제작이 가능한 음원생성 및 재생 방법과 그 장치
CN101382937A (zh) * 2008-07-01 2009-03-11 深圳先进技术研究院 基于语音识别的多媒体资源处理方法及其在线教学系统
CN103297710A (zh) * 2013-06-19 2013-09-11 江苏华音信息科技有限公司 汉语自动实时标注中外文字幕音像录播设备
CN106409296A (zh) * 2016-09-14 2017-02-15 安徽声讯信息技术有限公司 基于分核处理技术的语音快速转写校正系统
CN106792145A (zh) * 2017-02-22 2017-05-31 杭州当虹科技有限公司 一种音视频自动叠加字幕的方法和装置
CN106851401A (zh) * 2017-03-20 2017-06-13 惠州Tcl移动通信有限公司 一种自动添加字幕的方法及系统

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113014984A (zh) * 2019-12-18 2021-06-22 深圳市万普拉斯科技有限公司 实时添加字幕方法、装置、计算机设备和计算机存储介质
CN111814732A (zh) * 2020-07-23 2020-10-23 上海优扬新媒信息技术有限公司 一种身份验证方法及装置
CN111814732B (zh) * 2020-07-23 2024-02-09 度小满科技(北京)有限公司 一种身份验证方法及装置
EP4236328A4 (en) * 2020-11-27 2024-04-24 Beijing Zitiao Network Technology Co Ltd METHOD AND APPARATUS FOR SHARING VIDEO, ELECTRONIC DEVICE AND STORAGE MEDIUM
CN112770160A (zh) * 2020-12-24 2021-05-07 沈阳麟龙科技股份有限公司 一种股票分析视频创作系统及方法
CN112672099A (zh) * 2020-12-31 2021-04-16 深圳市潮流网络技术有限公司 字幕数据生成和呈现方法、装置、计算设备、存储介质
CN112672099B (zh) * 2020-12-31 2023-11-17 深圳市潮流网络技术有限公司 字幕数据生成和呈现方法、装置、计算设备、存储介质
CN113781988A (zh) * 2021-07-30 2021-12-10 北京达佳互联信息技术有限公司 字幕显示方法、装置、电子设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN107316642A (zh) 2017-11-03

Similar Documents

Publication Publication Date Title
WO2019000721A1 (zh) 视频文件录制方法、音频文件录制方法及移动终端
US10825480B2 (en) Automatic processing of double-system recording
US10848889B2 (en) Intelligent audio rendering for video recording
US20130211826A1 (en) Audio Signals as Buffered Streams of Audio Signals and Metadata
JP2013106298A (ja) 撮像制御装置、撮像制御方法、撮像制御方法のプログラムおよび撮像装置
WO2021244056A1 (zh) 一种数据处理方法、装置和可读介质
JPWO2020222925A5 (zh)
JP2018189924A (ja) 情報処理装置、情報処理方法、およびプログラム
US10607625B2 (en) Estimating a voice signal heard by a user
EP4138381A1 (en) Method and device for video playback
JP5214394B2 (ja) カメラ
JP3838159B2 (ja) 音声認識対話装置およびプログラム
JP2011055386A (ja) 音響信号処理装置及び電子機器
JP2013168878A (ja) 録音機器
CN111696566B (zh) 语音处理方法、装置和介质
WO2013008869A1 (ja) 電子機器及びデータ生成方法
TWI687917B (zh) 語音系統及聲音偵測方法
JP5750668B2 (ja) カメラ、再生装置、および再生方法
JP2014122978A (ja) 撮像装置、音声認識方法、及びプログラム
JP2012105234A (ja) 字幕生成配信システム、字幕生成配信方法およびプログラム
JP2012068419A (ja) カラオケ装置
CN111696564B (zh) 语音处理方法、装置和介质
US20240144948A1 (en) Sound signal processing method and electronic device
CN111696565A (zh) 语音处理方法、装置和介质
WO2022071959A1 (en) Audio-visual hearing aid

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17915415

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20/05/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17915415

Country of ref document: EP

Kind code of ref document: A1