WO2022166801A1 - 数据处理方法、装置、设备以及介质 - Google Patents

数据处理方法、装置、设备以及介质 Download PDF

Info

Publication number
WO2022166801A1
WO2022166801A1 PCT/CN2022/074513 CN2022074513W WO2022166801A1 WO 2022166801 A1 WO2022166801 A1 WO 2022166801A1 CN 2022074513 W CN2022074513 W CN 2022074513W WO 2022166801 A1 WO2022166801 A1 WO 2022166801A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
user
video
target
data
Prior art date
Application number
PCT/CN2022/074513
Other languages
English (en)
French (fr)
Inventor
于广雯
樊华锋
杨涛
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2023547594A priority Critical patent/JP2024509710A/ja
Priority to KR1020237019353A priority patent/KR20230106170A/ko
Publication of WO2022166801A1 publication Critical patent/WO2022166801A1/zh
Priority to US17/989,620 priority patent/US12041313B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4888Data services, e.g. news ticker for displaying teletext characters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/2222Prompting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/414Specialised client platforms, e.g. receiver in car or embedded in a mobile appliance
    • H04N21/41407Specialised client platforms, e.g. receiver in car or embedded in a mobile appliance embedded in a portable device, e.g. video client on a mobile phone, PDA, laptop
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/433Content storage operation, e.g. storage operation in response to a pause request, caching operations
    • H04N21/4334Recording operations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8166Monomedia components thereof involving executable data, e.g. software
    • H04N21/8173End-user applications, e.g. Web browser, game
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/64Computer-aided capture of images, e.g. transfer from script file into camera, check of taken image quality, advice or proposal for image composition or decision on when to take image
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • H04N5/765Interface circuits between an apparatus for recording and another apparatus
    • H04N5/77Interface circuits between an apparatus for recording and another apparatus between a recording apparatus and a television camera
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • H04N5/91Television signal processing therefor
    • H04N5/915Television signal processing therefor for field- or frame-skip recording or reproducing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present application relates to the field of Internet technology, in particular to data processing technology.
  • users can print out the content of the document and place it next to the camera as a reminder.
  • the user may not be able to quickly locate the content of the speech to be delivered, or the positioning may be wrong.
  • the camera will capture the user's actions, which will affect the quality of the final video.
  • Embodiments of the present application provide a data processing method, apparatus, device, and medium, which can improve the effectiveness of a word prompting function in a video recording service, thereby improving the quality of the recorded video.
  • an embodiment of the present application provides a data processing method, and the method is executed by a computer device, including:
  • Collect the user's voice in the video recording service determine the target text that matches the user's voice in the prompt text data associated with the video recording service, and identify the target text;
  • an embodiment of the present application provides a data processing method, and the method is executed by a computer device, including:
  • Collect the user voice corresponding to the target user perform text conversion on the user voice, and generate the user voice text corresponding to the user voice;
  • the same text as the user's voice text is determined as the target text, and the target text is identified in the prompting application.
  • an embodiment of the present application provides a data processing apparatus, and the apparatus is deployed on computer equipment, including:
  • the startup module is used to respond to the business startup operation in the video application and start the video recording business in the video application;
  • the display module is used to collect the user's voice in the video recording service, determine the target text that matches the user's voice in the prompt text data associated with the video recording service, and identify the target text;
  • the acquiring module is configured to acquire the target video data corresponding to the video recording service when the text position of the target text in the prompt text data is the end position in the prompt text data.
  • an embodiment of the present application provides a data processing apparatus, and the apparatus is deployed on computer equipment, including:
  • the prompt text uploading module is used to upload the prompt text data to the prompting application
  • the user voice acquisition module is used to collect the user voice corresponding to the target user, perform text conversion on the user voice, and generate the user voice text corresponding to the user voice;
  • the user voice text display module is used for determining the same text as the user voice text as the target text in the prompt text data, and marking the target text in the prompting application.
  • an embodiment of the present application provides a computer device, including a memory and a processor, the memory is connected to the processor, the memory is used for storing a computer program, and the processor is used for calling the computer program, so that the computer device executes the embodiments of the present application.
  • a computer device including a memory and a processor, the memory is connected to the processor, the memory is used for storing a computer program, and the processor is used for calling the computer program, so that the computer device executes the embodiments of the present application.
  • One aspect of the embodiments of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is adapted to be loaded and executed by a processor, so that a computer device having a processor executes the implementation of the present application
  • the method provided by any of the above aspects in the example.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, causing the computer device to perform the method provided by any of the above aspects.
  • This embodiment of the present application may respond to a service start operation in a video application, start a video recording service in the video application, collect user voice in the video recording service, and determine a target associated with the user voice in the prompt text data associated with the video recording service Text, mark the target text, so that the user who is speaking can quickly and accurately locate the content of the speech according to the mark, and improve the effectiveness of the text prompt function in the video recording service.
  • the text position of the target text in the prompt text data is the end position in the prompt text data, obtain the target video data corresponding to the video recording service.
  • the target text that matches the user's voice can be located and identified in the prompt text data, that is, the target text displayed in the video application matches the content of the user's speech, thereby The effectiveness of the text prompt function in the video recording service is improved, and the risk of recording failure caused by users forgetting words is reduced, thereby improving the quality of the recorded video.
  • FIG. 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a data processing scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an interface for inputting prompt text data provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an interface for starting a video recording service in a video application provided by an embodiment of the present application
  • FIG. 6 is a schematic diagram of an interface for displaying prompt text data provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an interface for displaying speech rate prompt information provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an interface for stopping a video recording service provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an interface for performing editing optimization on a recorded video provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of an interface for recommending a tutorial video according to a speech error type provided by an embodiment of the present application
  • FIG. 11 is a flowchart for realizing a video recording service provided by an embodiment of the present application.
  • FIG. 12 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of an application scenario of a teleprompter provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of implementing a data processing apparatus provided by an embodiment of the present application.
  • 16 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • the network architecture may include a server 10d and a user terminal cluster, and the user terminal cluster may include one or more user terminals, and the number of user terminals is not limited here.
  • the user terminal cluster may specifically include a user terminal 10a, a user terminal 10b, a user terminal 10c, and the like.
  • the server 10d may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud Cloud servers for basic cloud computing services such as communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the user terminal 10a, the user terminal 10b, the user terminal 10c, etc. may include: a smart phone, a tablet computer, a notebook computer, a palmtop computer, a mobile internet device (mobile internet device, MID), a wearable device (such as a smart watch, a smart bracelet, etc.) etc.) and smart terminals with video/image playback functions such as smart TVs.
  • the user terminal 10a, the user terminal 10b, and the user terminal 10c can be respectively connected to the server 10d via a network, so that each user terminal can exchange data with the server 10d through the network connection.
  • a video application with a video recording function may be installed in the user terminal 10a, wherein the video application may be a video editing application, a short video application, or the like.
  • the user can open the video application installed in the user terminal 10a, and the video application can provide the user with a video recording function, and the video recording function can include a conventional shooting mode and a prompting shooting mode; wherein, the conventional shooting mode may refer to using a user terminal.
  • the conventional shooting mode may refer to using a user terminal.
  • the prompting shooting method may refer to the process of using the camera built in the user terminal 10a or an external camera device to shoot the user, which can be displayed on the terminal screen of the user terminal 10a for the user.
  • the content of the manuscript can be switched and displayed according to the user's voice progress (eg scrolling display, etc.).
  • the content of the manuscript here can also be referred to as prompt text data in the video recording service.
  • the user terminal 10a can respond to the trigger operation for the prompt shooting portal, display the recording page in the video application, and display the recording page in the video application.
  • the user Before recording, the user can enter prompt text data on the recording page, or upload existing prompt text data to the recording page.
  • the user terminal 10a When the user starts video recording, the user terminal 10a can respond to the user's video recording start operation, start the video recording function in the video application, and during the video recording process, the user terminal 10a can display the progress according to the voice of the user on the terminal screen of the user terminal 10a. to display. In other words, during the video recording process, prompt text data can be displayed according to the progress of the user's voice.
  • the switching display speed of the prompt text data in the video application (which can be scrolling speed) is accelerated; when the user's voice speed decreases
  • the switching display speed of the prompt text data in the video application is slowed down, that is, the text displayed in the video application of the prompt text data matches the user's voice to ensure the effectiveness of the text prompt function during the video recording process, and Help users to smoothly complete the video recording, which can improve the quality of the recorded video.
  • FIG. 2 is a schematic diagram of a data processing scenario provided by an embodiment of the present application. Taking a video recording scenario as an example, the implementation process of the data processing method provided by the embodiment of the present application is described.
  • the user terminal 20a shown in FIG. 2 may be any user terminal in the user terminal cluster shown in FIG. 1, and the user terminal 20a is installed with a video application, and the video application has a video recording function.
  • User A (the user A may refer to the user of the user terminal 20a) can open the video application in the user terminal 20a and enter the homepage of the video application, the user can perform a trigger operation on the shooting entry in the video application, and the user terminal 20a responds to the In the triggering operation of the shooting portal, a shooting page 20m is displayed in the video application, and the shooting page 20m may include a shooting area 20b, a filter control 20c, a shooting control 20d, a beauty control 20e, and the like.
  • the shooting area 20b is used to display the video image captured by the user terminal 20a, and the video image can be a video image directed to the user A, which can be obtained through a camera built in the user terminal 20a or a camera device having a communication connection with the user terminal 20a.
  • the shooting control 20d can be used to control the opening and closing of video recording. After entering the shooting page 20m, the shooting control 20d is triggered to perform a trigger operation, which can indicate that shooting is started, and the shot video can be displayed in the shooting area 20b.
  • the filter control 20c can be used for the video picture collected by the user terminal 20a.
  • the beauty control 20e can be used for the video collected by the user terminal 20a.
  • the portraits in the picture are processed for beauty, such as automatically repairing the face shape of the portrait, increasing the eyes of the portrait, and increasing the nose of the portrait.
  • the shooting page 20m may also include a prompting and shooting portal 20f.
  • the user A can select the teleprompter shooting function in the video application, that is, the user A can perform a trigger operation on the teleprompter shooting entry 20f in the shooting page 20m, and the user terminal 20a can respond to the trigger operation of the user A on the teleprompter shooting entry 20f, and the video
  • the shooting page 20m in the application is switched and displayed as the recording page corresponding to the teleprompter shooting entry 20f.
  • the recording page can first display a text input area, and user A can enter the document content required for recording the video in the text input area. It can be used to prompt user A during the video recording process; in short, during the video recording process, user A can record according to the content of the manuscript displayed in the video application, and the content of the manuscript at this time can also be called prompt text data 20g.
  • the statistical information 20h of the content of the manuscript input by the user A can also be displayed in the text input area, and the statistical information 20h can include the number of words in the content of the input manuscript (that is, the number of prompt words, for example, the number of words in the content of the manuscript is 134) ), and the estimated video duration (eg, 35 seconds) corresponding to the content of the input manuscript.
  • User A can add or delete the content of the manuscript according to the estimated video duration. For example, user A wants to record a 1-minute video. When the estimated duration of the video corresponding to the content of the manuscript input by user A in the text input area is 4 minutes, then user A can record the content of the manuscript displayed in the text input area. Delete, so that the estimated video duration corresponding to the deleted manuscript content is about 1 minute (for example, the estimated video duration can range from 55 seconds to 65 seconds); when user A enters the manuscript content in the text input area The estimated duration of the corresponding video is 35 seconds, then user A can increase the content of the manuscript displayed in the text input area, so that the estimated duration of the video corresponding to the increased content of the manuscript is about 1 minute, and then the final determination can be made. The content of the document is determined as the text prompt text 20g.
  • the user A After the user A determines the prompt text data 20g, the user A can perform a trigger operation on the "next" control in the recording page, and the user terminal 20a can respond to the trigger operation for the "next" control and turn on the camera ( or a camera device with a communication connection), enter the video recording preparation state (that is, before the video starts recording); as shown in FIG. 2, the video screen 20i for user A collected by the user terminal 20a can be displayed on the recording page, and in the recording page The recording page displays a prompt message "adjust the position and put the mobile phone, and say 'start' to start teleprompter shooting", that is, user A can adjust his position and the position of the user terminal 20a according to the video screen 20i. After adjusting the position, the user A video recording can be initiated by voice, eg a user can initiate a video recording by saying "start”.
  • the user terminal 20a may respond to the voice activation operation of the user A, start the video recording in the video application, and display the prompt text data 20g on the recording page.
  • the text displayed on the recording page may only be part of the text in the prompt text data 20g, such as a sentence in the prompt text data 20g, so after starting the video recording, the first sentence in the prompt text data 20g can be displayed first. talk.
  • the user terminal 20a can collect the user voice corresponding to the user A, and the client of the video application installed in the user terminal 20a can transmit the user voice to the background of the video application
  • the server 20j sends a voice matching instruction to the background server 20j.
  • the background server 20j can convert the user's voice into the user's voice and text.
  • the background server 20j can also Convert the user's voice text into the first Hanyu Pinyin (when the user's voice text is Chinese, the first syllable information can be called the first Hanyu Pinyin); of course, after the user A enters the prompt text data 20g in the text input area, the video application The client can also transmit the prompt text data 20g to the background server 20j, so the background server 20j can convert the prompt text data 20g into the second Hanyu Pinyin (when the user's voice text is Chinese, the second syllable information can be called the second Chinese Pinyin).
  • the background server 20j can match the first Hanyu Pinyin and the second Hanyu Pinyin, search for the same pinyin as the first Hanyu Pinyin in the second Hanyu Pinyin, that is, search for the text position of the first Hanyu Pinyin in the second Hanyu Pinyin,
  • the text corresponding to the text position in the prompt text data 20g is determined as the target text (that is, the text matched by the user's voice in the prompt text data 20g), and the background server 20j can transmit the target text to the client of the video application, and the terminal device 20a
  • the target text can be identified in the video application (such as increasing the display size of the target text, changing the display color of the target text, enclosing the target text by a circle or a rectangular frame, etc.). Understandably, when user A speaks in the order of the text prompt data, the prompt text data can be scrolled and displayed on the recording page; when user A does not speak in the order of the text prompt data, the prompt text data can
  • the sentence in which the target application is located can be identified in the video application.
  • the background server 20j can match the target text corresponding to the user's voice in the prompt text data 20g as: weekend.
  • the target text "weekend” can be recorded on the recording page ” is marked with the sentence “On weekends, participate in the consumption class of xx and xx in Changsha” (increase the display size of the text, and make the text bold, as shown in the area 20k in Figure 2).
  • the prompt text data 20g can be displayed directly on the recording page, or can be displayed on a sub-page that is independently displayed on the recording page. This application does not limit the display form of the prompt text data 20g on the recording page. .
  • the purpose of matching the user's voice in the prompt text data 20g is to determine the text position of the user's voice in the prompt text data 20g, and when converting the user's voice into the user's voice text, only the consistency between the pronunciation of the text and the user's voice can be considered. There is no need to consider the accuracy between the converted user voice text and the user voice, so Chinese audio can be used for matching, which can improve the matching efficiency between the user voice and the prompt text data.
  • the user terminal 20a can collect the user voice spoken by the user A in real time, and the background server 20n can determine the target text corresponding to the user voice in the prompt text data 20g in real time, and then can scroll and display the prompt text data according to the progress of the user voice. For example, when user A speaks to the first sentence in the prompt text data 20g, the first sentence in the prompt text data 20g can be identified on the recording page; user A speaks to the second sentence in the prompt text data 20g When speaking, on the recording page, the first sentence in the prompt text data 20g can be switched and displayed to the second sentence, and the second sentence can be marked, and the target text marked each time on the recording page is User A The content of the current speech.
  • the user terminal 20a may turn off the video recording, and determine the video recorded this time as the recorded video. If user A is satisfied with the video recorded this time, the video can be saved; if user A is not satisfied with the video recorded this time, user A can re-shoot. Of course, user A can also perform editing optimization on the recorded video to obtain a final recorded video, that is, to obtain target video data.
  • prompt text data may be displayed according to the progress of the user's voice, so as to achieve the effect of accurate word prompting for the user, thereby improving the quality of the recorded video.
  • FIG. 3 is a schematic flowchart of a data processing method provided by an embodiment of the present application. Understandably, the data processing method can be executed by a computer device, and the computer device can be a user terminal, or an independent server, or a cluster composed of multiple servers, or a system composed of a user terminal and a server, or a system composed of a user terminal and a server.
  • a computer program application (including program code), which is not specifically limited here.
  • the data processing method may include the following S101-S103:
  • a user who needs to perform video recording may be referred to as a target user, and a device used by the target user for video recording may be referred to as a computer device.
  • the target user performs a service start operation for the video recording service in the video application installed on the computer device
  • the computer device can respond to the service start operation in the video application and start the video recording service in the video application, that is, in the video application Start video recording in .
  • the service initiation operations may include, but are not limited to, contact trigger operations such as single-click, double-click, long-press, and tapping on the screen, and non-contact trigger operations such as voice, remote control, and gesture.
  • the target user before the computer device starts the video recording service, the target user can also upload the prompt text data required in the video recording service to the video application, and the prompt text data can be used to prompt the target user in the video recording service, which can greatly reduce the The target user forgets words during the video recording process.
  • the target user opens the video application installed in the computer device, he can enter the shooting page in the video application (for example, the shooting page 20m in the embodiment corresponding to FIG. 2 above), and the shooting page of the video application can include a prompting and shooting entrance. .
  • the computer device may respond to the trigger operation on the teleprompter shooting entry in the video application, and display a recording page in the video application, where the recording page may include a text input area , the text input area can be used to edit the text content; the computer device can respond to the information editing operation for the text input area, and display the prompt text data determined by the information editing operation in the text input area.
  • the quantity threshold here can be preset according to actual needs, for example, the quantity threshold can be set to 100
  • the number of prompt texts and the estimated video duration corresponding to the number of prompt texts can be displayed in the text input area.
  • the shooting page can be switched to the recording page in the video application, and the target user can edit the text input area of the recording page.
  • the content of the manuscript received that is, the above-mentioned prompt text data
  • the number of prompt words input in the text input area can be counted in real time, and when the number of prompt words is greater than the preset number threshold,
  • the number of prompt texts and the estimated video duration corresponding to the currently input prompt text data can be displayed in the text input area.
  • the teleprompter shooting portal can also be displayed on any page of the video application, and the embodiment of the present application does not limit the display position of the teleprompter shooting portal.
  • the estimated video duration can be used as the duration reference information of the finished video recorded in the subsequent video recording service.
  • the target user can add or delete the text in the text input area. For example, when the estimated duration of the video displayed in the text input area is 35 seconds, and the target user expects the recorded video duration to be 2 minutes, the target user can continue to edit the text in the text input area until the video displayed in the text input area is The estimated video duration is within the set duration range (for example, the estimated video duration is between 1 minute and 50 seconds to 2 minutes and 10 seconds).
  • the displayed recording page can also display the text upload control, and the target user can perform the trigger operation on the text upload control in the recording page, and the edited
  • the prompt text data is uploaded to the recording page, that is, the computer device can respond to the trigger operation for the text upload control, determine the text content uploaded to the recording page as prompt text data, and display the prompt text data in the text input area of the recording page.
  • the number of prompt texts corresponding to the prompt text data and the estimated video duration corresponding to the prompt text data can also be displayed.
  • the text upload control may include but is not limited to: a text paste control and a select last text control; when the target user performs a trigger operation on the text paste control, it means that the target user can directly paste the pre-edited prompt text data to the text input In the area, there is no need to temporarily edit the text content; when the target user performs a trigger operation on selecting the last text control, it means that the target user can use the prompt text data in the last video recording service in this video recording service, that is, the target user may If you are not satisfied with the finished video recorded in the last video recording service, re-recording in this video recording service can avoid repeatedly inputting the same prompt text data, thereby improving the input efficiency of prompt text data.
  • FIG. 4 is a schematic diagram of an interface for inputting prompt text data provided by an embodiment of the present application.
  • the user terminal 30a can respond to the trigger operation for the shooting portal (the user terminal 30a at this time can be the above-mentioned computer equipment)
  • a shooting page 30g is displayed in the video application, and the shooting page 30g may include a shooting area 30b, a filter control 30c, a shooting control 30d, a beauty control 30e, a prompting shooting entrance 30f, and the like.
  • the description of the functions of the shooting area 30b, the filter control 30c, the shooting control 30d and the beauty control 30e in the video application can be referred to in the embodiment corresponding to FIG. 2 above.
  • the functional description of the beauty control 20e which will not be repeated here.
  • the user terminal 30a may respond to the trigger operation on the teleprompter shooting entry 30f in the shooting page 30g, and switch the shooting page 30g to display the recording in the video application
  • the page 30h, the recording page 30h may include a text input area 30i, which may be used to directly edit the text content.
  • the target user can click on the text input area 30i, the keyboard 30p pops up on the recording page 30h, and the prompt text data required in this video recording service can be edited through the keyboard 30p.
  • the text content determined by the editing operation is displayed in the text input area 30i as hint text data.
  • the user terminal 30a can count the number of prompt characters of the prompt text data input in the text input area 30i in real time.
  • the number of prompt characters of the prompt text data input in the text input area 30i is greater than the preset number threshold (for example, when the quantity threshold is set to 100)
  • the number of prompt texts and the estimated length of filming corresponding to the input prompt text data ie, the estimated video length
  • the target user enters the text content in the text input area 30i “On weekends, participate in the consumption class of xx and xx cooperation in Changsha.
  • the user terminal 30a obtained statistics
  • the number of prompt texts is 32, and the estimated length of the film is 15 seconds, that is, "the current number of characters is 32, and the estimated filming time is 15 seconds" in the area 30m;
  • the target user can edit the text content according to the estimated filming time displayed in the area 30m , after the target user finishes editing the text content in the text input area 30i, the text content in the text input area 30i can be determined as prompt text data, and then the "Next" control 30n in the recording page 30h can be triggered to perform a trigger operation to Trigger the user terminal 30n to enter the next operation of the video recording service.
  • the text input area 30i may further include a paste text control 30j and a last text control 30k.
  • a trigger operation on the paste text control 30j it means that the target user has edited the prompt text in other applications. data, and copy the prompt text data from other applications; the user terminal 30a pastes the prompt text data copied by the target user into the text input area 30i in response to the trigger operation for the paste text control 30j.
  • the target user can perform a trigger operation on the last text control 30k, and the user terminal 30a responds to the above
  • the trigger operation of the secondary text control 30k obtains the prompt text data in the last video recording service, displays the prompt text data in the last video recording service in the text input area 30i, and directly converts the prompt text data used in the last video recording service.
  • the prompt text data is used as the prompt text data of this video recording service.
  • the target user can adjust the prompt text data used in the last video recording service in the text input area 30i according to the experience of the last video recording service. If there is a logic error in sentence 1, in this video recording service, the prompt text data of the previous video recording service can be modified in the text input area 30i.
  • the target user uses the paste text control 30j and the last text control 30k to input the prompt text data in the video recording service into the text input area 30i, which can improve the input efficiency of the prompt text data in the video recording service.
  • the target user can perform a voice start operation on the video recording service in the video application after completing the editing operation of the prompt text data, and the computer device can respond to the above voice start operation and record the video application in the video application.
  • the page displays the recording countdown animation associated with the video recording service.
  • the recording countdown animation ends, the video recording service in the video application is started and executed, that is, the video recording is officially started.
  • the camera device corresponding to the computer device can be turned on, and the target user can adjust the position of himself and the computer device according to the video screen displayed on the recording page to find the best shooting angle.
  • the animation cancellation control corresponding to the recording countdown animation can also be displayed on the recording page.
  • the animation cancellation control can be triggered to cancel the recording of the countdown animation; that is, the computer device can respond to the target user's The trigger operation of the animation cancel control, cancels the recording countdown animation on the recording page, and starts and executes the video recording service in the video application.
  • the video application will not directly enter the formal recording mode, but will play the recording countdown animation on the recording page, providing the target user with a short recording preparation time (that is, recording the countdown animation. time, such as 5 seconds), the official recording mode will only be entered after the recording countdown animation is played; or the target user can cancel the playback of the recording countdown animation and directly enter the official recording mode if the recording is prepared in advance.
  • FIG. 5 is a schematic diagram of an interface for starting a video recording service in a video application according to an embodiment of the present application.
  • the target user can perform the next operation (such as performing a trigger operation on the “next step” control 30n in the embodiment corresponding to the above-mentioned FIG. 4 ), and exit the display text input area on the recording page.
  • the target user edits the prompt text data and performs the next step, he can exit the text input area in the recording page 40b, and display the video screen of the target user in the area 40c of the recording page 40b.
  • the prompt information 40d (“adjust the position, put the mobile phone and say 'start' to start the teleprompter shooting") can also be displayed on the recording page 40b, that is, before starting the video recording service, the user terminal 40a (the user terminal 40a at this time can be called as computer equipment) just can open its associated camera equipment (such as the camera that comes with the user terminal 40a), collect the image data of the target user, and render the collected image data into a video screen corresponding to the target user, on the recording page 40b
  • the video screen of the target user is displayed in the area 40c of .
  • the target user can adjust the position of himself and the lens according to the video image displayed in the area 40c, so as to find the best shooting angle.
  • the target user adjusts the position of himself and the camera, that is, after the target user is ready to record the video, he can say "start” to start the video recording service in the video application.
  • the user terminal 40a may respond to the voice activation operation for the video recording service, and display a recording countdown animation in the area 40e of the recording page 40b, which The countdown animation can be recorded as long as 5 seconds.
  • the first few sentences of the prompt text data eg, the first two sentences of the prompt text data
  • the user terminal 40a can start and execute the video recording service in the video application. If the target user does not want to wait for the recording countdown animation to finish playing before starting the video recording service, he can trigger the animation cancel control 40f on the recording page 40b to cancel the playback of the recording countdown animation on the recording page 40b, and directly start and execute the video Recording business.
  • the target user can start to speak, and the user terminal 40a can collect the user's voice of the target user, find the target text matching the user's voice in the prompt text data, and record the target text in the area 40g of the recording page 40b.
  • the text is identified (for example, the target text is bolded or enlarged), wherein the specific determination process of the target text will be described in the following S102.
  • S102 Collect the user's voice in the video recording service, determine a target text matching the user's voice in the prompt text data associated with the video recording service, and identify the target text.
  • the computer device can enable the audio capture function, collect the user voice of the target user in the video recording service, find the target text that matches the user's voice in the prompt text data, and check the prompt text data on the recording page.
  • the included target text is identified.
  • the computer equipment can collect the user voice of the target user in the video recording service in real time, and by performing text conversion on the user voice, determine the text position corresponding to the user voice in the prompt text data, and determine the target text corresponding to the user voice according to the text position.
  • the target text is identified on the page.
  • the identification can include but is not limited to: text display color, text font size, text background, and the target text can refer to the text data containing the user's voice text.
  • the user's voice text is: New Year
  • the target text at this time can refer to A complete sentence containing "New Year", such as the target text: As the New Year arrives, I wish everyone acultivated year of the Ox.
  • the computer equipment refers to the directly collected voice as the user's initial voice, that is, the computer equipment can collect the user's initial voice in the video recording service, and perform Voice Activity Detection (VAD) on the user's initial voice to obtain the valid voice in the user's initial voice. data, and determine the valid voice data as the user voice. Then, the user's voice can be converted into user's voice and text, text matching is performed on the prompt text data associated with the user's voice text and the video recording service, and the target text matching the user's voice text can be determined in the prompt text data; In the page, the target text is identified.
  • VAD Voice Activity Detection
  • the user's initial voice collected by the computer equipment may contain the noise of the target user's environment and the paused part of the target user's speech. Therefore, the voice endpoint detection can be performed on the user's initial voice and the The mute and noise are deleted as interference information, and the valid voice data in the user's initial voice is retained, and the valid voice data at this time may be called the user voice of the target user.
  • the computer equipment can convert the user's voice into the user's voice and text through a fast voice-to-text model, compare the user's voice text with the prompt text data, and find the text position of the user's voice text in the prompt text data, and then can according to the text. The location determines the target text corresponding to the user's voice in the text data, and the target text can be identified on the recording page of the video recording service.
  • the fast speech-to-text model means that in the process of converting the user's speech into text, there is no need to correct the context, and it is not necessary to consider whether the semantics is correct.
  • the computer device can determine the target text corresponding to the user's voice in the prompt text data according to the pronunciation of the user's voice text and the pronunciation of the prompt text data, that is, the computer device can obtain The first syllable information corresponding to the user's voice text, obtain the second syllable information corresponding to the prompt text data associated with the video recording service, obtain the target syllable information that is the same as the first syllable information in the second syllable information, and in the prompt text data Determine the target text corresponding to the target syllable information.
  • the syllable information may refer to pinyin information in Chinese, or phonetic symbol information in English, or the like.
  • the computer device can convert the user's voice text into the first pinyin information, convert the prompt text data into the second pinyin information, and find the text position corresponding to the first pinyin information in the second pinyin information, according to The text position determines the target text corresponding to the user's voice in the prompt text data; when the prompt text data is in other languages such as English, the computer device can convert the user's voice text into the first phonetic symbol information, and convert the prompt text data into the second phonetic symbol information , and further, according to the first phonetic symbol information and the second phonetic symbol information, the target text corresponding to the user's voice can be determined in the prompt text data.
  • the area used to display the target text in the recording page can be set according to the terminal screen size of the computer device, as shown in the above-mentioned FIG.
  • the width is the same as the screen width of the computer device (eg, the user terminal 40a), and the display height of the area 40g is smaller than the screen height of the computer device.
  • the terminal screen size of the computer equipment is large (such as the display screen of a desktop computer)
  • the size and width of the area used to display the target text is the same as the terminal screen size and width of the computer equipment, the target user will watch the target text in the video recording service.
  • a text prompt area corresponding to the target text can be determined on the recording page of the video recording service according to the position of the camera device corresponding to the computer device, and the prompt text data can be displayed in the prompt text data according to the target text.
  • the text position of mark the target text in the text prompt area.
  • the target user can face the camera head-on.
  • FIG. 6 is a schematic diagram of an interface for displaying prompt text data provided by an embodiment of the present application.
  • the user terminal 50a that is, the above-mentioned computer equipment
  • determines the target text corresponding to the user's voice "Weekend, participate in the consumption class of xx and xx in Changsha" in the prompt text data it can be
  • a text prompt area 50e for displaying the target text is determined on the recording page 50b of the video recording service, and the text prompt area 50e is located in the same orientation as the camera 50d.
  • the target user's video picture may be displayed in the area 50c of the recording page 50b, and the video recording duration (eg, the video recording duration is 00:13 seconds) may be displayed in the area 50f of the recording page 50b.
  • the video recording duration eg, the video recording duration is 00:13 seconds
  • the computer equipment can collect the user's initial voice of the target user in real time, obtain the voice duration corresponding to the user's initial voice and the number of voice characters contained in the user's initial voice, and determine the ratio of the voice and text quantity to the voice duration as the user's voice.
  • the speech rate threshold can be manually set based on actual needs, for example, the speech rate threshold is 500 words/minute
  • the speech rate prompt information can be displayed on the recording page.
  • the speech rate prompt information may be used to prompt the target user associated with the video recording service to reduce the user's speech rate.
  • the computer device can obtain the user's speech rate of the target user in real time.
  • the user's speech rate is greater than the speech rate threshold, it indicates that the target user's speech rate in the video recording service is too fast, and the target user can be reminded to appropriately slow down the speech rate.
  • FIG. 7 is a schematic diagram of an interface for displaying speech rate prompt information provided by an embodiment of the present application.
  • the user's speech rate of the target user can be determined according to the number of voice characters and the voice duration included in the user's initial voice.
  • the speech rate prompt information 60c may be displayed on the recording page 60b of the video recording service (for example, the speech rate prompt information may be "You are currently The speaking rate is too fast, to ensure the quality of the recorded video, please slow down your speaking rate").
  • the target user may also be reminded to slow down the speech rate in the form of voice broadcast, and the embodiment of the present application does not limit the display form of the speech rate prompt information.
  • the recording page of the video recording service may further include a cancel recording control and a complete recording control.
  • the computer device can respond to the trigger operation for the cancel recording control, cancel the video recording service, delete the video data recorded by the video recording service, and generate a recording for the video recording service.
  • Prompt information the recording prompt information is displayed on the recording page, wherein the recording prompt information may include a re-recording control.
  • the computer device can respond to the triggering operation for the re-recording control, and switch the target text displayed on the recording page to display the prompt text data, that is, display the prompt text in the text input area of the recording page. data, and restart the video recording service.
  • the recording prompt information can also include a home page control, the target user performs a trigger operation on the home page control, and the computer device can respond to the trigger operation for the home page control, in the video application, switch the recording page to display the application home page, that is, cancel the ongoing After the video recording service is executed, the video recording service will not continue to be enabled for the time being.
  • the computer device may respond to the triggering operation for the recording completion control, stop the video recording service, and determine the video data recorded by the video recording service as the recorded target video data, That is, the video recording service is stopped when the prompt text data is not finished, and the video recorded before the video recording service is stopped is called target video data.
  • FIG. 8 is a schematic diagram of an interface for stopping a video recording service provided by an embodiment of the present application.
  • the user terminal 70a that is, the above-mentioned computer equipment
  • the user terminal 70a can determine the target text of the user's voice in the prompt text data of the video recording service according to the user's voice of the target user in the video recording service, and record the target text on the recording page 70b.
  • the text is identified, that is, the user terminal 70a can scroll and display the prompt text data according to the progress of the user's voice.
  • a cancel recording control 70c and a complete recording control 70d may also be displayed in the recording page 70b.
  • the user terminal 70a may respond to the triggering operation on the recording completion control 70d, stop the video recording service, and save the video data recorded by the current video recording service, that is, the current video recording is completed. business.
  • the user terminal 70a can respond to the trigger operation for the cancel recording control 70c, cancel the video recording service, and delete the video data recorded by this video recording service, and the user terminal 70a can be
  • the target user in the video recording service generates recording prompt information 70e (for example, the recording prompt information can be "The recorded segment will be cleared, re-shoot the segment?"), and the recording prompt information 70e is displayed on the recording page 70b of the video recording service.
  • the recording prompt information 70e may include a "return to the home page” control and a "re-shoot” control; when the target user performs a trigger operation on the "return to the home page” control, the user terminal 70a can exit the video recording service, from the recording page 70b Returning to the application home page of the video application, that is, the target user gives up re-shooting; when the target user performs a trigger operation on the "re-shoot” control, the user terminal 70a can exit the video recording service, return to the text input area from the recording page 70b, and log in The prompt text data is displayed in the text input area, that is, the target user chooses to re-record the video.
  • the computer device can automatically end the video recording without the target user's operation. service, and save the video data recorded in the video recording service, and determine the target video data from the video data recorded in the video recording service.
  • the computer device may determine the video data saved when the video recording service is stopped as the original video data, enter the editing page of the video application, and display the original video data and the editing optimization controls corresponding to the original video data in the editing page of the video application.
  • the target user can perform a triggering operation on the editing optimization control displayed in the editing page, and the computer device at this time can respond to the triggering operation for the editing optimization control, and display M editing optimization methods for the original video data, wherein M is a positive integer , that is, M can take a value of 1, 2, .
  • the editing optimization method of pauses between sentences may be called the second editing method
  • the computer equipment can respond to the selection operation for the M editing optimization methods, according to The editing optimization mode determined by the operation is selected, and the editing optimization processing is performed on the original video data to obtain target video data corresponding to the video recording service. It can be understood that the display area and display size of the original video data and the target video data in the editing page can be adjusted according to actual requirements.
  • the display area of the original video data may be located at the top of the editing page, or may be located at the bottom of the editing page, or may be located in the middle area of the editing page, etc.; the display of the original video data (or the target video data)
  • the size can be a 16:9 aspect ratio, etc.
  • the computer equipment can obtain the target voice data contained in the original video data, and convert the target voice data into The target text result, and then the target text result can be compared with the prompt text data, and the text that is different from the prompt text data in the target text result is determined as the wrong text; the voice data corresponding to the wrong text is deleted in the original video data, The target video data corresponding to the video recording service is obtained.
  • the computer equipment can use an accurate speech-to-text model to perform text-conversion processing on the target speech data contained in the original video data.
  • the above-mentioned precise speech-to-text model It can learn the semantic information in the target speech data, not only need to consider the consistency between the converted text pronunciation and the user's speech, but also consider the semantic information between the user's speech, and correct the converted text through the contextual semantic information.
  • the computer equipment can perform voice endpoint detection on the target voice data contained in the original video data, remove noise and mute in the original video data, and obtain valid voice data in the original video data.
  • the data is converted into text to obtain the target text result corresponding to the target voice data; the text contained in the target text result and the text contained in the prompt text data are compared one by one, and then the target text result and the prompt text data can be compared between the two.
  • Different texts are determined to be wrong texts, and the wrong texts here may be caused by the target user making a slip of the tongue during the recording process of the video recording service.
  • the computer equipment deletes the voice data corresponding to the wrong text from the original video data, and the final target video data can be obtained.
  • the computer equipment can convert the target voice data contained in the original video data into the target text result.
  • the text in the target text result that is different from the prompt text data is determined as the error text; then the target text result can be divided into N text characters, and the timestamps of the N text characters in the target speech data can be obtained, where N is a positive integer, such as N can be 1, 2, ...; the computer device can determine the speech pause segment in the target speech data according to the timestamp, delete the speech pause segment and the speech data corresponding to the wrong text in the original video data, The target video data corresponding to the video recording service is obtained.
  • N is a positive integer, such as N can be 1, 2, ...
  • the process of obtaining the speech pause segment by the computer device may include: the computer device may perform word segmentation processing on the target text result corresponding to the target speech data, obtain N text characters, and obtain the time stamp of each text character in the target speech data, that is, in the target speech data. For the timestamps in the original video data, according to the timestamps corresponding to each of the two adjacent text characters in the N text characters, the time interval between each adjacent two text characters is obtained. When the time interval is greater than the duration threshold (for example, the duration threshold can be set to 1.5 seconds), the speech segment between two adjacent text characters can be determined as a speech pause segment, wherein the number of speech pause segments can be one, or It can be multiple or zero (that is, there is no speech pause segment).
  • the duration threshold for example, the duration threshold can be set to 1.5 seconds
  • N text characters can be represented as: text character 1, text character 2, text character 3, text character 4, text character 5, and text character 6 according to the arrangement order in the target text result, and text character 1 is in the original
  • the timestamp in the video data is t1
  • the timestamp of text character 2 in the original video data is t2
  • the timestamp of text character 3 in the original video data is t3
  • the timestamp of text character 4 in the original video data is t4
  • the time stamp of text character 6 in the original video data is t5
  • the time stamp of text character 6 in the original video data is t6
  • the computer device calculates that the time interval between text character 2 and text character 3 is greater than the duration threshold, Then the speech segment between the text character 2 and the text character 3 can be determined as the speech pause segment 1.
  • the speech segment in between is determined as speech pause segment 2.
  • the final target video data can be obtained by deleting the speech corresponding to the wrong text and the video segments corresponding to the speech pause segment 1 and the speech pause segment 2 respectively in the original video data.
  • FIG. 9 is a schematic diagram of an interface for editing and optimizing a recorded video provided by an embodiment of the present application.
  • the editing page 80b of the video application can be entered, and the video data 80c (such as the above-mentioned original video data) recorded in the video recording service can be previewed and played in the editing page 80b.
  • the video data 80c can be displayed in the clip page 80b according to the ratio of 16:9, and the time axis 80d corresponding to the video data 80c can be displayed in the clip page 80b.
  • the time axis 80d can include the video nodes in the video data 80c, and the target user can Quickly locate the playback point in the video data 80c through the video node in the time axis 80d.
  • the clip optimization control 80e (also referred to as a clip optimization option button) may also be displayed in the clip page 80b.
  • the user terminal 80a ie, the computer device
  • the clip optimization control 80e can respond to the clip optimization control 80e.
  • the selection page 80f pops up in the editing page 80b (in the embodiment of the present application, the selection page may refer to a certain area in the editing page, or a sub-page displayed independently in the editing page, or a floating page in the editing page page, or a page covering the clip page, the display form of the selected page is not limited here).
  • different editing optimization methods for the video data 80c can be displayed, and the corresponding video durations of the different editing optimization methods can be displayed; as shown in FIG. (i.e. the above-mentioned first editing mode), the video duration of the optimized video data 80c is 57 seconds (the video duration of the video data 80c is 60 seconds); "Pause" (that is, the above-mentioned second editing method), the video duration of the optimized video data 80c after editing is 50 seconds; if the target user chooses not to do any processing in the selection page 80f, that is, keep the video data 80c without processing.
  • the user terminal 80a can perform text conversion processing on the target voice data in the video data 80c, obtain the target text result corresponding to the target voice data, and compare the target text result with the target text result. Text matching is performed on the prompt text data to determine the erroneous text, and the voice data corresponding to the erroneous text is deleted in the video data 80c to obtain the target video data.
  • the user terminal 80a deletes the voice data corresponding to the erroneous text in the video data 80c, and the voice pause segment in the video data 80c, and then obtains the target Video data
  • the target video data here refers to the video data in which the slip-of-slip part and the pause part between sentences are deleted.
  • the target user can save the target video data, or upload the target video data to the information release platform, so that all user terminals in the information release platform can watch the target video data.
  • the above error text may include K error sub-texts, where K is a positive integer, for example, the value of K may be 1, 2, ...; the computer device may, according to the K error sub-texts and the video duration corresponding to the original video data, Determine the error frequency in the video recording service; when the error frequency is greater than the error threshold (for example, the error threshold can be set to 2 errors per minute), identify the speech error types corresponding to the K error sub-texts, and then can be used in video applications.
  • the error threshold for example, the error threshold can be set to 2 errors per minute
  • the computer device can recommend a corresponding tutorial video for the target user in the video application according to the speech error type corresponding to the error text, where the speech error type includes but is not limited to: non-standard Mandarin, wrong pronunciation, and unclear words.
  • the computer device can determine the speech error type of the wrong subtext corresponding to the three errors.
  • the computer device can push a Mandarin tutorial video for the target user in the video application; if the speech error type is a pronunciation error type, the computer device can push a language tutorial video for the target user in the video application; If the type is the slurred speech type, the computer device can push a dubbing tutorial video for the target user in the video application.
  • FIG. 10 is a schematic diagram of an interface for recommending a tutorial video according to a speech error type provided by an embodiment of the present application.
  • the target user selects the "remove the slip-of-slip part" editing optimization mode, and the original video data recorded in the video recording service is edited and optimized to obtain the target video data 90c after the editing optimization (that is, the part of the slip-of-slip is removed)
  • the user terminal 90a (that is, the above-mentioned computer equipment) can display the target video data 90c in the editing page 90b, and the time axis 90d can also be displayed in the editing page 90b, and the time axis 90d can include and
  • the video node associated with the target video data 90c can be positioned to play at a specific time point in the target video data 90c by triggering the video node in the timeline 90d, and the target user can preview and play the target video data 90c on the clip page 90b .
  • the user terminal 90a may push a tutorial video matching the speech error type to the target user in the video application according to the speech error type corresponding to the error text in the editing optimization process.
  • the speech error type corresponding to the error text is Mandarin Chinese.
  • Non-standard type that is, the reason for the slip of the tongue is that the Mandarin is not standard
  • the user terminal 90a can obtain the tutorial video (that is, the Mandarin tutorial video) related to the Mandarin video teaching in the video application, and display the pushed Mandarin tutorial in the area 90e of the editing page 90b video.
  • FIG. 11 is a flowchart for implementing a video recording service provided by an embodiment of the present application.
  • the implementation process of the video recording service is described by taking the client and the background server of the video application as an example.
  • the client and the background server at this time can be called computer equipment; the implementation process of the video recording service can be achieved through The following S11-S25 are implemented.
  • input prompt text data that is, the target user can open the client of the video application, enter the shooting page of the client, and enter the recording page from the prompting and shooting entrance of the shooting page.
  • the recording page includes a text input area, and the target user can Enter prompt text data in the text input area.
  • the target user finishes editing the prompt text data he can execute S12, and the voice starts "start", that is, "start” can be used as the wake-up word.
  • the client can respond to the user's voice start operation , and perform S13 to start the video recording service, that is, start to enter the recording mode.
  • the target user can read the text on the screen aloud (the screen is the screen of the terminal device where the client is installed, and the text on the screen of the terminal device at this time can be part of the text content in the prompt text data, for example, enter
  • the text displayed in the recording mode can be the first two sentences in the prompt text data
  • the client can collect the user's initial voice of the target user, transmit the user's initial voice to the background server of the video application, and send the text conversion to the background server. instruction.
  • the background server can perform S15, detect the user's initial voice through the voice endpoint detection technology (VAD technology), delete the noise and mute in the user's initial voice, and obtain the corresponding target user's voice.
  • VAD voice endpoint detection technology
  • User voice ie valid voice data
  • S15 may be performed by the client through a local voice endpoint detection module, or may be performed by the background server using the VAD technology.
  • the background server can use a fast text conversion model to perform text conversion on the user's voice, convert the user's voice into text (that is, the user's voice and text), continue to perform S17, and convert the user's voice and text (text) into pinyin (the embodiment of the present application).
  • the default text prompt data is Chinese
  • the background server can obtain the prompt text data input by the target user, and convert the prompt text data into pinyin, and match the pinyin of the user's voice text and the pinyin of the prompt text data , continue to execute S19, find the text position matching the user's voice in the prompt text data, and transmit the text position of the user's voice in the prompt text data to the client.
  • the client terminal after receiving the text position transmitted by the background server, the client terminal can determine the target text corresponding to the user's voice according to the text position, and identify the target text on the recording page of the client terminal, that is, the prompt text data can be scrolled and displayed according to the text position;
  • the client can execute S21 to end the video recording service.
  • the target user can trigger the complete recording control on the recording page or trigger the cancel recording control on the recording page to end the video recording service.
  • the client can transmit the recorded video corresponding to the video recording service (that is, the above-mentioned original video data) to the backend server, and send a text conversion instruction to the backend server.
  • the backend server can Execute S22, use an accurate text conversion model to perform text conversion on the voice data contained in the recorded video, convert the voice data contained in the recorded video into text (that is, the target text result), and obtain the timing when the text appears in the recorded video, or you can It is called the time stamp of the text in the recorded video; the background server at this time can execute S23 and S24 in parallel.
  • the background server can compare the target text result with the prompt text data, and find out the slip of the tongue in the recorded video (that is, the voice data corresponding to the above-mentioned error text); S24, the background server can use the text to appear in the recorded video. Timing (ie timestamp) to find the pauses in the user's speech contained in the recorded video.
  • the backend server can transmit the slips and pauses in the recorded video to the client.
  • the client After the client receives the slip-up part and the pause part transmitted by the background server, it can execute S25, and according to the slip-up part and the pause part, the client can provide different editing optimization methods for the target user. In the editing optimization mode, select an appropriate editing optimization mode, and the client can perform editing optimization on the recorded video based on the editing optimization mode selected by the target user to obtain the final target video data.
  • the video recording service can be started by voice, and the user can be provided with a word prompting function during the recording process of the video recording service; Match the target text, and identify the target text in the video application, that is, the target text displayed in the video application matches the content of the user's speech, which can improve the effectiveness of the text prompt function in the video recording service and reduce user costs.
  • the risk of recording failure due to forgetting words can improve the quality of the recorded video; start or stop the video recording service through the user's voice, which can reduce user operations in the video recording service and improve the effect of video recording; after the video recording service ends , which can automatically edit and optimize the recorded video in the video recording service, which can further improve the quality of the recorded video.
  • FIG. 12 is a schematic flowchart of a data processing method provided by an embodiment of the present application. Understandably, the data processing method can be executed by a computer device, and the computer device can be a user terminal, or an independent server, or a cluster composed of multiple servers, or a system composed of a user terminal and a server, or a system composed of a user terminal and a server.
  • a computer program application (including program code), which is not specifically limited here.
  • the data processing method may include the following S201-S203:
  • the target user can input prompt text data in the teleprompter application, or upload the edited prompt text data to the teleprompter application.
  • the computer device can respond to the target user's text input operation or text upload operation, and upload the prompt text data to the prompting application, that is, when using the prompting function provided by the prompting application, the prompting text data needs to be uploaded to the prompting application.
  • the computer device in the embodiment of the present application may refer to a device installed with a teleprompter application, and may also be referred to as a teleprompter.
  • S202 Collect user voice corresponding to the target user, perform text conversion on the user voice, and generate user voice text corresponding to the user voice.
  • the computer equipment can collect the user's initial voice of the target user, perform voice endpoint detection on the user's initial voice, delete the noise and mute contained in the user's initial voice, and obtain the user's voice corresponding to the target user (that is, the effective voice in the user's initial voice). data), perform text conversion on the user's voice, and generate the user's voice text corresponding to the user's voice.
  • the computer device can convert the user's voice text into the first syllable information, convert the prompt text data into the second syllable information, compare the first syllable information with the second syllable information, and determine the text of the user's voice text in the prompt text data. Position, according to the text position, the target text matching the user's voice can be determined in the prompt text data, and the target text can be identified in the word prompt application.
  • S202 and S203 reference may be made to S102 in the above-mentioned embodiment corresponding to FIG. 3 , which will not be repeated here.
  • the number of target users can be one or more, and different target users can correspond to different prompt text data; when the number of target users is 1, the determination and display process of the target text in the prompt application can refer to the corresponding figure 3 above.
  • S102 in the embodiment when the number of target users is multiple, after the user voice collected by the computer equipment, voiceprint recognition can be performed on the user voice, and the user identity corresponding to the collected user voice can be determined according to the voiceprint recognition result, The target text corresponding to the user's voice is determined in the prompt text data corresponding to the user identity, and the target text is identified in the word prompting application.
  • voiceprint recognition may refer to extracting voiceprint features (for example, spectrum, cepstrum, formant, fundamental tone, reflection coefficient, etc.) in the user's voice data, and by identifying the voiceprint features, the user corresponding to the user's voice can be determined. identity, so voiceprint recognition can also be called speaker recognition.
  • voiceprint features for example, spectrum, cepstrum, formant, fundamental tone, reflection coefficient, etc.
  • the following description takes the number of target users as 2, that is, the target users include the first user and the second user as an example, and the prompt text data at this time includes the first prompt text corresponding to the first user and the second prompt corresponding to the second user.
  • Text can obtain the user voiceprint feature in the user's voice, and determine the user identity corresponding to the user's voice according to the user voiceprint feature; if the user identity is the first user, the first prompt text will be the same as the user's voice text.
  • the text is determined as the target text, and the target text is identified in the prompting application; if the user identity is the second user, in the second prompting text, the same text as the user's voice text is determined as the target text, and the prompting text
  • the target text is identified in the application.
  • the user identity corresponding to the user's voice needs to be determined first, and then the target text that matches the user's voice can be determined in the prompt text data corresponding to the user identity, and the target text can be identified. , which can improve the effectiveness of the teleprompter function in the teleprompter application.
  • FIG. 13 is a schematic diagram of an application scenario of a teleprompter provided by an embodiment of the present application. Taking the teleprompter scene of the party as an example, the data processing process will be described.
  • the lines 90a of the host in the party can be edited in advance, and the lines 90a can be uploaded to the teleprompter (understandable). It is the device where the above-mentioned teleprompter application is located, which can provide the host with a line prompting function); the lines 90a can include the lines of the host A and the host B. After the teleprompter receives the lines 90a, it can Line 90a is stored locally.
  • the teleprompter can collect the voice data of all hosts in real time.
  • the teleprompter can perform voiceprint recognition on the user's voice, and determine the corresponding user's voice according to the voiceprint recognition results.
  • user ID When the user identity of the collected user voice is Xiao A, the teleprompter can search for the target text that matches the collected user voice in the lines of the host Xiao A (such as "With the warm blessings of winter, full of joy mood”), and marked in the teleprompter “with the warm blessings of winter, full of joyful mood”.
  • the teleprompter can search for the target text matching the collected user voice in the lines of the host Xiao B (such as "In the past year, we paid sweat"), and marked "In the past year, we have sweated" in the teleprompter.
  • the teleprompter can identify the sentence that the target user is reading aloud, and automatically recognize the target user's voice along with the target user's reading progress, and scrolling and displaying the prompt text data in the teleprompter can improve the prompting performance.
  • the effectiveness of the text prompt function in the browser can improve the prompting performance.
  • FIG. 14 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • the data processing apparatus can perform the steps in the above-mentioned embodiment corresponding to FIG. 3 .
  • the data processing apparatus 1 may include: a startup module 101 , a display module 102 , and an acquisition module 103 ;
  • a startup module 101 configured to start a video recording service in the video application in response to a service start operation in the video application;
  • the display module 102 is used to collect the user's voice in the video recording service, determine the target text that matches the user's voice in the prompt text data associated with the video recording service, and identify the target text;
  • the obtaining module 103 is configured to obtain target video data corresponding to the video recording service when the text position of the target text in the prompt text data is the end position of the prompt text data.
  • the specific function implementation manner of the startup module 101 , the display module 102 , and the acquisition module 103 may refer to S101 - S103 in the embodiment corresponding to FIG. 3 , and details will not be repeated here.
  • the data processing apparatus 1 may further include: a first recording page display module 104 , an editing module 105 , a first estimated duration display module 106 , a second recording page display module 107 , and a text upload module 108 , the second estimated duration display module 109;
  • the first recording page display module 104 is used to display the recording page in the video application in response to the triggering operation for the teleprompter shooting portal in the video application before starting the video recording service in the video application; the recording page includes a text input area;
  • the editing module 105 is configured to display the prompt text data determined by the information editing operation in the text input area in response to the information editing operation for the text input area;
  • the first estimated duration display module 106 is configured to display the number of prompt texts and the estimated video duration corresponding to the prompt text data in the text input area when the number of prompt texts corresponding to the prompt text data is greater than the quantity threshold.
  • the second recording page display module 107 is configured to display the recording page in the video application in response to a trigger operation for the prompting entry in the video application before starting the video recording service in the video application;
  • the recording page includes text uploading controls and text input areas;
  • the text uploading module 108 is configured to respond to the trigger operation for the text uploading control, determine the text content uploaded to the recording page as prompt text data, and display the prompt text data in the text input area;
  • the second estimated duration display module 109 is configured to display the number of prompt texts corresponding to the prompt text data and the estimated video duration corresponding to the prompt text data.
  • the first recording page display module 104, the editing module 105, the first estimated duration display module 106, the second recording page display module 107, the text upload module 108, and the second estimated duration display module 109 can be implemented in a specific way. Refer to S101 in the embodiment corresponding to FIG. 3 above, which will not be repeated here.
  • the second recording page display module 107, the text upload module 108, and the second estimated duration display module 109 suspend the execution of operations; when the second recording page display module 107, the text uploading module 108, and the second estimated duration display module 109 are performing corresponding operations, the first recording page display module 104, the editing module 105, the first The estimated duration display module 106 suspends the execution of operations.
  • first recording page display module 104 and the second recording page display module 107 may be combined into the same recording page display module; the first estimated duration display module 106 and the second estimated duration display module 109 may be combined into the same predetermined duration display module Estimated duration display module.
  • the service activation operation includes a voice activation operation
  • the startup module 101 may include: a countdown animation display unit 1011, and a recording service startup unit 1012;
  • the countdown animation display unit 1011 is used to display the recording countdown animation associated with the video recording service in the recording page of the video application in response to the voice activation operation in the video application;
  • the recording service starting unit 1012 is configured to start and execute the video recording service in the video application when the recording countdown animation ends.
  • the specific function implementation manner of the countdown animation display unit 1011 and the recording service initiation unit 1012 may refer to S101 in the embodiment corresponding to FIG. 3 above, which will not be repeated here.
  • recording a countdown animation includes an animation cancellation control
  • the data processing apparatus 1 may further include: a countdown animation cancellation module 110;
  • the countdown animation cancellation module 110 is configured to, when the recording countdown animation ends, before starting and executing the video recording service in the video application, in response to a triggering operation for the animation cancellation control, cancel the display of the recording countdown animation, and start and execute the countdown animation. Execute the video recording service in the video application.
  • the specific function implementation manner of the countdown animation canceling module 110 may refer to S101 in the embodiment corresponding to FIG. 3 above, which will not be repeated here.
  • the display module 102 may include: a voice endpoint detection unit 1021, a target text determination unit 1022, and a target text display unit 1023;
  • Voice endpoint detection unit 1021 for collecting the user's initial voice in the video recording service, performing voice endpoint detection on the user's initial voice to obtain valid voice data in the user's initial voice, and determining the valid voice data as the user's voice;
  • the target text determining unit 1022 is used to convert the user's voice into the user's voice and text, perform text matching on the prompt text data associated with the user's voice text and the video recording service, and determine the target text that matches the user's voice text in the prompt text data;
  • the target text display unit 1023 is configured to identify the target text in the recording page of the video recording service.
  • the specific function implementation of the voice endpoint detection unit 1021 , the target text determination unit 1022 , and the target text display unit 1023 may refer to S102 in the embodiment corresponding to FIG. 3 , and will not be repeated here.
  • the target text determination unit 1022 may include: a syllable information acquisition subunit 10221, and a syllable matching subunit 10222;
  • the syllable information obtaining subunit 10221 is used to obtain the first syllable information of the user's voice text, and obtain the second syllable information of the prompt text data associated with the video recording service;
  • the syllable matching subunit 10222 is configured to obtain the same target syllable information as the first syllable information in the second syllable information, and determine the target text corresponding to the target syllable information in the prompt text data.
  • the specific function implementation manner of the syllable information acquisition subunit 10221 and the syllable matching subunit 10222 may refer to S102 in the embodiment corresponding to FIG. 3 above, which will not be repeated here.
  • the target text display unit 1023 may include: a prompt area determination subunit 10231, and an identification subunit 10232;
  • the prompt area determination subunit 10231 determines the text prompt area corresponding to the target text in the recording page of the video recording service
  • the identification subunit 10232 is configured to identify the target text in the text prompt area according to the text position of the target text in the prompt text data.
  • the specific function implementation manner of the prompt area determination subunit 10231 and the identification subunit 10232 may refer to S102 in the embodiment corresponding to FIG. 3 above, which will not be repeated here.
  • the recording page includes a cancel recording control
  • the data processing device 1 may further include: a recording cancellation module 111, a recording prompt information display module 112, and a re-recording module 113;
  • the recording cancellation module 111 is used to cancel the video recording service and delete the video data recorded by the video recording service in response to the triggering operation for the cancellation recording control;
  • the recording prompt information display module 112 is used to generate the recording prompt information for the video recording service, and display the recording prompt information on the recording page; the recording prompt information includes a re-recording control;
  • the re-recording module 113 is configured to switch and display the target text displayed on the recording page as prompt text data in response to the triggering operation for the re-recording control.
  • the specific function implementation of the recording cancellation module 111 , the recording prompt information display module 112 , and the re-recording module 113 can be referred to S102 in the embodiment corresponding to FIG. 3 , which will not be repeated here.
  • the recording page includes a complete recording control
  • the data processing apparatus 1 may include: a recording completion module 114;
  • the recording completion module 114 is used to, when the text position of the target text in the prompt text data is the end position of the prompt text data, before acquiring the target video data corresponding to the video recording service, respond to the completion of the recording
  • the trigger operation of the control stops the video recording service, and determines the video data recorded by the video recording service as the target video data.
  • the acquisition module 103 may include: an original video acquisition unit 1031, an optimization control display unit 1032, an optimization mode display unit 1033, and an optimization processing unit 1034;
  • Original video acquisition unit 1031 for when the text position of the target text in the prompt text data is the end position of the prompt text data, stop the video recording service, and determine the video data recorded by the video recording service as the original video data;
  • An optimization control display unit 1032 used for displaying the original video data in the editing page of the video application, and the editing optimization control corresponding to the original video data;
  • the optimization mode display unit 1033 is used to respond to the trigger operation for the editing optimization control, and display M editing optimization modes for the original video data; M is a positive integer;
  • the optimization processing unit 1034 is configured to, in response to the selection operation for the M editing optimization modes, perform editing optimization processing on the original video data according to the editing optimization mode determined by the selection operation to obtain target video data.
  • the original video acquisition unit 1031, the optimization control display unit 1032, the optimization mode display unit 1033, and the optimization processing unit 1034 may refer to S103 in the embodiment corresponding to FIG.
  • the optimization processing unit 1034 may include: a first speech conversion subunit 10341, a text comparison subunit 10342, a speech deletion subunit 10343, a second speech conversion subunit 10344, and a timestamp acquisition subunit 10345 , the speech pause segment determination subunit 10346;
  • the first voice conversion subunit 10341 is used to obtain the target voice data contained in the original video data if the clipping optimization mode determined by the selection operation is the first clipping mode, and convert the target voice data into the target text result;
  • the text comparison subunit 10342 is used to perform text comparison between the target text result and the prompt text data, and determine the text that is different from the prompt text data in the target text result as the wrong text;
  • the voice deletion subunit 10343 is used to delete the voice data corresponding to the wrong text in the original video data to obtain the target video data.
  • the second voice conversion subunit 10344 is configured to convert the target voice data contained in the original video data into a target text result if the clipping optimization mode determined by the selection operation is the second clipping mode, and convert the target text result with the prompt text Texts with different data are determined as error texts;
  • the timestamp obtaining subunit 10345 is used to divide the target text result into N text characters, and obtain the timestamps of the N text characters in the target speech data respectively; N is a positive integer;
  • the speech pause segment determination subunit 10346 is configured to determine the speech pause segment in the target speech data according to the timestamp, delete the speech pause segment and the speech data corresponding to the wrong text in the original video data, and obtain the target video data.
  • the first speech conversion subunit 10341 the text comparison subunit 10342, the speech deletion subunit 10343, the second speech conversion subunit 10344, the timestamp acquisition subunit 10345, the speech pause segment determination subunit 10346
  • the specific function implementation mode Reference may be made to S103 in the embodiment corresponding to FIG. 3 above, and details are not repeated here.
  • the second speech conversion subunit 10344, the timestamp acquisition subunit 10345, the speech pause segment determination subunit 10346 all suspend the execution of operations; when the second speech conversion subunit 10344, the time stamp acquisition subunit 10345, and the speech pause segment determination subunit 10346 perform corresponding operations, the first speech conversion subunit 10341, the text comparison subunit 10342, the voice deletion sub-unit 10343 all suspend the operation.
  • the data processing apparatus 1 may further include: a user speech rate determination module 115, and a speech rate prompt information display module 116;
  • the user's speech rate determination module 115 is used to obtain the speech duration corresponding to the user's initial speech and the number of speech characters contained in the user's initial speech, and determine the ratio of the number of speech characters to the speech duration as the user's speech rate;
  • the speech rate prompt information display module 116 is used to display the speech rate prompt information on the recording page when the user's speech rate is greater than the speech rate threshold; the speech rate prompt information is used to prompt the target user associated with the video recording service to reduce the user's speech rate.
  • the specific function implementation of the user speech rate determination module 115 and the speech rate prompt information display module 116 may refer to S102 in the embodiment corresponding to FIG. 3 above, which will not be repeated here.
  • the error text includes K error sub-texts, and K is a positive integer
  • the data processing device 1 may further include: an error frequency determination module 117, an error type identification module 118, and a tutorial video push module 119;
  • the error frequency determination module 117 is used to determine the error frequency in the video recording service according to the video duration corresponding to the K error subtexts and the original video data;
  • the error type identification module 118 is used to identify the speech error types corresponding to the K error sub-texts respectively when the error frequency is greater than the error threshold;
  • the tutorial video push module 119 is configured to push the tutorial video associated with the speech error type to the target user associated with the video recording service in the video application.
  • the specific function implementation manner of the error frequency determination module 117 , the error type identification module 118 , and the tutorial video push module 119 may refer to S103 in the embodiment corresponding to FIG. 3 , which will not be repeated here.
  • the video recording service can be started by voice, and the user can be provided with a word prompting function during the recording process of the video recording service; Match the target text, and identify the target text in the video application, that is, the target text displayed in the video application matches the content of the user's speech, which can improve the effectiveness of the text prompt function in the video recording service and reduce user costs.
  • the risk of recording failure due to forgetting words can improve the quality of the recorded video; start or stop the video recording service through the user's voice, which can reduce user operations in the video recording service and improve the effect of video recording; after the video recording service ends , which can automatically edit and optimize the recorded video in the video recording service, which can further improve the quality of the recorded video.
  • FIG. 15 is a schematic structural diagram of implementing a data processing apparatus provided by an embodiment of the present application.
  • the data processing apparatus can perform the steps in the embodiment corresponding to FIG. 12 .
  • the data processing apparatus 2 can include: a prompt text uploading module 21 , a user voice collection module 22 , and a user voice text display module 23 ;
  • the prompt text uploading module 21 is used to upload the prompt text data to the prompting application
  • the user voice collection module 22 is used to collect the user voice corresponding to the target user, perform text conversion on the user voice, and generate the user voice text corresponding to the user voice;
  • the user voice text display module 23 is configured to determine the same text as the user voice text as the target text in the prompt text data, and identify the target text in the prompting application.
  • the specific implementation of the prompt text uploading module 21, the user voice collection module 22, and the user voice text display module 23 can refer to S201-S203 in the embodiment corresponding to FIG. 12, which will not be repeated here.
  • the target user includes a first user and a second user
  • the prompt text data includes a first prompt text corresponding to the first user and a second prompt text corresponding to the second user
  • the user voice and text display module 23 may include: a user identity determination unit 231, a first determination unit 232, and a second determination unit 233;
  • the user identity determination unit 231 is used to obtain the user voiceprint feature in the user voice, and determine the user identity corresponding to the user voice according to the user voiceprint feature;
  • the first determining unit 232 is used for, if the user identity is the first user, in the first prompt text, determine the text identical to the user's voice text as the target text, and identify the target text in the prompting application;
  • the second determining unit 233 is configured to determine, in the second prompt text, the same text as the user's voice text as the target text if the user identity is the second user, and identify the target text in the prompting application.
  • the specific implementation manner of the user identity determination unit 231, the first determination unit 232, and the second determination unit 233 may refer to S203 in the embodiment corresponding to FIG. 12, which will not be repeated here.
  • the teleprompter can identify the sentence that the target user is reading aloud, and automatically recognize the target user's voice along with the target user's reading progress, and scrolling and displaying the prompt text data in the teleprompter can improve the prompting performance.
  • the effectiveness of the text prompt function in the browser can improve the prompting performance.
  • FIG. 16 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device 1000 may include: a processor 1001 , a network interface 1004 and a memory 1005 , in addition, the above-mentioned computer device 1000 may further include: a user interface 1003 , and at least one communication bus 1002 .
  • the communication bus 1002 is used to realize the connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may include a standard wired interface, a wireless interface (eg, a WI-FI interface).
  • the memory 1005 may be high-speed RAM memory or non-volatile memory, such as at least one disk memory.
  • the memory 1005 may also be at least one storage device located remotely from the aforementioned processor 1001 .
  • the memory 1005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 1004 can provide a network communication function;
  • the user interface 1003 is mainly used to provide an input interface for the user; and
  • the processor 1001 can be used to call the device control stored in the memory 1005 application to achieve:
  • Collect the user's voice in the video recording service determine the target text that matches the user's voice in the prompt text data associated with the video recording service, and identify the target text;
  • the computer device 1000 described in the embodiment of the present application can execute the description of the data processing method in the embodiment corresponding to FIG. 3 above, and can also execute the description of the data processing apparatus 1 in the embodiment corresponding to FIG. 14 above, It is not repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.
  • FIG. 17 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device 2000 may include: a processor 2001 , a network interface 2004 and a memory 2005 , in addition, the above-mentioned computer device 2000 may further include: a user interface 2003 , and at least one communication bus 2002 .
  • the communication bus 2002 is used to realize the connection and communication between these components.
  • the user interface 2003 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 2003 may also include a standard wired interface and a wireless interface.
  • the network interface 2004 may include a standard wired interface, a wireless interface (eg, a WI-FI interface).
  • the memory 2005 may be high-speed RAM memory or non-volatile memory, such as at least one disk memory.
  • the memory 2005 may also be at least one storage device located remotely from the aforementioned processor 2001 .
  • the memory 2005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 2004 can provide network communication functions;
  • the user interface 2003 is mainly used to provide an input interface for the user; and
  • the processor 2001 can be used to call the device control stored in the memory 2005 application to achieve:
  • Collect the user voice corresponding to the target user perform text conversion on the user voice, and generate the user voice text corresponding to the user voice;
  • the same text as the user's voice text is determined as the target text, and the target text is identified in the prompting application.
  • the computer device 2000 described in the embodiment of the present application can execute the description of the data processing method in the embodiment corresponding to FIG. 6 above, and can also execute the description of the data processing apparatus 2 in the embodiment corresponding to FIG. 14 above, It is not repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.
  • the embodiment of the present application further provides a computer-readable storage medium, and the computer-readable storage medium stores a computer program executed by the aforementioned data processing apparatus 1, and the computer program includes
  • the program instruction when the processor executes the program instruction, can execute the description of the data processing method in any one of the corresponding embodiments in FIG. 3 , FIG. 11 , and FIG. 12 .
  • the description of the beneficial effects of using the same method will not be repeated.
  • program instructions may be deployed for execution on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communications network.
  • multiple computing devices distributed in multiple locations and interconnected by a communication network can form a blockchain system.
  • the embodiments of the present application further provide a computer program product or computer program
  • the computer program product or computer program may include computer instructions, and the computer instructions may be stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor can execute the computer instructions, so that the computer device executes the data processing method in any of the corresponding embodiments of FIG. 3 , FIG. 11 and FIG. 12 . Therefore, it will not be repeated here.
  • the description of the beneficial effects of using the same method will not be repeated.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Computer Security & Cryptography (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Television Signal Processing For Recording (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Management Or Editing Of Information On Record Carriers (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请实施例提供了一种数据处理方法、装置、设备以及介质,该方法包括:响应视频应用中的业务启动操作,启动视频应用中的视频录制业务;采集视频录制业务中的用户语音,在视频录制业务关联的提示文本数据中确定与用户语音相匹配的目标文本,对目标文本进行标识;当目标文本在提示文本数据中的文本位置为提示文本数据的末尾位置时,获取视频录制业务对应的目标视频数据。采用本申请实施例,可以提高视频录制业务中提词功能的有效性,进而提高录制视频的质量。

Description

数据处理方法、装置、设备以及介质
本申请要求于2021年02月08日提交中国专利局、申请号202110179007.4、申请名称为“数据处理方法、装置、设备以及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网技术领域,尤其涉及数据处理技术。
背景技术
随着短视频的发展,越来越多的用户(包括没有任何拍摄剪辑经验的人)加入到多媒体创作者的队伍中来,开始在镜头前展示自己的表演。对于缺乏经验的多媒体创作者而言,面对镜头经常会出现忘词等情况,即使背下内容文稿也会有磕巴或表情不自然等问题。
为此,在拍摄短视频的过程中,用户可以将文稿内容打印出来放在镜头旁边以作提示。
然而,当文稿内容较多时,用户可能无法快速定位到即将演讲的内容,或者出现定位错误的情况,通过打印文稿内容的方式进行台词提示的效果并不明显,并且在用户看镜头旁边的文稿内容时,镜头会将该用户的动作拍摄进去,进而影响最终拍摄视频的质量。
发明内容
本申请实施例提供一种数据处理方法、装置、设备以及介质,可以提高视频录制业务中提词功能的有效性,进而提高录制视频的质量。
本申请实施例一方面提供了一种数据处理方法,该方法由计算机设备执行,包括:
响应视频应用中的业务启动操作,启动视频应用中的视频录制业务;
采集视频录制业务中的用户语音,在视频录制业务关联的提示文本数据中确定与用户语音相匹配的目标文本,并对目标文本进行标识;
当目标文本在提示文本数据中的文本位置为提示文本数据的末尾位置时,获取视频录制业务对应的目标视频数据。
本申请实施例一方面提供了一种数据处理方法,该方法由计算机设备执行,包括:
将提示文本数据上传至提词应用;
采集目标用户对应的用户语音,对用户语音进行文本转换,生成用户语音对应的用户语音文本;
在提示文本数据中,将与用户语音文本相同的文本确定为目标文本,在提词应用中对目标文本进行标识。
本申请实施例一方面提供了一种数据处理装置,该装置部署在计算机设备上,包括:
启动模块,用于响应视频应用中的业务启动操作,启动视频应用中的视频录制业务;
显示模块,用于采集视频录制业务中的用户语音,在视频录制业务关联的提示文本数据中确定与用户语音相匹配的目标文本,并对目标文本进行标识;
获取模块,用于当目标文本在提示文本数据中的文本位置为提示文本数据中的末尾位置时,获取视频录制业务对应的目标视频数据。
本申请实施例一方面提供了一种数据处理装置,该装置部署在计算机设备上,包括:
提示文本上传模块,用于将提示文本数据上传至提词应用;
用户语音采集模块,用于采集目标用户对应的用户语音,对用户语音进行文本转换,生成用户语音对应的用户语音文本;
用户语音文本显示模块,用于在提示文本数据中,将与用户语音文本相同的文本确定为目标文本,在提词应用中对目标文本进行标识。
本申请实施例一方面提供了一种计算机设备,包括存储器和处理器,存储器与处理器相连,存储器用于存储计算机程序,处理器用于调用计算机程序,以使得该计算机设备执行本申请实施例中上述任一方面提供的方法。
本申请实施例一方面提供了一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序适于由处理器加载并执行,以使得具有处理器的计算机设备执行本申请实施例中上述任一方面提供的方法。
根据本申请的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述任一方面提供的方法。
本申请实施例可以响应视频应用中的业务启动操作,启动视频应用中的视频录制业务,采集视频录制业务中的用户语音,在视频录制业务关联的提示文本数据中确定与用户语音相关联的目标文本,对目标文本进行标识,这样正在演讲的用户可以根据标识快速、准确地定位到演讲的内容,提高视频录制业务中文本提示功能的有效性。当目标文本在提示文本数据中的文本位置为提示文本数据中的末尾位置时,获取视频录制业务对应的目标视频数据。可见,在视频应用中启动视频录制业务后,可以在提示文本数据中定位并标识出与用户语音相匹配的目标文本,即视频应用中所显示的目标文本与用户正在演讲的内容相匹配,从而提高视频录制业务中文本提示功能的有效性,降低用户由于忘词而造成录制失败的风险,进而可以提高录制视频的质量。
附图说明
图1是本申请实施例提供的一种网络架构的结构示意图;
图2是本申请实施例提供的一种数据处理场景示意图;
图3是本申请实施例提供的一种数据处理方法的流程示意图;
图4是本申请实施例提供的一种输入提示文本数据的界面示意图;
图5是本申请实施例提供的一种在视频应用中启动视频录制业务的界面示意图;
图6是本申请实施例提供的一种显示提示文本数据的界面示意图;
图7是本申请实施例提供的一种显示语速提示信息的界面示意图;
图8是本申请实施例提供的一种停止视频录制业务的界面示意图;
图9是本申请实施例提供的一种对录制视频进行剪辑优化的界面示意图;
图10是本申请实施例提供的一种根据演讲错误类型推荐教程视频的界面示意图;
图11是本申请实施例提供的一种视频录制业务的实现流程图;
图12是本申请实施例提供的一种数据处理方法的流程示意图;
图13是本申请实施例提供的一种提词器的应用场景示意图;
图14是本申请实施例提供的一种数据处理装置的结构示意图;
图15实施本申请实施例提供的一种数据处理装置的结构示意图;
图16是本申请实施例提供的一种计算机设备的结构示意图;
图17是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
请参见图1,图1是本申请实施例提供的一种网络架构的结构示意图。如图1所示,该网络架构可以包括服务器10d和用户终端集群,该用户终端集群可以包括一个或者多个用户终端,这里不对用户终端的数量进行限制。如图1所示,该用户终端集群可以具体包括用户终端10a、用户终端10b以及用户终端10c等。其中,服务器10d可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。用户终端10a、用 户终端10b以及用户终端10c等均可以包括:智能手机、平板电脑、笔记本电脑、掌上电脑、移动互联网设备(mobile internet device,MID)、可穿戴设备(例如智能手表、智能手环等)以及智能电视等具有视频/图像播放功能的智能终端。如图1所示,用户终端10a、用户终端10b以及用户终端10c等可以分别与服务器10d进行网络连接,以便于每个用户终端可以通过该网络连接与服务器10d之间进行数据交互。
以图1所示的用户终端10a为例,用户终端10a中可以安装有具有视频录制功能的视频应用,其中,该视频应用可以为视频剪辑应用、短视频应用等。用户可以打开用户终端10a中所安装的视频应用,该视频应用可以为用户提供视频录制功能,该视频录制功能可以包括常规拍摄方式和提词拍摄方式;其中,常规拍摄方式可以是指采用用户终端10a自带的摄像头(或者与用户终端10a具有通信连接的外部摄像设备)拍摄用户的过程中,可能无法为用户进行文稿内容提示,该用户需要预先将视频录制过程中想要表达的文稿内容进行整理(例如,将文稿内容记下来);提词拍摄方式可以是指采用用户终端10a自带的摄像头或者外接的摄像设备拍摄用户的过程中,可以在用户终端10a的终端屏幕中为该用户显示文稿内容,且可以根据用户语音进度对文稿内容进行切换显示(例如滚动显示等),此处的文稿内容也可以称为视频录制业务中的提示文本数据。用户对视频应用中的提词拍摄方式所对应的入口(即提词拍摄入口)执行触发操作后,用户终端10a可以响应针对提词拍摄入口的触发操作,在视频应用中显示录制页面,在视频录制之前,用户可以在录制页面中输入提示文本数据,或者将已经存在的提示文本数据上传至录制页面。
在用户启动视频录制时,用户终端10a可以响应该用户的视频录制启动操作,在该视频应用中启动视频录制功能,并在视频录制过程中,可以在用户终端10a的终端屏幕中根据用户语音进度进行显示。换言之,在视频录制过程中,可以根据用户语音的进度显示提示文本数据,当用户语音速度加快时,提示文本数据在视频应用中的切换显示速度(可以为滚动速度)加快;当用户语音速度减慢时,提示文本数据在视频应用中的切换显示速度减慢,即提示文本数据在视频应用中所显示的文本与用户语音相匹配,以确保视频录制过程中的文本提示功能的有效性,以帮助用户顺畅地完成视频的录制,进而可以提高录制视频的质量。
请一并参见图2,图2是本申请实施例提供的一种数据处理场景示意图,以视频录制场景为例,对本申请实施例提供的数据处理方法的实现过程进行描述。如图2所示的用户终端20a可以为上述图1所示的用户终端集群中的任意一个用户终端,用户终端20a中安装有视频应用,且该视频应用具有视频录制功能。用户A(该用户A可以是指用户终端20a的使用者)可以打开用户终端20a中的视频应用并进入视频应用的主页,用户可以对视频应用中的拍摄入口执行触发操作,用户终端20a响应针对拍摄入口的触发操作,在视频应用中显示拍摄页面20m,该拍摄页面20m可以包括拍摄区域20b、滤镜控件20c、拍摄控件20d以及美颜控件20e等。其中,拍摄区域20b用于显示用户终端20a所采集的视频画面,该视频画面可以是指针对用户A的视频画面,可以通过用户终端20a自带的摄像头或者与用户终端20a具有通信连接的摄像设备采集得到;拍摄控件20d可以用于控制视频录制的开启和关闭,进入拍摄页面20m后对拍摄控件20d执行触发操作,可以表示启动拍摄,拍摄区域20b中可以显示所拍摄的视频画面,在拍摄过程中再次对拍摄控件20d执行触发操作时,可以表示停止拍摄,拍摄区域20b中所显示的视频画面会定格在停止拍摄时的画面;滤镜控件20c可以用于对用户终端20a所采集的视频画面进行图像处理,以达到某种特殊效果,如磨皮滤镜可以对采集的视频画面中的人像进行皮肤修饰、润色磨皮等处理;美颜控件20e可以用于对用户终端20a所采集的视频画面中的人像进行美颜处理,如自动修复人像的脸型,增大人像的眼睛,变高人像的鼻子等。
其中,拍摄页面20m中还可以包括提词拍摄入口20f,当用户A缺乏录制视频的经验时,为了防止在视频录制过程中出现忘词的情形(若出现忘词可能需要重新录制视频),用户A可以选择视频应用中的提词拍摄功能,即用户A可以对拍摄页面20m中的提词拍摄入口20f执行触发操作,用户终端20a可以响应用户A针对提词拍摄入口20f的触发操作,将视频应用中的拍摄页面20m切换显示为该提词拍摄入口20f对应的录制页面,该录制页面首先可以显示文本输入区域,用户A可以在文本输入区域中输入录制视频所需的文稿内容,该文稿内容可以用于在视频录制过程中提示用户A;简单来说,在录制视频的过程中,用户A可以根据视频应用中所显示的文稿内容进行录制,此时的文稿内容也可以称为提示文本数据20g。其中,在文本输入区域中还可以显示用户A所输入的文稿内容的统计信息20h,该统计信息20h可以包括所 输入文稿内容中的字数(即提示文字数量,例如,文稿内容的字数为134个),以及所输入文稿内容对应的视频预估时长(如35秒),用户A可以根据视频预估时长增加或删减文稿内容。例如,用户A想要录制一个1分钟的视频,当用户A在文本输入区域所输入的文稿内容对应的视频预估时长为4分钟,则用户A可以对文本输入区域中所显示的文稿内容进行删减,使得删减后的文稿内容所对应的视频预估时长为1分钟左右(例如,视频预估时长的范围可以为55秒至65秒);当用户A在文本输入区域所输入文稿内容对应的视频预估时长为35秒,则用户A可以对文本输入区域中所显示的文稿内容进行增加,使得增加后的文稿内容所对应的视频预估时长为1分钟左右,进而可以将最终确定的文稿内容确定为文本提示文本20g。
在用户A确定了提示文本数据20g后,用户A可以对录制页面中的“下一步”控件执行触发操作,用户终端20a可以响应针对“下一步”控件的触发操作,开启用户终端20a的照相机(或具有通信连接的摄像设备),进入视频录制准备状态(即视频开始录制之前);如图2所示,在录制页面中可以显示用户终端20a所采集的针对用户A的视频画面20i,并在录制页面中显示提示信息“调整好位置放好手机,说‘开始’启动提词拍摄”,即用户A可以根据视频画面20i调整自己的位置和用户终端20a的位置,在调整好位置后,用户可以通过语音启动视频录制,如用户可以通过说“开始”来启动视频录制。
在用户A说出“开始”后,用户终端20a可以响应用户A的语音启动操作,在视频应用中启动视频录制,并在录制页面中显示提示文本数据20g。可以理解的是,录制页面中所显示的文本可以只是提示文本数据20g中的部分文本,如提示文本数据20g中的一句话,因此启动视频录制后首先可以显示提示文本数据20g中的第一句话。当用户A在视频录制过程中开始演讲时,用户终端20a可以采集该用户A所对应的用户语音,该用户终端20a中所安装的视频应用的客户端,可以将用户语音传输至视频应用的后台服务器20j,并向后台服务器20j发送语音匹配指令。后台服务器20j在接收到用户语音和语音匹配指令后,可以将用户语音转换为用户语音文本,当用户语音文本为汉语时(此时可以默认提示文本数据20g同样为汉语),后台服务器20j还可以将用户语音文本转换为第一汉语拼音(当用户语音文本为汉语时,第一音节信息可以称为第一汉语拼音);当然,用户A在文本输入区域中输入提示文本数据20g后,视频应用的客户端同样可以将提示文本数据20g传输至后台服务器20j,因此后台服务器20j可以将提示文本数据20g转换为第二汉语拼音(当用户语音文本为汉语时,第二音节信息可以称为第二汉语拼音)。后台服务器20j可以将第一汉语拼音和第二汉语拼音进行匹配,在第二汉语拼音中查找与第一汉语拼音相同的拼音,即查找第一汉语拼音在第二汉语拼音中的文本位置,将提示文本数据20g中该文本位置所对应的文本确定为目标文本(即用户语音在提示文本数据20g中所匹配的文本),后台服务器20j可以将目标文本传输至视频应用的客户端,终端设备20a可以在视频应用中对目标文本进行标识(如增大目标文本的显示尺寸、变换目标文本的显示颜色、通过圆圈或矩形框等将目标文本框起来等)。可以理解地,用户A按照文本提示数据的顺序演讲时,在录制页面中可以对提示文本数据进行滚动显示;用户A没有按照文本提示数据的顺序演讲时,在录制页面中可以对提示文本数据进行跳跃显示。
当目标文本为词语或者短语时,可以在视频应用中对目标应用所在的语句进行标识。如图2所示,当用户语音为“周末”时,后台服务器20j可以在提示文本数据20g中匹配到用户语音所对应的目标文本为:周末,此时可以在录制页面中对目标文本“周末”所在的语句“周末,在长沙参加xx和xx合作的消费班”进行标识(增加文本显示尺寸,并加粗文本,如图2中的区域20k所示)。
需要说明的是,提示文本数据20g可以直接在录制页面中进行展示,也可以在独立显示于录制页面的子页面中进行展示,本申请对提示文本数据20g在录制页面中的显示形式不做限定。在提示文本数据20g中对用户语音进行匹配的目的在于:确定用户语音在提示文本数据20g中的文本位置,将用户语音转换为用户语音文本时,可以只考虑文字读音与用户语音之间的一致性,无需考虑转换后的用户语音文本与用户语音之间准确性,因此可以通过汉语音频进行匹配,可以提高用户语音与提示文本数据之间的匹配效率。
其中,用户终端20a可以实时采集用户A所演讲的用户语音,通过后台服务器20n可以实时确定用户语音在提示文本数据20g所对应的目标文本,进而可以根据用户语音进度对提示文本数据进行滚动显示。例如,用户A演讲到提示文本数据20g中的第一句话时,在录制页面中可以对提示文本数据20g中的第一句话进行标识;用户A演讲到提示文本数据20g中的第二句话时,在录制页面中可以从提示文本数据20g中的第一句话切换显示为第二句话,并对第二句话进行标识,在录制页面中每次标识的目标文本均为用户 A当前演讲的内容。当用户A演讲到提示文本数据20g中的最后一个字时,用户终端20a可以关闭视频录制,并将本次录制的视频确定为录制完成的视频。若用户A满意本次录制的视频,则可以将该视频进行保存;若用户A不满意本次录制的视频,则用户A可以进行重新拍摄。当然,用户A还可以对录制完成的视频进行剪辑优化,得到最终的录制视频,即得到目标视频数据。
本申请实施例所示的视频录制过程中,可以根据用户语音进度对提示文本数据进行显示,以实现对用户的精准提词效果,进而可以提高录制视频的质量。
请参见图3,图3是本申请实施例提供的一种数据处理方法的流程示意图。可以理解地,该数据处理方法可以由计算机设备执行,该计算机设备可以为用户终端,或者为独立的服务器,或者为多个服务器组成的集群,或者为用户终端和服务器所组成的系统,或者为一个计算机程序应用(包括程序代码),这里不做具体限定。如图3所示,该数据处理方法可以包括以下S101-S103:
S101,响应视频应用中的业务启动操作,启动视频应用中的视频录制业务。
当用户想要在镜头前发表自己的观点或者展示自己的生活时,可以在视频应用中进行视频录制,以录制自己想要的视频,对于最终录制完成的视频,用户可以将其上传至信息发布平台进行分享,使信息发布平台中的用户均可以观看该录制视频。本申请实施例中,可以将需要进行视频录制的用户称为目标用户,目标用户在进行视频录制时所使用的设备可以称为计算机设备。当目标用户在计算机设备所安装的视频应用中执行针对视频录制业务的业务启动操作时,计算机设备可以响应该视频应用中的业务启动操作,在该视频应用中启动视频录制业务,即在视频应用中开启视频录制。其中,业务启动操作可以包括但不限于:单击、双击、长按、敲击屏幕等接触式触发操作,语音、遥控、手势等非接触式触发操作。
其中,在计算机设备启动视频录制业务之前,目标用户还可以将视频录制业务中所需要的提示文本数据上传至视频应用,该提示文本数据可以用于在视频录制业务中提示目标用户,可以大大减少目标用户在视频录制过程中出现忘词的情形。目标用户打开计算机设备中所安装的视频应用后,可以进入视频应用中的拍摄页面(例如上述图2所对应实施例中的拍摄页面20m),在视频应用的拍摄页面中可以包括提词拍摄入口。当目标用户对拍摄页面中的提词拍摄入口执行触发操作时,计算机设备可以响应针对视频应用中的提词拍摄入口的触发操作,在视频应用中显示录制页面,该录制页面可以包括文本输入区域,该文本输入区域可以用于编辑文本内容;计算机设备可以响应针对文本输入区域的信息编辑操作,在文本输入区域中显示信息编辑操作所确定的提示文本数据,当提示文本数据对应的提示文字数量大于数量阈值(此处的数量阈值可以根据实际需求进行预先设置,如数量阈值可以设置为100)时,可以在文本输入区域中显示提示文字数量和提示文本数量对应的视频预估时长。换言之,目标用户对拍摄页面中的提词拍摄入口执行触发操作后,在视频应用中可以将拍摄页面切换显示为录制页面,目标用户可以在录制页面的文本输入区域中编辑视频录制业务中需要用到的文稿内容(即上述提示文本数据),当目标用户在文本输入区域中编辑文本时,可以实时统计文本输入区域中所输入的提示文字数量,当提示文字数量大于预先设置的数量阈值时,可以在文本输入区域中显示提示文字数量和当前所输入的提示文本数据对应的视频预估时长。提词拍摄入口除了在拍摄页面中进行显示之外,还可以在视频应用的任一页面中进行显示,本申请实施例对提词拍摄入口的显示位置不做限定。
其中,视频预估时长可以作为后续视频录制业务中所录制的成品视频的时长参考信息,当文本输入区域中所显示的视频预估时长与目标用户期望录制的视频时长之间存在较大的差异时,目标用户可以对文本输入区域中的文本进行增加或删减。例如,当文本输入区域中所显示的视频预估时长为35秒,目标用户期望录制的视频时长为2分钟时,目标用户可以在文本输入区域中继续编辑文本,直至文本输入区域中所显示的视频预估时长处于设定的时长范围(例如,视频预估时长处于1分钟50秒至2分钟10秒之间)。
计算机设备在响应针对视频应用中的提词拍摄入口的触发操作后,所显示的录制页面还可以显示文本上传控件,目标用户可以对录制页面中的文本上传控件执行触发操作,将已经编辑好的提示文本数据上传至录制页面,即计算机设备可以响应针对文本上传控件的触发操作,将上传至录制页面的文本内容确定为提示文本数据,在录制页面的文本输入区域中对提示文本数据进行显示,还可以显示提示文本数据对应的提示文字数量,以及提示文本数据对应的视频预估时长。其中,文本上传控件可以包括但不限于:文本粘贴控件和选择上次文本控件;当目标用户对文本粘贴控件执行触发操作时,表示目标用户可以直接将预先 编辑好的提示文本数据粘贴至文本输入区域中,无需临时编辑文本内容;当目标用户对选择上次文本控件执行触发操作时,表示目标用户在本次视频录制业务中可以使用上一次视频录制业务中的提示文本数据,即目标用户可能对上一次视频录制业务中所录制的成品视频不满意,因此在本次视频录制业务中进行重新录制,可以避免重复输入相同的提示文本数据,进而可以提高提示文本数据的输入效率。
请一并参见图4,图4是本申请实施例提供的一种输入提示文本数据的界面示意图。如图4所示,目标用户对用户终端30a所安装的视频应用中的拍摄入口执行出发操作后,用户终端30a可以响应针对拍摄入口的触发操作(此时的用户终端30a可以为上述计算机设备),在视频应用中显示拍摄页面30g,该拍摄页面30g可以包括拍摄区域30b、滤镜控件30c、拍摄控件30d、美颜控件30e以及提词拍摄入口30f等。其中,拍摄区域30b、滤镜控件30c、拍摄控件30d以及美颜控件30e在视频应用中的功能描述可以参见上述图2所对应实施例中,对拍摄区域20b、滤镜控件20c、拍摄控件20d以及美颜控件20e的功能描述,这里不再进行赘述。
当目标用户对拍摄页面30g中的提词拍摄入口30f执行触发操作时,用户终端30a可以响应拍摄页面30g中针对提词拍摄入口30f的触发操作,在视频应用中将拍摄页面30g切换显示为录制页面30h,该录制页面30h可以包括文本输入区域30i,该文本输入区域30i可以用于直接编辑文本内容。目标用户可以点击文本输入区域30i,在录制页面30h中弹出键盘30p,通过键盘30p可以编辑本次视频录制业务中所需要的提示文本数据,用户终端30a可以响应目标用户的信息编辑操作,将信息编辑操作所确定的文本内容作为提示文本数据显示在文本输入区域30i中。与此同时,用户终端30a可以实时统计文本输入区域30i中所输入的提示文本数据的提示文字数量,当文本输入区域30i中所输入的提示文本数据的提示文字数量大于预先设定的数量阈值(如,数量阈值设置为100)时,可以在文本输入区域30i的区域30m中显示提示文字数量,以及所输入的提示文本数据对应的预计成片时长(即视频预估时长)。如图4所示,当目标用户在文本输入区域30i中输入文本内容“周末,在长沙参加xx与xx合作的消费班。当年别人都是通过公众号走线上”时,用户终端30a统计得到的提示文字数量为32,预计成片时长为15秒,即在区域30m中显示“当前字数32,预计成片15秒”;目标用户可以根据区域30m中所显示的预计成片时长编辑文本内容,目标用户在文本输入区域30i中完成编辑文本内容后,可以将文本输入区域30i中的文本内容确定为提示文本数据,进而可以对录制页面30h中的“下一步”控件30n执行触发操作,以触发用户终端30n进入视频录制业务的下一步操作。
如图4所示,文本输入区域30i还可以包括粘贴文本控件30j和上次文本控件30k,当目标用户对粘贴文本控件30j执行触发操作时,表示目标用户已经在其余应用中编辑好了提示文本数据,并从其余应用中复制了提示文本数据;用户终端30a响应针对粘贴文本控件30j的触发操作,将目标用户所复制的提示文本数据粘贴至文本输入区域30i。当目标用户在本次视频录制业务中所录制的视频是对上一次视频录制业务中所录制的视频进行重新录制时,目标用户可以对上次文本控件30k执行触发操作,用户终端30a响应针对上次文本控件30k的触发操作,获取上一次视频录制业务中的提示文本数据,并将上一次视频录制业务中的提示文本数据显示在文本输入区域30i中,直接将上一次视频录制业务所使用的提示文本数据作为本次视频录制业务的提示文本数据。目标用户可以根据上一次视频录制业务的经验,在文本输入区域30i中对上一次视频录制业务所使用的提示文本数据进行调整,例如,目标用户在上一次视频录制业务中发现提示文本数据中的语句1存在逻辑错误,则在本次视频录制业务中,可以在文本输入区域30i中对上一次视频录制业务的提示文本数据进行修改。
需要说明的是,通过粘贴文本控件30j和上次文本控件30k输入至文本输入区域30i的提示文本数据,同样可以在文本输入区域30i的区域30m中显示提示文本数据的文字数量和预成片时长。本申请实施例中,目标用户采用粘贴文本控件30j和上次文本控件30k,将视频录制业务中的提示文本数据输入至文本输入区域30i中,可以提高视频录制业务中提示文本数据的输入效率。
当业务启动操作为语音启动操作时,目标用户可以在完成提示文本数据的编辑操作后,对视频应用中的视频录制业务执行语音启动操作,计算机设备可以响应上述语音启动操作,在视频应用的录制页面中显示与该视频录制业务相关联的录制倒计时动画,当录制倒计时动画结束时,启动并执行视频应用中的视频录制业务,即正式开始录制视频。其中,在录制页面中播放录制倒计时动画时,可以开启计算机设备对应 的摄像设备,目标用户可以根据录制页面中所显示的视频画面调整自己和计算机设备的位置,以找到最佳的拍摄角度。录制页面中还可以显示录制倒计时动画对应的动画取消控件,当目标用户已经做好录制视频的准备时,可以对动画取消控制执行触发操作,以取消录制倒计时动画;即计算机设备可以响应目标用户针对动画取消控件的触发操作,在录制页面中取消显示录制倒计时动画,启动并执行视频应用中的视频录制业务。换言之,在目标用户语音启动视频录制业务后,在视频应用中不会直接进入正式录制模式,而是在录制页面中播放录制倒计时动画,给目标用户提供一小段录制准备时间(即录制倒计时动画的时长,如5秒),在录制倒计时动画播放完成后才会进入正式录制模式;或者目标用户在提前准备好录制的前提下,也可以取消播放录制倒计时动画,直接进入正式录制模式。
请一并参见图5,图5是本申请实施例提供的一种在视频应用中启动视频录制业务的界面示意图。目标用户可以在完成提示文本数据的编辑操作后,可以执行下一步操作(如对上述图4所对应实施例中的“下一步”控件30n执行触发操作),在录制页面中退出显示文本输入区域。如图5所示,目标用户编辑完提示文本数据并执行下一步操作后,可以在录制页面40b中退出文本输入区域,在录制页面40b的区域40c中显示目标用户的视频画面,与此同时,还可以在录制页面40b中显示提示信息40d(“调整好位置放好手机说‘开始’启动提词拍摄”),即在启动视频录制业务之前,用户终端40a(此时的用户终端40a可以称为计算机设备)就可以开启与其关联的摄像设备(如用户终端40a自带的摄像头),采集目标用户的图像数据,并将采集到的图像数据渲染成目标用户对应的视频画面,在录制页面40b的区域40c中对目标用户的视频画面进行显示。目标用户可以根据区域40c中显示的视频画面调整自己和镜头的位置,以找准最佳的拍摄角度。
目标用户调整好自己和镜头的位置,即目标用户做好录制视频的准备工作后,可以说出“开始”以启动视频应用中的视频录制业务。当目标用户说出“开始”对视频应用中的视频录制业务执行语音启动操作后,用户终端40a可以响应针对视频录制业务的语音启动操作,在录制页面40b的区域40e中显示录制倒计时动画,该录制倒计时动画的时长可以为5秒。当然,在录制页面40b的区域40e中还可以显示提示文本数据的前几句话(如提示文本数据的前两句话)。
当录制页面40b中的录制倒计时动画播放结束时,用户终端40a可以启动并执行视频应用中的视频录制业务。若目标用户想要不愿意等待录制倒计时动画播放完再启动视频录制业务,可以对录制页面40b中的动画取消控件40f执行触发操作,取消播放录制页面40b中的录制倒计时动画,直接启动并执行视频录制业务。在开始正式录制视频后,目标用户可以开始讲话,用户终端40a可以采集目标用户的用户语音,并在提示文本数据中查找与用户语音相匹配的目标文本,在录制页面40b的区域40g中对目标文本进行标识(如对目标文本进行加粗、增大处理),其中,目标文本的具体确定过程将在下述S102中进行描述。
S102,采集视频录制业务中的用户语音,在视频录制业务关联的提示文本数据中确定与用户语音相匹配的目标文本,并对目标文本进行标识。
开始正式录制视频后,计算机设备可以开启音频采集功能,采集目标用户在视频录制业务中的用户语音,并在提示文本数据中查找与用户语音相匹配的目标文本,在录制页面中对提示文本数据所包含的目标文本进行标识。计算机设备可以实时采集目标用户在视频录制业务中的用户语音,通过对用户语音进行文本转换,在提示文本数据中确定用户语音对应的文本位置,根据文本位置确定用户语音对应的目标文本,在录制页面中对目标文本进行标识。其中,标识可以包括但不限于:文本显示颜色、文本字体大小、文本背景,目标文本可以是指包含用户语音文本的文本数据,例如,用户语音文本为:新年,此时的目标文本可以是指包含“新年”的完整句子,如目标文本为:在新年到来之际,祝愿大家牛年大吉。
计算机设备将直接采集的语音称为用户初始语音,即计算机设备可以采集视频录制业务中的用户初始语音,对用户初始语音进行语音端点检测(Voice Activity Detection,VAD)得到用户初始语音中的有效语音数据,将有效语音数据确定为用户语音。然后可以将用户语音转换为用户语音文本,对用户语音文本和视频录制业务关联的提示文本数据进行文本匹配,在提示文本数据中确定与用户语音文本相匹配的目标文本;在视频录制业务的录制页面中,对目标文本进行标识。换言之,计算机设备所采集到的用户初始语音可能会包含目标用户所处环境的噪音,以及目标用户在演讲过程中的停顿部分,因此可以对用户初始语音进行语音端点检测,将用户初始语音中的静音和噪音作为干扰信息进行删除,保留用户初始语音中的有效 语音数据,此时的有效语音数据可以称为目标用户的用户语音。计算机设备可以通过一个快速的语音转文字模型,将用户语音转换为用户语音文本,将用户语音文本与提示文本数据进行比对,查找用户语音文本在提示文本数据中的文本位置,进而可以根据文本位置在文本数据中确定用户语音对应的目标文本,在视频录制业务的录制页面中可以对目标文本进行标识。
其中,快速的语音转文字模型是指将用户语音转换为文字的过程中无需对上下文进行纠错,也无需考虑语义是否正确,只需判断转换后的文字读音与用户语音是否一致。在提示文本数据中确定与用户语音相匹配的目标文本时,计算机设备可以根据用户语音文本的读音以及提示文本数据的读音,在提示文本数据中确定用户语音对应的目标文本,即计算机设备可以获取用户语音文本对应的第一音节信息,获取视频录制业务关联的提示文本数据所对应的第二音节信息,在第二音节信息中获取与第一音节信息相同的目标音节信息,在提示文本数据中确定目标音节信息对应的目标文本。
其中,音节信息可以是指汉语中的拼音信息,或者是英语中的音标信息等。当提示文本数据为汉语时,计算机设备可以将用户语音文本转换为第一拼音信息,将提示文本数据转换为第二拼音信息,在第二拼音信息中找到第一拼音信息对应的文本位置,根据文本位置在提示文本数据中确定用户语音对应的目标文本;当提示文本数据为英语等其余语言时,计算机设备可以将用户语音文本转换为第一音标信息,将提示文本数据转换为第二音标信息,进而可以根据第一音标信息和第二音标信息,在提示文本数据中确定用户语音对应的目标文本。可以理解的是,对于汉语而言,相同的读音可以对应不同的文字,因此通过拼音匹配的方式可以提高确定目标文本的效率;对于不同读音对应不同文字的语言(例如,英语),计算机设备可以直接对用户语音文本所包含的字母与提示文本数量所包含的字母进行匹配,在提示文本数据中确定用户语音对应的目标文本。
需要说明的是,在视频录制业务中,录制页面中用于显示目标文本的区域可以根据计算机设备的终端屏幕尺寸进行设置,如上述图5所示录制页面40b中的区域40g,区域40g的显示宽度与计算机设备(如用户终端40a)的屏幕宽度相同,区域40g的显示高度小于计算机设备的屏幕高度。当计算机设备的终端屏幕尺寸较大(如台式电脑的显示屏)时,若用于显示目标文本的区域的尺寸宽度与计算机设备的终端屏幕尺寸宽度相同,则在视频录制业务中目标用户观看目标文本的动作(例如,目标用户在观看目标文本时,可能会从终端屏幕的左边移到右边)都会录制下来,造成最终录制视频中目标用户的动作和神情均不自然,进而导致录制视频的质量过低。因此,为了确保录制视频中目标用户的动作表现自然,可以根据计算机设备对应的摄像设备的位置,在视频录制业务的录制页面中确定目标文本对应的文本提示区域,根据目标文本在提示文本数据中的文本位置,在文本提示区域中对目标文本进行标识。换言之,在视频录制业务中,目标用户可以是正面面对镜头,当文本提示区域与计算机设备的摄像设备位于同一个方位时,视频录制业务所录制的视频中目标用户的动作是自然的。
请一并参见图6,图6是本申请实施例提供的一种显示提示文本数据的界面示意图。如图6所示,用户终端50a(即上述计算机设备)在提示文本数据中确定了用户语音对应的目标文本“周末,在长沙参加xx与xx合作的消费班”后,可以根据终端设备50a的摄像头50d的位置,在视频录制业务的录制页面50b中确定用于显示目标文本的文本提示区域50e,该文本提示区域50e与摄像头50d位于同一个方位。在开始正式录制视频之后,可以在录制页面50b的区域50c中显示目标用户的视频画面,并在录制页面50b的区域50f中显示视频录制时长(例如,视频录制时长为00:13秒)。
在视频录制业务中,计算机设备可以实时采集目标用户的用户初始语音,获取用户初始语音对应的语音时长以及用户初始语音所包含的语音文字数量,将语音文字数量与语音时长的比值确定为用户语速;当用户语速大于语速阈值(该语速阈值可以基于实际需求进行人为设定,如语速阈值为500个字/每分钟)时,可以在录制页面中显示语速提示信息,该语速提示信息可以用于提示视频录制业务所关联的目标用户降低用户语速。换言之,计算机设备可以实时获取目标用户的用户语速,当用户语速大于语速阈值时,表示目标用户的视频录制业务中的语速过快,可以提醒目标用户适当减慢演讲速度。
请一并参见图7,图7是本申请实施例提供的一种显示语速提示信息的界面示意图。如图7所示,用户终端60a(即上述计算机设备)采集到的目标用户的用户初始语音后,可以根据用户初始语音所包含的语音文字数量与语音时长,确定目标用户的用户语速,当目标用户在视频录制业务中的用户语速过快(即 大于语速阈值)时,可以在视频录制业务的录制页面60b中显示语速提示信息60c(例如,语速提示信息可以为“您目前的语速过快,为确保录制视频的质量,请您减慢语速”)。当然,在实际应用中,还可以采用语音播报的形式提醒目标用户减慢语速,本申请实施例对语速提示信息的展示形式不做限定。
在视频录制过程中,视频录制业务的录制页面中还可以包括取消录制控件和完成录制控件。目标用户对录制页面中的取消录制控件执行触发操作后,计算机设备可以响应针对取消录制控件的触发操作,取消视频录制业务,删除视频录制业务所录制的视频数据,并生成针对视频录制业务的录制提示信息,在录制页面中显示录制提示信息,其中,录制提示信息可以包括重新录制控件。目标用户对重新录制控件执行触发操作后,计算机设备可以响应针对重新录制控件的触发操作,将录制页面所显示的目标文本切换显示为提示文本数据,即在录制页面的文本输入区域中显示提示文本数据,重新开启视频录制业务。当然,录制提示信息还可以包括返回首页控件,目标用户对返回首页控件执行触发操作,计算机设备可以响应针对返回首页控件的触发操作,在视频应用中将录制页面切换显示为应用首页,即取消正在执行的视频录制业务后暂时不再继续开启视频录制业务。
目标用户对录制页面中的完成录制控件执行触发操作后,计算机设备可以响应针对完成录制控件的触发操作,停止视频录制业务,将视频录制业务所录制的视频数据确定为录制完成的目标视频数据,即在提示文本数据还未演讲完的情况下停止视频录制业务,将停止视频录制业务之前所录制的视频称为目标视频数据。
请一并参见图8,图8是本申请实施例提供的一种停止视频录制业务的界面示意图。如图8所示,用户终端70a(即上述计算机设备)可以根据视频录制业务中目标用户的用户语音,在视频录制业务的提示文本数据中确定用户语音的目标文本,在录制页面70b中对目标文本进行标识,即用户终端70a可以根据用户语音的进度对提示文本数据进行滚动显示。在视频录制过程中,录制页面70b中还可以显示取消录制控件70c和完成录制控件70d。当目标用户对完成录制控件70d执行触发操作时,用户终端70a可以响应针对完成录制控件70d的触发操作,停止视频录制业务,保存本次视频录制业务所录制的视频数据,即完成本次视频录制业务。当目标用户对取消录制控件70c执行触发操作时,用户终端70a可以响应针对取消录制控件70c的触发操作,取消视频录制业务,并删除本次视频录制业务所录制的视频数据,用户终端70a可以为视频录制业务中的目标用户生成录制提示信息70e(例如,录制提示信息可以为“已拍摄片段将被清空,重新拍摄片段?”),在视频录制业务的录制页面70b中显示录制提示信息70e。其中,该录制提示信息70e可以包括“回到首页”控件和“重新拍摄”控件;当目标用户对“回到首页”控件执行触发操作时,用户终端70a可以退出视频录制业务,从录制页面70b返回至视频应用的应用首页,即目标用户放弃重新拍摄;当目标用户对“重新拍摄”控件执行触发操作时,用户终端70a可以退出视频录制业务,从录制页面70b返回至文本输入区域,并在文本输入区域中显示提示文本数据,即目标用户选择重新录制视频。
S103,当目标文本在提示文本数据中的文本位置为提示文本数据的末尾位置时,获取视频录制业务对应的目标视频数据。
在视频录制业务中,当目标文本在提示文本数据中的文本位置为提示文本数据的末尾位置时,表示目标用户已经完成视频录制业务的拍摄工作,无需目标用户操作,计算机设备可以自动结束视频录制业务,并保存视频录制业务中所录制的视频数据,将视频录制业务中所录制的视频数据确定目标视频数据。
计算机设备可以将停止视频录制业务时所保存的视频数据确定为原始视频数据,并进入视频应用的剪辑页面,在视频应用的剪辑页面中显示原始视频数据,以及原始视频数据对应的剪辑优化控件。目标用户可以对剪辑页面中所显示的剪辑优化控件执行触发操作,此时的计算机设备可以响应针对剪辑优化控件的触发操作,显示针对原始视频数据的M个剪辑优化方式,其中,M为正整数,即M可以取值为1,2,……,本申请实施例中,M个剪辑优化方式可以包括但不限于:去掉口误的剪辑优化方式(可以称为第一剪辑方式)、去掉口误和句子间停顿的剪辑优化方式(可以称为第二剪辑方式);当目标用户从M个剪辑优化方式选择了某种剪辑优化方式时,计算机设备可以响应针对M个剪辑优化方式的选取操作,根据选取操作所确定的剪辑优化方式,对原始视频数据进行剪辑优化处理,得到视频录制业务对应的目标视频数据。可以理解的是,原始视频数据和目标视频数据在剪辑页面中的显示区域和显示尺寸可以根据实际需求进行调整。 例如,原始视频数据(或者目标视频数据)的显示区域可以位于剪辑页面的顶端,或者可以位于剪辑页面的底部,或者可以位于剪辑页面的中间区域等;原始视频数据(或者目标视频数据)的显示尺寸可以为16:9的显示比例等。
其中,若选取操作所确定的剪辑优化方式为第一剪辑方式,即目标用户选择了去掉口误的剪辑优化方式,则计算机设备可以获取原始视频数据所包含的目标语音数据,将目标语音数据转换为目标文本结果,进而可以将目标文本结果与提示文本数据进行文本比对,将目标文本结果中与提示文本数据不相同的文本确定为错误文本;在原始视频数据中删除错误文本对应的语音数据,得到视频录制业务对应的目标视频数据。其中,在对原始视频数据进行剪辑优化处理的过程中,计算机设备可以采用一个精准的语音转文字模型,对原始视频数据中所包含的目标语音数据进行文字转换处理,上述精准的语音转文字模型可以学习目标语音数据中的语义信息,不仅需要考虑转换后的文本读音与用户语音之间的一致性,还要考虑用户语音之间的语义信息,通过上下文语义信息对转换后的文本进行纠错。计算机设备可以对原始视频数据中所包含的目标语音数据进行语音端点检测,去除原始视频数据中的噪音和静音,得到原始视频数据中的有效语音数据,通过精准的语音转文字模型,对有效语音数据进行文字转换,得到目标语音数据对应的目标文本结果;将目标文本结果所包含的文字与提示文本数据所包含的文字进行一一比对,进而可以将目标文本结果与提示文本数据两者之间不同的文本确定为错误文本,此处的错误文本可能是目标用户在视频录制业务的录制过程中出现口误而造成的。计算机设备将错误文本所对应的语音数据从原始视频数据中删除,可以得到最终的目标视频数据。
若选取操作所确定的剪辑优化方式为第二剪辑方式,即目标用户选择了去掉口误和句子间停顿的剪辑优化方式,则计算机设备可以将原始视频数据所包含的目标语音数据转换为目标文本结果,将目标文本结果中与提示文本数据不相同的文本确定为错误文本;进而可以将目标文本结果划分为N个文本字符,获取N个文本字符分别在目标语音数据中的时间戳,其中,N为正整数,如N可以取值为1,2,……;计算机设备可以根据时间戳确定目标语音数据中的语音停顿片段,在原始视频数据中删除语音停顿片段和错误文本对应的语音数据,得到视频录制业务对应的目标视频数据。其中,计算机设备确定错误文本的过程可以参见上述选取第一剪辑方式时的描述,这里不再进行赘述。
计算机设备获取语音停顿片段的过程可以包括:计算机设备可以对目标语音数据对应的目标文本结果进行分词处理,得到N个文本字符,获取每个文本字符分别在目标语音数据中的时间戳,即在原始视频数据中的时间戳,根据N个文本字符中每相邻两个文本字符分别对应的时间戳,得到每相邻两个文本字符之间的时间间隔,当相邻两个文本字符之间的时间间隔大于时长阈值(例如,时长阈值可以设置为1.5秒)时,可以将相邻两个文本字符之间的语音片段确定为语音停顿片段,其中,语音停顿片段的数量可以为一个,也可以为多个,也可以为零(即不存在语音停顿片段)。举例来说,N个文本字符按照在目标文本结果中的排列顺序可以表示为:文本字符1、文本字符2、文本字符3、文本字符4、文本字符5以及文本字符6,文本字符1在原始视频数据中的时间戳为t1,文本字符2在原始视频数据中的时间戳为t2,文本字符3在原始视频数据中的时间戳为t3,文本字符4在原始视频数据中的时间戳为t4,文本字符6在原始视频数据中的时间戳为t5,文本字符6在原始视频数据中的时间戳为t6;若计算机设备计算出文本字符2与文本字符3之间的时间间隔大于时长阈值,则可以将文本字符2与文本字符3之间语音片段确定为语音停顿片段1,若计算出文本字符5与文本字符6之间的时间间隔大于时长阈值,则可以将文本字符5与文本字符6之间语音片段确定为语音停顿片段2。在原始视频数据中删除错误文本所对应的语音,以及语音停顿片段1和语音停顿片段2分别对应的视频片段,可以得到最终的目标视频数据。
请一并参见图9,图9是本申请实施例提供的一种对录制视频进行剪辑优化的界面示意图。如图9所示,在完成视频录制业务后,可以进入视频应用的剪辑页面80b,在剪辑页面80b中可以预览播放视频录制业务中所录制的视频数据80c(如上述原始视频数据),视频数据80c可以按照16:9的比例在剪辑页面80b中进行显示,在该剪辑页面80b中可以显示视频数据80c对应的时间轴80d,该时间轴80d可以包括视频数据80c中的视频节点,目标用户可以通过时间轴80d中的视频节点快速定位到视频数据80c中的播放点。剪辑页面80b中还可以显示剪辑优化控件80e(也可以称为剪辑优化选项按钮),当目标用户对剪辑优化控件80e执行触发操作时,用户终端80a(即计算机设备)可以响应针对剪辑优化控件80e的触发 操作,在剪辑页面80b中弹出选择页面80f(本申请实施例中,选择页面可以是指剪辑页面中的某区域,或者是独立显示于剪辑页面的子页面,或者为剪辑页面中的悬浮页面,或者为覆盖剪辑页面的页面,这里对选择页面的展示形式不做限定)。
在选择页面80f中可以显示针对视频数据80c的不同剪辑优化方式,并显示不同剪辑优化方式分别对应的视频时长;如图9所示,若目标用户在选择页面80f中选择“去掉口误的部分”(即上述第一剪辑方式),则剪辑优化后的视频数据80c的视频时长为57秒(视频数据80c的视频时长为60秒);若目标用户在选择页面80f中选择“去掉口误和句子间的停顿”(即上述第二剪辑方式),则剪辑优化后的视频数据80c的视频时长为50秒;若目标用户在选择页面80f中选择不做任何处理,即保持视频数据80c不做处理。目标用户选择“去掉口误的部分”这一种优化剪辑方式时,用户终端80a可以对视频数据80c中的目标语音数据进行文本转换处理,得到目标语音数据对应的目标文本结果,将目标文本结果与提示文本数据进行文字匹配,可以确定错误文本,在视频数据80c中删除错误文本所对应的语音数据,可以得到目标视频数据,此处的目标视频数据是指删除口误部分的视频数据。目标用户选择“去掉口误和句子间的停顿”这一种优化剪辑方式时,用户终端80a在视频数据80c中删除错误文本所对应的语音数据,以及视频数据80c中的语音停顿片段,进而得到目标视频数据,此处的目标视频数据是指删除口误部分和句子间停顿部分的视频数据。目标用户在获得目标视频数据后,可以对目标视频数据进行保存,或者将目标视频数据上传至信息发布平台,以使信息发布平台中的用户终端均可以观看该目标视频数据。
上述错误文本可以包括K个错误子文本,其中,K为正整数,如K的取值可以为1,2,……;计算机设备可以根据K个错误子文本和原始视频数据对应的视频时长,确定视频录制业务中的错误频率;当错误频率大于错误阈值(例如,错误阈值可以设置为每分钟2次错误)时,识别K个错误子文本分别对应的演讲错误类型,进而可以在视频应用中为视频录制业务关联的目标用户推送与演讲错误类型相关联的教程视频。换言之,计算机设备可以根据错误文本对应的演讲错误类型,在视频应用中为目标用户推荐相应的教程视频,其中,演讲错误类型包括但不限于:普通话不标准、发音错误、吐词不清。举例来说,当原始视频数据的视频时长为1分钟,目标用户在原始视频数据中出现3次错误时,计算机设备可以确定3次错误所对应的错误子文本的演讲错误类型,若演讲错误类型为普通话不标准类型,则计算机设备可以在视频应用中为目标用户推送普通话教程视频;若演讲错误类型为发音错误类型,则计算机设备可以在视频应用中为目标用户推送语文教程视频;若演讲错误类型为吐词不清类型,则计算机设备可以在视频应用中为目标用户推送配音教程视频。
请一并参见图10,图10是本申请实施例提供的一种根据演讲错误类型推荐教程视频的界面示意图。如图10所示,假设目标用户选择了“去掉口误的部分”剪辑优化方式,对视频录制业务中所录制的原始视频数据进行剪辑优化,得到剪辑优化后的目标视频数据90c(即去掉口误部分后所得到的录制视频);用户终端90a(即上述计算机设备)可以在剪辑页面90b中对目标视频数据90c进行显示,在剪辑页面90b中还可以显示时间轴90d,该时间轴90d可以包括与目标视频数据90c相关联的视频节点,通过触发时间轴90d中的视频节点可以定位到目标视频数据90c中的特定时间点进行播放,目标用户可以在剪辑页面90b中对目标视频数据90c进行预览播放。用户终端90a可以根据剪辑优化过程中错误文本对应的演讲错误类型,在视频应用中为目标用户推送与演讲错误类型相匹配的教程视频,如图10所示,错误文本对应的演讲错误类型为普通话不标准类型,即口误原因为普通话不标准,则用户终端90a可以在视频应用中获取与普通话视频教学的教程视频(即普通话教程视频),并在编辑页面90b的区域90e中显示推送的普通话教程视频。
请参见图11,图11是本申请实施例提供的一种视频录制业务的实现流程图。如图11所示,以视频应用的客户端和后台服务器为例,对视频录制业务的实现过程进行描述,此时的客户端和后台服务器可以称为计算机设备;视频录制业务的实现流程可以通过下述S11-S25来实现。
S11,输入提示文本数据,即目标用户可以打开视频应用的客户端,进入客户端的拍摄页面,从拍摄页面的提词拍摄入口中进入录制页面,此时的录制页面包括文本输入区域,目标用户可以在文本输入区域中输入提示文本数据。目标用户在完成提示文本数据的编辑后,可以执行S12,语音启动“开始”,即可 以将“开始”作为唤醒词,当目标用户说出“开始”后,客户端可以响应用户的语音启动操作,执行S13,开启视频录制业务,即开始进入录制模式。
S14,在进入录制模式后,目标用户可以朗读屏幕上的文字(该屏幕为安装该客户端的终端设备屏幕,此时终端设备屏幕上的文字可以为提示文本数据中的部分文本内容,例如,进入录制模式时所显示的文字可以为提示文本数据中的前两句话);客户端可以采集目标用户的用户初始语音,将用户初始语音传输至视频应用的后台服务器,并向后台服务器发送文本转换指令。后台服务器接收到客户端发送的用户初始语音和指令后,可以执行S15,通过语音端点检测技术(VAD技术)对用户初始语音进行检测,删除用户初始语音中的噪音和静音,得到目标用户对应的用户语音(即有效语音数据)。需要说明的是,S15可以由客户端通过本地的语音端点检测模块来执行,也可以由后台服务器采用VAD技术来执行。
S16,后台服务器可以采用快速的文字转换模型对用户语音进行文本转换,将用户语音转换为文字(即用户语音文本),继续执行S17,将用户语音文本转换(文字)为拼音(本申请实施例中,默认文本提示数据为汉语),进而可以执行S18,后台服务器可以获取目标用户输入的提示文本数据,并将提示文本数据转换为拼音,将用户语音文本的拼音和提示文本数据的拼音进行匹配,继续执行S19,在提示文本数据中找到与用户语音相匹配的文字位置,并将用户语音在提示文本数据中的文字位置传输至客户端。
S20,客户端接收到后台服务器传输的文字位置后,可以根据文字位置确定用户语音对应的目标文本,在客户端的录制页面中对目标文本进行标识,即可以根据文字位置对提示文本数据滚动显示;当目标用户朗读到提示文本数据中的最后一个字时,客户端可以执行S21,结束视频录制业务。当然,目标用户可以触发录制页面中的完成录制控件或者触发录制页面中的取消录制控件,结束视频录制业务。
在结束视频录制业务后,客户端可以将视频录制业务对应的录制视频(即上述原始视频数据)传输至后台服务器,并向后台服务器发送文本转换指令,后台服务器在接收到文本转换指令后,可以执行S22,采用精准的文字转换模型对录制视频所包含语音数据进行文本转换,将录制视频所包含语音数据转换为文字(即目标文本结果),并获取文字在录制视频中出现的时机,也可以称为文字在录制视频中的时间戳;此时的后台服务器可以并行执行S23和S24。
S23,后台服务器可以将目标文本结果与提示文本数据进行比对,找出录制视频中的口误部分(即上述错误文本所对应的语音数据);S24,后台服务器可以通过文字在录制视频中出现的时机(即时间戳),找出录制视频所包含的用户语音中的停顿部分。后台服务器可以将录制视频中的口误部分和停顿部分均传输至客户端。客户端接收到后台服务器传输的口误部分和停顿部分后,可以执行S25,根据口误部分和停顿部分在客户端中为目标用户提供不同的剪辑优化方式,目标用户可以在客户端所提供的多种剪辑优化方式中,选择合适的剪辑优化方式,客户端可以基于目标用户所选择的剪辑优化方式,对录制视频进行剪辑优化,以得到最终的目标视频数据。
本申请实施例中,用户在视频应用中输入提示文本数据后,可以通过语音启动视频录制业务,在视频录制业务的录制过程中为用户提供提词功能;可以在提示文本数据中定位与用户语音相匹配的目标文本,并在视频应用中对目标文本进行标识,即视频应用中所显示的目标文本与用户正在演讲的内容相匹配,可以提高视频录制业务中文本提示功能的有效性,降低用户由于忘词而造成录制失败的风险,进而可以提高录制视频的质量;通过用户语音启动或停止视频录制业务,可以减少视频录制业务中的用户操作,提升视频录制的效果;在视频录制业务结束后,可以对视频录制业务中的录制视频进行自动剪辑优化,可以进一步提高录制视频的质量。
请参见图12,图12是本申请实施例提供的一种数据处理方法的流程示意图。可以理解地,该数据处理方法可以由计算机设备执行,该计算机设备可以为用户终端,或者为独立的服务器,或者为多个服务器组成的集群,或者为用户终端和服务器所组成的系统,或者为一个计算机程序应用(包括程序代码),这里不做具体限定。如图12所示,该数据处理方法可以包括以下S201-S203:
S201,将提示文本数据上传至提词应用。
目标用户可以在提词应用中输入提示文本数据,或者将已经编辑好的提示文本数据上传至提词应用。计算机设备可以响应目标用户的文本输入操作,或者文本上传操作,将提示文本数据上传至提词应用,即 在使用提词应用所提供的提词功能时,需要将提示文本数据上传至提词应用。需要说明的是,本申请实施例中的计算机设备可以是指安装有提词应用的设备,也可以称为提词器。
S202,采集目标用户对应的用户语音,对用户语音进行文本转换,生成用户语音对应的用户语音文本。
计算机设备可以采集目标用户的用户初始语音,对用户初始语音进行语音端点检测,对用户初始语音中所包含的噪音和静音进行删除,得到目标用户对应的用户语音(即用户初始语音中的有效语音数据),对用户语音进行文本转换,生成用户语音对应的用户语音文本。
S203,在提示文本数据中,将与用户语音文本相同的文本确定为目标文本,在提词应用中对目标文本进行标识。
计算机设备可以将用户语音文本转换成第一音节信息,将提示文本数据转换为第二音节信息,将第一音节信息与第二音节信息进行比对,在提示文本数据中确定用户语音文本的文本位置,根据文本位置可以在提示文本数据中确定与用户语音相匹配的目标文本,在提词应用中对目标文本进行标识。其中,针对S202和S203更为详细地描述,可以参见上述图3所对应实施例中的S102,这里不再进行赘述。
目标用户的数量可以为一个或者多个,不同的目标用户可以对应不同的提示文本数据;当目标用户的数量为1时,提词应用中目标文本的确定及展示过程可以参见上述图3所对应实施例中的S102;当目标用户的数量为多个时,计算机设备采集到的用户语音后,可以对用户语音进行声纹识别,根据声纹识别结果确定采集到的用户语音对应的用户身份,在用户身份所对应的提示文本数据中确定用户语音对应的目标文本,并在提词应用中对目标文本进行标识。其中,声纹识别可以是指提取用户语音数据中的声纹特征(例如,频谱、倒频谱、共振峰、基音、反射系数等),通过对声纹特征进行识别,可以确定用户语音对应的用户身份,因此声纹识别也可以称为说话人识别。
下面以目标用户的数量为2,即目标用户包括第一用户和第二用户为例进行描述,此时的提示文本数据包括第一用户对应的第一提示文本和第二用户对应的第二提示文本;计算机设备可以获取用户语音中的用户声纹特征,根据用户声纹特征确定用户语音对应的用户身份;若用户身份为第一用户,则在第一提示文本中,将与用户语音文本相同的文本确定为目标文本,在提词应用中对目标文本进行标识;若用户身份为第二用户,则在第二提示文本中,将与用户语音文本相同的文本确定为目标文本,在提词应用中对目标文本进行标识。换言之,当目标用户的数量为多个时,首先需要确定用户语音对应的用户身份,进而可以在该用户身份对应的提示文本数据中确定与用户语音相匹配的目标文本,并对目标文本进行标识,可以提高提词应用中提词功能的有效性。
请一并参见图13,图13是本申请实施例提供的一种提词器的应用场景示意图。以晚会提词场景为例,对数据处理过程进行描述,如图13所示,可以预先编辑晚会中主持人的台词90a(即提示文本数据),并将台词90a上传至提词器(可以理解为上述提词应用所在的设备,可以为主持人提供台词提示功能);在台词90a中可以包括主持人小A的台词和主持人小B的台词,提词器接收到台词90a后,可以将台词90a保存在本地。在晚会进行时,提词器可以实时采集所有主持人的语音数据,当提词器采集到主持人的用户语音时,可以对用户语音进行声纹识别,根据声纹识别结果确定用户语音对应的用户身份。当采集到的用户语音的用户身份为小A时,提词器可以在主持人小A的台词中查找与所采集用户语音相匹配的目标文本(如“伴随着冬日温暖的祝福,满怀着喜悦的心情”),并在提词器中对“伴随着冬日温暖的祝福,满怀着喜悦的心情”进行标识。
当采集到的用户语音的用户身份为小B时,提词器可以在主持人小B的台词中查找与所采集用户语音相匹配的目标文本(如“在过去的一年里,我们付出了汗水”),并在提词器中对“在过去的一年里,我们付出了汗水”进行标识。
本申请实施例中,提词器可以标识目标用户正在朗读的语句,并随着目标用户的朗读进度自动识别目标用户的语音,在提词器中对提示文本数据进行滚动显示,可以提高提词器中文本提示功能的有效性。
请参见图14,图14是本申请实施例提供的一种数据处理装置的结构示意图。该数据处理装置可以执行上述图3所对应实施例中的步骤,如图14所示,该数据处理装置1可以包括:启动模块101,显示模块102,获取模块103;
启动模块101,用于响应视频应用中的业务启动操作,启动视频应用中的视频录制业务;
显示模块102,用于采集视频录制业务中的用户语音,在视频录制业务关联的提示文本数据中确定与用户语音相匹配的目标文本,并对目标文本进行标识;
获取模块103,用于当目标文本在提示文本数据中的文本位置为提示文本数据的末尾位置时,获取视频录制业务对应的目标视频数据。
其中,启动模块101,显示模块102,获取模块103的具体功能实现方式可以参见上述图3所对应实施例中的S101-S103,这里不再进行赘述。
在一些可行的实施方式中,该数据处理装置1还可以包括:第一录制页面显示模块104,编辑模块105,第一预估时长显示模块106,第二录制页面显示模块107,文本上传模块108,第二预估时长显示模块109;
第一录制页面显示模块104,用于在启动所述视频应用中的视频录制业务之前,响应针对视频应用中的提词拍摄入口的触发操作,在视频应用中显示录制页面;录制页面包括文本输入区域;
编辑模块105,用于响应针对文本输入区域的信息编辑操作,在文本输入区域中显示信息编辑操作所确定的提示文本数据;
第一预估时长显示模块106,用于当提示文本数据对应的提示文字数量大于数量阈值时,在文本输入区域中显示提示文字数量和提示文本数据对应的视频预估时长。
第二录制页面显示模块107,用于在启动所述视频应用中的视频录制业务之前,响应针对视频应用中的提词拍摄入口的触发操作,在视频应用中显示录制页面;录制页面包括文本上传控件和文本输入区域;
文本上传模块108,用于响应针对文本上传控件的触发操作,将上传至录制页面的文本内容确定为提示文本数据,在文本输入区域中显示提示文本数据;
第二预估时长显示模块109,用于显示提示文本数据对应的提示文字数量,以及提示文本数据对应的视频预估时长。
其中,第一录制页面显示模块104,编辑模块105,第一预估时长显示模块106,第二录制页面显示模块107,文本上传模块108,第二预估时长显示模块109的具体功能实现方式可以参见上述图3所对应实施例中的S101,这里不再进行赘述。其中,当第一录制页面显示模块104,编辑模块105,第一预估时长显示模块106在执行相应的操作时,第二录制页面显示模块107,文本上传模块108,第二预估时长显示模块109均暂停执行操作;当第二录制页面显示模块107,文本上传模块108,第二预估时长显示模块109在执行相应的操作时,第一录制页面显示模块104,编辑模块105,第一预估时长显示模块106均暂停执行操作。其中,第一录制页面显示模块104和第二录制页面显示模块107可以合并为同一个录制页面显示模块;第一预估时长显示模块106和第二预估时长显示模块109可以合并为同一个预估时长显示模块。
在一些可行的实施方式中,业务启动操作包括语音启动操作;
启动模块101可以包括:倒计时动画显示单元1011,录制业务启动单元1012;
倒计时动画显示单元1011,用于响应视频应用中的语音启动操作,在视频应用的录制页面中显示与视频录制业务相关联的录制倒计时动画;
录制业务启动单元1012,用于当录制倒计时动画结束时,启动并执行视频应用中的视频录制业务。
其中,倒计时动画显示单元1011,录制业务启动单元1012的具体功能实现方式可以参见上述图3所对应实施例中的S101,这里不再进行赘述。
在一些可行的实施方式中,录制倒计时动画包括动画取消控件;
该数据处理装置1还可以包括:倒计时动画取消模块110;
倒计时动画取消模块110,用于当所述录制倒计时动画结束时,启动并执行所述视频应用中的所述视频录制业务之前,响应针对动画取消控件的触发操作,取消显示录制倒计时动画,启动并执行视频应用中的视频录制业务。
其中,倒计时动画取消模块110的具体功能实现方式可以参见上述图3所对应实施例中的S101,这里不再进行赘述。
在一些可行的实施方式中,显示模块102可以包括:语音端点检测单元1021,目标文本确定单元1022,目标文本显示单元1023;
语音端点检测单元1021,用于采集视频录制业务中的用户初始语音,对用户初始语音进行语音端点检测得到用户初始语音中的有效语音数据,将有效语音数据确定为用户语音;
目标文本确定单元1022,用于将用户语音转换为用户语音文本,对用户语音文本和视频录制业务关联的提示文本数据进行文本匹配,在提示文本数据中确定与用户语音文本相匹配的目标文本;
目标文本显示单元1023,用于在视频录制业务的录制页面中,对目标文本进行标识。
其中,语音端点检测单元1021,目标文本确定单元1022,目标文本显示单元1023的具体功能实现方式可以参见上述图3所对应实施例中的S102,这里不再进行赘述。
在一些可行的实施方式中,目标文本确定单元1022可以包括:音节信息获取子单元10221,音节匹配子单元10222;
音节信息获取子单元10221,用于获取用户语音文本的第一音节信息,获取视频录制业务关联的提示文本数据的第二音节信息;
音节匹配子单元10222,用于在第二音节信息中获取与第一音节信息相同的目标音节信息,在提示文本数据中确定目标音节信息对应的目标文本。
其中,音节信息获取子单元10221,音节匹配子单元10222的具体功能实现方式可以参见上述图3所对应实施例中的S102,这里不再进行赘述。
在一些可行的实施方式中,目标文本显示单元1023可以包括:提示区域确定子单元10231,标识子单元10232;
提示区域确定子单元10231,在视频录制业务的录制页面中确定目标文本对应的文本提示区域;
标识子单元10232,用于根据目标文本在提示文本数据中的文本位置,在文本提示区域中对目标文本进行标识。
其中,提示区域确定子单元10231,标识子单元10232的具体功能实现方式可以参见上述图3所对应实施例中的S102,这里不再进行赘述。
在一些可行的实施方式中,录制页面包括取消录制控件;
该数据处理装置1还可以包括:录制取消模块111,录制提示信息显示模块112,重新录制模块113;
录制取消模块111,用于响应针对取消录制控件的触发操作,取消视频录制业务,删除视频录制业务所录制的视频数据;
录制提示信息显示模块112,用于生成针对视频录制业务的录制提示信息,在录制页面中显示录制提示信息;录制提示信息包括重新录制控件;
重新录制模块113,用于响应针对重新录制控件的触发操作,将录制页面所显示的目标文本切换显示为提示文本数据。
其中,录制取消模块111,录制提示信息显示模块112,重新录制模块113的具体功能实现方式可以参见上述图3所对应实施例中的S102,这里不再进行赘述。
在一些可行的实施方式中,录制页面包括完成录制控件;
该数据处理装置1可以包括:录制完成模块114;
录制完成模块114,用于当所述目标文本在所述提示文本数据中的文本位置为所述提示文本数据的末尾位置时,获取所述视频录制业务对应的目标视频数据之前,响应针对完成录制控件的触发操作,停止视频录制业务,将视频录制业务所录制的视频数据确定为目标视频数据。
其中,录制完成模块114的具体功能实现方式可以参见上述图3所对应实施例中的S102,这里不再进行赘述。
在一些可行的实施方式中,获取模块103可以包括:原始视频获取单元1031,优化控件显示单元1032,优化方式显示单元1033,优化处理单元1034;
原始视频获取单元1031,用于当目标文本在提示文本数据中的文本位置为提示文本数据的末尾位置时,停止视频录制业务,将视频录制业务所录制的视频数据确定为原始视频数据;
优化控件显示单元1032,用于在视频应用的剪辑页面中显示原始视频数据,以及原始视频数据对应的剪辑优化控件;
优化方式显示单元1033,用于响应针对剪辑优化控件的触发操作,显示针对原始视频数据的M个剪辑优化方式;M为正整数;
优化处理单元1034,用于响应针对M个剪辑优化方式的选取操作,根据选取操作所确定的剪辑优化方式,对原始视频数据进行剪辑优化处理,得到目标视频数据。
其中,原始视频获取单元1031,优化控件显示单元1032,优化方式显示单元1033,优化处理单元1034的具体功能实现方式可以参见上述图3所对应实施例中的S103,这里不再进行赘述。
在一些可行的实施方式中,优化处理单元1034可以包括:第一语音转换子单元10341,文本比对子单元10342,语音删除子单元10343,第二语音转换子单元10344,时间戳获取子单元10345,语音停顿片段确定子单元10346;
第一语音转换子单元10341,用于若选取操作所确定的剪辑优化方式为第一剪辑方式,则获取原始视频数据所包含的目标语音数据,将目标语音数据转换为目标文本结果;
文本比对子单元10342,用于将目标文本结果与提示文本数据进行文本比对,将目标文本结果中与提示文本数据不相同的文本确定为错误文本;
语音删除子单元10343,用于在原始视频数据中删除错误文本对应的语音数据,得到目标视频数据。
第二语音转换子单元10344,用于若选取操作所确定的剪辑优化方式为第二剪辑方式,则将原始视频数据所包含的目标语音数据转换为目标文本结果,将目标文本结果中与提示文本数据不相同的文本确定为错误文本;
时间戳获取子单元10345,用于将目标文本结果划分为N个文本字符,获取N个文本字符分别在目标语音数据中的时间戳;N为正整数;
语音停顿片段确定子单元10346,用于根据时间戳确定目标语音数据中的语音停顿片段,在原始视频数据中删除语音停顿片段和错误文本对应的语音数据,得到目标视频数据。
其中,第一语音转换子单元10341,文本比对子单元10342,语音删除子单元10343,第二语音转换子单元10344,时间戳获取子单元10345,语音停顿片段确定子单元10346的具体功能实现方式可以参见上述图3所对应实施例中的S103,这里不再进行赘述。其中,当第一语音转换子单元10341,文本比对子单元10342,语音删除子单元10343在执行相应的操作时,第二语音转换子单元10344,时间戳获取子单元10345,语音停顿片段确定子单元10346均暂停执行操作;当第二语音转换子单元10344,时间戳获取子单元10345,语音停顿片段确定子单元10346在执行相应的操作时,第一语音转换子单元10341,文本比对子单元10342,语音删除子单元10343均暂停执行操作。
在一些可行的实施方式中,该数据处理装置1还可以包括:用户语速确定模块115,语速提示信息显示模块116;
用户语速确定模块115,用于获取用户初始语音对应的语音时长,以及用户初始语音所包含的语音文字数量,将语音文字数量与语音时长的比值确定为用户语速;
语速提示信息显示模块116,用于当用户语速大于语速阈值时,在录制页面中显示语速提示信息;语速提示信息用于提示视频录制业务所关联的目标用户降低用户语速。
其中,用户语速确定模块115,语速提示信息显示模块116的具体功能实现方式可以参见上述图3所对应实施例中的S102,这里不再进行赘述。
在一些可行的实施方式中,错误文本包括K个错误子文本,K为正整数;
该数据处理装置1还可以包括:错误频率确定模块117,错误类型识别模块118,教程视频推送模块119;
错误频率确定模块117,用于根据K个错误子文本和原始视频数据对应的视频时长,确定视频录制业务中的错误频率;
错误类型识别模块118,用于当错误频率大于错误阈值时,识别K个错误子文本分别对应的演讲错误类型;
教程视频推送模块119,用于在视频应用中为视频录制业务关联的目标用户推送与演讲错误类型相关联的教程视频。
其中,错误频率确定模块117,错误类型识别模块118,教程视频推送模块119的具体功能实现方式可以参见上述图3所对应实施例中的S103,这里不再进行赘述。
本申请实施例中,用户在视频应用中输入提示文本数据后,可以通过语音启动视频录制业务,在视频录制业务的录制过程中为用户提供提词功能;可以在提示文本数据中定位与用户语音相匹配的目标文本,并在视频应用中对目标文本进行标识,即视频应用中所显示的目标文本与用户正在演讲的内容相匹配,可以提高视频录制业务中文本提示功能的有效性,降低用户由于忘词而造成录制失败的风险,进而可以提高录制视频的质量;通过用户语音启动或停止视频录制业务,可以减少视频录制业务中的用户操作,提升视频录制的效果;在视频录制业务结束后,可以对视频录制业务中的录制视频进行自动剪辑优化,可以进一步提高录制视频的质量。
请参见图15,图15实施本申请实施例提供的一种数据处理装置的结构示意图。该数据处理装置可以执行上述图12所对应实施例中的步骤,如图15所示,该数据处理装置2可以包括:提示文本上传模块21,用户语音采集模块22,用户语音文本显示模块23;
提示文本上传模块21,用于将提示文本数据上传至提词应用;
用户语音采集模块22,用于采集目标用户对应的用户语音,对用户语音进行文本转换,生成用户语音对应的用户语音文本;
用户语音文本显示模块23,用于在提示文本数据中,将与用户语音文本相同的文本确定为目标文本,在提词应用中对目标文本进行标识。
其中,提示文本上传模块21,用户语音采集模块22,用户语音文本显示模块23的具体实现方式可以参见上述图12所对应实施例中的S201-S203,这里不再进行赘述。
其中,目标用户包括第一用户和第二用户,提示文本数据包括第一用户对应的第一提示文本和第二用户对应的第二提示文本;
用户语音文本显示模块23可以包括:用户身份确定单元231,第一确定单元232,第二确定单元233;
用户身份确定单元231,用于获取用户语音中的用户声纹特征,根据用户声纹特征确定用户语音对应的用户身份;
第一确定单元232,用于若用户身份为第一用户,则在第一提示文本中,将与用户语音文本相同的文本确定为目标文本,在提词应用中对目标文本进行标识;
第二确定单元233,用于若用户身份为第二用户,则在第二提示文本中,将与用户语音文本相同的文本确定为目标文本,在提词应用中对目标文本进行标识。
其中,用户身份确定单元231,第一确定单元232,第二确定单元233的具体实现方式可以参见上述图12所对应实施例中的S203,这里不再进行赘述。
本申请实施例中,提词器可以标识目标用户正在朗读的语句,并随着目标用户的朗读进度自动识别目标用户的语音,在提词器中对提示文本数据进行滚动显示,可以提高提词器中文本提示功能的有效性。
请参见图16,图16是本申请实施例提供的一种计算机设备的结构示意图。如图16所示,该计算机设备1000可以包括:处理器1001,网络接口1004和存储器1005,此外,上述计算机设备1000还可以包括:用户接口1003,和至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。其中,用户接口1003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1005还可以是至少一个位于远离前述处理器1001的存储装置。如图16所示,作为一种计算机可读存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。
在如图16所示的计算机设备1000中,网络接口1004可提供网络通讯功能;而用户接口1003主要用于为用户提供输入的接口;而处理器1001可以用于调用存储器1005中存储的设备控制应用程序,以实现:
响应视频应用中的业务启动操作,启动视频应用中的视频录制业务;
采集视频录制业务中的用户语音,在视频录制业务关联的提示文本数据中确定与用户语音相匹配的目标文本,并对目标文本进行标识;
当目标文本在提示文本数据中的文本位置为提示文本数据的末尾位置时,获取视频录制业务对应的目标视频数据。
应当理解,本申请实施例中所描述的计算机设备1000可执行前文图3所对应实施例中对数据处理方法的描述,也可执行前文图14所对应实施例中对数据处理装置1的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
请参见图17,图17是本申请实施例提供的一种计算机设备的结构示意图。如图17所示,该计算机设备2000可以包括:处理器2001,网络接口2004和存储器2005,此外,上述计算机设备2000还可以包括:用户接口2003,和至少一个通信总线2002。其中,通信总线2002用于实现这些组件之间的连接通信。其中,用户接口2003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口2003还可以包括标准的有线接口、无线接口。网络接口2004可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器2005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器2005还可以是至少一个位于远离前述处理器2001的存储装置。如图17所示,作为一种计算机可读存储介质的存储器2005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。
在如图17所示的计算机设备2000中,网络接口2004可提供网络通讯功能;而用户接口2003主要用于为用户提供输入的接口;而处理器2001可以用于调用存储器2005中存储的设备控制应用程序,以实现:
将提示文本数据上传至提词应用;
采集目标用户对应的用户语音,对用户语音进行文本转换,生成用户语音对应的用户语音文本;
在提示文本数据中,将与用户语音文本相同的文本确定为目标文本,在提词应用中对目标文本进行标识。
应当理解,本申请实施例中所描述的计算机设备2000可执行前文图6所对应实施例中对数据处理方法的描述,也可执行前文图14所对应实施例中对数据处理装置2的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
此外,这里需要指出的是:本申请实施例还提供了一种计算机可读存储介质,且计算机可读存储介质中存储有前文提及的数据处理装置1所执行的计算机程序,且计算机程序包括程序指令,当处理器执行程序指令时,能够执行前文图3、图11以及图12任一个所对应实施例中对数据处理方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述。作为示例,程序指令可被部署在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行,分布在多个地点且通过通信网络互连的多个计算设备可以组成区块链系统。
此外,需要说明的是:本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或者计算机程序可以包括计算机指令,该计算机指令可以存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器可以执行该计算机指令,使得该计算机设备执行前文图3、图11以及图12任一个所对应实施例中对数据处理方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机程序产品或者计算机程序实施例中未披露的技术细节,请参照本申请方法实施例的描述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,存储介质可为磁碟、光盘、只读存储器(Read-Only Memory,ROM)或随机存储器(Random Access Memory,RAM)等。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (22)

  1. 一种数据处理方法,所述方法由计算机设备执行,所述方法包括:
    响应视频应用中的业务启动操作,启动所述视频应用中的视频录制业务;
    采集所述视频录制业务中的用户语音,在所述视频录制业务关联的提示文本数据中确定与所述用户语音相匹配的目标文本,并对所述目标文本进行标识;
    当所述目标文本在所述提示文本数据中的文本位置为所述提示文本数据的末尾位置时,获取所述视频录制业务对应的目标视频数据。
  2. 根据权利要求1所述的方法,所述启动所述视频应用中的视频录制业务之前,所述方法还包括:
    响应针对所述视频应用中的提词拍摄入口的触发操作,在所述视频应用中显示录制页面;所述录制页面包括文本输入区域;
    响应针对所述文本输入区域的信息编辑操作,在所述文本输入区域中显示所述信息编辑操作所确定的提示文本数据;
    当所述提示文本数据对应的提示文字数量大于数量阈值时,在所述文本输入区域中显示所述提示文字数量和所述提示文本数据对应的视频预估时长。
  3. 根据权利要求1所述的方法,所述启动所述视频应用中的视频录制业务之前,所述方法还包括:
    响应针对所述视频应用中的提词拍摄入口的触发操作,在所述视频应用中显示录制页面;所述录制页面包括文本上传控件和文本输入区域;
    响应针对所述文本上传控件的触发操作,将上传至所述录制页面的文本内容确定为提示文本数据,在所述文本输入区域中显示所述提示文本数据;
    显示所述提示文本数据对应的提示文字数量,以及所述提示文本数据对应的视频预估时长。
  4. 根据权利要求1所述的方法,所述业务启动操作包括语音启动操作;
    所述响应视频应用中的业务启动操作,启动所述视频应用中的视频录制业务,包括:
    响应所述视频应用中的语音启动操作,在所述视频应用的录制页面中显示与所述视频录制业务相关联的录制倒计时动画;
    当所述录制倒计时动画结束时,启动并执行所述视频应用中的所述视频录制业务。
  5. 根据权利要求4所述的方法,所述录制倒计时动画包括动画取消控件;
    所述当所述录制倒计时动画结束时,启动并执行所述视频应用中的所述视频录制业务之前,所述方法还包括:
    响应针对所述动画取消控件的触发操作,取消显示所述录制倒计时动画,启动并执行所述视频应用中的所述视频录制业务。
  6. 根据权利要求1所述的方法,所述采集所述视频录制业务中的用户语音,在所述视频录制业务关联的提示文本数据中确定与所述用户语音相匹配的目标文本,对所述目标文本进行标识,包括:
    采集所述视频录制业务中的用户初始语音,对所述用户初始语音进行语音端点检测得到所述用户初始语音中的有效语音数据,将所述有效语音数据确定为所述用户语音;
    将所述用户语音转换为用户语音文本,对所述用户语音文本和所述视频录制业务关联的提示文本数据进行文本匹配,在所述提示文本数据中确定与所述用户语音文本相匹配的目标文本;
    在所述视频录制业务的录制页面中,对所述目标文本进行标识。
  7. 根据权利要求6所述的方法,所述对所述用户语音文本和所述视频录制业务关联的提示文本数据进行文本匹配,在所述提示文本数据中确定与所述用户语音文本相匹配的目标文本,包括:
    获取所述用户语音文本的第一音节信息,获取所述视频录制业务关联的提示文本数据的第二音节信息;
    在所述第二音节信息中获取与所述第一音节信息相同的目标音节信息,在所述提示文本数据中确定所述目标音节信息对应的目标文本。
  8. 根据权利要求6所述的方法,所述在所述视频录制业务的录制页面中,对所述目标文本进行标识,包括:
    在所述视频录制业务的录制页面中确定所述目标文本对应的文本提示区域;
    根据所述目标文本在所述提示文本数据中的文本位置,在所述文本提示区域中对所述目标文本进行标识。
  9. 根据权利要求4-8任一项所述的方法,所述录制页面包括取消录制控件;
    所述方法还包括:
    响应针对所述取消录制控件的触发操作,取消所述视频录制业务,删除所述视频录制业务所录制的视频数据;
    生成针对所述视频录制业务的录制提示信息,在所述录制页面中显示所述录制提示信息;所述录制提示信息包括重新录制控件;
    响应针对所述重新录制控件的触发操作,将所述录制页面所显示的目标文本切换显示为所述提示文本数据。
  10. 根据权利要求4-8任一项所述的方法,所述录制页面包括完成录制控件;
    所述当所述目标文本在所述提示文本数据中的文本位置为所述提示文本数据的末尾位置时,获取所述视频录制业务对应的目标视频数据之前,所述方法还包括:
    响应针对所述完成录制控件的触发操作,停止所述视频录制业务,将所述视频录制业务所录制的视频数据确定为所述目标视频数据。
  11. 根据权利要求1所述的方法,所述当所述目标文本在所述提示文本数据中的文本位置为所述提示文本数据的末尾位置时,获取所述视频录制业务对应的目标视频数据,包括:
    当所述目标文本在所述提示文本数据中的文本位置为所述提示文本数据的末尾位置时,停止所述视频录制业务,将所述视频录制业务所录制的视频数据确定为原始视频数据;
    在所述视频应用的剪辑页面中显示所述原始视频数据,以及所述原始视频数据对应的剪辑优化控件;
    响应针对所述剪辑优化控件的触发操作,显示针对所述原始视频数据的M个剪辑优化方式;M为正整数;
    响应针对所述M个剪辑优化方式的选取操作,根据所述选取操作所确定的剪辑优化方式,对所述原始视频数据进行剪辑优化处理,得到所述目标视频数据。
  12. 根据权利要求11所述的方法,所述根据所述选取操作所确定的剪辑优化方式,对所述原始视频数据进行剪辑优化处理,得到所述目标视频数据,包括:
    若所述选取操作所确定的剪辑优化方式为第一剪辑方式,则获取所述原始视频数据所包含的目标语音数据,将所述目标语音数据转换为目标文本结果;
    将所述目标文本结果与所述提示文本数据进行文本比对,将所述目标文本结果中与所述提示文本数据不相同的文本确定为错误文本;
    在所述原始视频数据中删除所述错误文本对应的语音数据,得到所述目标视频数据。
  13. 根据权利要求11所述的方法,所述根据所述选取操作所确定的剪辑优化方式,对所述原始视频数据进行剪辑优化处理,得到所述目标视频数据,包括:
    若所述选取操作所确定的剪辑优化方式为第二剪辑方式,则将所述原始视频数据所包含的目标语音数据转换为目标文本结果,将所述目标文本结果中与所述提示文本数据不相同的文本确定为错误文本;
    将所述目标文本结果划分为N个文本字符,获取所述N个文本字符分别在所述目标语音数据中的时间戳;N为正整数;
    根据所述时间戳确定所述目标语音数据中的语音停顿片段,在所述原始视频数据中删除所述语音停顿片段和所述错误文本对应的语音数据,得到所述目标视频数据。
  14. 根据权利要求6所述的方法,在执行所述视频录制业务的过程中,所述方法还包括:
    获取所述用户初始语音对应的语音时长,以及所述用户初始语音所包含的语音文字数量,将所述语音文字数量与所述语音时长的比值确定为用户语速;
    当所述用户语速大于语速阈值时,在所述录制页面中显示语速提示信息;所述语速提示信息用于提示所述视频录制业务所关联的目标用户降低用户语速。
  15. 根据权利要求12-13任一项所述的方法,所述错误文本包括K个错误子文本,K为正整数;
    所述方法还包括:
    根据所述K个错误子文本和所述原始视频数据对应的视频时长,确定所述视频录制业务中的错误频率;
    当所述错误频率大于错误阈值时,识别所述K个错误子文本分别对应的演讲错误类型;
    在所述视频应用中为所述视频录制业务关联的目标用户推送与所述演讲错误类型相关联的教程视频。
  16. 一种数据处理方法,所述方法由计算机设备执行,所述方法包括:
    将提示文本数据上传至提词应用;
    采集目标用户对应的用户语音,对所述用户语音进行文本转换,生成所述用户语音对应的用户语音文本;
    在所述提示文本数据中,将与所述用户语音文本相同的文本确定为目标文本,在所述提词应用中对所述目标文本进行标识。
  17. 根据权利要求16所述的方法,所述目标用户包括第一用户和第二用户,所述提示文本数据包括所述第一用户对应的第一提示文本和所述第二用户对应的第二提示文本;
    所述在所述提示文本数据中,将与所述用户语音文本相同的文本确定为目标文本,在所述提词应用中对所述目标文本进行标识,包括:
    获取所述用户语音中的用户声纹特征,根据所述用户声纹特征确定所述用户语音对应的用户身份;
    若所述用户身份为所述第一用户,则在所述第一提示文本中,将与所述用户语音文本相同的文本确定为目标文本,在所述提词应用中对所述目标文本进行标识;
    若所述用户身份为所述第二用户,则在所述第二提示文本中,将与所述用户语音文本相同的文本确定为目标文本,在所述提词应用中对所述目标文本进行标识。
  18. 一种数据处理装置,所述装置部署在计算机设备上,所述装置包括:
    启动模块,用于响应视频应用中的业务启动操作,启动所述视频应用中的视频录制业务;
    显示模块,用于采集所述视频录制业务中的用户语音,在所述视频录制业务关联的提示文本数据中确定与所述用户语音相匹配的目标文本,并对所述目标文本进行标识;
    获取模块,用于当所述目标文本在所述提示文本数据中的文本位置为所述提示文本数据的末尾位置时,获取所述视频录制业务对应的目标视频数据。
  19. 一种数据处理装置,所述装置部署在计算机设备上,所述装置包括:
    提示文本上传模块,用于将提示文本数据上传至提词应用;
    用户语音采集模块,用于采集目标用户对应的用户语音,对所述用户语音进行文本转换,生成所述用户语音对应的用户语音文本;
    用户语音文本显示模块,用于在所述提示文本数据中,将与所述用户语音文本相同的文本确定为目标文本,在所述提词应用中对所述目标文本进行标识。
  20. 一种计算机设备,包括存储器和处理器;
    所述存储器与所述处理器相连,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序,以使得所述计算机设备执行权利要求1至15任一项所述的方法,或者执行权利要求16至17任一项所述的方法。
  21. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序适于由处理器加载并执行,以使得具有所述处理器的计算机设备执行权利要求1至15任一项所述的方法,或者执行权利要求16至17任一项所述的方法。
  22. 一种计算机程序产品,当所述计算机程序产品被执行时,用于执行权利要求1至15任一项所述的方法,或者执行权利要求16至17任一项所述的方法。
PCT/CN2022/074513 2021-02-08 2022-01-28 数据处理方法、装置、设备以及介质 WO2022166801A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2023547594A JP2024509710A (ja) 2021-02-08 2022-01-28 データ処理方法、装置、機器、及びコンピュータプログラム
KR1020237019353A KR20230106170A (ko) 2021-02-08 2022-01-28 데이터 처리 방법 및 장치, 디바이스, 및 매체
US17/989,620 US12041313B2 (en) 2021-02-08 2022-11-17 Data processing method and apparatus, device, and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110179007.4 2021-02-08
CN202110179007.4A CN114911448A (zh) 2021-02-08 2021-02-08 数据处理方法、装置、设备以及介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/989,620 Continuation US12041313B2 (en) 2021-02-08 2022-11-17 Data processing method and apparatus, device, and medium

Publications (1)

Publication Number Publication Date
WO2022166801A1 true WO2022166801A1 (zh) 2022-08-11

Family

ID=82741977

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074513 WO2022166801A1 (zh) 2021-02-08 2022-01-28 数据处理方法、装置、设备以及介质

Country Status (5)

Country Link
US (1) US12041313B2 (zh)
JP (1) JP2024509710A (zh)
KR (1) KR20230106170A (zh)
CN (1) CN114911448A (zh)
WO (1) WO2022166801A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102385176B1 (ko) * 2021-11-16 2022-04-14 주식회사 하이 심리 상담 장치 및 그 방법
CN117975949B (zh) * 2024-03-28 2024-06-07 杭州威灿科技有限公司 基于语音转换的事件记录方法、装置、设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180061256A1 (en) * 2016-01-25 2018-03-01 Wespeke, Inc. Automated digital media content extraction for digital lesson generation
CN111372119A (zh) * 2020-04-17 2020-07-03 维沃移动通信有限公司 多媒体数据录制方法、装置及电子设备
CN112738618A (zh) * 2020-12-28 2021-04-30 北京达佳互联信息技术有限公司 视频录制方法、装置及电子设备
CN113301362A (zh) * 2020-10-16 2021-08-24 阿里巴巴集团控股有限公司 视频元素展示方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180061256A1 (en) * 2016-01-25 2018-03-01 Wespeke, Inc. Automated digital media content extraction for digital lesson generation
CN111372119A (zh) * 2020-04-17 2020-07-03 维沃移动通信有限公司 多媒体数据录制方法、装置及电子设备
CN113301362A (zh) * 2020-10-16 2021-08-24 阿里巴巴集团控股有限公司 视频元素展示方法及装置
CN112738618A (zh) * 2020-12-28 2021-04-30 北京达佳互联信息技术有限公司 视频录制方法、装置及电子设备

Also Published As

Publication number Publication date
CN114911448A (zh) 2022-08-16
US20230109852A1 (en) 2023-04-13
US12041313B2 (en) 2024-07-16
JP2024509710A (ja) 2024-03-05
KR20230106170A (ko) 2023-07-12

Similar Documents

Publication Publication Date Title
CN110634483B (zh) 人机交互方法、装置、电子设备及存储介质
US12020708B2 (en) Method and system for conversation transcription with metadata
US10769495B2 (en) Collecting multimodal image editing requests
WO2022166801A1 (zh) 数据处理方法、装置、设备以及介质
US20220343918A1 (en) Systems and methods for live broadcasting of context-aware transcription and/or other elements related to conversations and/or speeches
JP4875752B2 (ja) 編集可能なオーディオストリームにおける音声の認識
CN111050201B (zh) 数据处理方法、装置、电子设备及存储介质
CN109859298B (zh) 一种图像处理方法及其装置、设备和存储介质
JP6280312B2 (ja) 議事録記録装置、議事録記録方法及びプログラム
CN112188266A (zh) 视频生成方法、装置及电子设备
CN113343675B (zh) 一种字幕生成方法、装置和用于生成字幕的装置
TWI807428B (zh) 一同管理與語音檔有關的文本轉換記錄和備忘錄的方法、系統及電腦可讀記錄介質
CN112581965A (zh) 转写方法、装置、录音笔和存储介质
CN114154459A (zh) 语音识别文本处理方法、装置、电子设备及存储介质
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
WO2023213314A1 (zh) 用于编辑音频的方法、装置、设备和存储介质
CN112837668B (zh) 一种语音处理方法、装置和用于处理语音的装置
CN115811665A (zh) 一种视频生成方法、装置、终端设备及存储介质
WO2021017302A1 (zh) 一种数据提取方法、装置、计算机系统及可读存储介质
KR102599001B1 (ko) 템플릿 기반 회의문서 생성장치 및 그 방법
KR102446300B1 (ko) 음성 기록을 위한 음성 인식률을 향상시키는 방법, 시스템, 및 컴퓨터 판독가능한 기록 매체
KR20170130198A (ko) 모바일 기반의 실시간 대본 리딩 시스템 및 방법
KR20170088255A (ko) 온라인 기반의 연기자 대본 리딩을 위한 전자 대본 제공 시스템 및 방법
CN113918114A (zh) 文档控制方法、装置、计算机设备和存储介质
CN118675555A (zh) 音频处理方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22749080

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20237019353

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2023547594

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 08/12/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 22749080

Country of ref document: EP

Kind code of ref document: A1