WO2023212920A1 - Multi-modal rapid transliteration and annotation system based on self-built template - Google Patents

Multi-modal rapid transliteration and annotation system based on self-built template Download PDF

Info

Publication number
WO2023212920A1
WO2023212920A1 PCT/CN2022/091181 CN2022091181W WO2023212920A1 WO 2023212920 A1 WO2023212920 A1 WO 2023212920A1 CN 2022091181 W CN2022091181 W CN 2022091181W WO 2023212920 A1 WO2023212920 A1 WO 2023212920A1
Authority
WO
WIPO (PCT)
Prior art keywords
segment
file
axis control
data
boundary
Prior art date
Application number
PCT/CN2022/091181
Other languages
French (fr)
Chinese (zh)
Inventor
李斌
Original Assignee
湖南师范大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 湖南师范大学 filed Critical 湖南师范大学
Priority to PCT/CN2022/091181 priority Critical patent/WO2023212920A1/en
Priority to CN202280002307.8A priority patent/CN115136233B/en
Publication of WO2023212920A1 publication Critical patent/WO2023212920A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces

Definitions

  • This application relates to the field of speech processing technology, and specifically relates to a multi-modal fast transcription and annotation method based on a self-built template, a multi-modal fast transcription and annotation system based on a self-built template, and a storage medium.
  • Speech recognition technology identifies the corresponding speech content from the collected speech information, that is, recognizes digital speech information into corresponding text.
  • Speech transcription technology is used to convert speech into written text.
  • Speech transcription can be used for simple single-person speech transcription, or for complex multi-person speech transcription, such as conference speech transcription, court hearing speech transcription, classroom transcription, etc.
  • Embodiments of the present application provide a multi-modal fast transcription and annotation method based on a self-built template, a multi-modal fast transcription and annotation system and a storage medium based on a self-built template, which can provide a simple and convenient voice transcoding.
  • the writing annotation method can realize speech transcription annotation through self-built language templates, and can realize rapid merging of segments and fine-tuning of boundaries, which improves the efficiency of transcription annotation to adapt to the usage needs of various scenarios mentioned above.
  • a multi-modal rapid transcription and annotation method based on a self-built template includes: obtaining the project engineering file corresponding to the media file to be processed; and obtaining the project engineering file according to the directory of the project engineering file.
  • the interface is used to provide a display interface and a boundary axis control; in response to an editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data;
  • the processed segment data is subjected to speech recognition processing to obtain a transcribed text;
  • the project engineering file is updated according to the transcribed text to obtain an updated project engineering file, and the updated project engineering file carries the transcribed text.
  • Writing text when playing the updated project engineering file on the display interface, display the text fragment in the media file and the transcribed text that corresponds to the playback progress of the media file.
  • performing boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data includes: responding to the editing operation on the boundary axis control. Describe the first editing operation of the movable end of the first boundary axis control of the active segment in the segment data, control the movable end of the first boundary axis control to move to the first position; determine whether there is a link between the first boundary axis control and the first boundary axis control at the first position.
  • the second boundary axis control has an overlapping active end of the first boundary axis control.
  • the second boundary axis control is a boundary axis control corresponding to the second segment.
  • the active segment and the second segment are Adjacent segments; if there is a second boundary axis control that overlaps the active end of the first boundary axis control at the first position, merge the active segment with the second segment deal with.
  • the method further includes: if at the first position If there is no second boundary axis control overlapping the active end of the first boundary axis control at the position, the boundary of the active segment is adjusted according to the first position.
  • performing boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data includes: responding to the editing operation on the boundary axis control. Describe the second editing operation of the movable end of the first boundary axis control of the active segment in the segment data, control the movable end of the first boundary axis control to move to the second position; determine whether it exists at the second position A third boundary axis control that overlaps with the active end of the first boundary axis control.
  • the third boundary axis control is a boundary axis control corresponding to the third segment.
  • the active segment and the third segment is a non-adjacent segment; if there is a third boundary axis control that overlaps the active end of the first boundary axis control at the second position, then the active segment, the third segment , and the intermediate segments between the active segment and the third segment are merged.
  • the method further includes: if at the second position If there is no third boundary axis control overlapping the movable end of the first boundary axis control, it is determined whether the target area between the stationary end position of the first boundary axis control and the second position is consistent with Any of the intermediate segments overlaps; if the target area between the static end position of the first boundary axis control and the second position does not overlap with any of the intermediate segments, then according to the second The position is adjusted to the boundary of the active segment; or if the target area between the static end position of the first boundary axis control and the second position overlaps with at least one of the intermediate segments, the activity is Segments and all intermediate segments that overlap with the target area are merged.
  • performing segmentation processing on the audio data according to the amplitude of the audio data to obtain segment data of the audio data includes: based on a noise amplitude threshold and the amplitude of the audio data. Perform segmentation processing on the audio data to obtain segment data of the audio data.
  • performing segmentation processing on the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain the segment data of the audio data includes: obtaining the segment data of the audio data. Initial segmented data; determine whether the average amplitude in the current segment in the initial segmented data is greater than the noise amplitude threshold; if the average amplitude in the current segment in the initial segmented data is greater than the noise amplitude threshold, Then the current segment is marked as a voiced segment; the audio points in the current segment marked as a voiced segment are trimmed from the segment starting point and the segment end point to remove silence or noise in the current segment.
  • the starting position of the current segment after cropping is the same as the end position of the previous segment, then merge the current segment after cropping and the previous segment; if the current segment after cropping If the starting position of the segment is not the same as the end position of the previous segment, then the current segment after cropping is marked as a new segment; the initial segment data of the audio data is traversed and processed to obtain the Describes segment data of audio data.
  • obtaining the initial segment data of the audio data includes: performing initial segment processing on the audio data according to a preset language template to obtain the initial segment data of the audio data.
  • obtaining the project engineering file corresponding to the media file to be processed includes: obtaining the media file to be processed; detecting whether the corresponding project engineering file has been created for the media file; if the media file is detected If the corresponding project engineering file is not created, create the project engineering file corresponding to the media file based on the template file; or if it is detected that the corresponding project engineering file has been created for the media file, obtain the project engineering file corresponding to the created media file. Project engineering documents.
  • the method further includes: in response to an export instruction carrying a target file type, exporting an export file corresponding to the target file type from the project engineering file, and the target file type belongs to a preset file Any of the file types.
  • the method further includes: in response to the import instruction, obtaining the imported file; when the file type of the imported file belongs to any of the preset file types, converting the imported file Import the project file.
  • displaying the segment data of the audio data on the operation interface includes: displaying the segment waveform information of the segment data of the audio data on the operation interface, and the segment waveform information Corresponding timeline information.
  • the method further includes: in response to a hide waveform instruction, hiding the segment waveform information and the timeline information on the operation interface. In some embodiments, the method further includes: in response to an insert breakpoint operation for a target segment in the segment data, inserting a breakpoint in a boundary axis control of the target segment to determine based on the breakpoint The target segment is segmented.
  • the transcribed text includes text fragments corresponding to each segment in the segment data. After performing speech recognition processing on the processed segment data to obtain the transcribed text, It also includes: responding to a modification instruction for a target text fragment in the transcribed text, modifying the target text fragment to obtain a modified transcribed text, where the target text fragment is at least one of the transcribed texts. Text snippet.
  • the method further includes: responding to annotation instructions for the target text fragment, annotating the target text fragment to obtain annotated transcribed text.
  • a multi-modal rapid transcription and annotation system based on self-built templates includes:
  • the first acquisition unit is used to acquire the project engineering file corresponding to the media file to be processed
  • the second acquisition unit is used to acquire the audio data of the media file according to the directory of the project engineering file
  • a segmentation unit configured to segment the audio data according to the amplitude of the audio data to obtain segment data of the audio data
  • a display unit configured to display the segment data of the audio data on an operation interface, the operation interface being used to provide a display interface and boundary axis controls;
  • a processing unit configured to perform boundary adjustment processing or segment merging processing on the segment data in response to the editing operation on the boundary axis control, to obtain processed segment data
  • a transliteration unit used to perform speech recognition processing on the processed segment data to obtain transcribed text
  • An update unit configured to update the project engineering file according to the transcribed text to obtain an updated project engineering file, where the updated project engineering file carries the transcribed text;
  • a playback unit configured to display text segments in the media file and the transcribed text corresponding to the playback progress of the media file when the updated project engineering file is played on the display interface.
  • a computer-readable storage medium stores a computer program, and the computer program is suitable for loading by the processor to execute the self-built template-based process as described in the first aspect. Steps in multimodal fast transcription and annotation methods.
  • a terminal device includes a processor and a memory.
  • a computer program is stored in the memory.
  • the processor is used to execute the following by calling the computer program stored in the memory. The steps in the multi-modal fast transcription and annotation method based on self-built templates described in the first aspect.
  • Embodiments of the present application provide a multi-modal fast transcribing and annotating method based on a self-built template, a multi-modal fast transcribing and annotating system based on a self-built template, and a storage medium, by obtaining the items corresponding to the media files to be processed.
  • Project file obtain the audio data of the media file according to the directory of the project project file; segment the audio data according to the amplitude of the audio data to obtain the segment data of the audio data; display the segment data of the audio data on the operation interface,
  • the operation interface is used to provide a display interface and boundary axis control; in response to the editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data; the processed segment data
  • the data is processed through speech recognition to obtain the transcribed text; the project engineering file is updated according to the transcribed text to obtain the updated project engineering file, which carries the transcribed text; the updated project engineering file is played on the display interface file, displays the text fragments in the media file and the transcribed text that correspond to the playback progress of the media file.
  • Embodiments of the present application can provide a simple and convenient speech transcription annotation method, which can realize multiple speech transcriptions through self-built multi-language templates, and can support template import when speech recognition cannot be performed in a large number of languages or dialects.
  • Horizontal dragging on the axis control enables fine-tuning of the boundary, which improves the efficiency of speech transcription annotation to adapt to the usage needs of the various scenarios mentioned above.
  • Figure 1 is a schematic flowchart of a multi-modal rapid transcription and annotation method based on a self-built template provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of the first application scenario provided by the embodiment of the present application.
  • Figure 3 is a schematic diagram of the second application scenario provided by the embodiment of the present application.
  • Figure 4 is a schematic diagram of the third application scenario provided by the embodiment of the present application.
  • Figure 5 is a schematic diagram of the fourth application scenario provided by the embodiment of the present application.
  • Figure 6 is a schematic diagram of the fifth application scenario provided by the embodiment of the present application.
  • Figure 7 is a schematic diagram of the sixth application scenario provided by the embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a multi-modal fast transcription and annotation system based on self-built templates provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • Embodiments of the present application provide a multi-modal fast transcribing and annotating method based on a self-built template, a multi-modal fast transcribing and annotating system based on a self-built template, and a storage medium.
  • the multi-modal fast transcription and annotation method based on the self-built template in the embodiment of the present application can be executed by a terminal device, where the terminal device can be a terminal or a server.
  • the terminal can be a terminal device such as a smartphone, a tablet, a touch screen, a personal computer (Personal Computer, PC), etc.
  • the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also provide cloud services, cloud databases, cloud storage, network services, cloud communications, middleware services, domain name services, Cloud servers for security services, content distribution network services, and basic cloud computing services such as big data and artificial intelligence platforms, but are not limited to these.
  • Figure 1 is a schematic flowchart of a multi-modal fast transcription and annotation method based on self-built templates provided by an embodiment of the present application.
  • Figures 2 to 7 are application scenarios provided by an embodiment of the present application. Schematic diagram.
  • the multi-modal fast transcription and annotation method based on the self-built template of the embodiment of the present application can be applied to the multi-modal fast transcription and annotation system based on the self-built template of the embodiment of the present application.
  • the multi-modal fast transcription and annotation system based on the self-built template The dynamic fast transcription and annotation system can be configured on the terminal device.
  • the terminal device may be a terminal device, and the method includes the following steps:
  • Step 110 Obtain the project engineering file corresponding to the media file to be processed.
  • obtaining the project engineering file corresponding to the media file to be processed includes: obtaining the media file to be processed; detecting whether the corresponding project engineering file has been created for the media file; if it is detected that the corresponding project engineering file has not been created for the media file file, create a project project file corresponding to the media file based on the template file; or if it is detected that a project project file corresponding to the media file has been created, obtain the project project file corresponding to the created media file.
  • the media file may be an audio file or a video file.
  • the target client can be a multi-modal rapid transcription and annotation system based on a self-built template. It is a tool software developed specifically for rapid transcription and annotation of audio and video language materials.
  • the software can have built-in Mandarin, Chinese dialects, and minority languages.
  • Multi-language templates such as ethnic languages directly provide support for the discourse transliteration of the Chinese Language Resource Protection Project.
  • the multi-language template can be a multi-layer annotation template.
  • multi-language templates can be built according to project needs. For example, language transliteration annotation templates corresponding to different languages can also be built-in.
  • the target client can also be used in the production of video subtitles (*.SRT), the production of mp3 music plug-in lyrics (*.LRC), the transcription of various recordings, language listening teaching, audio-visual teaching, spoken language corpus construction, and multimedia resources. It can be used in many application scenarios such as library construction, situational language research, and multi-modal research in classroom teaching.
  • the target client can save the historical records so that when the media files are opened next time, the project files with the same name corresponding to the historical records can be directly called.
  • the history record is record information of media files that have been opened within the historical period recorded by the target client.
  • step 120 For example, if there is a project file with the same name as the media file, it is determined that a corresponding project file has been created for the media file, and then directly obtains the created project file with the same name as the media file in the storage path, and then performs step 120.
  • step 120 For example, if there is no project file with the same name as the media file, a project file with the same name corresponding to the media file is created based on the template file, and the corresponding project file is loaded, and then step 120 is performed.
  • Step 120 Obtain the audio data of the media file according to the directory of the project engineering file.
  • start the audio and video data parsing thread find the media file to be processed corresponding to the directory from the storage path of the media file based on the media file information recorded in the directory of the project project file, and based on the audio and video data parsing thread from Extract the audio data of the media file from the media file.
  • Step 130 Perform segmentation processing on the audio data according to the amplitude of the audio data to obtain segment data of the audio data.
  • the audio data is segmented. After the audio data is segmented, a notification of the end of the audio data segmentation is sent to the main thread. If it is determined that the audio data does not need to be segmented, a notification of the end of segmentation of the audio data is sent to the main thread.
  • the audio data it can be determined whether the audio data needs to be segmented by detecting whether the audio data in the project file contains divided segment data. If divided segment data exists, it is determined that the audio data does not need to be segmented. If there is no divided segment data, it is determined that the audio data needs to be segmented.
  • segmenting the audio data according to the amplitude of the audio data to obtain segment data of the audio data includes: segmenting the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain Segment data of audio data.
  • the audio data may be initially segmented according to a preset segmentation interval, or the audio data may be initially segmented according to a silence segment. Then, based on the relationship between the noise amplitude threshold and the amplitude of the audio data, the audio data is subjected to a second segmentation process to obtain segment data of the audio data.
  • segmenting the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain segment data of the audio data includes: obtaining the initial segment data of the audio data; determining the initial segment data Whether the average amplitude in the current segment in the initial segment data is greater than the noise amplitude threshold; if the average amplitude in the current segment in the initial segment data is greater than the noise amplitude threshold, the current segment is marked as a voiced segment; the current segment marked as a voiced segment is The audio points within the segment are trimmed from the beginning and end of the segment to remove silence or noise in the current segment; if the starting position of the current segment after trimming is the same as the end position of the previous segment, the segment will be cropped Merge the current segment after cropping with the previous segment; if the starting position of the current segment after cropping is different from the end position of the previous segment, mark the current segment after cropping as a new segment; The initial segment data of the audio data is traversed and processed to obtain the segment
  • obtaining the initial segment data of the audio data includes: performing initial segment processing on the audio data according to a preset language template, and obtaining the initial segment data of the audio data.
  • the preset language template has the ability to segment segments.
  • the preset language templates may include built-in or self-built multi-language templates in the target client to achieve rapid creation of initial segmented data.
  • the multi-language template can be a multi-layer annotation template.
  • the multi-language template may include language templates corresponding to different national languages, dialects of different regions, voices of different characters, etc., such as English, Mandarin, minority languages, Chinese dialects, female voices, male voices, children's voices, etc. Language templates corresponding to speech, etc.
  • the built-in multi-language templates can be language templates inserted through third-party software, and multiple speech transcriptions can be realized through the built-in multi-language templates.
  • the self-built multi-language template can be a language template created directly in the target client, and multiple speech transcription annotations can be realized by self-building multiple language templates.
  • the default language template includes a multi-language template built in or self-built in the target client.
  • the multi-language template can include languages corresponding to different national languages, dialects in different regions, voices of different characters, etc. template. Since different speaker genders and their corresponding languages may cause different noises, judging by a single noise threshold may cause one-sided speech segmentation. Therefore, in this embodiment, the corresponding noise amplitude threshold is automatically generated based on the current segmented speech signal.
  • a noise amplitude threshold generation module can be built in, a preset language template can be input into the noise amplitude threshold generation module, and the noise amplitude threshold corresponding to the current segmented speech signal can be adaptively determined.
  • the speech signal corresponding to the current segment is obtained, and the amplitude distribution function corresponding to the speech signal of the current segment is obtained by fitting:
  • x represents the signal amplitude corresponding to the current segmented speech
  • represents the signal variance of the current segmented speech
  • the noise amplitude threshold corresponding to the current segmented speech is determined to be:
  • Tam represents the noise amplitude threshold
  • the standard deviation represents the average amplitude
  • represents the preset amplitude factor.
  • noise or non-noise in the speech can be adaptively detected according to the speech condition, thereby improving the accuracy of noise detection and segmentation.
  • the audio data can be initially segmented according to the preset segmentation interval to obtain the initial segment data of the audio data.
  • the preset segmentation interval may be an interval set according to a regular sentence segmentation time.
  • the audio data can be initially segmented based on the silence segments to obtain the initial segment data of the audio data.
  • the audio data is initially segmented by detecting the silent segments in the audio data, and the initial segmentation is performed based on the position of the silent segments in the audio data.
  • the head end of the silent segment is connected to the end of the previous initial segment, and the mute segment is The end of the segment is connected to the beginning of the next initial segment.
  • Silent segments whose audio length is greater than the preset length are used as target silent segments as the basis for initial segmentation. For example, you can first detect the silent segments in the audio data, then select the silent segments whose audio length is greater than the preset length as the target silent segments used as the basis for initial segmentation, and then perform initialization based on the position of the target silent segment in the audio data. Segmentation.
  • a second segmentation process is performed on the initial segmented data. Specifically, it is judged whether the average amplitude in the current segment is greater than the noise amplitude threshold; if the average amplitude in the current segment is greater than the noise amplitude threshold, the current segment is marked as a voiced segment, and the current segment marked as a voiced segment is The audio points within the segment are trimmed from the beginning and end of the segment to remove silence or noise in the current segment. If the start and end positions of the current segment and the previous segment are the same, the current segment and the previous segment will be trimmed. Merge and use the merged segment as a segment in the segment data; if the starting and ending positions of the current segment and the previous segment are different, mark the current segment as a new segment, and you can The new segment appears as a segment in the segment data.
  • the current segment will be marked as a silent segment, and the current segment marked as a silent segment will be discarded and will not be used as segment data. a segment in .
  • Step 140 Display the segment data of the audio data on the operation interface, which is used to provide a display interface and a boundary axis control.
  • an operation interface 200 of the target client is provided, segment data 201 of the audio data is displayed on the operation interface 200 , and the operation interface 200 is used to provide a display interface 202 and a boundary axis control 203 .
  • interfaces such as file, editing, settings, and help; such as operation interfaces for transcription mode, annotation mode, and full-text mode; such as playback interfaces for display interfaces, etc.
  • displaying the segment data of the audio data on the operation interface includes: displaying the segment waveform information of the segment data of the audio data on the operation interface, and the timeline information corresponding to the segment waveform information.
  • the method further includes: in response to the hide waveform instruction, hiding the segment waveform information and the timeline information on the operation interface.
  • the segment waveform information and timeline information can be displayed or hidden in a flexible display manner.
  • Step 150 In response to the editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data.
  • segment dragging operation by dragging the boundary axis control to adjust segment boundaries or merge segments. That is, you can quickly merge segments by dragging the boundary axis control corresponding to the segment displayed on the operation interface, and you can directly drag the left or right horizontally on the boundary axis control corresponding to the segment displayed on the operation interface.
  • Boundary fine-tuning can be realized automatically.
  • the segment waveform can also be displayed on the operation interface, and the boundary axis control corresponding to the segment waveform displayed on the operation interface can be directly dragged horizontally on the left or right side to realize boundary fine-tuning.
  • the merge condition when judging whether the merge condition is met, it is mainly judged whether the final boundary point of the active segment exceeds the adjacent boundary of the merged segment. For example, when merging segments to the right, the right boundary of the active segment must exceed the left boundary of the merged segment before merging, and the two segments must be different. When merging segments to the left, the left boundary of the active segment must exceed the right boundary of the merged segment before merging, and the two segments must be different.
  • the implementation logic of obtaining the end segment is: sequentially traverse the entire list of segments, and determine the size of the left and right boundaries of each segment and the horizontal direction of the mouse end point.
  • merging left when the right boundary of a segment is greater than the end position of the mouse, it means that this segment is the end segment;
  • merging right when the left boundary of a segment is greater than the end position of the mouse, it means The segment before this segment is the end segment.
  • the view change diagram of the operation interface 300 shown in FIG. 3 shows a schematic diagram of adjusting segment boundaries.
  • the user hovers the mouse over the first boundary axis control 3031 of the segment that needs to be adjusted.
  • the terminal determines the active segment 3011 currently to be adjusted by detecting the hovering position of the mouse, and then the user can long-press. Start dragging the boundary label of one end of the first boundary axis control 3031 with the left mouse button. After dragging to the determined position, release the left mouse button to complete the drag operation of the active segment, and the boundary of the active segment 3011 will be updated to the new position.
  • the editing operation for the first boundary axis control 3031 may be a drag operation, a click operation, etc.
  • the boundary label of one end of the first boundary axis control 3031 that is not dragged as a stationary end, and the stationary end is located at position A;
  • the boundary label is defined as the active end, which is at position B before being dragged.
  • Diagram 3-1 in Figure 3 shows the picture before dragging
  • diagram 3-2 in Figure 3 shows the picture of updating the boundary position of the first boundary axis control 3031 after dragging.
  • the active end of the first boundary axis control 3031 is controlled to move from position B to position C to adjust the boundary. If the boundary position of the dragged active segment is not within the boundary range of other segments, update the boundary label at one end of the boundary of the active segment 303 to position C, that is, adjust the boundary of the active field 3011 from segment AB to segment AC.
  • the view change diagram of the operation interface 400 shown in FIG. 4 shows a schematic diagram of the segment merging operation.
  • the user hovers the mouse over the first boundary axis control 4031 of the segment that needs to be adjusted.
  • the terminal determines the currently active segment 4011 that needs to be adjusted by detecting the hovering position of the mouse. Then the user can long-press Use the left mouse button to start dragging the boundary label at one end of the first boundary axis control 4031. After dragging to the determined position, release the left mouse button to complete the drag operation of the active segment, and the boundary of the active segment 4011 will be updated to the new position.
  • the editing operation for the first boundary axis control 4031 may be a drag operation, a click operation, etc.
  • the boundary label of one end of the first boundary axis control 4031 that is not dragged as the stationary end, and the stationary end is located at position D;
  • the boundary label is defined as the active end, which is at position E before being dragged.
  • Diagram 4-1 in Figure 4 shows the picture before dragging
  • Diagram 4-2 in Figure 4 shows the picture of the boundary position change of the first boundary axis control 3031 during the dragging process
  • Figure 4 in Figure 4 -3 Schematic diagram shows the segment merging after dragging.
  • the active end of the first boundary axis control 4031 is controlled to move from the position E across the position A to the position F.
  • the boundary labels located in other segments can be displayed as different icons from other boundary labels.
  • the active end of the first boundary axis control 4031 is controlled from position E. Move position A to position F to drag the active end into other segments. At this time, the icon of the active end at position F can be in the shape of a small light blue candle, while other boundary labels can be displayed as red right-angled icons. Users can merge segments by releasing the mouse. If the boundary of the dragged active segment 4031 exceeds the adjacent boundaries of other segments, all segments within the range that overlap with the boundary of the dragged active segment can be merged.
  • active segment 4031 exceeds the left boundary (position A) of other segments 4032, active segment 4031 and other segments 4032 can be merged to obtain merged segment 4013.
  • the boundary axis control of the merged segment 4013 The boundary of 4033 is the DC segment.
  • boundary adjustment processing or segment merging processing is performed on the segment data to obtain processed segment data, including: in response to the active segment in the segment data
  • the first editing operation of the movable end of the first boundary axis control controls the movable end of the first boundary axis control to move to the first position; it is determined whether there is a third movable end at the first position that overlaps with the movable end of the first boundary axis control.
  • the second boundary axis control is the boundary axis control corresponding to the second segment, and the active segment and the second segment are adjacent segments; if there is an activity with the first boundary axis control at the first position If the second boundary axis control overlaps, the active segment and the second segment will be merged.
  • the backend program when the backend program processes audio data, in order to avoid merging the same segment, it is necessary to determine whether the active segment and the second segment are combined before processing. for the same. Specifically, you can determine whether the left boundaries of the two segments are the same and whether the right boundaries of the two segments are also the same. If the left boundaries of the two segments are the same and the right boundaries of the two segments are the same, are also the same, then the active segment and the second segment are judged to be the same segment. If the left boundaries of the two segments are different and/or the right boundaries of the two segments are different, it is determined that the active segment and the second segment are not the same segment, thereby accurately distinguishing the active segment from the second segment. second segment, and then merge the active segment with the second segment.
  • the method further includes: if there is not a second boundary axis control that overlaps with the first boundary axis control at the first position. If the active end of the control overlaps the second boundary axis control, the boundary of the active segment is adjusted according to the first position.
  • the merging function of two adjacent segments can be realized.
  • Figure 3 shows a schematic diagram of boundary adjustment processing on segment data
  • Figure 4 shows a schematic diagram of segment merging processing on segment data.
  • the active end of the first boundary axis control 3031 of the active segment 3011 in the segment data is controlled to move from position A to the first position.
  • the first position is position C in Figure 3. If there is no second boundary axis control overlapping the active end of the first boundary axis control 3031 at the first position (position C), the boundary of the active segment 3011 is adjusted according to the first position (position C), that is, the active field The boundary of 3011 is adjusted from segment AB to segment AC.
  • the active end of the first boundary axis control 4031 of the active segment 4011 in the segment data is controlled to move to the first position.
  • One position is position F in Figure 4. If there is a second boundary axis control 4032 that overlaps with the active end of the first boundary axis control 4031 at the first position (position F), then the active segment 4011 and the second segment 4012 are merged to obtain the merged segment. 4013.
  • the boundary of the boundary axis control 4033 of the merged segment 4013 is the DC segment.
  • performing boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data including: responding to the active sentence in the segment data.
  • the second editing operation of the movable end of the first boundary axis control of the segment controls the movable end of the first boundary axis control to move to the second position; it is determined whether there is an overlapping movable end of the first boundary axis control at the second position.
  • the third boundary axis control is the boundary axis control corresponding to the third segment.
  • the active segment and the third segment are non-adjacent segments; if there is a boundary axis control corresponding to the first boundary axis control at the second position If the active end of the third boundary axis control overlaps, the active segment, the third segment, and the intermediate segment between the active segment and the third segment will be merged.
  • the method further includes: if there is not a third boundary axis control that overlaps the active end of the first boundary axis control at the second position.
  • the third boundary axis control overlaps the active end of the control, then determine whether the target area between the static end position of the first boundary axis control and the second position overlaps with any intermediate segment; if the first boundary axis control If the target area between the static end position and the second position does not overlap with any intermediate segment, the boundary of the active segment is adjusted according to the second position; or if the static end position of the first boundary axis control is between the second position and If there is overlap with at least one intermediate segment in the target area, the active segment and all intermediate segments that overlap with the target area will be merged.
  • the method further includes: in response to an insert breakpoint operation for the target segment in the segment data, inserting a breakpoint in a boundary axis control of the target segment to perform operations on the target segment based on the breakpoint. Processing in segments.
  • Step 160 Perform speech recognition processing on the processed segment data to obtain transcribed text.
  • automatic transcription can be implemented by calling the speech recognition module configured on the terminal or a third-party speech recognition module to perform speech recognition processing on the processed segment data to obtain the transcribed text.
  • the transcribed text includes text fragments corresponding to each segment in the segment data.
  • the method further includes: responding to the transcribed The modification instruction of the target text fragment in the text is to modify the target text fragment to obtain the modified transcribed text, and the target text fragment is at least one text fragment in the transcribed text.
  • the modification instructions may include instructions such as modifying words, deleting words, adding words, modifying fonts, modifying font size, modifying font color, etc.
  • the method further includes: responding to an annotation instruction for the target text segment, annotating the target text segment to obtain annotated transcribed text.
  • the target text fragment can be annotated in any of the following ways: industry field annotation, content category annotation, part-of-speech annotation, dependency annotation, entity annotation, relationship annotation, event annotation, reading comprehension annotation and question and answer annotation.
  • Step 170 Update the project engineering file according to the transcribed text to obtain an updated project engineering file.
  • the updated project engineering file carries the transcribed text.
  • a fixed-format (.Baf) project file For example, save the transcribed text and the path of the media file together in a fixed-format (.Baf) project file to update the project file.
  • the updated project engineering files carry the transcribed text.
  • the memory data used for display can be initialized using the audio results parsed by the audio and video data parsing thread and the segmented information obtained by segmentation processing, and then set default values for some parameters that need to be used.
  • Step 180 When the updated project engineering file is played on the display interface, text fragments in the media file and the transcribed text corresponding to the playback progress of the media file are displayed.
  • the text fragments in the media file and the transcribed text corresponding to the playback progress of the media file are displayed. You can also control the playback progress through the playback controls on the display interface.
  • the embodiment of this application also provides a multi-format import and export function, which can support the import of Word (docx, txt, aud.txt), Excel (xls, xlsx), lrc, srt, json format files, etc., and also supports the above file types. And file export in eaf format. It can facilitate the migration of transcribed files to achieve multi-format file import and file export.
  • corresponding interface functions for writing files and writing files can be provided for different file types and file reading and writing methods, so that different types of files can be written and written when importing or exporting files. out.
  • Excel, srt and other format files and corresponding media files can be imported at the same time, data files can be converted to Baf format, and multiple file formats can be optionally exported at one time.
  • the method further includes: in response to the export instruction carrying the target file type, export an export file corresponding to the target file type from the project engineering file, and the target file type belongs to any one of the preset file types. type.
  • the schematic diagram of the file export application scenario shown in Figure 5 the schematic diagram of the file export interface shown in 5-1 in Figure 5
  • the exported target file type, etc. can be set on the file export interface, such as the target file type Set to Excel, and the export language is set to Mandarin.
  • the exported Excel format file has the content shown in 5-2 in Figure 5.
  • FIG 6 is a schematic diagram of another application scenario of file export, and a schematic diagram of the file export interface is shown as 6-1 in Figure 6,
  • the exported target file type can be set on the file export interface, such as the target
  • the file type can be set to Excel, Word, and EAF at the same time, and the export language can be set to dialect.
  • the file After executing the export command, the file can be exported according to the setting content.
  • the target file type is set to multiple file formats at the same time, multiple file formats can be optionally exported at one time.
  • the exported Excel format file is shown in Figure 6- 2 shows the content.
  • preset file types may include: Word (docx, txt, aud.txt), Excel (xls, xlsx), lrc, srt, json format files, etc.
  • Word docx, txt, aud.txt
  • Excel xls, xlsx
  • lrc srt
  • json format files etc.
  • the method further includes: in response to the import instruction, obtaining the import file; when the file type of the import file belongs to any of the preset file types, importing the import file into the project engineering file.
  • preset file types may include: Word (docx, txt, aud.txt), Excel (xls, xlsx), lrc, srt, json format files, etc.
  • File import that supports the above file types can be supported. It can facilitate the migration of transcribed files to achieve multi-format file import.
  • the embodiment of this application obtains the project engineering file corresponding to the media file to be processed; obtains the audio data of the media file according to the directory of the project engineering file; performs segmentation processing on the audio data according to the amplitude of the audio data to obtain the segments of the audio data data; display the segment data of the audio data on the operation interface, which is used to provide a display interface and boundary axis control; in response to the editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data, Obtain the processed segment data; perform speech recognition processing on the processed segment data to obtain the transcribed text; update the project engineering file according to the transcribed text to obtain the updated project engineering file, and carry the updated project engineering file Transcribe text; when playing the updated project file on the display interface, display the text fragments in the media file and the transcribed text that correspond to the playback progress of the media file.
  • Embodiments of the present application can provide a simple and convenient speech transcription method, which can realize multiple speech transcriptions by self-building multiple language templates and dragging the boundary axis control corresponding to the segment displayed on the operation interface.
  • the embodiment of the present application also provides a multi-modal fast transcription and annotation system based on the self-built template.
  • FIG. 8 is a schematic structural diagram of a multi-modal fast transcription and annotation system based on a self-built template provided by an embodiment of the present application.
  • the multi-modal fast transcription and annotation system 800 based on the self-built template is applied to a terminal device that provides a graphical user interface.
  • the multi-modal fast transcription and annotation system 800 based on the self-built template may include:
  • the first obtaining unit 801 is used to obtain the project engineering file corresponding to the media file to be processed
  • the second acquisition unit 802 is used to acquire the audio data of the media file according to the directory of the project engineering file;
  • the segmentation unit 803 is used to segment the audio data according to the amplitude of the audio data to obtain segment data of the audio data;
  • the display unit 804 is used to display the segment data of the audio data on the operation interface, and the operation interface is used to provide a display interface and boundary axis control;
  • the processing unit 805 is configured to perform boundary adjustment processing or segment merging processing on the segment data in response to the editing operation on the boundary axis control, and obtain processed segment data;
  • Transcription unit 806 used to perform speech recognition processing on the processed segment data to obtain transcribed text
  • the update unit 807 is used to update the project engineering file according to the transcribed text to obtain an updated project engineering file, and the updated project engineering file carries the transcribed text;
  • the playback unit 808 is configured to display the text fragments in the media file and the transcribed text that correspond to the playback progress of the media file when the updated project engineering file is played on the display interface.
  • the processing unit 805 may be configured to: in response to the first editing operation on the active end of the first boundary axis control of the active segment in the segment data, control the active end of the first boundary axis control to move to the third One position; determine whether there is a second boundary axis control overlapping the active end of the first boundary axis control at the first position, the second boundary axis control is the boundary axis control corresponding to the second segment, and the active segment is the same as the first boundary axis control.
  • the two segments are adjacent segments; if there is a second boundary axis control overlapping the active end of the first boundary axis control at the first position, the active segment and the second segment will be merged.
  • the processing unit 805 may also be configured to: if there is not a second boundary axis control at the first position, The second boundary axis control that overlaps the active end of the first boundary axis control adjusts the boundary of the active segment according to the first position.
  • the processing unit 805 may be configured to: in response to a second editing operation for the active end of the first boundary axis control of the active segment in the segment data, control the active end of the first boundary axis control to move to The second position; determine whether there is a third boundary axis control that overlaps with the active end of the first boundary axis control at the second position.
  • the third boundary axis control is the boundary axis control corresponding to the third segment.
  • the active segment and The third segment is a non-adjacent segment; if there is a third boundary axis control at the second position that overlaps the active end of the first boundary axis control, then the active segment, the third segment, and the active sentence
  • the intermediate segments between the first segment and the third segment are merged.
  • the processing unit 805 may also be configured to: if there is not a third boundary axis control at the second position, For the third boundary axis control that overlaps the active end of the first boundary axis control, it is determined whether the target area between the stationary end position of the first boundary axis control and the second position overlaps with any intermediate segment; if If the target area between the static end position of a boundary axis control and the second position does not overlap with any intermediate segment, the boundary of the active segment is adjusted according to the second position; or if the static end position of the first boundary axis control If the target area between the target area and the second position overlaps with at least one intermediate segment, then the active segment and all the intermediate segments that have an overlapping relationship with the target area are merged.
  • the segmentation unit 803 may be used to segment the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain segment data of the audio data.
  • the segmentation unit 803 can be used to: obtain the initial segmentation of the audio data. segment data; determine whether the average amplitude of the current segment in the initial segment data is greater than the noise amplitude threshold; if the average amplitude of the current segment in the initial segment data is greater than the noise amplitude threshold, mark the current segment as a sound segment; The audio points in the current segment marked as a sound segment are trimmed from the segment starting point and segment end point to remove silence or noise in the current segment; if the starting position of the trimmed current segment is different from the previous segment If the end position of the current segment after cropping is the same as the end position of the previous segment, the current segment after cropping will be merged with the previous segment; if the starting position of the current segment after cropping is different from the end position of the previous segment, the current segment after cropping will be merged. Mark it as
  • the segmentation unit 803 when acquiring the initial segment data of the audio data, may be used to: perform initial segmentation processing on the audio data according to the preset language template, and obtain the initial segment data of the audio data.
  • the first obtaining unit 801 can be used to: obtain the media file to be processed; detect whether the corresponding project engineering file has been created for the media file; if it is detected that the corresponding project engineering file has not been created for the media file, based on The template file creates a project project file corresponding to the media file; or if it is detected that the media file has created a corresponding project project file, the project project file corresponding to the created media file is obtained.
  • the processing unit 805 may also be configured to respond to the export instruction carrying the target file type and export an export file corresponding to the target file type from the project engineering file.
  • the target file type belongs to any of the preset file types. A file type.
  • processing unit 805 can also be used to: respond to the import instruction, obtain the import file;
  • the imported file When the file type of the imported file belongs to any of the preset file types, the imported file will be imported into the project file.
  • the display unit 804 may be configured to display the segment waveform information of the segment data of the audio data and the timeline information corresponding to the segment waveform information on the operation interface.
  • the display unit 804 may also be configured to hide the segment waveform information and the timeline information on the operation interface in response to the hide waveform instruction.
  • the processing unit 805 may also be configured to respond to the insert breakpoint operation for the target segment in the segment data, insert a breakpoint in the boundary axis control of the target segment, so as to adjust the target based on the breakpoint. Segments are processed into segments.
  • the transcribed text includes text fragments corresponding to each segment in the segment data.
  • the transcribing unit 806 performs speech recognition processing on the processed segment data to obtain the transcribed text, it may also be used to : In response to a modification instruction for a target text fragment in the transcribed text, modify the target text fragment to obtain a modified transcribed text, where the target text fragment is at least one text fragment in the transcribed text.
  • the transliteration unit 806 may also be configured to respond to annotation instructions for the target text fragment, annotate the target text fragment, and obtain annotated transcribed text.
  • system embodiments and method embodiments may correspond to each other, and similar descriptions may refer to the method embodiments. To avoid repetition, they will not be repeated here.
  • the system shown in Figure 8 can execute the above-mentioned self-built template-based multi-modal rapid transcription and annotation method embodiments, and the aforementioned and other operations and/or functions of each unit in the system respectively implement the above-mentioned method embodiments. The corresponding process, for the sake of brevity, will not be repeated here.
  • inventions of the present application also provide a terminal device.
  • the terminal device may be a terminal or a server.
  • the terminal may be a smartphone, a tablet, a laptop, a smart TV, a smart speaker, a wearable smart device, a personal computer, etc. equipment.
  • Figure 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the terminal device 900 includes a processor 901 with one or more processing cores, a memory 902 with one or more computer-readable storage media, and a computer program stored on the memory 902 and capable of running on the processor.
  • the processor 901 is electrically connected to the memory 902.
  • the structure of the terminal equipment shown in the figures does not constitute a limitation on the terminal equipment, and may include more or fewer components than shown in the figures, or combine certain components, or arrange different components.
  • the processor 901 is the control center of the terminal device 900, using various interfaces and lines to connect various parts of the entire terminal device 900, by running or loading software programs and/or modules stored in the memory 902, and calling the software programs and/or modules stored in the memory 902. data, perform various functions of the terminal device 900 and process data, thereby overall monitoring the terminal device 900.
  • the processor 901 in the terminal device 900 will follow the following steps to load instructions corresponding to the processes of one or more application programs into the memory 902, and the processor 901 will run the instructions stored in the memory. 902 applications to achieve various functions:
  • the segment data is subjected to boundary adjustment processing or segment merging processing to obtain processed segment data; speech recognition processing is performed on the processed segment data to obtain a transcribed text; and the project engineering is processed according to the transcribed text.
  • the file is updated to obtain an updated project engineering file, which carries the transcribed text; when the updated project engineering file is played on the display interface, the media file and A text segment in the transcribed text corresponding to the playback progress of the media file.
  • the terminal device 900 further includes: a display unit 903, a radio frequency circuit 904, an audio circuit 905, an input unit 906, and a power supply 907.
  • the processor 901 is electrically connected to the display unit 903, the radio frequency circuit 904, the audio circuit 905, the input unit 906 and the power supply 907 respectively.
  • the structure of the terminal device shown in FIG. 9 does not constitute a limitation on the terminal device, and may include more or fewer components than shown in the figure, or combine certain components, or arrange different components.
  • the display unit 903 may be used to display information input by the user or information provided to the user as well as various graphical user interfaces of the terminal device. These graphical user interfaces may be composed of graphics, text, icons, videos, and any combination thereof.
  • the display unit 903 may include a display panel and a touch panel.
  • the radio frequency circuit 904 can be used to send and receive radio frequency signals to establish wireless communication with network equipment or other terminal equipment through wireless communication, and to send and receive signals with the network equipment or other terminal equipment.
  • the audio circuit 905 can be used to provide an audio interface between the user and the terminal device through speakers and microphones.
  • the input unit 906 can be used to receive input numbers, character information or user characteristic information (such as fingerprints, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control. .
  • the power supply 907 is used to power various components of the terminal device 900 .
  • the power supply 907 can be logically connected to the processor 901 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system.
  • Power supply 907 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.
  • the terminal device 900 may also include a camera, a sensor, a wireless fidelity module, a Bluetooth module, etc., which will not be described again here.
  • embodiments of the present application provide a computer-readable storage medium in which multiple computer programs are stored.
  • the computer programs can be loaded by the processor to execute any of the self-built templates provided by the embodiments of the present application.
  • the steps in the multi-modal fast transcription and annotation method are for the specific implementation of each of the above operations, please refer to the previous embodiments and will not be described again here.
  • the storage medium may include: read-only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.
  • the computer program stored in the storage medium can execute any of the steps in the self-built template-based multi-modal rapid transcription and annotation method provided by the embodiments of the present application, it is possible to implement the steps provided by the embodiments of the present application.
  • the beneficial effects that can be achieved by any of the provided multi-modal fast transcription and annotation methods based on self-built templates are detailed in the previous embodiments and will not be described again here.
  • Embodiments of the present application also provide a computer program product.
  • the computer program product includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, causing the computer device to perform any multi-modal fast transcription and annotation based on the self-built template in the embodiments of the present application. The corresponding process in the method will not be repeated here for the sake of brevity.
  • An embodiment of the present application also provides a computer program.
  • the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, causing the computer device to perform any multi-modal fast transcription and annotation based on the self-built template in the embodiments of the present application. The corresponding process in the method will not be repeated here for the sake of brevity.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The present application discloses a multi-modal rapid transliteration and annotation system based on a self-built template, comprising: a first acquisition unit for acquiring a project engineering file corresponding to a media file; a second acquisition unit for acquiring audio data of the media file according to a directory of the project engineering file; a segmentation unit for performing segmentation processing on the audio data according to an amplitude of the audio data to obtain segment data of the audio data; a display unit for displaying the segment data on an operation interface, the operation interface being used for providing a display interface and a boundary axis control; a processing unit for performing, in response to an editing operation for the boundary axis control, boundary adjustment or segment merging on the segment data to obtain processed segment data, and then performing speech recognition processing to obtain a transliteration text; a transliteration unit for updating the project engineering file according to the transliteration text; and a playing unit for displaying, when playing the updated project engineering file on the display interface, a text fragment corresponding to a playing progress of the media file in the media file and the transliteration text.

Description

一种基于自建模板的多模态快速转写及标注系统A multi-modal fast transcription and annotation system based on self-built templates 技术领域Technical field
本申请涉及语音处理技术领域,具体涉及一种基于自建模板的多模态快速转写及标注方法、基于自建模板的多模态快速转写及标注系统及存储介质。This application relates to the field of speech processing technology, and specifically relates to a multi-modal fast transcription and annotation method based on a self-built template, a multi-modal fast transcription and annotation system based on a self-built template, and a storage medium.
背景技术Background technique
随着计算机技术的发展,语音识别技术的应用越来越广泛。语音识别技术是从采集到的语音信息中识别出相应的语音内容,即将数字语音信息识别成相应的文本。With the development of computer technology, the application of speech recognition technology is becoming more and more widespread. Speech recognition technology identifies the corresponding speech content from the collected speech information, that is, recognizes digital speech information into corresponding text.
语音转写技术用于将语音转换成文字文本。语音转写有用于简单的单人语音转写,也有用于复杂的多人语音转写,如会议语音转写、庭审语音转写、课堂用于转写等。Transcription technology is used to convert speech into written text. Speech transcription can be used for simple single-person speech transcription, or for complex multi-person speech transcription, such as conference speech transcription, court hearing speech transcription, classroom transcription, etc.
但目前已有的语音转写标注工具,不能自建语言模板,扩展性差。同时,无法实现句段的快速合并以及边界微调,无法适应现实中各种场景的使用需求。例如:视频外挂字幕(*.SRT)制作、mp3音乐外挂歌词(*.LRC)制作、各类录音转写、语言听力教学、视听说教学、口语语料库建设、多媒体资源库建设、态势语研究、课堂教学多模态研究等。However, the existing speech transcription annotation tools cannot create self-built language templates and have poor scalability. At the same time, it is impossible to achieve rapid merging of segments and fine-tuning of boundaries, and cannot adapt to the usage needs of various real-world scenarios. For example: production of video subtitles (*.SRT), production of mp3 music plug-in lyrics (*.LRC), transcription of various recordings, language listening teaching, audio-visual teaching, oral corpus construction, multimedia resource library construction, situational language research, Multimodal research on classroom teaching, etc.
技术问题technical problem
本申请实施例提供一种基于自建模板的多模态快速转写及标注方法、基于自建模板的多模态快速转写及标注系统及存储介质,可以提供一种简单、方便的语音转写标注方式,可以通过自建语言模板实现语音转写标注,并能实现句段的快速合并以及边界微调,提升了转写标注效率,以适应上述各种场景的使用需求。Embodiments of the present application provide a multi-modal fast transcription and annotation method based on a self-built template, a multi-modal fast transcription and annotation system and a storage medium based on a self-built template, which can provide a simple and convenient voice transcoding. The writing annotation method can realize speech transcription annotation through self-built language templates, and can realize rapid merging of segments and fine-tuning of boundaries, which improves the efficiency of transcription annotation to adapt to the usage needs of various scenarios mentioned above.
技术解决方案Technical solutions
一方面,提供一种基于自建模板的多模态快速转写及标注方法,所述方法包括:获取待处理的媒体文件对应的项目工程文件;根据所述项目工程文件的目录,获取所述媒体文件的音频数据;根据所述音频数据的幅度对所述音频数据进行分段处理,得到所述音频数据的句段数据;在操作界面上显示所述音频数据的句段数据,所述操作界面用于提供展示界面和边界轴控件;响应于针对所述边界轴控件的编辑操作,对所述句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据;对所述处理后的句段数据进行语音识别处理得到转写文本;根据所述转写文本对所述项目工程文件进行更新,得到更新后的项目工程文件,所述更新后的项目工程文件携带所述转写文本;在所述展示界面上播放所述更新后的项目工程文件时,显示所述媒体文件和所述转写文本中与所述媒体文件的播放进度对应的文本片段。On the one hand, a multi-modal rapid transcription and annotation method based on a self-built template is provided. The method includes: obtaining the project engineering file corresponding to the media file to be processed; and obtaining the project engineering file according to the directory of the project engineering file. audio data of the media file; performing segmentation processing on the audio data according to the amplitude of the audio data to obtain segment data of the audio data; displaying the segment data of the audio data on the operation interface, and the operation The interface is used to provide a display interface and a boundary axis control; in response to an editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data; The processed segment data is subjected to speech recognition processing to obtain a transcribed text; the project engineering file is updated according to the transcribed text to obtain an updated project engineering file, and the updated project engineering file carries the transcribed text. Writing text; when playing the updated project engineering file on the display interface, display the text fragment in the media file and the transcribed text that corresponds to the playback progress of the media file.
在一些实施例中,所述响应于针对所述边界轴控件的编辑操作,对所述句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据,包括:响应于针对所述句段数据中活动句段的第一边界轴控件的活动端的第一编辑操作,控制所述第一边界轴控件的活动端移动至第一位置;判断在所述第一位置处是否存在与所述第一边界轴控件的活动端相重叠的第二边界轴控件,所述第二边界轴控件为第二句段对应的边界轴控件,所述活动句段与所述第二句段为相邻句段;若在所述第一位置处存在与所述第一边界轴控件的活动端相重叠的第二边界轴控件,则将所述活动句段与所述第二句段进行合并处理。In some embodiments, in response to an editing operation on the boundary axis control, performing boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data includes: responding to the editing operation on the boundary axis control. Describe the first editing operation of the movable end of the first boundary axis control of the active segment in the segment data, control the movable end of the first boundary axis control to move to the first position; determine whether there is a link between the first boundary axis control and the first boundary axis control at the first position. The second boundary axis control has an overlapping active end of the first boundary axis control. The second boundary axis control is a boundary axis control corresponding to the second segment. The active segment and the second segment are Adjacent segments; if there is a second boundary axis control that overlaps the active end of the first boundary axis control at the first position, merge the active segment with the second segment deal with.
在一些实施例中,在所述判断在所述第一位置处是否存在与所述第一边界轴控件的活动端相重叠的第二边界轴控件之后,还包括:若在所述第一位置处不存在与所述第一边界轴控件的活动端相重叠的第二边界轴控件,则根据所述第一位置调整所述活动句段的边界。In some embodiments, after determining whether there is a second boundary axis control overlapping the movable end of the first boundary axis control at the first position, the method further includes: if at the first position If there is no second boundary axis control overlapping the active end of the first boundary axis control at the position, the boundary of the active segment is adjusted according to the first position.
在一些实施例中,所述响应于针对所述边界轴控件的编辑操作,对所述句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据,包括:响应于针对所述句段数据中的活动句段的第一边界轴控件的活动端的第二编辑操作,控制所述第一边界轴控件的活动端移动至第二位置;判断在所述第二位置处是否存在与所述第一边界轴控件 的活动端相重叠的第三边界轴控件,所述第三边界轴控件为第三句段对应的边界轴控件,所述活动句段与所述第三句段为非相邻句段;若在所述第二位置处存在与所述第一边界轴控件的活动端相重叠的第三边界轴控件,则将所述活动句段、所述第三句段、以及所述活动句段与所述第三句段之间的中间句段进行合并处理。In some embodiments, in response to an editing operation on the boundary axis control, performing boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data includes: responding to the editing operation on the boundary axis control. Describe the second editing operation of the movable end of the first boundary axis control of the active segment in the segment data, control the movable end of the first boundary axis control to move to the second position; determine whether it exists at the second position A third boundary axis control that overlaps with the active end of the first boundary axis control. The third boundary axis control is a boundary axis control corresponding to the third segment. The active segment and the third segment is a non-adjacent segment; if there is a third boundary axis control that overlaps the active end of the first boundary axis control at the second position, then the active segment, the third segment , and the intermediate segments between the active segment and the third segment are merged.
在一些实施例中,在所述判断在所述第二位置处是否存在与所述第一边界轴控件的活动端相重叠的第三边界轴控件之后,还包括:若在所述第二位置处不存在与所述第一边界轴控件的活动端相重叠的第三边界轴控件,则判断所述第一边界轴控件的静止端位置至所述第二位置之间的目标区域内是否与任一所述中间句段重叠;若所述第一边界轴控件的静止端位置至所述第二位置之间的目标区域内不与任一所述中间句段重叠,则根据所述第二位置调整所述活动句段的边界;或者若所述第一边界轴控件的静止端位置至所述第二位置之间的目标区域内与至少一个所述中间句段重叠,则将所述活动句段、与所述目标区域存在相重叠关系的所有中间句段进行合并处理。In some embodiments, after determining whether there is a third boundary axis control overlapping the movable end of the first boundary axis control at the second position, the method further includes: if at the second position If there is no third boundary axis control overlapping the movable end of the first boundary axis control, it is determined whether the target area between the stationary end position of the first boundary axis control and the second position is consistent with Any of the intermediate segments overlaps; if the target area between the static end position of the first boundary axis control and the second position does not overlap with any of the intermediate segments, then according to the second The position is adjusted to the boundary of the active segment; or if the target area between the static end position of the first boundary axis control and the second position overlaps with at least one of the intermediate segments, the activity is Segments and all intermediate segments that overlap with the target area are merged.
在一些实施例中,所述根据所述音频数据的幅度对所述音频数据进行分段处理,得到所述音频数据的句段数据,包括:根据噪音幅度阈值和所述音频数据的幅度的大小关系对所述音频数据进行分段处理,得到所述音频数据的句段数据。In some embodiments, performing segmentation processing on the audio data according to the amplitude of the audio data to obtain segment data of the audio data includes: based on a noise amplitude threshold and the amplitude of the audio data. Perform segmentation processing on the audio data to obtain segment data of the audio data.
在一些实施例中,所述根据噪音幅度阈值和所述音频数据的幅度的大小关系对所述音频数据进行分段处理,得到所述音频数据的句段数据,包括:获取所述音频数据的初始分段数据;判断所述初始分段数据中当前分段内的平均幅度是否大于所述噪音幅度阈值;若所述初始分段数据中当前分段内的平均幅度大于所述噪音幅度阈值,则对所述当前分段标记为有声段;对标记为有声段的所述当前分段内的音频点进行句段起点和句段终点的裁剪,以去除所述当前分段内的静音或噪声;若所述裁剪后的当前分段的起点位置与上一个分段的终点位置相同,则将所述裁剪后的当前分段和所述上一个分段进行合并;若所述裁剪后的当前分段的起点位置与所述上一个分段的终点位置不相同,则将所述裁剪后的当前分段标记为一个新的分段;遍历处理所述音频数据的初始分段数据,得到所述音频数据的句段数据。In some embodiments, performing segmentation processing on the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain the segment data of the audio data includes: obtaining the segment data of the audio data. Initial segmented data; determine whether the average amplitude in the current segment in the initial segmented data is greater than the noise amplitude threshold; if the average amplitude in the current segment in the initial segmented data is greater than the noise amplitude threshold, Then the current segment is marked as a voiced segment; the audio points in the current segment marked as a voiced segment are trimmed from the segment starting point and the segment end point to remove silence or noise in the current segment. ; If the starting position of the current segment after cropping is the same as the end position of the previous segment, then merge the current segment after cropping and the previous segment; if the current segment after cropping If the starting position of the segment is not the same as the end position of the previous segment, then the current segment after cropping is marked as a new segment; the initial segment data of the audio data is traversed and processed to obtain the Describes segment data of audio data.
在一些实施例中,所述获取所述音频数据的初始分段数据,包括:根据预设语言模板对所述音频数据进行初始分段处理,获取所述音频数据的初始分段数据。In some embodiments, obtaining the initial segment data of the audio data includes: performing initial segment processing on the audio data according to a preset language template to obtain the initial segment data of the audio data.
在一些实施例中,所述获取待处理的媒体文件对应的项目工程文件,包括:获取待处理的媒体文件;检测所述媒体文件是否已创建对应的项目工程文件;若检测到所述媒体文件未创建对应的项目工程文件,则基于模板文件创建所述媒体文件对应的项目工程文件;或者若检测到所述媒体文件已创建对应的项目工程文件,则获取已创建的所述媒体文件对应的项目工程文件。In some embodiments, obtaining the project engineering file corresponding to the media file to be processed includes: obtaining the media file to be processed; detecting whether the corresponding project engineering file has been created for the media file; if the media file is detected If the corresponding project engineering file is not created, create the project engineering file corresponding to the media file based on the template file; or if it is detected that the corresponding project engineering file has been created for the media file, obtain the project engineering file corresponding to the created media file. Project engineering documents.
在一些实施例中,所述方法还包括:响应于携带目标文件类型的导出指令,从所述项目工程文件中导出与所述目标文件类型对应的导出文件,所述目标文件类型属于预设文件类型中的任一种文件类型。In some embodiments, the method further includes: in response to an export instruction carrying a target file type, exporting an export file corresponding to the target file type from the project engineering file, and the target file type belongs to a preset file Any of the file types.
在一些实施例中,所述方法还包括:响应于导入指令,获取导入文件;当所述导入文件的文件类型属于所述预设文件类型中的任一种文件类型时,将所述导入文件导入所述项目工程文件中。In some embodiments, the method further includes: in response to the import instruction, obtaining the imported file; when the file type of the imported file belongs to any of the preset file types, converting the imported file Import the project file.
在一些实施例中,所述在操作界面上显示所述音频数据的句段数据,包括:在操作界面上显示所述音频数据的句段数据的句段波形信息,以及所述句段波形信息对应的时间轴信息。In some embodiments, displaying the segment data of the audio data on the operation interface includes: displaying the segment waveform information of the segment data of the audio data on the operation interface, and the segment waveform information Corresponding timeline information.
在一些实施例中,所述方法还包括:响应于隐藏波形指令,在操作界面上隐藏所述句段波形信息和所述时间轴信息。在一些实施例中,所述方法还包括:响应于针对所述句段数据中目标句段的插入断点操作,在述目标句段的边界轴控件中插入断点,以基于所述断点对所述目标句段进行分段处理。In some embodiments, the method further includes: in response to a hide waveform instruction, hiding the segment waveform information and the timeline information on the operation interface. In some embodiments, the method further includes: in response to an insert breakpoint operation for a target segment in the segment data, inserting a breakpoint in a boundary axis control of the target segment to determine based on the breakpoint The target segment is segmented.
在一些实施例中,所述转写文本包括所述句段数据中的每一个句段对应的文本片 段,在所述对所述处理后的句段数据进行语音识别处理得到转写文本之后,还包括:响应于针对所述转写文本中的目标文本片段的修改指令,对所述目标文本片段进行修改,得到修改后的转写文本,目标文本片段为所述转写文本中的至少一个文本片段。In some embodiments, the transcribed text includes text fragments corresponding to each segment in the segment data. After performing speech recognition processing on the processed segment data to obtain the transcribed text, It also includes: responding to a modification instruction for a target text fragment in the transcribed text, modifying the target text fragment to obtain a modified transcribed text, where the target text fragment is at least one of the transcribed texts. Text snippet.
在一些实施例中,所述方法还包括:响应于针对所述目标文本片段的标注指令,对所述目标文本片段进行标注,得到标注后的转写文本。In some embodiments, the method further includes: responding to annotation instructions for the target text fragment, annotating the target text fragment to obtain annotated transcribed text.
另一方面,提供一种基于自建模板的多模态快速转写及标注系统,所述系统包括:On the other hand, a multi-modal rapid transcription and annotation system based on self-built templates is provided, and the system includes:
第一获取单元,用于获取待处理的媒体文件对应的项目工程文件;The first acquisition unit is used to acquire the project engineering file corresponding to the media file to be processed;
第二获取单元,用于根据所述项目工程文件的目录,获取所述媒体文件的音频数据;The second acquisition unit is used to acquire the audio data of the media file according to the directory of the project engineering file;
分段单元,用于根据所述音频数据的幅度对所述音频数据进行分段处理,得到所述音频数据的句段数据;A segmentation unit, configured to segment the audio data according to the amplitude of the audio data to obtain segment data of the audio data;
显示单元,用于在操作界面上显示所述音频数据的句段数据,所述操作界面用于提供展示界面和边界轴控件;A display unit configured to display the segment data of the audio data on an operation interface, the operation interface being used to provide a display interface and boundary axis controls;
处理单元,用于响应于针对所述边界轴控件的编辑操作,对所述句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据;A processing unit, configured to perform boundary adjustment processing or segment merging processing on the segment data in response to the editing operation on the boundary axis control, to obtain processed segment data;
转写单元,用于对所述处理后的句段数据进行语音识别处理得到转写文本;A transliteration unit, used to perform speech recognition processing on the processed segment data to obtain transcribed text;
更新单元,用于根据所述转写文本对所述项目工程文件进行更新,得到更新后的项目工程文件,所述更新后的项目工程文件携带所述转写文本;An update unit, configured to update the project engineering file according to the transcribed text to obtain an updated project engineering file, where the updated project engineering file carries the transcribed text;
播放单元,用于在所述展示界面上播放所述更新后的项目工程文件时,显示所述媒体文件和所述转写文本中与所述媒体文件的播放进度对应的文本片段。A playback unit, configured to display text segments in the media file and the transcribed text corresponding to the playback progress of the media file when the updated project engineering file is played on the display interface.
另一方面,提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序适于处理器进行加载,以执行如第一方面所述的基于自建模板的多模态快速转写及标注方法中的步骤。On the other hand, a computer-readable storage medium is provided, the computer-readable storage medium stores a computer program, and the computer program is suitable for loading by the processor to execute the self-built template-based process as described in the first aspect. Steps in multimodal fast transcription and annotation methods.
另一方面,提供一种终端设备,所述终端设备包括处理器和存储器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行如第一方面所述的基于自建模板的多模态快速转写及标注方法中的步骤。On the other hand, a terminal device is provided. The terminal device includes a processor and a memory. A computer program is stored in the memory. The processor is used to execute the following by calling the computer program stored in the memory. The steps in the multi-modal fast transcription and annotation method based on self-built templates described in the first aspect.
有益效果beneficial effects
本申请实施例提供一种基于自建模板的多模态快速转写及标注方法、基于自建模板的多模态快速转写及标注系统及存储介质,通过获取待处理的媒体文件对应的项目工程文件;根据项目工程文件的目录,获取媒体文件的音频数据;根据音频数据的幅度对音频数据进行分段处理,得到音频数据的句段数据;在操作界面上显示音频数据的句段数据,操作界面用于提供展示界面和边界轴控件;响应于针对边界轴控件的编辑操作,对句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据;对处理后的句段数据进行语音识别处理得到转写文本;根据转写文本对项目工程文件进行更新,得到更新后的项目工程文件,更新后的项目工程文件携带转写文本;在展示界面上播放更新后的项目工程文件时,显示媒体文件和转写文本中与媒体文件的播放进度对应的文本片段。Embodiments of the present application provide a multi-modal fast transcribing and annotating method based on a self-built template, a multi-modal fast transcribing and annotating system based on a self-built template, and a storage medium, by obtaining the items corresponding to the media files to be processed. Project file; obtain the audio data of the media file according to the directory of the project project file; segment the audio data according to the amplitude of the audio data to obtain the segment data of the audio data; display the segment data of the audio data on the operation interface, The operation interface is used to provide a display interface and boundary axis control; in response to the editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data; the processed segment data The data is processed through speech recognition to obtain the transcribed text; the project engineering file is updated according to the transcribed text to obtain the updated project engineering file, which carries the transcribed text; the updated project engineering file is played on the display interface file, displays the text fragments in the media file and the transcribed text that correspond to the playback progress of the media file.
本申请实施例可以提供一种简单、方便的语音转写标注方式,可以通过自建多语言模板实现多种语音转写,可以在大量的语言或方言不能进行语音识别时,支持模板导入,最终实现快捷、高效的断句和转写标注,并通过拖动操作界面上显示的句段对应的边界轴控件来实现句段的快速合并,以及可以直接在操作界面上显示的句段波形对应的边界轴控件上进行水平拖动实现边界微调,提升了语音转写标注效率,以适应上述各种场景的使用需求。Embodiments of the present application can provide a simple and convenient speech transcription annotation method, which can realize multiple speech transcriptions through self-built multi-language templates, and can support template import when speech recognition cannot be performed in a large number of languages or dialects. Finally, Realize fast and efficient segmentation and transliteration annotation, and realize rapid merging of segments by dragging the boundary axis control corresponding to the segment displayed on the operation interface, and the boundary corresponding to the segment waveform can be directly displayed on the operation interface Horizontal dragging on the axis control enables fine-tuning of the boundary, which improves the efficiency of speech transcription annotation to adapt to the usage needs of the various scenarios mentioned above.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对 于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.
图1为本申请实施例提供的基于自建模板的多模态快速转写及标注方法的流程示意图。Figure 1 is a schematic flowchart of a multi-modal rapid transcription and annotation method based on a self-built template provided by an embodiment of the present application.
图2为本申请实施例提供的第一应用场景示意图。Figure 2 is a schematic diagram of the first application scenario provided by the embodiment of the present application.
图3为本申请实施例提供的第二应用场景示意图。Figure 3 is a schematic diagram of the second application scenario provided by the embodiment of the present application.
图4为本申请实施例提供的第三应用场景示意图。Figure 4 is a schematic diagram of the third application scenario provided by the embodiment of the present application.
图5为本申请实施例提供的第四应用场景示意图。Figure 5 is a schematic diagram of the fourth application scenario provided by the embodiment of the present application.
图6为本申请实施例提供的第五应用场景示意图。Figure 6 is a schematic diagram of the fifth application scenario provided by the embodiment of the present application.
图7为本申请实施例提供的第六应用场景示意图。Figure 7 is a schematic diagram of the sixth application scenario provided by the embodiment of the present application.
图8为本申请实施例提供的基于自建模板的多模态快速转写及标注系统的结构示意图。Figure 8 is a schematic structural diagram of a multi-modal fast transcription and annotation system based on self-built templates provided by an embodiment of the present application.
图9为本申请实施例提供的终端设备的结构示意图。Figure 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
本发明的实施方式Embodiments of the invention
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without making creative efforts fall within the scope of protection of this application.
本申请实施例提供一种基于自建模板的多模态快速转写及标注方法、基于自建模板的多模态快速转写及标注系统及存储介质。具体地,本申请实施例的基于自建模板的多模态快速转写及标注方法可以由终端设备执行,其中,该终端设备可以为终端或者服务器等设备。该终端可以为智能手机、平板电脑、触控屏幕、个人计算机(Personal Computer,PC)等终端设备。服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络服务、以及大数据和人工智能平台等基础云计算服务的云服务器,但并不局限于此。Embodiments of the present application provide a multi-modal fast transcribing and annotating method based on a self-built template, a multi-modal fast transcribing and annotating system based on a self-built template, and a storage medium. Specifically, the multi-modal fast transcription and annotation method based on the self-built template in the embodiment of the present application can be executed by a terminal device, where the terminal device can be a terminal or a server. The terminal can be a terminal device such as a smartphone, a tablet, a touch screen, a personal computer (Personal Computer, PC), etc. The server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also provide cloud services, cloud databases, cloud storage, network services, cloud communications, middleware services, domain name services, Cloud servers for security services, content distribution network services, and basic cloud computing services such as big data and artificial intelligence platforms, but are not limited to these.
以下分别进行详细说明。需说明的是,以下实施例的描述顺序不作为对实施例优先顺序的限定。Each is explained in detail below. It should be noted that the description order of the following embodiments is not used to limit the priority order of the embodiments.
请参阅图1至图7,图1为本申请实施例提供的基于自建模板的多模态快速转写及标注方法的流程示意图,图2至图7均为本申请实施例提供的应用场景示意图。本申请实施例的基于自建模板的多模态快速转写及标注方法可应用于本申请实施例的基于自建模板的多模态快速转写及标注系统,该基于自建模板的多模态快速转写及标注系统可被配置于终端设备上。该终端设备可以为终端设备,该方法包括以下步骤:Please refer to Figures 1 to 7. Figure 1 is a schematic flowchart of a multi-modal fast transcription and annotation method based on self-built templates provided by an embodiment of the present application. Figures 2 to 7 are application scenarios provided by an embodiment of the present application. Schematic diagram. The multi-modal fast transcription and annotation method based on the self-built template of the embodiment of the present application can be applied to the multi-modal fast transcription and annotation system based on the self-built template of the embodiment of the present application. The multi-modal fast transcription and annotation system based on the self-built template The dynamic fast transcription and annotation system can be configured on the terminal device. The terminal device may be a terminal device, and the method includes the following steps:
步骤110,获取待处理的媒体文件对应的项目工程文件。Step 110: Obtain the project engineering file corresponding to the media file to be processed.
在一些实施例中,获取待处理的媒体文件对应的项目工程文件,包括:获取待处理的媒体文件;检测媒体文件是否已创建对应的项目工程文件;若检测到媒体文件未创建对应的项目工程文件,则基于模板文件创建媒体文件对应的项目工程文件;或者若检测到媒体文件已创建对应的项目工程文件,则获取已创建的媒体文件对应的项目工程文件。In some embodiments, obtaining the project engineering file corresponding to the media file to be processed includes: obtaining the media file to be processed; detecting whether the corresponding project engineering file has been created for the media file; if it is detected that the corresponding project engineering file has not been created for the media file file, create a project project file corresponding to the media file based on the template file; or if it is detected that a project project file corresponding to the media file has been created, obtain the project project file corresponding to the created media file.
例如,可以提供一个目标客户端,启动该目标客户端,然后通过该目标客户端打开或者导入一个待处理的媒体文件,以获取该媒体文件。例如,媒体文件可以为音频文件或者视频文件。For example, you can provide a target client, start the target client, and then open or import a media file to be processed through the target client to obtain the media file. For example, the media file may be an audio file or a video file.
例如,该目标客户端可以是基于自建模板的多模态快速转写及标注系统的专为音频、视频语言材料快速转写和标注而开发的工具软件,软件可以内置普通话、汉语方言、少数民族语言等多语言模板,直接为中国语言资源保护工程的语篇转写提供支持。其中,多语言模板可以是多层标注模板。另可根据项目需要自建多语言模板,比如还可 以内置不同语言对应的语言转写标注模板。另外该目标客户端还可以应用于视频外挂字幕(*.SRT)制作、mp3音乐外挂歌词(*.LRC)制作、各类录音转写、语言听力教学、视听说教学、口语语料库建设、多媒体资源库建设、态势语研究、课堂教学多模态研究等多个方面的应用场景中。For example, the target client can be a multi-modal rapid transcription and annotation system based on a self-built template. It is a tool software developed specifically for rapid transcription and annotation of audio and video language materials. The software can have built-in Mandarin, Chinese dialects, and minority languages. Multi-language templates such as ethnic languages directly provide support for the discourse transliteration of the Chinese Language Resource Protection Project. Among them, the multi-language template can be a multi-layer annotation template. In addition, multi-language templates can be built according to project needs. For example, language transliteration annotation templates corresponding to different languages can also be built-in. In addition, the target client can also be used in the production of video subtitles (*.SRT), the production of mp3 music plug-in lyrics (*.LRC), the transcription of various recordings, language listening teaching, audio-visual teaching, spoken language corpus construction, and multimedia resources. It can be used in many application scenarios such as library construction, situational language research, and multi-modal research in classroom teaching.
然后,通过检测存储路径中是否存在与媒体文件同名的项目工程文件,来检测媒体文件是否已创建对应的项目工程文件。其中,对于历史开启过的媒体文件,该目标客户端可以保存历史记录,以便在下一次开启该媒体文件时,直接调用历史记录对应的同名项目工程文件,只需对首次开启或者历史记录中不记载的媒体文件创建项目工程文件,可实现处理流程的优化。例如,该历史记录为目标客户端记录的历史时段内开启过的媒体文件的记录信息。Then, by detecting whether a project file with the same name as the media file exists in the storage path, it is detected whether a corresponding project file has been created for the media file. Among them, for media files that have been opened in the past, the target client can save the historical records so that when the media files are opened next time, the project files with the same name corresponding to the historical records can be directly called. Create project engineering files from media files to optimize the processing process. For example, the history record is record information of media files that have been opened within the historical period recorded by the target client.
例如,若存在与媒体文件同名的项目工程文件,则确定媒体文件已创建对应的项目工程文件,进而直接获取存储路径中已创建的与媒体文件同名的项目工程文件,进而执行步骤120。For example, if there is a project file with the same name as the media file, it is determined that a corresponding project file has been created for the media file, and then directly obtains the created project file with the same name as the media file in the storage path, and then performs step 120.
例如,若不存在与媒体文件同名的项目工程文件,则基于模板文件创建媒体文件对应的同名的项目工程文件,并加载对应的项目工程文件,进而执行步骤120。For example, if there is no project file with the same name as the media file, a project file with the same name corresponding to the media file is created based on the template file, and the corresponding project file is loaded, and then step 120 is performed.
步骤120,根据项目工程文件的目录,获取媒体文件的音频数据。Step 120: Obtain the audio data of the media file according to the directory of the project engineering file.
例如,启动音视频数据解析线程,根据项目工程文件的目录中记载的媒体文件信息,从媒体文件的存储路径中查找到与该目录对应的待处理的媒体文件,并基于音视频数据解析线程从媒体文件中提取出媒体文件的音频数据。For example, start the audio and video data parsing thread, find the media file to be processed corresponding to the directory from the storage path of the media file based on the media file information recorded in the directory of the project project file, and based on the audio and video data parsing thread from Extract the audio data of the media file from the media file.
步骤130,根据音频数据的幅度对音频数据进行分段处理,得到音频数据的句段数据。Step 130: Perform segmentation processing on the audio data according to the amplitude of the audio data to obtain segment data of the audio data.
例如,在进行分段处理之前,还需要判断是否需要对音频数据进行分段处理。若判断需要对音频数据进行分段处理,则对音频数据进行分段处理,在音频数据分段结束后,将音频数据分段结束的通知发送至主线程。若判断不需要对音频数据进行分段处理,则将音频数据分段结束的通知发送至主线程。For example, before performing segmentation processing, it is also necessary to determine whether the audio data needs to be segmented. If it is determined that the audio data needs to be segmented, the audio data is segmented. After the audio data is segmented, a notification of the end of the audio data segmentation is sent to the main thread. If it is determined that the audio data does not need to be segmented, a notification of the end of segmentation of the audio data is sent to the main thread.
其中,可以通过检测项目工程文件中的音频数据是否存在已划分的句段数据,来判断是否需要对音频数据进行分段处理。若存在已划分的句段数据,则判断不需要对音频数据进行分段处理。若不存在已划分的句段数据,则判断需要对音频数据进行分段处理。Among them, it can be determined whether the audio data needs to be segmented by detecting whether the audio data in the project file contains divided segment data. If divided segment data exists, it is determined that the audio data does not need to be segmented. If there is no divided segment data, it is determined that the audio data needs to be segmented.
在一些实施例中,根据音频数据的幅度对音频数据进行分段处理,得到音频数据的句段数据,包括:根据噪音幅度阈值和音频数据的幅度的大小关系对音频数据进行分段处理,得到音频数据的句段数据。In some embodiments, segmenting the audio data according to the amplitude of the audio data to obtain segment data of the audio data includes: segmenting the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain Segment data of audio data.
例如,可以根据预设分段间隔对音频数据进行初始分段处理,或者根据静音片段对音频数据进行初始分段处理。然后,再根据噪音幅度阈值和音频数据的幅度的大小关系,对音频数据进行第二分段处理,得到音频数据的句段数据。For example, the audio data may be initially segmented according to a preset segmentation interval, or the audio data may be initially segmented according to a silence segment. Then, based on the relationship between the noise amplitude threshold and the amplitude of the audio data, the audio data is subjected to a second segmentation process to obtain segment data of the audio data.
在一些实施例中,根据噪音幅度阈值和音频数据的幅度的大小关系对音频数据进行分段处理,得到音频数据的句段数据,包括:获取音频数据的初始分段数据;判断初始分段数据中当前分段内的平均幅度是否大于噪音幅度阈值;若初始分段数据中当前分段内的平均幅度大于噪音幅度阈值,则对当前分段标记为有声段;对标记为有声段的当前分段内的音频点进行句段起点和句段终点的裁剪,以去除当前分段内的静音或噪声;若裁剪后的当前分段的起点位置与上一个分段的终点位置相同,则将裁剪后的当前分段和上一个分段进行合并;若裁剪后的当前分段的起点位置与上一个分段的终点位置不相同,则将裁剪后的当前分段标记为一个新的分段;遍历处理音频数据的初始分段数据,得到音频数据的句段数据。In some embodiments, segmenting the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain segment data of the audio data includes: obtaining the initial segment data of the audio data; determining the initial segment data Whether the average amplitude in the current segment in the initial segment data is greater than the noise amplitude threshold; if the average amplitude in the current segment in the initial segment data is greater than the noise amplitude threshold, the current segment is marked as a voiced segment; the current segment marked as a voiced segment is The audio points within the segment are trimmed from the beginning and end of the segment to remove silence or noise in the current segment; if the starting position of the current segment after trimming is the same as the end position of the previous segment, the segment will be cropped Merge the current segment after cropping with the previous segment; if the starting position of the current segment after cropping is different from the end position of the previous segment, mark the current segment after cropping as a new segment; The initial segment data of the audio data is traversed and processed to obtain the segment data of the audio data.
在一些实施例中,获取所述音频数据的初始分段数据,包括:根据预设语言模板对音频数据进行初始分段处理,获取音频数据的初始分段数据。In some embodiments, obtaining the initial segment data of the audio data includes: performing initial segment processing on the audio data according to a preset language template, and obtaining the initial segment data of the audio data.
例如,该预设语言模板具有句段的分段处理能力。该预设语言模板可以包括内置或者自建于目标客户端内的多语言模板来实现初始分段数据的快速创建。其中,多语言模板可以是多层标注模板。例如,该多语言模板可以为包含有不同国家语言、不同地区的方言、不同人物角色语音等对应的语言模板,比如包含有英语、普通话、少数民族语言、汉语方言、女性语音、男性语音、儿童语音等对应的语言模板。其中,内置的多语言模板可以为通过第三方软件置入的语言模板,可以通过内置多种语言模板实现多种语音转写。自建的多语言模板可以为直接创建于目标客户端内的语言模板,可以通过自建多种语言模板实现多种语音转写标注。For example, the preset language template has the ability to segment segments. The preset language templates may include built-in or self-built multi-language templates in the target client to achieve rapid creation of initial segmented data. Among them, the multi-language template can be a multi-layer annotation template. For example, the multi-language template may include language templates corresponding to different national languages, dialects of different regions, voices of different characters, etc., such as English, Mandarin, minority languages, Chinese dialects, female voices, male voices, children's voices, etc. Language templates corresponding to speech, etc. Among them, the built-in multi-language templates can be language templates inserted through third-party software, and multiple speech transcriptions can be realized through the built-in multi-language templates. The self-built multi-language template can be a language template created directly in the target client, and multiple speech transcription annotations can be realized by self-building multiple language templates.
在一些实施例中,预设语言模板包括内置或者自建于目标客户端内的多语言模板,该多语言模板可以为包含有不同国家语言、不同地区的方言、不同人物角色语音等对应的语言模板。由于不同的说话人性别及其对应的语言可能造成不同的噪声,通过单一的噪声阈值来判断可能造成语音分段的片面性。因此,本实施例中基于当前分段的语音信号自动生成其对应的噪音幅度阈值。例如,可以内置噪音幅度阈值生成模块,将预设语言模板输入噪音幅度阈值生成模块中,自适应确定当前分段的语音信号对应的噪音幅度阈值。In some embodiments, the default language template includes a multi-language template built in or self-built in the target client. The multi-language template can include languages corresponding to different national languages, dialects in different regions, voices of different characters, etc. template. Since different speaker genders and their corresponding languages may cause different noises, judging by a single noise threshold may cause one-sided speech segmentation. Therefore, in this embodiment, the corresponding noise amplitude threshold is automatically generated based on the current segmented speech signal. For example, a noise amplitude threshold generation module can be built in, a preset language template can be input into the noise amplitude threshold generation module, and the noise amplitude threshold corresponding to the current segmented speech signal can be adaptively determined.
具体的,本实施例中获取当前分段对应的语音信号,并拟合得到当前分段的语音信号对应的幅度分布函数为:Specifically, in this embodiment, the speech signal corresponding to the current segment is obtained, and the amplitude distribution function corresponding to the speech signal of the current segment is obtained by fitting:
Figure PCTCN2022091181-appb-000001
Figure PCTCN2022091181-appb-000001
其中,x表示当前分段的语音对应的信号幅度,σ表示当前分段的语音的信号方差;Among them, x represents the signal amplitude corresponding to the current segmented speech, and σ represents the signal variance of the current segmented speech;
基于幅度分布函数确定当前分段的语音对应的信号标准差;Determine the signal standard deviation corresponding to the current segmented speech based on the amplitude distribution function;
基于所述标准差、平均幅度以及预设幅度因子之间的乘积,确定当前分段的语音对应的噪音幅度阈值为:Based on the product between the standard deviation, the average amplitude and the preset amplitude factor, the noise amplitude threshold corresponding to the current segmented speech is determined to be:
Figure PCTCN2022091181-appb-000002
Figure PCTCN2022091181-appb-000002
其中,Tam表示噪音幅度阈值,
Figure PCTCN2022091181-appb-000003
表示标准差,
Figure PCTCN2022091181-appb-000004
表示平均幅度,α表示预设幅度因子。本实施例中通过上述确定噪音幅度阈值并进行语音分段的方式,可以根据语音情况自适应检测出语音中的噪声或非噪声,进而提高噪声检测和分段的精确性。
Among them, Tam represents the noise amplitude threshold,
Figure PCTCN2022091181-appb-000003
represents the standard deviation,
Figure PCTCN2022091181-appb-000004
represents the average amplitude, α represents the preset amplitude factor. In this embodiment, through the above-mentioned method of determining the noise amplitude threshold and performing speech segmentation, noise or non-noise in the speech can be adaptively detected according to the speech condition, thereby improving the accuracy of noise detection and segmentation.
例如,可以根据预设分段间隔对音频数据进行初始分段处理,获取音频数据的初始分段数据。例如,该预设分段间隔可以为根据常规断句时间设定的间隔。For example, the audio data can be initially segmented according to the preset segmentation interval to obtain the initial segment data of the audio data. For example, the preset segmentation interval may be an interval set according to a regular sentence segmentation time.
例如,可以根据静音片段对音频数据进行初始分段处理,获取音频数据的初始分段数据。例如,通过检测音频数据中的静音片段来对音频数据进行初始分段处理,基于静音片段在音频数据中的位置进行初始分段,静音片段的首端与上一初始分段的末端相连,静音片段的末端与下一初始分段的首端相连。For example, the audio data can be initially segmented based on the silence segments to obtain the initial segment data of the audio data. For example, the audio data is initially segmented by detecting the silent segments in the audio data, and the initial segmentation is performed based on the position of the silent segments in the audio data. The head end of the silent segment is connected to the end of the previous initial segment, and the mute segment is The end of the segment is connected to the beginning of the next initial segment.
例如,为了避免初始分段过多,导致常规的断句语气引起的短促静音片段而将完整句子分段为多个初始分段时,可以在进行初始分段之前,先忽略短促静音片段,只采用音频长度大于预设长度的静音片段作为用于作为初始分段依据的目标静音片段。例如,可以先检测音频数据中的静音片段,然后从选取音频长度大于预设长度的静音片段作为用于作为初始分段依据的目标静音片段,然后基于目标静音片段在音频数据中的位置进行初始分段。For example, when segmenting a complete sentence into multiple initial segments in order to avoid too many initial segments, resulting in short silent segments caused by regular sentence fragmentation, you can ignore the short silent segments before performing the initial segmentation, and only use Silent segments whose audio length is greater than the preset length are used as target silent segments as the basis for initial segmentation. For example, you can first detect the silent segments in the audio data, then select the silent segments whose audio length is greater than the preset length as the target silent segments used as the basis for initial segmentation, and then perform initialization based on the position of the target silent segment in the audio data. Segmentation.
然后,根据噪音幅度阈值和音频数据的幅度的大小的关系,对初始分段数据进行第二分段处理。具体为,判断当前分段内的平均幅度是否是大于噪音幅度阈值;若当前分段内的平均幅度大于噪音幅度阈值,则对当前分段标记为有声段,对标记为有声段的当前分段内的音频点进行句段起点和句段终点的裁剪,以去除当前分段内的静音或噪声,若当前分段和上一个分段的起止位置相同,则将当前分段和上一个分段进行合并,将合 并后的分段作为句段数据中的一个句段;若当前分段和上一个分段的起止位置不相同,则将当前分段标记为一个新的分段,可以将该新的分段作为句段数据中的一个句段。Then, based on the relationship between the noise amplitude threshold and the amplitude of the audio data, a second segmentation process is performed on the initial segmented data. Specifically, it is judged whether the average amplitude in the current segment is greater than the noise amplitude threshold; if the average amplitude in the current segment is greater than the noise amplitude threshold, the current segment is marked as a voiced segment, and the current segment marked as a voiced segment is The audio points within the segment are trimmed from the beginning and end of the segment to remove silence or noise in the current segment. If the start and end positions of the current segment and the previous segment are the same, the current segment and the previous segment will be trimmed. Merge and use the merged segment as a segment in the segment data; if the starting and ending positions of the current segment and the previous segment are different, mark the current segment as a new segment, and you can The new segment appears as a segment in the segment data.
例如,若初始分段数据中当前分段内的平均幅度不大于噪音幅度阈值,则对当前分段标记为无声段,可将该标记为无声段的当前分段放弃,不会作为句段数据中的一个句段。For example, if the average amplitude in the current segment in the initial segment data is not greater than the noise amplitude threshold, the current segment will be marked as a silent segment, and the current segment marked as a silent segment will be discarded and will not be used as segment data. a segment in .
步骤140,在操作界面上显示音频数据的句段数据,操作界面用于提供展示界面和边界轴控件。Step 140: Display the segment data of the audio data on the operation interface, which is used to provide a display interface and a boundary axis control.
例如,如图2所示,提供目标客户端的操作界面200,在操作界面200上显示音频数据的句段数据201,操作界面200用于提供展示界面202和边界轴控件203。For example, as shown in FIG. 2 , an operation interface 200 of the target client is provided, segment data 201 of the audio data is displayed on the operation interface 200 , and the operation interface 200 is used to provide a display interface 202 and a boundary axis control 203 .
例如,还可以在该操作界面200显示其他编辑接口或者操作接口。比如文件、编辑、设置、帮助等接口;比如转写模式、标注模式和全文模式的操作接口;比如展示界面的播放接口等。For example, other editing interfaces or operation interfaces may also be displayed on the operation interface 200 . For example, interfaces such as file, editing, settings, and help; such as operation interfaces for transcription mode, annotation mode, and full-text mode; such as playback interfaces for display interfaces, etc.
在一些实施例中,在操作界面上显示音频数据的句段数据,包括:在操作界面上显示音频数据的句段数据的句段波形信息,以及句段波形信息对应的时间轴信息。In some embodiments, displaying the segment data of the audio data on the operation interface includes: displaying the segment waveform information of the segment data of the audio data on the operation interface, and the timeline information corresponding to the segment waveform information.
在一些实施例中,该方法还包括:响应于隐藏波形指令,在操作界面上隐藏句段波形信息和时间轴信息。In some embodiments, the method further includes: in response to the hide waveform instruction, hiding the segment waveform information and the timeline information on the operation interface.
例如,可以基于用户输入的指令,对句段波形信息和时间轴信息实现显示操作或者隐藏操作,显示方式灵活。For example, based on instructions input by the user, the segment waveform information and timeline information can be displayed or hidden in a flexible display manner.
步骤150,响应于针对边界轴控件的编辑操作,对句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据。Step 150: In response to the editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data.
例如,可以通过拖动边界轴控件,来实现句段拖动操作,以进行句段边界调整处理,或者对句段合并处理。即通过拖动操作界面上显示的句段对应的边界轴控件来实现句段的快速合并,以及可以直接在操作界面上显示的句段对应的边界轴控件上进行左侧或右侧的水平拖动实现边界微调,例如,还可以在操作界面上显示句段波形,可以直接在操作界面上显示的句段波形对应的边界轴控件上进行左侧或右侧的水平拖动实现边界微调。For example, you can realize segment dragging operation by dragging the boundary axis control to adjust segment boundaries or merge segments. That is, you can quickly merge segments by dragging the boundary axis control corresponding to the segment displayed on the operation interface, and you can directly drag the left or right horizontally on the boundary axis control corresponding to the segment displayed on the operation interface. Boundary fine-tuning can be realized automatically. For example, the segment waveform can also be displayed on the operation interface, and the boundary axis control corresponding to the segment waveform displayed on the operation interface can be directly dragged horizontally on the left or right side to realize boundary fine-tuning.
例如,可以右键点击边界轴控件,记录当前活动句段信息;缓存当前所有句段信息列表;响应于通过长按鼠标左键触发的拖动操作拖动边界轴控件;判断活动句段是否存在,若是,则更新临时活动句段的左边界点与右边界点;松开左键,判断上次是否是拖动操作,若是,则获取当前鼠标所在的句段;判断是否满足合并条件,若是,则合并句段,若否,则更新活动句段的边界信息。For example, you can right-click the boundary axis control to record the current active segment information; cache the current list of all segment information; drag the boundary axis control in response to a drag operation triggered by long pressing the left mouse button; determine whether the active segment exists, If so, update the left boundary point and right boundary point of the temporary active segment; release the left button to determine whether the last drag operation was, if so, obtain the segment where the current mouse is located; determine whether the merge conditions are met, and if so, Then merge the segments, if not, update the boundary information of the active segment.
例如,在判断是否满足合并条件时,主要是判断活动句段最终的边界点是否超过了被合并句段的相邻边界。比如,向右合并句段时,活动句段的右边界必须要超过被合并句段的左边界才能合并,且要保证两个句段不相同。向左合并句段时,活动句段的左边界必须要超过被合并句段的右边界才能合并,且要保证两个句段不相同。For example, when judging whether the merge condition is met, it is mainly judged whether the final boundary point of the active segment exceeds the adjacent boundary of the merged segment. For example, when merging segments to the right, the right boundary of the active segment must exceed the left boundary of the merged segment before merging, and the two segments must be different. When merging segments to the left, the left boundary of the active segment must exceed the right boundary of the merged segment before merging, and the two segments must be different.
例如,获取结束句段的实现逻辑是:顺序遍历整个句段列表,判断每个句段的左右边界与鼠标结束点水平方向的大小。左合并时,当某个句段的右边界大于鼠标结束的位置时,则表示此句段为结束句段;右合并时,当某个句段的左边界大于鼠标结束的位置时,则表示此句段的前一个句段为结束句段。For example, the implementation logic of obtaining the end segment is: sequentially traverse the entire list of segments, and determine the size of the left and right boundaries of each segment and the horizontal direction of the mouse end point. When merging left, when the right boundary of a segment is greater than the end position of the mouse, it means that this segment is the end segment; when merging right, when the left boundary of a segment is greater than the end position of the mouse, it means The segment before this segment is the end segment.
例如,以向右合并句段为例,在判断是否满足合并条件时,检测结束句段是否存在;若结束句段不存在,则不可以合并,不满足合并条件;若结束句段存在,则判断是否是相同的句段;若是相同的句段,则不可合并,不满足合并条件;若不是相同的句段,判断活动句段的当前右边界是否大于结束句段的左边界,若大于,则可以合并,满足合并条件;若小于,则不可合并,不满足合并条件。For example, taking merging segments to the right as an example, when judging whether the merging conditions are met, check whether the ending segment exists; if the ending segment does not exist, it cannot be merged and the merging conditions are not met; if the ending segment exists, then Determine whether they are the same segment; if they are the same segment, they cannot be merged and the merging conditions are not met; if they are not the same segment, determine whether the current right boundary of the active segment is greater than the left boundary of the ending segment. If it is greater, Then it can be merged and the merge conditions are met; if it is less than, it cannot be merged and the merge conditions are not met.
例如,如图3所示的操作界面300的视图变化示意图,示出了调整句段边界的示意图。例如,用户将鼠标的悬停(hover)到需要调整的句段的第一边界轴控件3031上, 终端通过检测鼠标的悬停位置,确定当前要调整的活动句段3011,然后用户可以长按鼠标左键开始拖动第一边界轴控件3031的一端边界标签,拖动到确定位置后松开鼠标左键,完成活动句段的拖动操作,活动句段3011的边界就更新为新的位置。其中,该针对第一边界轴控件3031的编辑操作可以为拖动操作、点击操作等。例如,以拖动操作为例,将未被拖动的第一边界轴控件3031的一端边界标签定义为静止端,该静止端位于位置A;将被拖动的第一边界轴控件3031的一端边界标签定义为活动端,在被拖动前,该活动端位于位置B。图3中的3-1示意图示出了拖动前的画面,图3中的3-2示意图示出了拖动后更新第一边界轴控件3031的边界位置的画面。响应于针对活动句段3011的第一边界轴控件3031的活动端的第一编辑操作,控制第一边界轴控件3031的活动端从位置B移动至位置C,以调整边界。若拖动后的活动句段的边界位不在其他句段的边界范围内,将活动句段303的边界的一端边界标签更新至位置C,即将活动字段3011的边界从AB段调整为AC段。For example, the view change diagram of the operation interface 300 shown in FIG. 3 shows a schematic diagram of adjusting segment boundaries. For example, the user hovers the mouse over the first boundary axis control 3031 of the segment that needs to be adjusted. The terminal determines the active segment 3011 currently to be adjusted by detecting the hovering position of the mouse, and then the user can long-press. Start dragging the boundary label of one end of the first boundary axis control 3031 with the left mouse button. After dragging to the determined position, release the left mouse button to complete the drag operation of the active segment, and the boundary of the active segment 3011 will be updated to the new position. . The editing operation for the first boundary axis control 3031 may be a drag operation, a click operation, etc. For example, taking the drag operation as an example, define the boundary label of one end of the first boundary axis control 3031 that is not dragged as a stationary end, and the stationary end is located at position A; The boundary label is defined as the active end, which is at position B before being dragged. Diagram 3-1 in Figure 3 shows the picture before dragging, and diagram 3-2 in Figure 3 shows the picture of updating the boundary position of the first boundary axis control 3031 after dragging. In response to the first editing operation for the active end of the first boundary axis control 3031 of the active segment 3011, the active end of the first boundary axis control 3031 is controlled to move from position B to position C to adjust the boundary. If the boundary position of the dragged active segment is not within the boundary range of other segments, update the boundary label at one end of the boundary of the active segment 303 to position C, that is, adjust the boundary of the active field 3011 from segment AB to segment AC.
例如,如图4所示的操作界面400的视图变化示意图,示出了句段合并操作的示意图。例如,用户将鼠标的悬停(hover)到需要调整的句段的第一边界轴控件4031上,终端通过检测鼠标的悬停位置,确定当前要调整的活动句段4011,然后用户可以长按鼠标左键开始拖动第一边界轴控件4031的一端边界标签,拖动到确定位置后松开鼠标左键,完成活动句段的拖动操作,活动句段4011的边界就更新为新的位置。其中,该针对第一边界轴控件4031的编辑操作可以为拖动操作、点击操作等。例如,以拖动操作为例,将未被拖动的第一边界轴控件4031的一端边界标签定义为静止端,该静止端位于位置D;将被拖动的第一边界轴控件4031的一端边界标签定义为活动端,在被拖动前,该活动端位于位置E。图4中的4-1示意图示出了拖动前的画面,图4中的4-2示意图示出了拖动过程中第一边界轴控件3031的边界位置变化的画面,图4中的4-3示意图示出了拖动后句段合并的画面。响应于针对活动句段4011的第一边界轴控件4031的活动端的第一编辑操作,控制第一边界轴控件4031的活动端从位置E越过位置A移动至位置F。例如,拖动某个边界标签到其他句段内时,位于其他句段内的边界标签可以显示为与其他边界标签不同的图标,比如若控制第一边界轴控件4031的活动端从位置E越过位置A移动至位置F,以将该活动端拖动到其他句段内,此时,处于位置F的活动端的图标可呈浅蓝色小蜡烛状,而其他边界标签可显示为红色直角图标,用户松开鼠标即可实现句段合并。若拖动后的活动句段4031的边界超过了其他句段的相邻边界,即可合并与拖动后的活动句段的边界相重叠的范围内的所有句段。比如,活动句段4031的边界超过了其他句段4032的左边界(位置A),即可合并活动句段4031和其他句段4032,得到合并句段4013,该合并句段4013的边界轴控件4033的边界为DC段。For example, the view change diagram of the operation interface 400 shown in FIG. 4 shows a schematic diagram of the segment merging operation. For example, the user hovers the mouse over the first boundary axis control 4031 of the segment that needs to be adjusted. The terminal determines the currently active segment 4011 that needs to be adjusted by detecting the hovering position of the mouse. Then the user can long-press Use the left mouse button to start dragging the boundary label at one end of the first boundary axis control 4031. After dragging to the determined position, release the left mouse button to complete the drag operation of the active segment, and the boundary of the active segment 4011 will be updated to the new position. . The editing operation for the first boundary axis control 4031 may be a drag operation, a click operation, etc. For example, taking the drag operation as an example, define the boundary label of one end of the first boundary axis control 4031 that is not dragged as the stationary end, and the stationary end is located at position D; The boundary label is defined as the active end, which is at position E before being dragged. Diagram 4-1 in Figure 4 shows the picture before dragging, Diagram 4-2 in Figure 4 shows the picture of the boundary position change of the first boundary axis control 3031 during the dragging process, Figure 4 in Figure 4 -3 Schematic diagram shows the segment merging after dragging. In response to the first editing operation for the active end of the first boundary axis control 4031 of the active segment 4011, the active end of the first boundary axis control 4031 is controlled to move from the position E across the position A to the position F. For example, when a certain boundary label is dragged into other segments, the boundary labels located in other segments can be displayed as different icons from other boundary labels. For example, if the active end of the first boundary axis control 4031 is controlled from position E. Move position A to position F to drag the active end into other segments. At this time, the icon of the active end at position F can be in the shape of a small light blue candle, while other boundary labels can be displayed as red right-angled icons. Users can merge segments by releasing the mouse. If the boundary of the dragged active segment 4031 exceeds the adjacent boundaries of other segments, all segments within the range that overlap with the boundary of the dragged active segment can be merged. For example, if the boundary of active segment 4031 exceeds the left boundary (position A) of other segments 4032, active segment 4031 and other segments 4032 can be merged to obtain merged segment 4013. The boundary axis control of the merged segment 4013 The boundary of 4033 is the DC segment.
在一些实施例中,响应于针对边界轴控件的编辑操作,对句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据,包括:响应于针对句段数据中活动句段的第一边界轴控件的活动端的第一编辑操作,控制第一边界轴控件的活动端移动至第一位置;判断在第一位置处是否存在与第一边界轴控件的活动端相重叠的第二边界轴控件,第二边界轴控件为第二句段对应的边界轴控件,活动句段与第二句段为相邻句段;若在第一位置处存在与第一边界轴控件的活动端相重叠的第二边界轴控件,则将活动句段与第二句段进行合并处理。In some embodiments, in response to an editing operation on the boundary axis control, boundary adjustment processing or segment merging processing is performed on the segment data to obtain processed segment data, including: in response to the active segment in the segment data The first editing operation of the movable end of the first boundary axis control controls the movable end of the first boundary axis control to move to the first position; it is determined whether there is a third movable end at the first position that overlaps with the movable end of the first boundary axis control. Two boundary axis controls, the second boundary axis control is the boundary axis control corresponding to the second segment, and the active segment and the second segment are adjacent segments; if there is an activity with the first boundary axis control at the first position If the second boundary axis control overlaps, the active segment and the second segment will be merged.
在一些实施例中,后端程序在处理音频数据时,为了避免合并相同的句段,在将活动句段与第二句段进行合并处理之前,还需判断活动句段与第二句段是否为相同。具体的,可以通过判断该两个句段的左边界是否相同、并且判断该两个句段的右边界是否也相同,若该两个句段的左边界相同且该两个句段的右边界也相同,则判断活动句段与第二句段为相同的句段。若该两个句段的左边界不相同和/或该两个句段的右边界不相同,则判断活动句段与第二句段不是相同的句段,从而准确区分出活动句段与第二句段,然后将活动句段与第二句段进行合并处理。In some embodiments, when the backend program processes audio data, in order to avoid merging the same segment, it is necessary to determine whether the active segment and the second segment are combined before processing. for the same. Specifically, you can determine whether the left boundaries of the two segments are the same and whether the right boundaries of the two segments are also the same. If the left boundaries of the two segments are the same and the right boundaries of the two segments are the same, are also the same, then the active segment and the second segment are judged to be the same segment. If the left boundaries of the two segments are different and/or the right boundaries of the two segments are different, it is determined that the active segment and the second segment are not the same segment, thereby accurately distinguishing the active segment from the second segment. second segment, and then merge the active segment with the second segment.
在一些实施例中,在判断在第一位置处是否存在与第一边界轴控件的活动端相重叠的第二边界轴控件之后,还包括:若在第一位置处不存在与第一边界轴控件的活动端相重叠的第二边界轴控件,则根据第一位置调整活动句段的边界。In some embodiments, after determining whether there is a second boundary axis control that overlaps the active end of the first boundary axis control at the first position, the method further includes: if there is not a second boundary axis control that overlaps with the first boundary axis control at the first position. If the active end of the control overlaps the second boundary axis control, the boundary of the active segment is adjusted according to the first position.
例如,可以通过拖动操作界面上显示的句段来实现每相邻两个句段间的快速合并。在进行句段合并时,可以实现相邻两个句段的合并功能。在此基础上,如果有需要同时合并多个句段,则按照句段顺序,依次进行两两合并,即可实现任意多个句段的合并。例如,可以通过拖动相邻两个句段的边界标签触碰即可合并为一个新句段;例如,还可以通过跨越拖动某个句段的边界标签触碰其他句段的边界标签来实现合并多个句段。For example, you can quickly merge two adjacent segments by dragging the segments displayed on the operation interface. When merging segments, the merging function of two adjacent segments can be realized. On this basis, if it is necessary to merge multiple segments at the same time, merge them one by one according to the sequence of the segments, so that any number of segments can be merged. For example, you can merge two adjacent segments into a new segment by dragging and touching the boundary labels of two adjacent segments; for example, you can also drag the boundary labels of one segment across and touch the boundary labels of other segments. Achieve merging multiple segments.
请参阅图3和图4,图3示出了对句段数据进行边界调整处理的示意图,图4示出了对句段数据进行句段合并处理的示意图。Please refer to Figures 3 and 4. Figure 3 shows a schematic diagram of boundary adjustment processing on segment data, and Figure 4 shows a schematic diagram of segment merging processing on segment data.
如图3所示,响应于针对句段数据中活动句段3011的第一边界轴控件3031的活动端的第一编辑操作,控制第一边界轴控件3031的活动端从位置A移动至第一位置,该第一位置为图3中的位置C。在第一位置(位置C)处不存在与第一边界轴控件3031的活动端相重叠的第二边界轴控件,则根据第一位置(位置C)调整活动句段3011的边界,即将活动字段3011的边界从AB段调整为AC段。As shown in FIG. 3 , in response to the first editing operation for the active end of the first boundary axis control 3031 of the active segment 3011 in the segment data, the active end of the first boundary axis control 3031 is controlled to move from position A to the first position. , the first position is position C in Figure 3. If there is no second boundary axis control overlapping the active end of the first boundary axis control 3031 at the first position (position C), the boundary of the active segment 3011 is adjusted according to the first position (position C), that is, the active field The boundary of 3011 is adjusted from segment AB to segment AC.
如图4所示,响应于针对句段数据中活动句段4011的第一边界轴控件4031的活动端的第一编辑操作,控制第一边界轴控件4031的活动端移动至第一位置,该第一位置为图4中的位置F。在第一位置(位置F)处存在与第一边界轴控件4031的活动端相重叠的第二边界轴控件4032,则将活动句段4011与第二句段4012进行合并处理,得到合并句段4013,该合并句段4013的边界轴控件4033的边界为DC段。As shown in FIG. 4 , in response to the first editing operation for the active end of the first boundary axis control 4031 of the active segment 4011 in the segment data, the active end of the first boundary axis control 4031 is controlled to move to the first position. One position is position F in Figure 4. If there is a second boundary axis control 4032 that overlaps with the active end of the first boundary axis control 4031 at the first position (position F), then the active segment 4011 and the second segment 4012 are merged to obtain the merged segment. 4013. The boundary of the boundary axis control 4033 of the merged segment 4013 is the DC segment.
在一些实施例中,响应于针对边界轴控件的编辑操作,对句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据,包括:响应于针对句段数据中的活动句段的第一边界轴控件的活动端的第二编辑操作,控制第一边界轴控件的活动端移动至第二位置;判断在第二位置处是否存在与第一边界轴控件的活动端相重叠的第三边界轴控件,第三边界轴控件为第三句段对应的边界轴控件,活动句段与第三句段为非相邻句段;若在第二位置处存在与第一边界轴控件的活动端相重叠的第三边界轴控件,则将活动句段、第三句段、以及活动句段与第三句段之间的中间句段进行合并处理。In some embodiments, in response to the editing operation for the boundary axis control, performing boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data, including: responding to the active sentence in the segment data. The second editing operation of the movable end of the first boundary axis control of the segment controls the movable end of the first boundary axis control to move to the second position; it is determined whether there is an overlapping movable end of the first boundary axis control at the second position. The third boundary axis control is the boundary axis control corresponding to the third segment. The active segment and the third segment are non-adjacent segments; if there is a boundary axis control corresponding to the first boundary axis control at the second position If the active end of the third boundary axis control overlaps, the active segment, the third segment, and the intermediate segment between the active segment and the third segment will be merged.
在一些实施例中,在判断在第二位置处是否存在与第一边界轴控件的活动端相重叠的第三边界轴控件之后,还包括:若在第二位置处不存在与第一边界轴控件的活动端相重叠的第三边界轴控件,则判断第一边界轴控件的静止端位置至第二位置之间的目标区域内是否与任一中间句段重叠;若第一边界轴控件的静止端位置至第二位置之间的目标区域内不与任一中间句段重叠,则根据第二位置调整活动句段的边界;或者若第一边界轴控件的静止端位置至第二位置之间的目标区域内与至少一个中间句段重叠,则将活动句段、与目标区域存在相重叠关系的所有中间句段进行合并处理。In some embodiments, after determining whether there is a third boundary axis control that overlaps the active end of the first boundary axis control at the second position, the method further includes: if there is not a third boundary axis control that overlaps the active end of the first boundary axis control at the second position. If the third boundary axis control overlaps the active end of the control, then determine whether the target area between the static end position of the first boundary axis control and the second position overlaps with any intermediate segment; if the first boundary axis control If the target area between the static end position and the second position does not overlap with any intermediate segment, the boundary of the active segment is adjusted according to the second position; or if the static end position of the first boundary axis control is between the second position and If there is overlap with at least one intermediate segment in the target area, the active segment and all intermediate segments that overlap with the target area will be merged.
例如,可以通过拖动操作界面上显示的句段的边界轴控件来实现多个句段的快速合并。具体的,可以通过拖动操作合并多个句段,在拖动活动句段的第一边界轴控件后,可以将活动句段、与目标区域存在相重叠关系的所有中间句段进行合并处理,实现同时合并多个句段,该目标区域为第一边界轴控件的静止端位置至第二位置之间的区域,即拖动后的边界位置需要位于其他句段的范围内,即可合并此范围内的所有句段。For example, you can quickly merge multiple segments by dragging the boundary axis control of the segment displayed on the operation interface. Specifically, multiple segments can be merged by dragging. After dragging the first boundary axis control of the active segment, the active segment and all intermediate segments that overlap with the target area can be merged. To merge multiple segments at the same time, the target area is the area between the static end position of the first boundary axis control and the second position. That is, the dragged boundary position needs to be within the range of other segments, and this can be merged All segments within the range.
在一些实施例中,该方法还包括:响应于针对句段数据中目标句段的插入断点操作,在述目标句段的边界轴控件中插入断点,以基于断点对目标句段进行分段处理。In some embodiments, the method further includes: in response to an insert breakpoint operation for the target segment in the segment data, inserting a breakpoint in a boundary axis control of the target segment to perform operations on the target segment based on the breakpoint. Processing in segments.
例如,可以通过插入断点实现对目标句段进行分段处理,增加句段调整的灵活性。For example, you can insert breakpoints to perform segmented processing on the target segment, increasing the flexibility of segment adjustment.
步骤160,对处理后的句段数据进行语音识别处理得到转写文本。Step 160: Perform speech recognition processing on the processed segment data to obtain transcribed text.
例如,可以通过调用终端配置的语音识别模块或者第三方的语音识别模块来实现自动转写,以对处理后的句段数据进行语音识别处理得到转写文本。For example, automatic transcription can be implemented by calling the speech recognition module configured on the terminal or a third-party speech recognition module to perform speech recognition processing on the processed segment data to obtain the transcribed text.
在一些实施例中,转写文本包括句段数据中的每一个句段对应的文本片段,在对处 理后的句段数据进行语音识别处理得到转写文本之后,还包括:响应于针对转写文本中的目标文本片段的修改指令,对目标文本片段进行修改,得到修改后的转写文本,目标文本片段为转写文本中的至少一个文本片段。In some embodiments, the transcribed text includes text fragments corresponding to each segment in the segment data. After performing speech recognition processing on the processed segment data to obtain the transcribed text, the method further includes: responding to the transcribed The modification instruction of the target text fragment in the text is to modify the target text fragment to obtain the modified transcribed text, and the target text fragment is at least one text fragment in the transcribed text.
例如,还在自动转写生成的初始转写文本后,用户可以通过操作界面输入针对转写文本中的目标文本片段的修改指令,来实现人工更新转写文本。该修改指令可以包括修改字词、删除字词、增加字词、修改字体、修改修改字体大小、修改字体颜色等指令。For example, after the initial transcribed text is automatically transcribed, the user can enter modification instructions for the target text fragment in the transcribed text through the operation interface to manually update the transcribed text. The modification instructions may include instructions such as modifying words, deleting words, adding words, modifying fonts, modifying font size, modifying font color, etc.
在一些实施例中,该方法还包括:响应于针对目标文本片段的标注指令,对目标文本片段进行标注,得到标注后的转写文本。In some embodiments, the method further includes: responding to an annotation instruction for the target text segment, annotating the target text segment to obtain annotated transcribed text.
例如,可以通过操作界面输入针对目标文本片段的标注指令,对目标文本片段进行标注,得到标注后的转写文本。例如,可以对目标文本片段进行以下任意种标注:行业领域标注、内容类别标注、词性标注、依赖关系标注、实体标注、关系标注、事件标注、阅读理解标注和问答标注。For example, you can input annotation instructions for the target text fragment through the operation interface, annotate the target text fragment, and obtain the annotated transcribed text. For example, the target text fragment can be annotated in any of the following ways: industry field annotation, content category annotation, part-of-speech annotation, dependency annotation, entity annotation, relationship annotation, event annotation, reading comprehension annotation and question and answer annotation.
步骤170,根据转写文本对项目工程文件进行更新,得到更新后的项目工程文件,更新后的项目工程文件携带转写文本。Step 170: Update the project engineering file according to the transcribed text to obtain an updated project engineering file. The updated project engineering file carries the transcribed text.
例如,将转写文本和媒体文件的路径一起保存在一个固定格式(.Baf)的项目工程文件中,以对项目工程文件进行更新。更新后的项目工程文件携带转写文本。For example, save the transcribed text and the path of the media file together in a fixed-format (.Baf) project file to update the project file. The updated project engineering files carry the transcribed text.
例如,在对项目工程文件进行更新时,可以初始化音频数据的波形,构造句段波形信息数组、更新句段波形信息的展示界面;保存媒体文件信息、句段数据到项目工程文件;通知媒体文件更改消息;播放器更换媒体文件;软件更新标题信息;控制器更新界面及相关控件信息。For example, when updating the project file, you can initialize the waveform of the audio data, construct an array of segment waveform information, and update the display interface of the segment waveform information; save the media file information and segment data to the project file; notify the media file Change messages; player changes media files; software updates title information; controller updates interface and related control information.
在进行初始化时,对显示用到的内存数据,可以采用音视频数据解析线程解析的音频结果和分段处理得到的分段信息进行初始化,再对一些需要用到的参数设定默认值。During initialization, the memory data used for display can be initialized using the audio results parsed by the audio and video data parsing thread and the segmented information obtained by segmentation processing, and then set default values for some parameters that need to be used.
步骤180,在展示界面上播放更新后的项目工程文件时,显示媒体文件和转写文本中与媒体文件的播放进度对应的文本片段。Step 180: When the updated project engineering file is played on the display interface, text fragments in the media file and the transcribed text corresponding to the playback progress of the media file are displayed.
例如,在展示界面上播放更新后的项目工程文件时,显示媒体文件和转写文本中与媒体文件的播放进度对应的文本片段。还可通过展示界面上的播放控件控制播放进度。For example, when the updated project file is played on the display interface, the text fragments in the media file and the transcribed text corresponding to the playback progress of the media file are displayed. You can also control the playback progress through the playback controls on the display interface.
例如,本申请实施例还提供多格式导入导出功能,可以支持Word(docx、txt、aud.txt)、Excel(xls、xlsx)、lrc、srt、json格式文件等的导入,同时支持以上文件类型以及eaf格式的文件导出。可方便进行转写文件的迁移等,以实现多格式的文件导入和文件导出。关于多格式导入导出功能,可以针对不同的文件类型和文件读写方式,提供对应的写入文件和写出文件的接口函数,以便在导入或导出文件时对不同类型的文件进行写入和写出。例如,可以同时导入Excel、srt等格式文件和对应媒体文件,数据文件可以转换为Baf格式,还可以实现多种文件格式一次可选导出。For example, the embodiment of this application also provides a multi-format import and export function, which can support the import of Word (docx, txt, aud.txt), Excel (xls, xlsx), lrc, srt, json format files, etc., and also supports the above file types. And file export in eaf format. It can facilitate the migration of transcribed files to achieve multi-format file import and file export. Regarding the multi-format import and export function, corresponding interface functions for writing files and writing files can be provided for different file types and file reading and writing methods, so that different types of files can be written and written when importing or exporting files. out. For example, Excel, srt and other format files and corresponding media files can be imported at the same time, data files can be converted to Baf format, and multiple file formats can be optionally exported at one time.
例如,导入格式的文件类型与导入接口的对应关系可以如表1所示:For example, the corresponding relationship between the file types of the import format and the import interface can be shown in Table 1:
表1Table 1
文件类型file type 导入接口Import interface
Xls、XlsxXls, Xlsx DoImportFile_ExcelDoImportFile_Excel
LrcLrc DoImportFile_LrcDoImportFile_Lrc
SrtSrt DoImportFile_SrtDoImportFile_Srt
DocxDocx DoImportFile_DocxDoImportFile_Docx
JsonJson DoImportFile_JsonDoImportFile_Json
AudAud DoImportFile_AudDoImportFile_Aud
TxtTxt DoImportFile_TxtDoImportFile_Txt
例如,导出格式的文件类型与导出接口的对应关系可以如表2所示:For example, the corresponding relationship between the file type of the export format and the export interface can be shown in Table 2:
表2Table 2
文件类型file type 导出接口Export interface
Xls、XlsxXls, Xlsx ExportFile_ExcelExportFile_Excel
LrcLrc DoExportFile_LRCDoExportFile_LRC
SrtSrt DoExportFile_SRTDoExportFile_SRT
AudAud DoExportFile_AudacityDoExportFile_Audacity
STLSTL DoExportFile_STLDoExportFile_STL
Docx、TxtDocx,Txt DoExportFile_TxtDoExportFile_Txt
EAFEAF IBAF::SaveToIBAF::SaveTo
在一些实施例中,该方法还包括:响应于携带目标文件类型的导出指令,从项目工程文件中导出与目标文件类型对应的导出文件,目标文件类型属于预设文件类型中的任一种文件类型。In some embodiments, the method further includes: in response to the export instruction carrying the target file type, export an export file corresponding to the target file type from the project engineering file, and the target file type belongs to any one of the preset file types. type.
例如,如图5所示的文件导出的应用场景示意图,如图5中的5-1所示的文件导出界面的示意图,可以在文件导出界面上设置导出的目标文件类型等,比如目标文件类型设置为Excel,导出语言设置为普通话。执行导出指令后,可根据设置内容导出文件,比如,导出的Excel格式文件如图5中的5-2所示的内容。For example, the schematic diagram of the file export application scenario shown in Figure 5, the schematic diagram of the file export interface shown in 5-1 in Figure 5, the exported target file type, etc. can be set on the file export interface, such as the target file type Set to Excel, and the export language is set to Mandarin. After executing the export command, you can export the file according to the setting content. For example, the exported Excel format file has the content shown in 5-2 in Figure 5.
例如,如图6所示的文件导出的另一应用场景示意图,如图6中的6-1所示的文件导出界面的示意图,可以在文件导出界面上设置导出的目标文件类型等,比如目标文件类型可以同时设置为Excel、Word、EAF,导出语言设置为方言。执行导出指令后,可根据设置内容导出文件,当目标文件类型同时设置为多种文件格式时,可以实现多种文件格式一次可选导出,其中,导出的Excel格式文件如图6中的6-2所示的内容。For example, as shown in Figure 6 is a schematic diagram of another application scenario of file export, and a schematic diagram of the file export interface is shown as 6-1 in Figure 6, the exported target file type can be set on the file export interface, such as the target The file type can be set to Excel, Word, and EAF at the same time, and the export language can be set to dialect. After executing the export command, the file can be exported according to the setting content. When the target file type is set to multiple file formats at the same time, multiple file formats can be optionally exported at one time. The exported Excel format file is shown in Figure 6- 2 shows the content.
例如,预设文件类型可以包括:Word(docx、txt、aud.txt)、Excel(xls、xlsx)、lrc、srt、json格式文件等。可以支持以上文件类型以及eaf格式的文件导出。可方便进行转写文件的迁移等,以实现多格式的文件导出。For example, preset file types may include: Word (docx, txt, aud.txt), Excel (xls, xlsx), lrc, srt, json format files, etc. Can support the above file types and eaf format file export. It can facilitate the migration of transcribed files to achieve multi-format file export.
在一些实施例中,该方法还包括:响应于导入指令,获取导入文件;当导入文件的文件类型属于预设文件类型中的任一种文件类型时,将导入文件导入项目工程文件中。In some embodiments, the method further includes: in response to the import instruction, obtaining the import file; when the file type of the import file belongs to any of the preset file types, importing the import file into the project engineering file.
例如,如图7所示的文件导出界面的示意图,可以在文件导入界面上选择导入文件,或者导入文件与媒体文件,在导入文件的文件类型属于预设文件类型中的任一种文件类型时,将导入文件导入项目工程文件中。For example, as shown in the schematic diagram of the file export interface in Figure 7, you can choose to import files on the file import interface, or import files and media files, when the file type of the imported file belongs to any of the preset file types. , import the import file into the project project file.
例如,预设文件类型可以包括:Word(docx、txt、aud.txt)、Excel(xls、xlsx)、lrc、srt、json格式文件等。可以支持支持以上文件类型的文件导入。可方便进行转写文件的迁移等,以实现多格式的文件导入。For example, preset file types may include: Word (docx, txt, aud.txt), Excel (xls, xlsx), lrc, srt, json format files, etc. File import that supports the above file types can be supported. It can facilitate the migration of transcribed files to achieve multi-format file import.
上述所有的技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘述。All the above technical solutions can be combined in any way to form optional embodiments of the present application, and will not be described again one by one.
本申请实施例通过获取待处理的媒体文件对应的项目工程文件;根据项目工程文件的目录,获取媒体文件的音频数据;根据音频数据的幅度对音频数据进行分段处理,得到音频数据的句段数据;在操作界面上显示音频数据的句段数据,操作界面用于提供展示界面和边界轴控件;响应于针对边界轴控件的编辑操作,对句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据;对处理后的句段数据进行语音识别处理得到转写文本;根据转写文本对项目工程文件进行更新,得到更新后的项目工程文件, 更新后的项目工程文件携带转写文本;在展示界面上播放更新后的项目工程文件时,显示媒体文件和转写文本中与媒体文件的播放进度对应的文本片段。本申请实施例可以提供一种简单、方便的语音转写方式,可以通过自建多种语言模板实现多种语音转写,并通过拖动操作界面上显示的句段对应的边界轴控件来实现句段的快速合并,以及可以直接在操作界面上显示的句段波形对应的边界轴控件上进行水平拖动实现边界微调,提升了语音转写标注效率,以适应上述各种场景的使用需求。The embodiment of this application obtains the project engineering file corresponding to the media file to be processed; obtains the audio data of the media file according to the directory of the project engineering file; performs segmentation processing on the audio data according to the amplitude of the audio data to obtain the segments of the audio data data; display the segment data of the audio data on the operation interface, which is used to provide a display interface and boundary axis control; in response to the editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data, Obtain the processed segment data; perform speech recognition processing on the processed segment data to obtain the transcribed text; update the project engineering file according to the transcribed text to obtain the updated project engineering file, and carry the updated project engineering file Transcribe text; when playing the updated project file on the display interface, display the text fragments in the media file and the transcribed text that correspond to the playback progress of the media file. Embodiments of the present application can provide a simple and convenient speech transcription method, which can realize multiple speech transcriptions by self-building multiple language templates and dragging the boundary axis control corresponding to the segment displayed on the operation interface. The rapid merging of segments, and the ability to directly drag horizontally on the boundary axis control corresponding to the segment waveform displayed on the operation interface to achieve boundary fine-tuning, improve the efficiency of speech transcription annotation to adapt to the usage needs of various scenarios mentioned above.
为便于更好的实施本申请实施例的基于自建模板的多模态快速转写及标注方法,本申请实施例还提供一种基于自建模板的多模态快速转写及标注系统。请参阅图8,图8为本申请实施例提供的基于自建模板的多模态快速转写及标注系统的结构示意图。其中,该基于自建模板的多模态快速转写及标注系统800应用于提供图形用户界面的终端设备,该基于自建模板的多模态快速转写及标注系统800可以包括:In order to facilitate better implementation of the multi-modal fast transcription and annotation method based on the self-built template of the embodiment of the present application, the embodiment of the present application also provides a multi-modal fast transcription and annotation system based on the self-built template. Please refer to FIG. 8 , which is a schematic structural diagram of a multi-modal fast transcription and annotation system based on a self-built template provided by an embodiment of the present application. Among them, the multi-modal fast transcription and annotation system 800 based on the self-built template is applied to a terminal device that provides a graphical user interface. The multi-modal fast transcription and annotation system 800 based on the self-built template may include:
第一获取单元801,用于获取待处理的媒体文件对应的项目工程文件;The first obtaining unit 801 is used to obtain the project engineering file corresponding to the media file to be processed;
第二获取单元802,用于根据项目工程文件的目录,获取媒体文件的音频数据;The second acquisition unit 802 is used to acquire the audio data of the media file according to the directory of the project engineering file;
分段单元803,用于根据音频数据的幅度对音频数据进行分段处理,得到音频数据的句段数据;The segmentation unit 803 is used to segment the audio data according to the amplitude of the audio data to obtain segment data of the audio data;
显示单元804,用于在操作界面上显示音频数据的句段数据,操作界面用于提供展示界面和边界轴控件;The display unit 804 is used to display the segment data of the audio data on the operation interface, and the operation interface is used to provide a display interface and boundary axis control;
处理单元805,用于响应于针对边界轴控件的编辑操作,对句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据;The processing unit 805 is configured to perform boundary adjustment processing or segment merging processing on the segment data in response to the editing operation on the boundary axis control, and obtain processed segment data;
转写单元806,用于对处理后的句段数据进行语音识别处理得到转写文本; Transcription unit 806, used to perform speech recognition processing on the processed segment data to obtain transcribed text;
更新单元807,用于根据转写文本对项目工程文件进行更新,得到更新后的项目工程文件,更新后的项目工程文件携带转写文本;The update unit 807 is used to update the project engineering file according to the transcribed text to obtain an updated project engineering file, and the updated project engineering file carries the transcribed text;
播放单元808,用于在展示界面上播放更新后的项目工程文件时,显示媒体文件和转写文本中与媒体文件的播放进度对应的文本片段。The playback unit 808 is configured to display the text fragments in the media file and the transcribed text that correspond to the playback progress of the media file when the updated project engineering file is played on the display interface.
在一些实施例中,处理单元805,可以用于:响应于针对句段数据中活动句段的第一边界轴控件的活动端的第一编辑操作,控制第一边界轴控件的活动端移动至第一位置;判断在第一位置处是否存在与第一边界轴控件的活动端相重叠的第二边界轴控件,第二边界轴控件为第二句段对应的边界轴控件,活动句段与第二句段为相邻句段;若在第一位置处存在与第一边界轴控件的活动端相重叠的第二边界轴控件,则将活动句段与第二句段进行合并处理。In some embodiments, the processing unit 805 may be configured to: in response to the first editing operation on the active end of the first boundary axis control of the active segment in the segment data, control the active end of the first boundary axis control to move to the third One position; determine whether there is a second boundary axis control overlapping the active end of the first boundary axis control at the first position, the second boundary axis control is the boundary axis control corresponding to the second segment, and the active segment is the same as the first boundary axis control. The two segments are adjacent segments; if there is a second boundary axis control overlapping the active end of the first boundary axis control at the first position, the active segment and the second segment will be merged.
在一些实施例中,处理单元805在判断在第一位置处是否存在与第一边界轴控件的活动端相重叠的第二边界轴控件之后,还可以用于:若在第一位置处不存在与第一边界轴控件的活动端相重叠的第二边界轴控件,则根据第一位置调整活动句段的边界。In some embodiments, after determining whether there is a second boundary axis control that overlaps the active end of the first boundary axis control at the first position, the processing unit 805 may also be configured to: if there is not a second boundary axis control at the first position, The second boundary axis control that overlaps the active end of the first boundary axis control adjusts the boundary of the active segment according to the first position.
在一些实施例中,处理单元805,可以用于:响应于针对句段数据中的活动句段的第一边界轴控件的活动端的第二编辑操作,控制第一边界轴控件的活动端移动至第二位置;判断在第二位置处是否存在与第一边界轴控件的活动端相重叠的第三边界轴控件,第三边界轴控件为第三句段对应的边界轴控件,活动句段与第三句段为非相邻句段;若在第二位置处存在与第一边界轴控件的活动端相重叠的第三边界轴控件,则将活动句段、第三句段、以及活动句段与第三句段之间的中间句段进行合并处理。In some embodiments, the processing unit 805 may be configured to: in response to a second editing operation for the active end of the first boundary axis control of the active segment in the segment data, control the active end of the first boundary axis control to move to The second position; determine whether there is a third boundary axis control that overlaps with the active end of the first boundary axis control at the second position. The third boundary axis control is the boundary axis control corresponding to the third segment. The active segment and The third segment is a non-adjacent segment; if there is a third boundary axis control at the second position that overlaps the active end of the first boundary axis control, then the active segment, the third segment, and the active sentence The intermediate segments between the first segment and the third segment are merged.
在一些实施例中,处理单元805在判断在第二位置处是否存在与第一边界轴控件的活动端相重叠的第三边界轴控件之后,还可以用于:若在第二位置处不存在与第一边界轴控件的活动端相重叠的第三边界轴控件,则判断第一边界轴控件的静止端位置至第二位置之间的目标区域内是否与任一中间句段重叠;若第一边界轴控件的静止端位置至第二位置之间的目标区域内不与任一中间句段重叠,则根据第二位置调整活动句段的边界;或者若第一边界轴控件的静止端位置至第二位置之间的目标区域内与至少一个中间句段重叠,则将活动句段、与目标区域存在相重叠关系的所有中间句段进行合并处理。In some embodiments, after determining whether there is a third boundary axis control that overlaps the active end of the first boundary axis control at the second position, the processing unit 805 may also be configured to: if there is not a third boundary axis control at the second position, For the third boundary axis control that overlaps the active end of the first boundary axis control, it is determined whether the target area between the stationary end position of the first boundary axis control and the second position overlaps with any intermediate segment; if If the target area between the static end position of a boundary axis control and the second position does not overlap with any intermediate segment, the boundary of the active segment is adjusted according to the second position; or if the static end position of the first boundary axis control If the target area between the target area and the second position overlaps with at least one intermediate segment, then the active segment and all the intermediate segments that have an overlapping relationship with the target area are merged.
在一些实施例中,分段单元803,可以用于根据噪音幅度阈值和音频数据的幅度的大小关系对音频数据进行分段处理,得到音频数据的句段数据。In some embodiments, the segmentation unit 803 may be used to segment the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain segment data of the audio data.
在一些实施例中,分段单元803在根据噪音幅度阈值和音频数据的幅度的大小关系对音频数据进行分段处理,得到音频数据的句段数据时,可以用于:获取音频数据的初始分段数据;判断初始分段数据中当前分段内的平均幅度是否大于噪音幅度阈值;若初始分段数据中当前分段内的平均幅度大于噪音幅度阈值,则对当前分段标记为有声段;对标记为有声段的当前分段内的音频点进行句段起点和句段终点的裁剪,以去除当前分段内的静音或噪声;若裁剪后的当前分段的起点位置与上一个分段的终点位置相同,则将裁剪后的当前分段和上一个分段进行合并;若裁剪后的当前分段的起点位置与上一个分段的终点位置不相同,则将裁剪后的当前分段标记为一个新的分段;遍历处理音频数据的初始分段数据,得到音频数据的句段数据。In some embodiments, when segmenting the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain the segment data of the audio data, the segmentation unit 803 can be used to: obtain the initial segmentation of the audio data. segment data; determine whether the average amplitude of the current segment in the initial segment data is greater than the noise amplitude threshold; if the average amplitude of the current segment in the initial segment data is greater than the noise amplitude threshold, mark the current segment as a sound segment; The audio points in the current segment marked as a sound segment are trimmed from the segment starting point and segment end point to remove silence or noise in the current segment; if the starting position of the trimmed current segment is different from the previous segment If the end position of the current segment after cropping is the same as the end position of the previous segment, the current segment after cropping will be merged with the previous segment; if the starting position of the current segment after cropping is different from the end position of the previous segment, the current segment after cropping will be merged. Mark it as a new segment; traverse and process the initial segment data of the audio data to obtain the segment data of the audio data.
在一些实施例中,分段单元803在获取所述音频数据的初始分段数据时,可以用于:根据预设语言模板对音频数据进行初始分段处理,获取音频数据的初始分段数据。In some embodiments, when acquiring the initial segment data of the audio data, the segmentation unit 803 may be used to: perform initial segmentation processing on the audio data according to the preset language template, and obtain the initial segment data of the audio data.
在一些实施例中,第一获取单元801,可以用于:获取待处理的媒体文件;检测媒体文件是否已创建对应的项目工程文件;若检测到媒体文件未创建对应的项目工程文件,则基于模板文件创建媒体文件对应的项目工程文件;或者若检测到媒体文件已创建对应的项目工程文件,则获取已创建的媒体文件对应的项目工程文件。In some embodiments, the first obtaining unit 801 can be used to: obtain the media file to be processed; detect whether the corresponding project engineering file has been created for the media file; if it is detected that the corresponding project engineering file has not been created for the media file, based on The template file creates a project project file corresponding to the media file; or if it is detected that the media file has created a corresponding project project file, the project project file corresponding to the created media file is obtained.
在一些实施例中,处理单元805,还可以用于响应于携带目标文件类型的导出指令,从项目工程文件中导出与目标文件类型对应的导出文件,目标文件类型属于预设文件类型中的任一种文件类型。In some embodiments, the processing unit 805 may also be configured to respond to the export instruction carrying the target file type and export an export file corresponding to the target file type from the project engineering file. The target file type belongs to any of the preset file types. A file type.
在一些实施例中,处理单元805,还可以用于:响应于导入指令,获取导入文件;In some embodiments, the processing unit 805 can also be used to: respond to the import instruction, obtain the import file;
当导入文件的文件类型属于预设文件类型中的任一种文件类型时,将导入文件导入项目工程文件中。When the file type of the imported file belongs to any of the preset file types, the imported file will be imported into the project file.
在一些实施例中,显示单元804,可以用于在操作界面上显示音频数据的句段数据的句段波形信息,以及句段波形信息对应的时间轴信息。In some embodiments, the display unit 804 may be configured to display the segment waveform information of the segment data of the audio data and the timeline information corresponding to the segment waveform information on the operation interface.
在一些实施例中,显示单元804,还可以用于响应于隐藏波形指令,在操作界面上隐藏句段波形信息和时间轴信息。In some embodiments, the display unit 804 may also be configured to hide the segment waveform information and the timeline information on the operation interface in response to the hide waveform instruction.
在一些实施例中,处理单元805,还可以用于响应于针对句段数据中目标句段的插入断点操作,在述目标句段的边界轴控件中插入断点,以基于断点对目标句段进行分段处理。In some embodiments, the processing unit 805 may also be configured to respond to the insert breakpoint operation for the target segment in the segment data, insert a breakpoint in the boundary axis control of the target segment, so as to adjust the target based on the breakpoint. Segments are processed into segments.
在一些实施例中,转写文本包括句段数据中的每一个句段对应的文本片段,转写单元806在对处理后的句段数据进行语音识别处理得到转写文本之后,还可以用于:响应于针对转写文本中的目标文本片段的修改指令,对目标文本片段进行修改,得到修改后的转写文本,目标文本片段为转写文本中的至少一个文本片段。In some embodiments, the transcribed text includes text fragments corresponding to each segment in the segment data. After the transcribing unit 806 performs speech recognition processing on the processed segment data to obtain the transcribed text, it may also be used to : In response to a modification instruction for a target text fragment in the transcribed text, modify the target text fragment to obtain a modified transcribed text, where the target text fragment is at least one text fragment in the transcribed text.
在一些实施例中,转写单元806,还可以用于响应于针对目标文本片段的标注指令,对目标文本片段进行标注,得到标注后的转写文本。In some embodiments, the transliteration unit 806 may also be configured to respond to annotation instructions for the target text fragment, annotate the target text fragment, and obtain annotated transcribed text.
上述所有的技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘述。All the above technical solutions can be combined in any way to form optional embodiments of the present application, and will not be described again one by one.
应理解的是,系统实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图8所示的系统可以执行上述基于自建模板的多模态快速转写及标注方法实施例,并且系统中的各个单元的前述和其它操作和/或功能分别实现上述方法实施例的相应流程,为了简洁,在此不再赘述。It should be understood that system embodiments and method embodiments may correspond to each other, and similar descriptions may refer to the method embodiments. To avoid repetition, they will not be repeated here. Specifically, the system shown in Figure 8 can execute the above-mentioned self-built template-based multi-modal rapid transcription and annotation method embodiments, and the aforementioned and other operations and/or functions of each unit in the system respectively implement the above-mentioned method embodiments. The corresponding process, for the sake of brevity, will not be repeated here.
相应的,本申请实施例还提供一种终端设备,该终端设备可以为终端或者服务器,该终端可以为智能手机、平板电脑、笔记本电脑、智能电视、智能音箱、穿戴式智能设备、个人计算机等设备。如图9所示,图9为本申请实施例提供的终端设备的结构示意图。该终端设备900包括有一个或者一个以上处理核心的处理器901、有一个或一个以 上计算机可读存储介质的存储器902及存储在存储器902上并可在处理器上运行的计算机程序。其中,处理器901与存储器902电性连接。本领域技术人员可以理解,图中示出的终端设备结构并不构成对终端设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Correspondingly, embodiments of the present application also provide a terminal device. The terminal device may be a terminal or a server. The terminal may be a smartphone, a tablet, a laptop, a smart TV, a smart speaker, a wearable smart device, a personal computer, etc. equipment. As shown in Figure 9, Figure 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application. The terminal device 900 includes a processor 901 with one or more processing cores, a memory 902 with one or more computer-readable storage media, and a computer program stored on the memory 902 and capable of running on the processor. Among them, the processor 901 is electrically connected to the memory 902. Those skilled in the art can understand that the structure of the terminal equipment shown in the figures does not constitute a limitation on the terminal equipment, and may include more or fewer components than shown in the figures, or combine certain components, or arrange different components.
处理器901是终端设备900的控制中心,利用各种接口和线路连接整个终端设备900的各个部分,通过运行或加载存储在存储器902内的软件程序和/或模块,以及调用存储在存储器902内的数据,执行终端设备900的各种功能和处理数据,从而对终端设备900进行整体监控。The processor 901 is the control center of the terminal device 900, using various interfaces and lines to connect various parts of the entire terminal device 900, by running or loading software programs and/or modules stored in the memory 902, and calling the software programs and/or modules stored in the memory 902. data, perform various functions of the terminal device 900 and process data, thereby overall monitoring the terminal device 900.
在本申请实施例中,终端设备900中的处理器901会按照如下的步骤,将一个或一个以上的应用程序的进程对应的指令加载到存储器902中,并由处理器901来运行存储在存储器902中的应用程序,从而实现各种功能:In this embodiment of the present application, the processor 901 in the terminal device 900 will follow the following steps to load instructions corresponding to the processes of one or more application programs into the memory 902, and the processor 901 will run the instructions stored in the memory. 902 applications to achieve various functions:
获取待处理的媒体文件对应的项目工程文件;根据所述项目工程文件的目录,获取所述媒体文件的音频数据;根据所述音频数据的幅度对所述音频数据进行分段处理,得到所述音频数据的句段数据;在操作界面上显示所述音频数据的句段数据,所述操作界面用于提供展示界面和边界轴控件;响应于针对所述边界轴控件的编辑操作,对所述句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据;对所述处理后的句段数据进行语音识别处理得到转写文本;根据所述转写文本对所述项目工程文件进行更新,得到更新后的项目工程文件,所述更新后的项目工程文件携带所述转写文本;在所述展示界面上播放所述更新后的项目工程文件时,显示所述媒体文件和所述转写文本中与所述媒体文件的播放进度对应的文本片段。Obtain the project engineering file corresponding to the media file to be processed; obtain the audio data of the media file according to the directory of the project engineering file; perform segmentation processing on the audio data according to the amplitude of the audio data to obtain the Segment data of the audio data; displaying the segment data of the audio data on an operation interface, the operation interface being used to provide a display interface and a boundary axis control; in response to an editing operation for the boundary axis control, the The segment data is subjected to boundary adjustment processing or segment merging processing to obtain processed segment data; speech recognition processing is performed on the processed segment data to obtain a transcribed text; and the project engineering is processed according to the transcribed text. The file is updated to obtain an updated project engineering file, which carries the transcribed text; when the updated project engineering file is played on the display interface, the media file and A text segment in the transcribed text corresponding to the playback progress of the media file.
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。For the specific implementation of each of the above operations, please refer to the previous embodiments and will not be described again here.
在一些实施例中,如图9所示,终端设备900还包括:显示单元903、射频电路904、音频电路905、输入单元906以及电源907。其中,处理器901分别与显示单元903、射频电路904、音频电路905、输入单元906以及电源907电性连接。本领域技术人员可以理解,图9中示出的终端设备结构并不构成对终端设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。In some embodiments, as shown in FIG. 9 , the terminal device 900 further includes: a display unit 903, a radio frequency circuit 904, an audio circuit 905, an input unit 906, and a power supply 907. Among them, the processor 901 is electrically connected to the display unit 903, the radio frequency circuit 904, the audio circuit 905, the input unit 906 and the power supply 907 respectively. Those skilled in the art can understand that the structure of the terminal device shown in FIG. 9 does not constitute a limitation on the terminal device, and may include more or fewer components than shown in the figure, or combine certain components, or arrange different components.
显示单元903可用于显示由用户输入的信息或提供给用户的信息以及终端设备的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示单元903可以包括显示面板和触控面板。The display unit 903 may be used to display information input by the user or information provided to the user as well as various graphical user interfaces of the terminal device. These graphical user interfaces may be composed of graphics, text, icons, videos, and any combination thereof. The display unit 903 may include a display panel and a touch panel.
射频电路904可用于收发射频信号,以通过无线通信与网络设备或其他终端设备建立无线通讯,与网络设备或其他终端设备之间收发信号。The radio frequency circuit 904 can be used to send and receive radio frequency signals to establish wireless communication with network equipment or other terminal equipment through wireless communication, and to send and receive signals with the network equipment or other terminal equipment.
音频电路905可以用于通过扬声器、传声器提供用户与终端设备之间的音频接口。The audio circuit 905 can be used to provide an audio interface between the user and the terminal device through speakers and microphones.
输入单元906可用于接收输入的数字、字符信息或用户特征信息(例如指纹、虹膜、面部信息等),以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。The input unit 906 can be used to receive input numbers, character information or user characteristic information (such as fingerprints, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control. .
电源907用于给终端设备900的各个部件供电。在一些实施例中,电源907可以通过电源管理系统与处理器901逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源907还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。The power supply 907 is used to power various components of the terminal device 900 . In some embodiments, the power supply 907 can be logically connected to the processor 901 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system. Power supply 907 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.
尽管图9中未示出,终端设备900还可以包括摄像头、传感器、无线保真模块、蓝牙模块等,在此不再赘述。Although not shown in FIG. 9 , the terminal device 900 may also include a camera, a sensor, a wireless fidelity module, a Bluetooth module, etc., which will not be described again here.
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。For the specific implementation of each of the above operations, please refer to the previous embodiments and will not be described again here.
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructions, or by controlling relevant hardware through instructions. The instructions can be stored in a computer-readable storage medium, and loaded and executed by the processor.
为此,本申请实施例提供一种计算机可读存储介质,其中存储有多条计算机程序, 该计算机程序能够被处理器进行加载,以执行本申请实施例所提供的任一种基于自建模板的多模态快速转写及标注方法中的步骤。以上各个操作的具体实施可参见前面的实施例,在此不再赘述。To this end, embodiments of the present application provide a computer-readable storage medium in which multiple computer programs are stored. The computer programs can be loaded by the processor to execute any of the self-built templates provided by the embodiments of the present application. The steps in the multi-modal fast transcription and annotation method. For the specific implementation of each of the above operations, please refer to the previous embodiments and will not be described again here.
其中,该存储介质可以包括:只读存储器(Read Only Memory,ROM)、随机存取记忆体(Random Access Memory,RAM)、磁盘或光盘等。Among them, the storage medium may include: read-only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.
由于该存储介质中所存储的计算机程序,可以执行本申请实施例所提供的任一种基于自建模板的多模态快速转写及标注方法中的步骤,因此,可以实现本申请实施例所提供的任一种基于自建模板的多模态快速转写及标注方法所能实现的有益效果,详见前面的实施例,在此不再赘述。Since the computer program stored in the storage medium can execute any of the steps in the self-built template-based multi-modal rapid transcription and annotation method provided by the embodiments of the present application, it is possible to implement the steps provided by the embodiments of the present application. The beneficial effects that can be achieved by any of the provided multi-modal fast transcription and annotation methods based on self-built templates are detailed in the previous embodiments and will not be described again here.
本申请实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得计算机设备执行本申请实施例中的任一种基于自建模板的多模态快速转写及标注方法中的相应流程,为了简洁,在此不再赘述。Embodiments of the present application also provide a computer program product. The computer program product includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, causing the computer device to perform any multi-modal fast transcription and annotation based on the self-built template in the embodiments of the present application. The corresponding process in the method will not be repeated here for the sake of brevity.
本申请实施例还提供了一种计算机程序,该计算机程序包括计算机指令,计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得计算机设备执行本申请实施例中的任一种基于自建模板的多模态快速转写及标注方法中的相应流程,为了简洁,在此不再赘述。An embodiment of the present application also provides a computer program. The computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, causing the computer device to perform any multi-modal fast transcription and annotation based on the self-built template in the embodiments of the present application. The corresponding process in the method will not be repeated here for the sake of brevity.
以上对本申请实施例所提供的一种基于自建模板的多模态快速转写及标注方法、基于自建模板的多模态快速转写及标注系统及存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The above is a detailed introduction to a multi-modal fast transcription and annotation method based on a self-built template, a multi-modal fast transcription and annotation system and a storage medium based on a self-built template provided by the embodiments of the present application. In this article, The principles and implementations of the present application are described with specific examples. The description of the above embodiments is only used to help understand the method of the present application and its core ideas; at the same time, for those skilled in the art, based on the ideas of the present application, in There may be changes in the specific implementation modes and application scope. In summary, the contents of this description should not be construed as limiting the present application.

Claims (18)

  1. 一种基于自建模板的多模态快速转写及标注方法,其特征在于,所述方法包括:A multi-modal fast transcription and annotation method based on self-built templates, characterized in that the method includes:
    获取待处理的媒体文件对应的项目工程文件;Obtain the project engineering file corresponding to the media file to be processed;
    根据所述项目工程文件的目录,获取所述媒体文件的音频数据;Obtain the audio data of the media file according to the directory of the project engineering file;
    根据所述音频数据的幅度对所述音频数据进行分段处理,得到所述音频数据的句段数据;Perform segmentation processing on the audio data according to the amplitude of the audio data to obtain segment data of the audio data;
    在操作界面上显示所述音频数据的句段数据,所述操作界面用于提供展示界面和边界轴控件;Display the segment data of the audio data on an operation interface, the operation interface being used to provide a display interface and a boundary axis control;
    响应于针对所述边界轴控件的编辑操作,对所述句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据;In response to the editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data;
    对所述处理后的句段数据进行语音识别处理得到转写文本;Perform speech recognition processing on the processed segment data to obtain transcribed text;
    根据所述转写文本对所述项目工程文件进行更新,得到更新后的项目工程文件,所述更新后的项目工程文件携带所述转写文本;Update the project engineering file according to the transcribed text to obtain an updated project engineering file, and the updated project engineering file carries the transcribed text;
    在所述展示界面上播放所述更新后的项目工程文件时,显示所述媒体文件和所述转写文本中与所述媒体文件的播放进度对应的文本片段。When the updated project engineering file is played on the display interface, a text segment corresponding to the playback progress of the media file in the media file and the transcribed text is displayed.
  2. 如权利要求1所述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述响应于针对所述边界轴控件的编辑操作,对所述句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据,包括:The multi-modal fast transcription and annotation method based on self-built templates according to claim 1, characterized in that, in response to the editing operation for the boundary axis control, boundary adjustment processing is performed on the segment data Or segment segments can be merged to obtain processed segment data, including:
    响应于针对所述句段数据中活动句段的第一边界轴控件的活动端的第一编辑操作,控制所述第一边界轴控件的活动端移动至第一位置;In response to a first editing operation for an active end of a first boundary axis control of an active segment in the segment data, controlling the active end of the first boundary axis control to move to a first position;
    判断在所述第一位置处是否存在与所述第一边界轴控件的活动端相重叠的第二边界轴控件,所述第二边界轴控件为第二句段对应的边界轴控件,所述活动句段与所述第二句段为相邻句段;Determine whether there is a second boundary axis control that overlaps with the movable end of the first boundary axis control at the first position, and the second boundary axis control is the boundary axis control corresponding to the second segment, and the The active segment and the second segment are adjacent segments;
    若在所述第一位置处存在与所述第一边界轴控件的活动端相重叠的第二边界轴控件,则将所述活动句段与所述第二句段进行合并处理。If there is a second boundary axis control that overlaps the active end of the first boundary axis control at the first position, the active segment and the second segment are merged.
  3. 如权利要求2所述的基于自建模板的多模态快速转写及标注方法,其特征在于,在所述判断在所述第一位置处是否存在与所述第一边界轴控件的活动端相重叠的第二边界轴控件之后,还包括:The multi-modal fast transcription and annotation method based on a self-built template according to claim 2, characterized in that, in the judgment of whether there is a movable end of the first boundary axis control at the first position, After the overlapping second boundary axis control, also include:
    若在所述第一位置处不存在与所述第一边界轴控件的活动端相重叠的第二边界轴控件,则根据所述第一位置调整所述活动句段的边界。If there is no second boundary axis control overlapping the active end of the first boundary axis control at the first position, the boundary of the active segment is adjusted according to the first position.
  4. 如权利要求1所述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述响应于针对所述边界轴控件的编辑操作,对所述句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据,包括:The multi-modal fast transcription and annotation method based on self-built templates according to claim 1, characterized in that, in response to the editing operation for the boundary axis control, boundary adjustment processing is performed on the segment data Or segment segments can be merged to obtain processed segment data, including:
    响应于针对所述句段数据中的活动句段的第一边界轴控件的活动端的第二编辑操作,控制所述第一边界轴控件的活动端移动至第二位置;In response to a second editing operation for the active end of the first boundary axis control of the active segment in the segment data, controlling the active end of the first boundary axis control to move to a second position;
    判断在所述第二位置处是否存在与所述第一边界轴控件的活动端相重叠的第三边界轴控件,所述第三边界轴控件为第三句段对应的边界轴控件,所述活动句段与所述第三句段为非相邻句段;Determine whether there is a third boundary axis control that overlaps with the movable end of the first boundary axis control at the second position, and the third boundary axis control is the boundary axis control corresponding to the third segment, and the The active segment and the third segment are non-adjacent segments;
    若在所述第二位置处存在与所述第一边界轴控件的活动端相重叠的第三边界轴控件,则将所述活动句段、所述第三句段、以及所述活动句段与所述第三句段之间的中间句段进行合并处理。If there is a third boundary axis control that overlaps the active end of the first boundary axis control at the second position, then the active segment, the third segment, and the active segment Merge with the intermediate segment between the third segment.
  5. 如权利要求4所述的基于自建模板的多模态快速转写及标注方法,其特征在于,在所述判断在所述第二位置处是否存在与所述第一边界轴控件的活动端相重叠的第三边界轴控件之后,还包括:The multi-modal fast transcription and annotation method based on a self-built template according to claim 4, characterized in that, in the judgment of whether there is a movable end of the first boundary axis control at the second position, After the overlapping third boundary axis control, also include:
    若在所述第二位置处不存在与所述第一边界轴控件的活动端相重叠的第三边界轴控 件,则判断所述第一边界轴控件的静止端位置至所述第二位置之间的目标区域内是否与任一所述中间句段重叠;If there is no third boundary axis control overlapping the movable end of the first boundary axis control at the second position, then determine the position between the stationary end of the first boundary axis control and the second position. Whether the target area between overlaps with any of the intermediate segments;
    若所述第一边界轴控件的静止端位置至所述第二位置之间的目标区域内不与任一所述中间句段重叠,则根据所述第二位置调整所述活动句段的边界;或者If the target area between the static end position of the first boundary axis control and the second position does not overlap with any of the intermediate segments, adjust the boundary of the active segment according to the second position. ;or
    若所述第一边界轴控件的静止端位置至所述第二位置之间的目标区域内与至少一个所述中间句段重叠,则将所述活动句段、与所述目标区域存在相重叠关系的所有中间句段进行合并处理。If the target area between the static end position of the first boundary axis control and the second position overlaps with at least one of the intermediate segments, then the active segment overlaps with the target area. All intermediate segments of the relationship are merged.
  6. 如权利要求1所述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述根据所述音频数据的幅度对所述音频数据进行分段处理,得到所述音频数据的句段数据,包括:The multi-modal fast transcription and annotation method based on self-built templates according to claim 1, characterized in that the audio data is segmented according to the amplitude of the audio data to obtain the audio data. segment data, including:
    根据噪音幅度阈值和所述音频数据的幅度的大小关系对所述音频数据进行分段处理,得到所述音频数据的句段数据。The audio data is segmented according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain segment data of the audio data.
  7. 如权利要求6所述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述根据噪音幅度阈值和所述音频数据的幅度的大小关系对所述音频数据进行分段处理,得到所述音频数据的句段数据,包括:The multi-modal fast transcription and annotation method based on self-built templates according to claim 6, characterized in that the audio data is segmented according to the relationship between the noise amplitude threshold and the amplitude of the audio data. Process to obtain the segment data of the audio data, including:
    获取所述音频数据的初始分段数据;Obtain initial segment data of the audio data;
    判断所述初始分段数据中当前分段内的平均幅度是否大于所述噪音幅度阈值;Determine whether the average amplitude within the current segment in the initial segmented data is greater than the noise amplitude threshold;
    若所述初始分段数据中当前分段内的平均幅度大于所述噪音幅度阈值,则对所述当前分段标记为有声段;If the average amplitude within the current segment in the initial segment data is greater than the noise amplitude threshold, mark the current segment as a voiced segment;
    对标记为有声段的所述当前分段内的音频点进行句段起点和句段终点的裁剪,以去除所述当前分段内的静音或噪声;Trim the segment starting point and segment end point on the audio points within the current segment marked as a sound segment to remove silence or noise within the current segment;
    若所述裁剪后的当前分段的起点位置与上一个分段的终点位置相同,则将所述裁剪后的当前分段和所述上一个分段进行合并;If the starting position of the current segment after cropping is the same as the end position of the previous segment, merge the current segment after cropping and the previous segment;
    若所述裁剪后的当前分段的起点位置与所述上一个分段的终点位置不相同,则将所述裁剪后的当前分段标记为一个新的分段;If the starting position of the current segment after cropping is different from the end position of the previous segment, mark the current segment after cropping as a new segment;
    遍历处理所述音频数据的初始分段数据,得到所述音频数据的句段数据。The initial segment data of the audio data is traversed and processed to obtain the segment data of the audio data.
  8. 如权利要求7所述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述获取所述音频数据的初始分段数据,包括:The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 7, wherein the obtaining the initial segment data of the audio data includes:
    根据预设语言模板对所述音频数据进行初始分段处理,获取所述音频数据的初始分段数据。Perform initial segmentation processing on the audio data according to a preset language template to obtain initial segmentation data of the audio data.
  9. 如权利要求1所述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述获取待处理的媒体文件对应的项目工程文件,包括:The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 1, characterized in that said obtaining the project engineering files corresponding to the media files to be processed includes:
    获取待处理的媒体文件;Get the media files to be processed;
    检测所述媒体文件是否已创建对应的项目工程文件;Detect whether the corresponding project file has been created for the media file;
    若检测到所述媒体文件未创建对应的项目工程文件,则基于模板文件创建所述媒体文件对应的项目工程文件;或者If it is detected that the media file does not have a corresponding project file, create a project file corresponding to the media file based on the template file; or
    若检测到所述媒体文件已创建对应的项目工程文件,则获取已创建的所述媒体文件对应的项目工程文件。If it is detected that a corresponding project file has been created for the media file, the project file corresponding to the created media file is obtained.
  10. 如权利要求1所述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述方法还包括:The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 1, characterized in that the method further includes:
    响应于携带目标文件类型的导出指令,从所述项目工程文件中导出与所述目标文件类型对应的导出文件,所述目标文件类型属于预设文件类型中的任一种文件类型。In response to the export instruction carrying the target file type, an export file corresponding to the target file type is derived from the project engineering file, and the target file type belongs to any one of the preset file types.
  11. 如权利要求10所述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述方法还包括:The multi-modal fast transcription and annotation method based on self-built templates according to claim 10, characterized in that the method further includes:
    响应于导入指令,获取导入文件;In response to the import command, obtain the import file;
    当所述导入文件的文件类型属于所述预设文件类型中的任一种文件类型时,将所述 导入文件导入所述项目工程文件中。When the file type of the imported file belongs to any of the preset file types, the imported file is imported into the project engineering file.
  12. 如权利要求1所述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述在操作界面上显示所述音频数据的句段数据,包括:The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 1, wherein the segment data of the audio data displayed on the operation interface includes:
    在操作界面上显示所述音频数据的句段数据的句段波形信息,以及所述句段波形信息对应的时间轴信息。The segment waveform information of the segment data of the audio data and the time axis information corresponding to the segment waveform information are displayed on the operation interface.
  13. 如权利要求12所述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述方法还包括:The multi-modal rapid transcription and annotation method based on self-built templates according to claim 12, characterized in that the method further includes:
    响应于隐藏波形指令,在操作界面上隐藏所述句段波形信息和所述时间轴信息。In response to the hide waveform instruction, the segment waveform information and the timeline information are hidden on the operation interface.
  14. 如权利要求1述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述方法还包括:The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 1, characterized in that the method further includes:
    响应于针对所述句段数据中目标句段的插入断点操作,在述目标句段的边界轴控件中插入断点,以基于所述断点对所述目标句段进行分段处理。In response to an operation of inserting a breakpoint for a target segment in the segment data, a breakpoint is inserted in a boundary axis control of the target segment to perform segmentation processing on the target segment based on the breakpoint.
  15. 如权利要求1述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述转写文本包括所述句段数据中的每一个句段对应的文本片段,在所述对所述处理后的句段数据进行语音识别处理得到转写文本之后,还包括:The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 1, wherein the transcribed text includes text fragments corresponding to each segment in the segment data, and in the After performing speech recognition processing on the processed segment data to obtain the transcribed text, it also includes:
    响应于针对所述转写文本中的目标文本片段的修改指令,对所述目标文本片段进行修改,得到修改后的转写文本,目标文本片段为所述转写文本中的至少一个文本片段。In response to a modification instruction for a target text segment in the transcribed text, the target text segment is modified to obtain a modified transcribed text, where the target text segment is at least one text segment in the transcribed text.
  16. 如权利要求15述的基于自建模板的多模态快速转写及标注方法,其特征在于,所述方法还包括:The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 15, characterized in that the method further includes:
    响应于针对所述目标文本片段的标注指令,对所述目标文本片段进行标注,得到标注后的转写文本。In response to the annotation instruction for the target text segment, the target text segment is annotated to obtain an annotated transcribed text.
  17. 一种基于自建模板的多模态快速转写及标注系统,其特征在于,所述系统包括:A multi-modal rapid transcription and annotation system based on self-built templates, characterized in that the system includes:
    第一获取单元,用于获取待处理的媒体文件对应的项目工程文件;The first acquisition unit is used to acquire the project engineering file corresponding to the media file to be processed;
    第二获取单元,用于根据所述项目工程文件的目录,获取所述媒体文件的音频数据;The second acquisition unit is used to acquire the audio data of the media file according to the directory of the project engineering file;
    分段单元,用于根据所述音频数据的幅度对所述音频数据进行分段处理,得到所述音频数据的句段数据;A segmentation unit, configured to segment the audio data according to the amplitude of the audio data to obtain segment data of the audio data;
    显示单元,用于在操作界面上显示所述音频数据的句段数据,所述操作界面用于提供展示界面和边界轴控件;A display unit configured to display the segment data of the audio data on an operation interface, the operation interface being used to provide a display interface and boundary axis controls;
    处理单元,用于响应于针对所述边界轴控件的编辑操作,对所述句段数据进行边界调整处理或者句段合并处理,得到处理后的句段数据;A processing unit, configured to perform boundary adjustment processing or segment merging processing on the segment data in response to the editing operation on the boundary axis control, to obtain processed segment data;
    转写单元,用于对所述处理后的句段数据进行语音识别处理得到转写文本;A transliteration unit, used to perform speech recognition processing on the processed segment data to obtain transcribed text;
    更新单元,用于根据所述转写文本对所述项目工程文件进行更新,得到更新后的项目工程文件,所述更新后的项目工程文件携带所述转写文本;An update unit, configured to update the project engineering file according to the transcribed text to obtain an updated project engineering file, where the updated project engineering file carries the transcribed text;
    播放单元,用于在所述展示界面上播放所述更新后的项目工程文件时,显示所述媒体文件和所述转写文本中与所述媒体文件的播放进度对应的文本片段。A playback unit, configured to display text segments in the media file and the transcribed text corresponding to the playback progress of the media file when the updated project engineering file is played on the display interface.
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序适于处理器进行加载,以执行如权利要求1-16任一项所述的基于自建模板的多模态快速转写及标注方法中的步骤。A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program is suitable for loading by a processor to execute the method based on any one of claims 1-16. Steps in the multi-modal rapid transcription and annotation method of self-built templates.
PCT/CN2022/091181 2022-05-06 2022-05-06 Multi-modal rapid transliteration and annotation system based on self-built template WO2023212920A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/091181 WO2023212920A1 (en) 2022-05-06 2022-05-06 Multi-modal rapid transliteration and annotation system based on self-built template
CN202280002307.8A CN115136233B (en) 2022-05-06 2022-05-06 Multi-mode rapid transfer and labeling system based on self-built template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/091181 WO2023212920A1 (en) 2022-05-06 2022-05-06 Multi-modal rapid transliteration and annotation system based on self-built template

Publications (1)

Publication Number Publication Date
WO2023212920A1 true WO2023212920A1 (en) 2023-11-09

Family

ID=83387058

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/091181 WO2023212920A1 (en) 2022-05-06 2022-05-06 Multi-modal rapid transliteration and annotation system based on self-built template

Country Status (2)

Country Link
CN (1) CN115136233B (en)
WO (1) WO2023212920A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270437A1 (en) * 2007-04-26 2008-10-30 Custom Speech Usa, Inc. Session File Divide, Scramble, or Both for Manual or Automated Processing by One or More Processing Nodes
CN107657947A (en) * 2017-09-20 2018-02-02 百度在线网络技术(北京)有限公司 Method of speech processing and its device based on artificial intelligence
CN110740275A (en) * 2019-10-30 2020-01-31 中央电视台 nonlinear editing systems
CN112487238A (en) * 2020-10-27 2021-03-12 百果园技术(新加坡)有限公司 Audio processing method, device, terminal and medium
WO2021259221A1 (en) * 2020-06-23 2021-12-30 北京字节跳动网络技术有限公司 Video translation method and apparatus, storage medium, and electronic device
CN114268829A (en) * 2021-12-22 2022-04-01 中电金信软件有限公司 Video processing method and device, electronic equipment and computer readable storage medium
CN114420125A (en) * 2020-10-12 2022-04-29 腾讯科技(深圳)有限公司 Audio processing method, device, electronic equipment and medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8270815B2 (en) * 2008-09-22 2012-09-18 A-Peer Holding Group Llc Online video and audio editing
US9666208B1 (en) * 2015-12-14 2017-05-30 Adobe Systems Incorporated Hybrid audio representations for editing audio content
CN108681530A (en) * 2018-05-04 2018-10-19 北京天元创新科技有限公司 A kind of official document generation method and system based on Web

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270437A1 (en) * 2007-04-26 2008-10-30 Custom Speech Usa, Inc. Session File Divide, Scramble, or Both for Manual or Automated Processing by One or More Processing Nodes
CN107657947A (en) * 2017-09-20 2018-02-02 百度在线网络技术(北京)有限公司 Method of speech processing and its device based on artificial intelligence
CN110740275A (en) * 2019-10-30 2020-01-31 中央电视台 nonlinear editing systems
WO2021259221A1 (en) * 2020-06-23 2021-12-30 北京字节跳动网络技术有限公司 Video translation method and apparatus, storage medium, and electronic device
CN114420125A (en) * 2020-10-12 2022-04-29 腾讯科技(深圳)有限公司 Audio processing method, device, electronic equipment and medium
CN112487238A (en) * 2020-10-27 2021-03-12 百果园技术(新加坡)有限公司 Audio processing method, device, terminal and medium
CN114268829A (en) * 2021-12-22 2022-04-01 中电金信软件有限公司 Video processing method and device, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN115136233A (en) 2022-09-30
CN115136233B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
US12020708B2 (en) Method and system for conversation transcription with metadata
US20220230374A1 (en) User interface for generating expressive content
KR101674851B1 (en) Automatically creating a mapping between text data and audio data
US20220059096A1 (en) Systems and Methods for Improved Digital Transcript Creation Using Automated Speech Recognition
US9330720B2 (en) Methods and apparatus for altering audio output signals
US20180286459A1 (en) Audio processing
CN107464555B (en) Method, computing device and medium for enhancing audio data including speech
US11836180B2 (en) System and management of semantic indicators during document presentations
US20100003006A1 (en) Video searching apparatus, editing apparatus, video searching method, and program
US20080077869A1 (en) Conference supporting apparatus, method, and computer program product
US20140013192A1 (en) Techniques for touch-based digital document audio and user interface enhancement
WO2022001579A1 (en) Audio processing method and apparatus, device, and storage medium
EP4322029A1 (en) Method and apparatus for generating video corpus, and related device
CN114023301A (en) Audio editing method, electronic device and storage medium
WO2022037600A1 (en) Abstract recording method and apparatus, and computer device and storage medium
CN112040142B (en) Method for video authoring on mobile terminal
WO2016179128A1 (en) Techniques to automatically generate bookmarks for media files
US20200135169A1 (en) Audio playback device and audio playback method thereof
CN114095782A (en) Video processing method and device, computer equipment and storage medium
US20160027471A1 (en) Systems and methods for creating, editing and publishing recorded videos
CN109460548B (en) Intelligent robot-oriented story data processing method and system
WO2023212920A1 (en) Multi-modal rapid transliteration and annotation system based on self-built template
US11238865B2 (en) Function performance based on input intonation
Arons Authoring and transcription tools for speech-based hypermedia systems
JP2022051500A (en) Related information provision method and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22940614

Country of ref document: EP

Kind code of ref document: A1