WO2023212920A1

WO2023212920A1 - Multi-modal rapid transliteration and annotation system based on self-built template

Info

Publication number: WO2023212920A1
Application number: PCT/CN2022/091181
Authority: WO
Inventors: 李斌
Original assignee: 湖南师范大学
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2023-11-09
Also published as: CN115136233A; CN115136233B

Abstract

The present application discloses a multi-modal rapid transliteration and annotation system based on a self-built template, comprising: a first acquisition unit for acquiring a project engineering file corresponding to a media file; a second acquisition unit for acquiring audio data of the media file according to a directory of the project engineering file; a segmentation unit for performing segmentation processing on the audio data according to an amplitude of the audio data to obtain segment data of the audio data; a display unit for displaying the segment data on an operation interface, the operation interface being used for providing a display interface and a boundary axis control; a processing unit for performing, in response to an editing operation for the boundary axis control, boundary adjustment or segment merging on the segment data to obtain processed segment data, and then performing speech recognition processing to obtain a transliteration text; a transliteration unit for updating the project engineering file according to the transliteration text; and a playing unit for displaying, when playing the updated project engineering file on the display interface, a text fragment corresponding to a playing progress of the media file in the media file and the transliteration text.

Description

A multi-modal fast transcription and annotation system based on self-built templates

Technical field

This application relates to the field of speech processing technology, and specifically relates to a multi-modal fast transcription and annotation method based on a self-built template, a multi-modal fast transcription and annotation system based on a self-built template, and a storage medium.

Background technique

With the development of computer technology, the application of speech recognition technology is becoming more and more widespread. Speech recognition technology identifies the corresponding speech content from the collected speech information, that is, recognizes digital speech information into corresponding text.

Transcription technology is used to convert speech into written text. Speech transcription can be used for simple single-person speech transcription, or for complex multi-person speech transcription, such as conference speech transcription, court hearing speech transcription, classroom transcription, etc.

However, the existing speech transcription annotation tools cannot create self-built language templates and have poor scalability. At the same time, it is impossible to achieve rapid merging of segments and fine-tuning of boundaries, and cannot adapt to the usage needs of various real-world scenarios. For example: production of video subtitles (*.SRT), production of mp3 music plug-in lyrics (*.LRC), transcription of various recordings, language listening teaching, audio-visual teaching, oral corpus construction, multimedia resource library construction, situational language research, Multimodal research on classroom teaching, etc.

technical problem

Embodiments of the present application provide a multi-modal fast transcription and annotation method based on a self-built template, a multi-modal fast transcription and annotation system and a storage medium based on a self-built template, which can provide a simple and convenient voice transcoding. The writing annotation method can realize speech transcription annotation through self-built language templates, and can realize rapid merging of segments and fine-tuning of boundaries, which improves the efficiency of transcription annotation to adapt to the usage needs of various scenarios mentioned above.

Technical solutions

On the one hand, a multi-modal rapid transcription and annotation method based on a self-built template is provided. The method includes: obtaining the project engineering file corresponding to the media file to be processed; and obtaining the project engineering file according to the directory of the project engineering file. audio data of the media file; performing segmentation processing on the audio data according to the amplitude of the audio data to obtain segment data of the audio data; displaying the segment data of the audio data on the operation interface, and the operation The interface is used to provide a display interface and a boundary axis control; in response to an editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data; The processed segment data is subjected to speech recognition processing to obtain a transcribed text; the project engineering file is updated according to the transcribed text to obtain an updated project engineering file, and the updated project engineering file carries the transcribed text. Writing text; when playing the updated project engineering file on the display interface, display the text fragment in the media file and the transcribed text that corresponds to the playback progress of the media file.

In some embodiments, in response to an editing operation on the boundary axis control, performing boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data includes: responding to the editing operation on the boundary axis control. Describe the first editing operation of the movable end of the first boundary axis control of the active segment in the segment data, control the movable end of the first boundary axis control to move to the first position; determine whether there is a link between the first boundary axis control and the first boundary axis control at the first position. The second boundary axis control has an overlapping active end of the first boundary axis control. The second boundary axis control is a boundary axis control corresponding to the second segment. The active segment and the second segment are Adjacent segments; if there is a second boundary axis control that overlaps the active end of the first boundary axis control at the first position, merge the active segment with the second segment deal with.

In some embodiments, after determining whether there is a second boundary axis control overlapping the movable end of the first boundary axis control at the first position, the method further includes: if at the first position If there is no second boundary axis control overlapping the active end of the first boundary axis control at the position, the boundary of the active segment is adjusted according to the first position.

In some embodiments, in response to an editing operation on the boundary axis control, performing boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data includes: responding to the editing operation on the boundary axis control. Describe the second editing operation of the movable end of the first boundary axis control of the active segment in the segment data, control the movable end of the first boundary axis control to move to the second position; determine whether it exists at the second position A third boundary axis control that overlaps with the active end of the first boundary axis control. The third boundary axis control is a boundary axis control corresponding to the third segment. The active segment and the third segment is a non-adjacent segment; if there is a third boundary axis control that overlaps the active end of the first boundary axis control at the second position, then the active segment, the third segment , and the intermediate segments between the active segment and the third segment are merged.

In some embodiments, after determining whether there is a third boundary axis control overlapping the movable end of the first boundary axis control at the second position, the method further includes: if at the second position If there is no third boundary axis control overlapping the movable end of the first boundary axis control, it is determined whether the target area between the stationary end position of the first boundary axis control and the second position is consistent with Any of the intermediate segments overlaps; if the target area between the static end position of the first boundary axis control and the second position does not overlap with any of the intermediate segments, then according to the second The position is adjusted to the boundary of the active segment; or if the target area between the static end position of the first boundary axis control and the second position overlaps with at least one of the intermediate segments, the activity is Segments and all intermediate segments that overlap with the target area are merged.

In some embodiments, performing segmentation processing on the audio data according to the amplitude of the audio data to obtain segment data of the audio data includes: based on a noise amplitude threshold and the amplitude of the audio data. Perform segmentation processing on the audio data to obtain segment data of the audio data.

In some embodiments, performing segmentation processing on the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain the segment data of the audio data includes: obtaining the segment data of the audio data. Initial segmented data; determine whether the average amplitude in the current segment in the initial segmented data is greater than the noise amplitude threshold; if the average amplitude in the current segment in the initial segmented data is greater than the noise amplitude threshold, Then the current segment is marked as a voiced segment; the audio points in the current segment marked as a voiced segment are trimmed from the segment starting point and the segment end point to remove silence or noise in the current segment. ; If the starting position of the current segment after cropping is the same as the end position of the previous segment, then merge the current segment after cropping and the previous segment; if the current segment after cropping If the starting position of the segment is not the same as the end position of the previous segment, then the current segment after cropping is marked as a new segment; the initial segment data of the audio data is traversed and processed to obtain the Describes segment data of audio data.

In some embodiments, obtaining the initial segment data of the audio data includes: performing initial segment processing on the audio data according to a preset language template to obtain the initial segment data of the audio data.

In some embodiments, obtaining the project engineering file corresponding to the media file to be processed includes: obtaining the media file to be processed; detecting whether the corresponding project engineering file has been created for the media file; if the media file is detected If the corresponding project engineering file is not created, create the project engineering file corresponding to the media file based on the template file; or if it is detected that the corresponding project engineering file has been created for the media file, obtain the project engineering file corresponding to the created media file. Project engineering documents.

In some embodiments, the method further includes: in response to an export instruction carrying a target file type, exporting an export file corresponding to the target file type from the project engineering file, and the target file type belongs to a preset file Any of the file types.

In some embodiments, the method further includes: in response to the import instruction, obtaining the imported file; when the file type of the imported file belongs to any of the preset file types, converting the imported file Import the project file.

In some embodiments, displaying the segment data of the audio data on the operation interface includes: displaying the segment waveform information of the segment data of the audio data on the operation interface, and the segment waveform information Corresponding timeline information.

In some embodiments, the method further includes: in response to a hide waveform instruction, hiding the segment waveform information and the timeline information on the operation interface. In some embodiments, the method further includes: in response to an insert breakpoint operation for a target segment in the segment data, inserting a breakpoint in a boundary axis control of the target segment to determine based on the breakpoint The target segment is segmented.

In some embodiments, the transcribed text includes text fragments corresponding to each segment in the segment data. After performing speech recognition processing on the processed segment data to obtain the transcribed text, It also includes: responding to a modification instruction for a target text fragment in the transcribed text, modifying the target text fragment to obtain a modified transcribed text, where the target text fragment is at least one of the transcribed texts. Text snippet.

In some embodiments, the method further includes: responding to annotation instructions for the target text fragment, annotating the target text fragment to obtain annotated transcribed text.

On the other hand, a multi-modal rapid transcription and annotation system based on self-built templates is provided, and the system includes:

The first acquisition unit is used to acquire the project engineering file corresponding to the media file to be processed;

The second acquisition unit is used to acquire the audio data of the media file according to the directory of the project engineering file;

A segmentation unit, configured to segment the audio data according to the amplitude of the audio data to obtain segment data of the audio data;

A display unit configured to display the segment data of the audio data on an operation interface, the operation interface being used to provide a display interface and boundary axis controls;

A processing unit, configured to perform boundary adjustment processing or segment merging processing on the segment data in response to the editing operation on the boundary axis control, to obtain processed segment data;

A transliteration unit, used to perform speech recognition processing on the processed segment data to obtain transcribed text;

An update unit, configured to update the project engineering file according to the transcribed text to obtain an updated project engineering file, where the updated project engineering file carries the transcribed text;

A playback unit, configured to display text segments in the media file and the transcribed text corresponding to the playback progress of the media file when the updated project engineering file is played on the display interface.

On the other hand, a computer-readable storage medium is provided, the computer-readable storage medium stores a computer program, and the computer program is suitable for loading by the processor to execute the self-built template-based process as described in the first aspect. Steps in multimodal fast transcription and annotation methods.

On the other hand, a terminal device is provided. The terminal device includes a processor and a memory. A computer program is stored in the memory. The processor is used to execute the following by calling the computer program stored in the memory. The steps in the multi-modal fast transcription and annotation method based on self-built templates described in the first aspect.

beneficial effects

Embodiments of the present application provide a multi-modal fast transcribing and annotating method based on a self-built template, a multi-modal fast transcribing and annotating system based on a self-built template, and a storage medium, by obtaining the items corresponding to the media files to be processed. Project file; obtain the audio data of the media file according to the directory of the project project file; segment the audio data according to the amplitude of the audio data to obtain the segment data of the audio data; display the segment data of the audio data on the operation interface, The operation interface is used to provide a display interface and boundary axis control; in response to the editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data; the processed segment data The data is processed through speech recognition to obtain the transcribed text; the project engineering file is updated according to the transcribed text to obtain the updated project engineering file, which carries the transcribed text; the updated project engineering file is played on the display interface file, displays the text fragments in the media file and the transcribed text that correspond to the playback progress of the media file.

Embodiments of the present application can provide a simple and convenient speech transcription annotation method, which can realize multiple speech transcriptions through self-built multi-language templates, and can support template import when speech recognition cannot be performed in a large number of languages or dialects. Finally, Realize fast and efficient segmentation and transliteration annotation, and realize rapid merging of segments by dragging the boundary axis control corresponding to the segment displayed on the operation interface, and the boundary corresponding to the segment waveform can be directly displayed on the operation interface Horizontal dragging on the axis control enables fine-tuning of the boundary, which improves the efficiency of speech transcription annotation to adapt to the usage needs of the various scenarios mentioned above.

Description of the drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

Figure 1 is a schematic flowchart of a multi-modal rapid transcription and annotation method based on a self-built template provided by an embodiment of the present application.

Figure 2 is a schematic diagram of the first application scenario provided by the embodiment of the present application.

Figure 3 is a schematic diagram of the second application scenario provided by the embodiment of the present application.

Figure 4 is a schematic diagram of the third application scenario provided by the embodiment of the present application.

Figure 5 is a schematic diagram of the fourth application scenario provided by the embodiment of the present application.

Figure 6 is a schematic diagram of the fifth application scenario provided by the embodiment of the present application.

Figure 7 is a schematic diagram of the sixth application scenario provided by the embodiment of the present application.

Figure 8 is a schematic structural diagram of a multi-modal fast transcription and annotation system based on self-built templates provided by an embodiment of the present application.

Figure 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.

Embodiments of the invention

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without making creative efforts fall within the scope of protection of this application.

Embodiments of the present application provide a multi-modal fast transcribing and annotating method based on a self-built template, a multi-modal fast transcribing and annotating system based on a self-built template, and a storage medium. Specifically, the multi-modal fast transcription and annotation method based on the self-built template in the embodiment of the present application can be executed by a terminal device, where the terminal device can be a terminal or a server. The terminal can be a terminal device such as a smartphone, a tablet, a touch screen, a personal computer (Personal Computer, PC), etc. The server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also provide cloud services, cloud databases, cloud storage, network services, cloud communications, middleware services, domain name services, Cloud servers for security services, content distribution network services, and basic cloud computing services such as big data and artificial intelligence platforms, but are not limited to these.

Each is explained in detail below. It should be noted that the description order of the following embodiments is not used to limit the priority order of the embodiments.

Please refer to Figures 1 to 7. Figure 1 is a schematic flowchart of a multi-modal fast transcription and annotation method based on self-built templates provided by an embodiment of the present application. Figures 2 to 7 are application scenarios provided by an embodiment of the present application. Schematic diagram. The multi-modal fast transcription and annotation method based on the self-built template of the embodiment of the present application can be applied to the multi-modal fast transcription and annotation system based on the self-built template of the embodiment of the present application. The multi-modal fast transcription and annotation system based on the self-built template The dynamic fast transcription and annotation system can be configured on the terminal device. The terminal device may be a terminal device, and the method includes the following steps:

Step 110: Obtain the project engineering file corresponding to the media file to be processed.

In some embodiments, obtaining the project engineering file corresponding to the media file to be processed includes: obtaining the media file to be processed; detecting whether the corresponding project engineering file has been created for the media file; if it is detected that the corresponding project engineering file has not been created for the media file file, create a project project file corresponding to the media file based on the template file; or if it is detected that a project project file corresponding to the media file has been created, obtain the project project file corresponding to the created media file.

For example, you can provide a target client, start the target client, and then open or import a media file to be processed through the target client to obtain the media file. For example, the media file may be an audio file or a video file.

For example, the target client can be a multi-modal rapid transcription and annotation system based on a self-built template. It is a tool software developed specifically for rapid transcription and annotation of audio and video language materials. The software can have built-in Mandarin, Chinese dialects, and minority languages. Multi-language templates such as ethnic languages directly provide support for the discourse transliteration of the Chinese Language Resource Protection Project. Among them, the multi-language template can be a multi-layer annotation template. In addition, multi-language templates can be built according to project needs. For example, language transliteration annotation templates corresponding to different languages can also be built-in. In addition, the target client can also be used in the production of video subtitles (*.SRT), the production of mp3 music plug-in lyrics (*.LRC), the transcription of various recordings, language listening teaching, audio-visual teaching, spoken language corpus construction, and multimedia resources. It can be used in many application scenarios such as library construction, situational language research, and multi-modal research in classroom teaching.

Then, by detecting whether a project file with the same name as the media file exists in the storage path, it is detected whether a corresponding project file has been created for the media file. Among them, for media files that have been opened in the past, the target client can save the historical records so that when the media files are opened next time, the project files with the same name corresponding to the historical records can be directly called. Create project engineering files from media files to optimize the processing process. For example, the history record is record information of media files that have been opened within the historical period recorded by the target client.

For example, if there is a project file with the same name as the media file, it is determined that a corresponding project file has been created for the media file, and then directly obtains the created project file with the same name as the media file in the storage path, and then performs step 120.

For example, if there is no project file with the same name as the media file, a project file with the same name corresponding to the media file is created based on the template file, and the corresponding project file is loaded, and then step 120 is performed.

Step 120: Obtain the audio data of the media file according to the directory of the project engineering file.

For example, start the audio and video data parsing thread, find the media file to be processed corresponding to the directory from the storage path of the media file based on the media file information recorded in the directory of the project project file, and based on the audio and video data parsing thread from Extract the audio data of the media file from the media file.

Step 130: Perform segmentation processing on the audio data according to the amplitude of the audio data to obtain segment data of the audio data.

For example, before performing segmentation processing, it is also necessary to determine whether the audio data needs to be segmented. If it is determined that the audio data needs to be segmented, the audio data is segmented. After the audio data is segmented, a notification of the end of the audio data segmentation is sent to the main thread. If it is determined that the audio data does not need to be segmented, a notification of the end of segmentation of the audio data is sent to the main thread.

Among them, it can be determined whether the audio data needs to be segmented by detecting whether the audio data in the project file contains divided segment data. If divided segment data exists, it is determined that the audio data does not need to be segmented. If there is no divided segment data, it is determined that the audio data needs to be segmented.

In some embodiments, segmenting the audio data according to the amplitude of the audio data to obtain segment data of the audio data includes: segmenting the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain Segment data of audio data.

For example, the audio data may be initially segmented according to a preset segmentation interval, or the audio data may be initially segmented according to a silence segment. Then, based on the relationship between the noise amplitude threshold and the amplitude of the audio data, the audio data is subjected to a second segmentation process to obtain segment data of the audio data.

In some embodiments, segmenting the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain segment data of the audio data includes: obtaining the initial segment data of the audio data; determining the initial segment data Whether the average amplitude in the current segment in the initial segment data is greater than the noise amplitude threshold; if the average amplitude in the current segment in the initial segment data is greater than the noise amplitude threshold, the current segment is marked as a voiced segment; the current segment marked as a voiced segment is The audio points within the segment are trimmed from the beginning and end of the segment to remove silence or noise in the current segment; if the starting position of the current segment after trimming is the same as the end position of the previous segment, the segment will be cropped Merge the current segment after cropping with the previous segment; if the starting position of the current segment after cropping is different from the end position of the previous segment, mark the current segment after cropping as a new segment; The initial segment data of the audio data is traversed and processed to obtain the segment data of the audio data.

In some embodiments, obtaining the initial segment data of the audio data includes: performing initial segment processing on the audio data according to a preset language template, and obtaining the initial segment data of the audio data.

For example, the preset language template has the ability to segment segments. The preset language templates may include built-in or self-built multi-language templates in the target client to achieve rapid creation of initial segmented data. Among them, the multi-language template can be a multi-layer annotation template. For example, the multi-language template may include language templates corresponding to different national languages, dialects of different regions, voices of different characters, etc., such as English, Mandarin, minority languages, Chinese dialects, female voices, male voices, children's voices, etc. Language templates corresponding to speech, etc. Among them, the built-in multi-language templates can be language templates inserted through third-party software, and multiple speech transcriptions can be realized through the built-in multi-language templates. The self-built multi-language template can be a language template created directly in the target client, and multiple speech transcription annotations can be realized by self-building multiple language templates.

In some embodiments, the default language template includes a multi-language template built in or self-built in the target client. The multi-language template can include languages corresponding to different national languages, dialects in different regions, voices of different characters, etc. template. Since different speaker genders and their corresponding languages may cause different noises, judging by a single noise threshold may cause one-sided speech segmentation. Therefore, in this embodiment, the corresponding noise amplitude threshold is automatically generated based on the current segmented speech signal. For example, a noise amplitude threshold generation module can be built in, a preset language template can be input into the noise amplitude threshold generation module, and the noise amplitude threshold corresponding to the current segmented speech signal can be adaptively determined.

Specifically, in this embodiment, the speech signal corresponding to the current segment is obtained, and the amplitude distribution function corresponding to the speech signal of the current segment is obtained by fitting:

Among them, x represents the signal amplitude corresponding to the current segmented speech, and σ represents the signal variance of the current segmented speech;

Determine the signal standard deviation corresponding to the current segmented speech based on the amplitude distribution function;

Based on the product between the standard deviation, the average amplitude and the preset amplitude factor, the noise amplitude threshold corresponding to the current segmented speech is determined to be:

Among them, Tam represents the noise amplitude threshold,

represents the standard deviation,

represents the average amplitude, α represents the preset amplitude factor. In this embodiment, through the above-mentioned method of determining the noise amplitude threshold and performing speech segmentation, noise or non-noise in the speech can be adaptively detected according to the speech condition, thereby improving the accuracy of noise detection and segmentation.

For example, the audio data can be initially segmented according to the preset segmentation interval to obtain the initial segment data of the audio data. For example, the preset segmentation interval may be an interval set according to a regular sentence segmentation time.

For example, the audio data can be initially segmented based on the silence segments to obtain the initial segment data of the audio data. For example, the audio data is initially segmented by detecting the silent segments in the audio data, and the initial segmentation is performed based on the position of the silent segments in the audio data. The head end of the silent segment is connected to the end of the previous initial segment, and the mute segment is The end of the segment is connected to the beginning of the next initial segment.

For example, when segmenting a complete sentence into multiple initial segments in order to avoid too many initial segments, resulting in short silent segments caused by regular sentence fragmentation, you can ignore the short silent segments before performing the initial segmentation, and only use Silent segments whose audio length is greater than the preset length are used as target silent segments as the basis for initial segmentation. For example, you can first detect the silent segments in the audio data, then select the silent segments whose audio length is greater than the preset length as the target silent segments used as the basis for initial segmentation, and then perform initialization based on the position of the target silent segment in the audio data. Segmentation.

Then, based on the relationship between the noise amplitude threshold and the amplitude of the audio data, a second segmentation process is performed on the initial segmented data. Specifically, it is judged whether the average amplitude in the current segment is greater than the noise amplitude threshold; if the average amplitude in the current segment is greater than the noise amplitude threshold, the current segment is marked as a voiced segment, and the current segment marked as a voiced segment is The audio points within the segment are trimmed from the beginning and end of the segment to remove silence or noise in the current segment. If the start and end positions of the current segment and the previous segment are the same, the current segment and the previous segment will be trimmed. Merge and use the merged segment as a segment in the segment data; if the starting and ending positions of the current segment and the previous segment are different, mark the current segment as a new segment, and you can The new segment appears as a segment in the segment data.

For example, if the average amplitude in the current segment in the initial segment data is not greater than the noise amplitude threshold, the current segment will be marked as a silent segment, and the current segment marked as a silent segment will be discarded and will not be used as segment data. a segment in .

Step 140: Display the segment data of the audio data on the operation interface, which is used to provide a display interface and a boundary axis control.

For example, as shown in FIG. 2 , an operation interface 200 of the target client is provided, segment data 201 of the audio data is displayed on the operation interface 200 , and the operation interface 200 is used to provide a display interface 202 and a boundary axis control 203 .

For example, other editing interfaces or operation interfaces may also be displayed on the operation interface 200 . For example, interfaces such as file, editing, settings, and help; such as operation interfaces for transcription mode, annotation mode, and full-text mode; such as playback interfaces for display interfaces, etc.

In some embodiments, displaying the segment data of the audio data on the operation interface includes: displaying the segment waveform information of the segment data of the audio data on the operation interface, and the timeline information corresponding to the segment waveform information.

In some embodiments, the method further includes: in response to the hide waveform instruction, hiding the segment waveform information and the timeline information on the operation interface.

For example, based on instructions input by the user, the segment waveform information and timeline information can be displayed or hidden in a flexible display manner.

Step 150: In response to the editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data.

For example, you can realize segment dragging operation by dragging the boundary axis control to adjust segment boundaries or merge segments. That is, you can quickly merge segments by dragging the boundary axis control corresponding to the segment displayed on the operation interface, and you can directly drag the left or right horizontally on the boundary axis control corresponding to the segment displayed on the operation interface. Boundary fine-tuning can be realized automatically. For example, the segment waveform can also be displayed on the operation interface, and the boundary axis control corresponding to the segment waveform displayed on the operation interface can be directly dragged horizontally on the left or right side to realize boundary fine-tuning.

For example, you can right-click the boundary axis control to record the current active segment information; cache the current list of all segment information; drag the boundary axis control in response to a drag operation triggered by long pressing the left mouse button; determine whether the active segment exists, If so, update the left boundary point and right boundary point of the temporary active segment; release the left button to determine whether the last drag operation was, if so, obtain the segment where the current mouse is located; determine whether the merge conditions are met, and if so, Then merge the segments, if not, update the boundary information of the active segment.

For example, when judging whether the merge condition is met, it is mainly judged whether the final boundary point of the active segment exceeds the adjacent boundary of the merged segment. For example, when merging segments to the right, the right boundary of the active segment must exceed the left boundary of the merged segment before merging, and the two segments must be different. When merging segments to the left, the left boundary of the active segment must exceed the right boundary of the merged segment before merging, and the two segments must be different.

For example, the implementation logic of obtaining the end segment is: sequentially traverse the entire list of segments, and determine the size of the left and right boundaries of each segment and the horizontal direction of the mouse end point. When merging left, when the right boundary of a segment is greater than the end position of the mouse, it means that this segment is the end segment; when merging right, when the left boundary of a segment is greater than the end position of the mouse, it means The segment before this segment is the end segment.

For example, taking merging segments to the right as an example, when judging whether the merging conditions are met, check whether the ending segment exists; if the ending segment does not exist, it cannot be merged and the merging conditions are not met; if the ending segment exists, then Determine whether they are the same segment; if they are the same segment, they cannot be merged and the merging conditions are not met; if they are not the same segment, determine whether the current right boundary of the active segment is greater than the left boundary of the ending segment. If it is greater, Then it can be merged and the merge conditions are met; if it is less than, it cannot be merged and the merge conditions are not met.

For example, the view change diagram of the operation interface 300 shown in FIG. 3 shows a schematic diagram of adjusting segment boundaries. For example, the user hovers the mouse over the first boundary axis control 3031 of the segment that needs to be adjusted. The terminal determines the active segment 3011 currently to be adjusted by detecting the hovering position of the mouse, and then the user can long-press. Start dragging the boundary label of one end of the first boundary axis control 3031 with the left mouse button. After dragging to the determined position, release the left mouse button to complete the drag operation of the active segment, and the boundary of the active segment 3011 will be updated to the new position. . The editing operation for the first boundary axis control 3031 may be a drag operation, a click operation, etc. For example, taking the drag operation as an example, define the boundary label of one end of the first boundary axis control 3031 that is not dragged as a stationary end, and the stationary end is located at position A; The boundary label is defined as the active end, which is at position B before being dragged. Diagram 3-1 in Figure 3 shows the picture before dragging, and diagram 3-2 in Figure 3 shows the picture of updating the boundary position of the first boundary axis control 3031 after dragging. In response to the first editing operation for the active end of the first boundary axis control 3031 of the active segment 3011, the active end of the first boundary axis control 3031 is controlled to move from position B to position C to adjust the boundary. If the boundary position of the dragged active segment is not within the boundary range of other segments, update the boundary label at one end of the boundary of the active segment 303 to position C, that is, adjust the boundary of the active field 3011 from segment AB to segment AC.

For example, the view change diagram of the operation interface 400 shown in FIG. 4 shows a schematic diagram of the segment merging operation. For example, the user hovers the mouse over the first boundary axis control 4031 of the segment that needs to be adjusted. The terminal determines the currently active segment 4011 that needs to be adjusted by detecting the hovering position of the mouse. Then the user can long-press Use the left mouse button to start dragging the boundary label at one end of the first boundary axis control 4031. After dragging to the determined position, release the left mouse button to complete the drag operation of the active segment, and the boundary of the active segment 4011 will be updated to the new position. . The editing operation for the first boundary axis control 4031 may be a drag operation, a click operation, etc. For example, taking the drag operation as an example, define the boundary label of one end of the first boundary axis control 4031 that is not dragged as the stationary end, and the stationary end is located at position D; The boundary label is defined as the active end, which is at position E before being dragged. Diagram 4-1 in Figure 4 shows the picture before dragging, Diagram 4-2 in Figure 4 shows the picture of the boundary position change of the first boundary axis control 3031 during the dragging process, Figure 4 in Figure 4 -3 Schematic diagram shows the segment merging after dragging. In response to the first editing operation for the active end of the first boundary axis control 4031 of the active segment 4011, the active end of the first boundary axis control 4031 is controlled to move from the position E across the position A to the position F. For example, when a certain boundary label is dragged into other segments, the boundary labels located in other segments can be displayed as different icons from other boundary labels. For example, if the active end of the first boundary axis control 4031 is controlled from position E. Move position A to position F to drag the active end into other segments. At this time, the icon of the active end at position F can be in the shape of a small light blue candle, while other boundary labels can be displayed as red right-angled icons. Users can merge segments by releasing the mouse. If the boundary of the dragged active segment 4031 exceeds the adjacent boundaries of other segments, all segments within the range that overlap with the boundary of the dragged active segment can be merged. For example, if the boundary of active segment 4031 exceeds the left boundary (position A) of other segments 4032, active segment 4031 and other segments 4032 can be merged to obtain merged segment 4013. The boundary axis control of the merged segment 4013 The boundary of 4033 is the DC segment.

In some embodiments, in response to an editing operation on the boundary axis control, boundary adjustment processing or segment merging processing is performed on the segment data to obtain processed segment data, including: in response to the active segment in the segment data The first editing operation of the movable end of the first boundary axis control controls the movable end of the first boundary axis control to move to the first position; it is determined whether there is a third movable end at the first position that overlaps with the movable end of the first boundary axis control. Two boundary axis controls, the second boundary axis control is the boundary axis control corresponding to the second segment, and the active segment and the second segment are adjacent segments; if there is an activity with the first boundary axis control at the first position If the second boundary axis control overlaps, the active segment and the second segment will be merged.

In some embodiments, when the backend program processes audio data, in order to avoid merging the same segment, it is necessary to determine whether the active segment and the second segment are combined before processing. for the same. Specifically, you can determine whether the left boundaries of the two segments are the same and whether the right boundaries of the two segments are also the same. If the left boundaries of the two segments are the same and the right boundaries of the two segments are the same, are also the same, then the active segment and the second segment are judged to be the same segment. If the left boundaries of the two segments are different and/or the right boundaries of the two segments are different, it is determined that the active segment and the second segment are not the same segment, thereby accurately distinguishing the active segment from the second segment. second segment, and then merge the active segment with the second segment.

In some embodiments, after determining whether there is a second boundary axis control that overlaps the active end of the first boundary axis control at the first position, the method further includes: if there is not a second boundary axis control that overlaps with the first boundary axis control at the first position. If the active end of the control overlaps the second boundary axis control, the boundary of the active segment is adjusted according to the first position.

For example, you can quickly merge two adjacent segments by dragging the segments displayed on the operation interface. When merging segments, the merging function of two adjacent segments can be realized. On this basis, if it is necessary to merge multiple segments at the same time, merge them one by one according to the sequence of the segments, so that any number of segments can be merged. For example, you can merge two adjacent segments into a new segment by dragging and touching the boundary labels of two adjacent segments; for example, you can also drag the boundary labels of one segment across and touch the boundary labels of other segments. Achieve merging multiple segments.

Please refer to Figures 3 and 4. Figure 3 shows a schematic diagram of boundary adjustment processing on segment data, and Figure 4 shows a schematic diagram of segment merging processing on segment data.

As shown in FIG. 3 , in response to the first editing operation for the active end of the first boundary axis control 3031 of the active segment 3011 in the segment data, the active end of the first boundary axis control 3031 is controlled to move from position A to the first position. , the first position is position C in Figure 3. If there is no second boundary axis control overlapping the active end of the first boundary axis control 3031 at the first position (position C), the boundary of the active segment 3011 is adjusted according to the first position (position C), that is, the active field The boundary of 3011 is adjusted from segment AB to segment AC.

As shown in FIG. 4 , in response to the first editing operation for the active end of the first boundary axis control 4031 of the active segment 4011 in the segment data, the active end of the first boundary axis control 4031 is controlled to move to the first position. One position is position F in Figure 4. If there is a second boundary axis control 4032 that overlaps with the active end of the first boundary axis control 4031 at the first position (position F), then the active segment 4011 and the second segment 4012 are merged to obtain the merged segment. 4013. The boundary of the boundary axis control 4033 of the merged segment 4013 is the DC segment.

In some embodiments, in response to the editing operation for the boundary axis control, performing boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data, including: responding to the active sentence in the segment data. The second editing operation of the movable end of the first boundary axis control of the segment controls the movable end of the first boundary axis control to move to the second position; it is determined whether there is an overlapping movable end of the first boundary axis control at the second position. The third boundary axis control is the boundary axis control corresponding to the third segment. The active segment and the third segment are non-adjacent segments; if there is a boundary axis control corresponding to the first boundary axis control at the second position If the active end of the third boundary axis control overlaps, the active segment, the third segment, and the intermediate segment between the active segment and the third segment will be merged.

In some embodiments, after determining whether there is a third boundary axis control that overlaps the active end of the first boundary axis control at the second position, the method further includes: if there is not a third boundary axis control that overlaps the active end of the first boundary axis control at the second position. If the third boundary axis control overlaps the active end of the control, then determine whether the target area between the static end position of the first boundary axis control and the second position overlaps with any intermediate segment; if the first boundary axis control If the target area between the static end position and the second position does not overlap with any intermediate segment, the boundary of the active segment is adjusted according to the second position; or if the static end position of the first boundary axis control is between the second position and If there is overlap with at least one intermediate segment in the target area, the active segment and all intermediate segments that overlap with the target area will be merged.

For example, you can quickly merge multiple segments by dragging the boundary axis control of the segment displayed on the operation interface. Specifically, multiple segments can be merged by dragging. After dragging the first boundary axis control of the active segment, the active segment and all intermediate segments that overlap with the target area can be merged. To merge multiple segments at the same time, the target area is the area between the static end position of the first boundary axis control and the second position. That is, the dragged boundary position needs to be within the range of other segments, and this can be merged All segments within the range.

In some embodiments, the method further includes: in response to an insert breakpoint operation for the target segment in the segment data, inserting a breakpoint in a boundary axis control of the target segment to perform operations on the target segment based on the breakpoint. Processing in segments.

For example, you can insert breakpoints to perform segmented processing on the target segment, increasing the flexibility of segment adjustment.

Step 160: Perform speech recognition processing on the processed segment data to obtain transcribed text.

For example, automatic transcription can be implemented by calling the speech recognition module configured on the terminal or a third-party speech recognition module to perform speech recognition processing on the processed segment data to obtain the transcribed text.

In some embodiments, the transcribed text includes text fragments corresponding to each segment in the segment data. After performing speech recognition processing on the processed segment data to obtain the transcribed text, the method further includes: responding to the transcribed The modification instruction of the target text fragment in the text is to modify the target text fragment to obtain the modified transcribed text, and the target text fragment is at least one text fragment in the transcribed text.

For example, after the initial transcribed text is automatically transcribed, the user can enter modification instructions for the target text fragment in the transcribed text through the operation interface to manually update the transcribed text. The modification instructions may include instructions such as modifying words, deleting words, adding words, modifying fonts, modifying font size, modifying font color, etc.

In some embodiments, the method further includes: responding to an annotation instruction for the target text segment, annotating the target text segment to obtain annotated transcribed text.

For example, you can input annotation instructions for the target text fragment through the operation interface, annotate the target text fragment, and obtain the annotated transcribed text. For example, the target text fragment can be annotated in any of the following ways: industry field annotation, content category annotation, part-of-speech annotation, dependency annotation, entity annotation, relationship annotation, event annotation, reading comprehension annotation and question and answer annotation.

Step 170: Update the project engineering file according to the transcribed text to obtain an updated project engineering file. The updated project engineering file carries the transcribed text.

For example, save the transcribed text and the path of the media file together in a fixed-format (.Baf) project file to update the project file. The updated project engineering files carry the transcribed text.

For example, when updating the project file, you can initialize the waveform of the audio data, construct an array of segment waveform information, and update the display interface of the segment waveform information; save the media file information and segment data to the project file; notify the media file Change messages; player changes media files; software updates title information; controller updates interface and related control information.

During initialization, the memory data used for display can be initialized using the audio results parsed by the audio and video data parsing thread and the segmented information obtained by segmentation processing, and then set default values for some parameters that need to be used.

Step 180: When the updated project engineering file is played on the display interface, text fragments in the media file and the transcribed text corresponding to the playback progress of the media file are displayed.

For example, when the updated project file is played on the display interface, the text fragments in the media file and the transcribed text corresponding to the playback progress of the media file are displayed. You can also control the playback progress through the playback controls on the display interface.

For example, the embodiment of this application also provides a multi-format import and export function, which can support the import of Word (docx, txt, aud.txt), Excel (xls, xlsx), lrc, srt, json format files, etc., and also supports the above file types. And file export in eaf format. It can facilitate the migration of transcribed files to achieve multi-format file import and file export. Regarding the multi-format import and export function, corresponding interface functions for writing files and writing files can be provided for different file types and file reading and writing methods, so that different types of files can be written and written when importing or exporting files. out. For example, Excel, srt and other format files and corresponding media files can be imported at the same time, data files can be converted to Baf format, and multiple file formats can be optionally exported at one time.

For example, the corresponding relationship between the file types of the import format and the import interface can be shown in Table 1:

Table 1

文件类型file type	导入接口Import interface
Xls、XlsxXls, Xlsx	DoImportFile_ExcelDoImportFile_Excel
LrcLrc	DoImportFile_LrcDoImportFile_Lrc
SrtSrt	DoImportFile_SrtDoImportFile_Srt
DocxDocx	DoImportFile_DocxDoImportFile_Docx
JsonJson	DoImportFile_JsonDoImportFile_Json
AudAud	DoImportFile_AudDoImportFile_Aud
TxtTxt	DoImportFile_TxtDoImportFile_Txt

For example, the corresponding relationship between the file type of the export format and the export interface can be shown in Table 2:

Table 2

文件类型file type	导出接口Export interface
Xls、XlsxXls, Xlsx	ExportFile_ExcelExportFile_Excel
LrcLrc	DoExportFile_LRCDoExportFile_LRC
SrtSrt	DoExportFile_SRTDoExportFile_SRT
AudAud	DoExportFile_AudacityDoExportFile_Audacity
STLSTL	DoExportFile_STLDoExportFile_STL
Docx、TxtDocx,Txt	DoExportFile_TxtDoExportFile_Txt
EAFEAF	IBAF::SaveToIBAF::SaveTo

In some embodiments, the method further includes: in response to the export instruction carrying the target file type, export an export file corresponding to the target file type from the project engineering file, and the target file type belongs to any one of the preset file types. type.

For example, the schematic diagram of the file export application scenario shown in Figure 5, the schematic diagram of the file export interface shown in 5-1 in Figure 5, the exported target file type, etc. can be set on the file export interface, such as the target file type Set to Excel, and the export language is set to Mandarin. After executing the export command, you can export the file according to the setting content. For example, the exported Excel format file has the content shown in 5-2 in Figure 5.

For example, as shown in Figure 6 is a schematic diagram of another application scenario of file export, and a schematic diagram of the file export interface is shown as 6-1 in Figure 6, the exported target file type can be set on the file export interface, such as the target The file type can be set to Excel, Word, and EAF at the same time, and the export language can be set to dialect. After executing the export command, the file can be exported according to the setting content. When the target file type is set to multiple file formats at the same time, multiple file formats can be optionally exported at one time. The exported Excel format file is shown in Figure 6- 2 shows the content.

For example, preset file types may include: Word (docx, txt, aud.txt), Excel (xls, xlsx), lrc, srt, json format files, etc. Can support the above file types and eaf format file export. It can facilitate the migration of transcribed files to achieve multi-format file export.

In some embodiments, the method further includes: in response to the import instruction, obtaining the import file; when the file type of the import file belongs to any of the preset file types, importing the import file into the project engineering file.

For example, as shown in the schematic diagram of the file export interface in Figure 7, you can choose to import files on the file import interface, or import files and media files, when the file type of the imported file belongs to any of the preset file types. , import the import file into the project project file.

For example, preset file types may include: Word (docx, txt, aud.txt), Excel (xls, xlsx), lrc, srt, json format files, etc. File import that supports the above file types can be supported. It can facilitate the migration of transcribed files to achieve multi-format file import.

All the above technical solutions can be combined in any way to form optional embodiments of the present application, and will not be described again one by one.

The embodiment of this application obtains the project engineering file corresponding to the media file to be processed; obtains the audio data of the media file according to the directory of the project engineering file; performs segmentation processing on the audio data according to the amplitude of the audio data to obtain the segments of the audio data data; display the segment data of the audio data on the operation interface, which is used to provide a display interface and boundary axis control; in response to the editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data, Obtain the processed segment data; perform speech recognition processing on the processed segment data to obtain the transcribed text; update the project engineering file according to the transcribed text to obtain the updated project engineering file, and carry the updated project engineering file Transcribe text; when playing the updated project file on the display interface, display the text fragments in the media file and the transcribed text that correspond to the playback progress of the media file. Embodiments of the present application can provide a simple and convenient speech transcription method, which can realize multiple speech transcriptions by self-building multiple language templates and dragging the boundary axis control corresponding to the segment displayed on the operation interface. The rapid merging of segments, and the ability to directly drag horizontally on the boundary axis control corresponding to the segment waveform displayed on the operation interface to achieve boundary fine-tuning, improve the efficiency of speech transcription annotation to adapt to the usage needs of various scenarios mentioned above.

In order to facilitate better implementation of the multi-modal fast transcription and annotation method based on the self-built template of the embodiment of the present application, the embodiment of the present application also provides a multi-modal fast transcription and annotation system based on the self-built template. Please refer to FIG. 8 , which is a schematic structural diagram of a multi-modal fast transcription and annotation system based on a self-built template provided by an embodiment of the present application. Among them, the multi-modal fast transcription and annotation system 800 based on the self-built template is applied to a terminal device that provides a graphical user interface. The multi-modal fast transcription and annotation system 800 based on the self-built template may include:

The first obtaining unit 801 is used to obtain the project engineering file corresponding to the media file to be processed;

The second acquisition unit 802 is used to acquire the audio data of the media file according to the directory of the project engineering file;

The segmentation unit 803 is used to segment the audio data according to the amplitude of the audio data to obtain segment data of the audio data;

The display unit 804 is used to display the segment data of the audio data on the operation interface, and the operation interface is used to provide a display interface and boundary axis control;

The processing unit 805 is configured to perform boundary adjustment processing or segment merging processing on the segment data in response to the editing operation on the boundary axis control, and obtain processed segment data;

Transcription unit 806, used to perform speech recognition processing on the processed segment data to obtain transcribed text;

The update unit 807 is used to update the project engineering file according to the transcribed text to obtain an updated project engineering file, and the updated project engineering file carries the transcribed text;

The playback unit 808 is configured to display the text fragments in the media file and the transcribed text that correspond to the playback progress of the media file when the updated project engineering file is played on the display interface.

In some embodiments, the processing unit 805 may be configured to: in response to the first editing operation on the active end of the first boundary axis control of the active segment in the segment data, control the active end of the first boundary axis control to move to the third One position; determine whether there is a second boundary axis control overlapping the active end of the first boundary axis control at the first position, the second boundary axis control is the boundary axis control corresponding to the second segment, and the active segment is the same as the first boundary axis control. The two segments are adjacent segments; if there is a second boundary axis control overlapping the active end of the first boundary axis control at the first position, the active segment and the second segment will be merged.

In some embodiments, after determining whether there is a second boundary axis control that overlaps the active end of the first boundary axis control at the first position, the processing unit 805 may also be configured to: if there is not a second boundary axis control at the first position, The second boundary axis control that overlaps the active end of the first boundary axis control adjusts the boundary of the active segment according to the first position.

In some embodiments, the processing unit 805 may be configured to: in response to a second editing operation for the active end of the first boundary axis control of the active segment in the segment data, control the active end of the first boundary axis control to move to The second position; determine whether there is a third boundary axis control that overlaps with the active end of the first boundary axis control at the second position. The third boundary axis control is the boundary axis control corresponding to the third segment. The active segment and The third segment is a non-adjacent segment; if there is a third boundary axis control at the second position that overlaps the active end of the first boundary axis control, then the active segment, the third segment, and the active sentence The intermediate segments between the first segment and the third segment are merged.

In some embodiments, after determining whether there is a third boundary axis control that overlaps the active end of the first boundary axis control at the second position, the processing unit 805 may also be configured to: if there is not a third boundary axis control at the second position, For the third boundary axis control that overlaps the active end of the first boundary axis control, it is determined whether the target area between the stationary end position of the first boundary axis control and the second position overlaps with any intermediate segment; if If the target area between the static end position of a boundary axis control and the second position does not overlap with any intermediate segment, the boundary of the active segment is adjusted according to the second position; or if the static end position of the first boundary axis control If the target area between the target area and the second position overlaps with at least one intermediate segment, then the active segment and all the intermediate segments that have an overlapping relationship with the target area are merged.

In some embodiments, the segmentation unit 803 may be used to segment the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain segment data of the audio data.

In some embodiments, when segmenting the audio data according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain the segment data of the audio data, the segmentation unit 803 can be used to: obtain the initial segmentation of the audio data. segment data; determine whether the average amplitude of the current segment in the initial segment data is greater than the noise amplitude threshold; if the average amplitude of the current segment in the initial segment data is greater than the noise amplitude threshold, mark the current segment as a sound segment; The audio points in the current segment marked as a sound segment are trimmed from the segment starting point and segment end point to remove silence or noise in the current segment; if the starting position of the trimmed current segment is different from the previous segment If the end position of the current segment after cropping is the same as the end position of the previous segment, the current segment after cropping will be merged with the previous segment; if the starting position of the current segment after cropping is different from the end position of the previous segment, the current segment after cropping will be merged. Mark it as a new segment; traverse and process the initial segment data of the audio data to obtain the segment data of the audio data.

In some embodiments, when acquiring the initial segment data of the audio data, the segmentation unit 803 may be used to: perform initial segmentation processing on the audio data according to the preset language template, and obtain the initial segment data of the audio data.

In some embodiments, the first obtaining unit 801 can be used to: obtain the media file to be processed; detect whether the corresponding project engineering file has been created for the media file; if it is detected that the corresponding project engineering file has not been created for the media file, based on The template file creates a project project file corresponding to the media file; or if it is detected that the media file has created a corresponding project project file, the project project file corresponding to the created media file is obtained.

In some embodiments, the processing unit 805 may also be configured to respond to the export instruction carrying the target file type and export an export file corresponding to the target file type from the project engineering file. The target file type belongs to any of the preset file types. A file type.

In some embodiments, the processing unit 805 can also be used to: respond to the import instruction, obtain the import file;

When the file type of the imported file belongs to any of the preset file types, the imported file will be imported into the project file.

In some embodiments, the display unit 804 may be configured to display the segment waveform information of the segment data of the audio data and the timeline information corresponding to the segment waveform information on the operation interface.

In some embodiments, the display unit 804 may also be configured to hide the segment waveform information and the timeline information on the operation interface in response to the hide waveform instruction.

In some embodiments, the processing unit 805 may also be configured to respond to the insert breakpoint operation for the target segment in the segment data, insert a breakpoint in the boundary axis control of the target segment, so as to adjust the target based on the breakpoint. Segments are processed into segments.

In some embodiments, the transcribed text includes text fragments corresponding to each segment in the segment data. After the transcribing unit 806 performs speech recognition processing on the processed segment data to obtain the transcribed text, it may also be used to : In response to a modification instruction for a target text fragment in the transcribed text, modify the target text fragment to obtain a modified transcribed text, where the target text fragment is at least one text fragment in the transcribed text.

In some embodiments, the transliteration unit 806 may also be configured to respond to annotation instructions for the target text fragment, annotate the target text fragment, and obtain annotated transcribed text.

It should be understood that system embodiments and method embodiments may correspond to each other, and similar descriptions may refer to the method embodiments. To avoid repetition, they will not be repeated here. Specifically, the system shown in Figure 8 can execute the above-mentioned self-built template-based multi-modal rapid transcription and annotation method embodiments, and the aforementioned and other operations and/or functions of each unit in the system respectively implement the above-mentioned method embodiments. The corresponding process, for the sake of brevity, will not be repeated here.

Correspondingly, embodiments of the present application also provide a terminal device. The terminal device may be a terminal or a server. The terminal may be a smartphone, a tablet, a laptop, a smart TV, a smart speaker, a wearable smart device, a personal computer, etc. equipment. As shown in Figure 9, Figure 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application. The terminal device 900 includes a processor 901 with one or more processing cores, a memory 902 with one or more computer-readable storage media, and a computer program stored on the memory 902 and capable of running on the processor. Among them, the processor 901 is electrically connected to the memory 902. Those skilled in the art can understand that the structure of the terminal equipment shown in the figures does not constitute a limitation on the terminal equipment, and may include more or fewer components than shown in the figures, or combine certain components, or arrange different components.

The processor 901 is the control center of the terminal device 900, using various interfaces and lines to connect various parts of the entire terminal device 900, by running or loading software programs and/or modules stored in the memory 902, and calling the software programs and/or modules stored in the memory 902. data, perform various functions of the terminal device 900 and process data, thereby overall monitoring the terminal device 900.

In this embodiment of the present application, the processor 901 in the terminal device 900 will follow the following steps to load instructions corresponding to the processes of one or more application programs into the memory 902, and the processor 901 will run the instructions stored in the memory. 902 applications to achieve various functions:

Obtain the project engineering file corresponding to the media file to be processed; obtain the audio data of the media file according to the directory of the project engineering file; perform segmentation processing on the audio data according to the amplitude of the audio data to obtain the Segment data of the audio data; displaying the segment data of the audio data on an operation interface, the operation interface being used to provide a display interface and a boundary axis control; in response to an editing operation for the boundary axis control, the The segment data is subjected to boundary adjustment processing or segment merging processing to obtain processed segment data; speech recognition processing is performed on the processed segment data to obtain a transcribed text; and the project engineering is processed according to the transcribed text. The file is updated to obtain an updated project engineering file, which carries the transcribed text; when the updated project engineering file is played on the display interface, the media file and A text segment in the transcribed text corresponding to the playback progress of the media file.

For the specific implementation of each of the above operations, please refer to the previous embodiments and will not be described again here.

In some embodiments, as shown in FIG. 9 , the terminal device 900 further includes: a display unit 903, a radio frequency circuit 904, an audio circuit 905, an input unit 906, and a power supply 907. Among them, the processor 901 is electrically connected to the display unit 903, the radio frequency circuit 904, the audio circuit 905, the input unit 906 and the power supply 907 respectively. Those skilled in the art can understand that the structure of the terminal device shown in FIG. 9 does not constitute a limitation on the terminal device, and may include more or fewer components than shown in the figure, or combine certain components, or arrange different components.

The display unit 903 may be used to display information input by the user or information provided to the user as well as various graphical user interfaces of the terminal device. These graphical user interfaces may be composed of graphics, text, icons, videos, and any combination thereof. The display unit 903 may include a display panel and a touch panel.

The radio frequency circuit 904 can be used to send and receive radio frequency signals to establish wireless communication with network equipment or other terminal equipment through wireless communication, and to send and receive signals with the network equipment or other terminal equipment.

The audio circuit 905 can be used to provide an audio interface between the user and the terminal device through speakers and microphones.

The input unit 906 can be used to receive input numbers, character information or user characteristic information (such as fingerprints, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control. .

The power supply 907 is used to power various components of the terminal device 900 . In some embodiments, the power supply 907 can be logically connected to the processor 901 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system. Power supply 907 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.

Although not shown in FIG. 9 , the terminal device 900 may also include a camera, a sensor, a wireless fidelity module, a Bluetooth module, etc., which will not be described again here.

Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructions, or by controlling relevant hardware through instructions. The instructions can be stored in a computer-readable storage medium, and loaded and executed by the processor.

To this end, embodiments of the present application provide a computer-readable storage medium in which multiple computer programs are stored. The computer programs can be loaded by the processor to execute any of the self-built templates provided by the embodiments of the present application. The steps in the multi-modal fast transcription and annotation method. For the specific implementation of each of the above operations, please refer to the previous embodiments and will not be described again here.

Among them, the storage medium may include: read-only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

Since the computer program stored in the storage medium can execute any of the steps in the self-built template-based multi-modal rapid transcription and annotation method provided by the embodiments of the present application, it is possible to implement the steps provided by the embodiments of the present application. The beneficial effects that can be achieved by any of the provided multi-modal fast transcription and annotation methods based on self-built templates are detailed in the previous embodiments and will not be described again here.

Embodiments of the present application also provide a computer program product. The computer program product includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, causing the computer device to perform any multi-modal fast transcription and annotation based on the self-built template in the embodiments of the present application. The corresponding process in the method will not be repeated here for the sake of brevity.

An embodiment of the present application also provides a computer program. The computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, causing the computer device to perform any multi-modal fast transcription and annotation based on the self-built template in the embodiments of the present application. The corresponding process in the method will not be repeated here for the sake of brevity.

The above is a detailed introduction to a multi-modal fast transcription and annotation method based on a self-built template, a multi-modal fast transcription and annotation system and a storage medium based on a self-built template provided by the embodiments of the present application. In this article, The principles and implementations of the present application are described with specific examples. The description of the above embodiments is only used to help understand the method of the present application and its core ideas; at the same time, for those skilled in the art, based on the ideas of the present application, in There may be changes in the specific implementation modes and application scope. In summary, the contents of this description should not be construed as limiting the present application.

Claims

A multi-modal fast transcription and annotation method based on self-built templates, characterized in that the method includes:

Obtain the project engineering file corresponding to the media file to be processed;

Obtain the audio data of the media file according to the directory of the project engineering file;

Perform segmentation processing on the audio data according to the amplitude of the audio data to obtain segment data of the audio data;

Display the segment data of the audio data on an operation interface, the operation interface being used to provide a display interface and a boundary axis control;

In response to the editing operation on the boundary axis control, perform boundary adjustment processing or segment merging processing on the segment data to obtain processed segment data;

Perform speech recognition processing on the processed segment data to obtain transcribed text;

Update the project engineering file according to the transcribed text to obtain an updated project engineering file, and the updated project engineering file carries the transcribed text;

When the updated project engineering file is played on the display interface, a text segment corresponding to the playback progress of the media file in the media file and the transcribed text is displayed.
The multi-modal fast transcription and annotation method based on self-built templates according to claim 1, characterized in that, in response to the editing operation for the boundary axis control, boundary adjustment processing is performed on the segment data Or segment segments can be merged to obtain processed segment data, including:

In response to a first editing operation for an active end of a first boundary axis control of an active segment in the segment data, controlling the active end of the first boundary axis control to move to a first position;

Determine whether there is a second boundary axis control that overlaps with the movable end of the first boundary axis control at the first position, and the second boundary axis control is the boundary axis control corresponding to the second segment, and the The active segment and the second segment are adjacent segments;

If there is a second boundary axis control that overlaps the active end of the first boundary axis control at the first position, the active segment and the second segment are merged.
The multi-modal fast transcription and annotation method based on a self-built template according to claim 2, characterized in that, in the judgment of whether there is a movable end of the first boundary axis control at the first position, After the overlapping second boundary axis control, also include:

If there is no second boundary axis control overlapping the active end of the first boundary axis control at the first position, the boundary of the active segment is adjusted according to the first position.
The multi-modal fast transcription and annotation method based on self-built templates according to claim 1, characterized in that, in response to the editing operation for the boundary axis control, boundary adjustment processing is performed on the segment data Or segment segments can be merged to obtain processed segment data, including:

In response to a second editing operation for the active end of the first boundary axis control of the active segment in the segment data, controlling the active end of the first boundary axis control to move to a second position;

Determine whether there is a third boundary axis control that overlaps with the movable end of the first boundary axis control at the second position, and the third boundary axis control is the boundary axis control corresponding to the third segment, and the The active segment and the third segment are non-adjacent segments;

If there is a third boundary axis control that overlaps the active end of the first boundary axis control at the second position, then the active segment, the third segment, and the active segment Merge with the intermediate segment between the third segment.
The multi-modal fast transcription and annotation method based on a self-built template according to claim 4, characterized in that, in the judgment of whether there is a movable end of the first boundary axis control at the second position, After the overlapping third boundary axis control, also include:

If there is no third boundary axis control overlapping the movable end of the first boundary axis control at the second position, then determine the position between the stationary end of the first boundary axis control and the second position. Whether the target area between overlaps with any of the intermediate segments;

If the target area between the static end position of the first boundary axis control and the second position does not overlap with any of the intermediate segments, adjust the boundary of the active segment according to the second position. ;or

If the target area between the static end position of the first boundary axis control and the second position overlaps with at least one of the intermediate segments, then the active segment overlaps with the target area. All intermediate segments of the relationship are merged.
The multi-modal fast transcription and annotation method based on self-built templates according to claim 1, characterized in that the audio data is segmented according to the amplitude of the audio data to obtain the audio data. segment data, including:

The audio data is segmented according to the relationship between the noise amplitude threshold and the amplitude of the audio data to obtain segment data of the audio data.
The multi-modal fast transcription and annotation method based on self-built templates according to claim 6, characterized in that the audio data is segmented according to the relationship between the noise amplitude threshold and the amplitude of the audio data. Process to obtain the segment data of the audio data, including:

Obtain initial segment data of the audio data;

Determine whether the average amplitude within the current segment in the initial segmented data is greater than the noise amplitude threshold;

If the average amplitude within the current segment in the initial segment data is greater than the noise amplitude threshold, mark the current segment as a voiced segment;

Trim the segment starting point and segment end point on the audio points within the current segment marked as a sound segment to remove silence or noise within the current segment;

If the starting position of the current segment after cropping is the same as the end position of the previous segment, merge the current segment after cropping and the previous segment;

If the starting position of the current segment after cropping is different from the end position of the previous segment, mark the current segment after cropping as a new segment;

The initial segment data of the audio data is traversed and processed to obtain the segment data of the audio data.
The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 7, wherein the obtaining the initial segment data of the audio data includes:

Perform initial segmentation processing on the audio data according to a preset language template to obtain initial segmentation data of the audio data.
The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 1, characterized in that said obtaining the project engineering files corresponding to the media files to be processed includes:

Get the media files to be processed;

Detect whether the corresponding project file has been created for the media file;

If it is detected that the media file does not have a corresponding project file, create a project file corresponding to the media file based on the template file; or

If it is detected that a corresponding project file has been created for the media file, the project file corresponding to the created media file is obtained.
The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 1, characterized in that the method further includes:

In response to the export instruction carrying the target file type, an export file corresponding to the target file type is derived from the project engineering file, and the target file type belongs to any one of the preset file types.
The multi-modal fast transcription and annotation method based on self-built templates according to claim 10, characterized in that the method further includes:

In response to the import command, obtain the import file;

When the file type of the imported file belongs to any of the preset file types, the imported file is imported into the project engineering file.
The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 1, wherein the segment data of the audio data displayed on the operation interface includes:

The segment waveform information of the segment data of the audio data and the time axis information corresponding to the segment waveform information are displayed on the operation interface.
The multi-modal rapid transcription and annotation method based on self-built templates according to claim 12, characterized in that the method further includes:

In response to the hide waveform instruction, the segment waveform information and the timeline information are hidden on the operation interface.
The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 1, characterized in that the method further includes:

In response to an operation of inserting a breakpoint for a target segment in the segment data, a breakpoint is inserted in a boundary axis control of the target segment to perform segmentation processing on the target segment based on the breakpoint.
The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 1, wherein the transcribed text includes text fragments corresponding to each segment in the segment data, and in the After performing speech recognition processing on the processed segment data to obtain the transcribed text, it also includes:

In response to a modification instruction for a target text segment in the transcribed text, the target text segment is modified to obtain a modified transcribed text, where the target text segment is at least one text segment in the transcribed text.
The multi-modal fast transcription and annotation method based on self-built templates as claimed in claim 15, characterized in that the method further includes:

In response to the annotation instruction for the target text segment, the target text segment is annotated to obtain an annotated transcribed text.
A multi-modal rapid transcription and annotation system based on self-built templates, characterized in that the system includes:

The first acquisition unit is used to acquire the project engineering file corresponding to the media file to be processed;

The second acquisition unit is used to acquire the audio data of the media file according to the directory of the project engineering file;

A segmentation unit, configured to segment the audio data according to the amplitude of the audio data to obtain segment data of the audio data;

A display unit configured to display the segment data of the audio data on an operation interface, the operation interface being used to provide a display interface and boundary axis controls;

A processing unit, configured to perform boundary adjustment processing or segment merging processing on the segment data in response to the editing operation on the boundary axis control, to obtain processed segment data;

A transliteration unit, used to perform speech recognition processing on the processed segment data to obtain transcribed text;

An update unit, configured to update the project engineering file according to the transcribed text to obtain an updated project engineering file, where the updated project engineering file carries the transcribed text;

A playback unit, configured to display text segments in the media file and the transcribed text corresponding to the playback progress of the media file when the updated project engineering file is played on the display interface.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program is suitable for loading by a processor to execute the method based on any one of claims 1-16. Steps in the multi-modal rapid transcription and annotation method of self-built templates.