WO2022237448A1 - 语音识别训练集的生成方法及装置 - Google Patents
语音识别训练集的生成方法及装置 Download PDFInfo
- Publication number
- WO2022237448A1 WO2022237448A1 PCT/CN2022/087029 CN2022087029W WO2022237448A1 WO 2022237448 A1 WO2022237448 A1 WO 2022237448A1 CN 2022087029 W CN2022087029 W CN 2022087029W WO 2022237448 A1 WO2022237448 A1 WO 2022237448A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video frame
- text
- audio
- video
- processed
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 66
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000004590 computer program Methods 0.000 claims description 15
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 238000001514 detection method Methods 0.000 claims description 6
- 230000004044 response Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 6
- 238000012015 optical character recognition Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 239000012634 fragment Substances 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/435—Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
Definitions
- the embodiments of the present application relate to the field of computer technology, and in particular to a method and device for generating a speech recognition training set.
- ASR automatic speech recognition
- Embodiments of the present disclosure propose a method and device for generating a speech recognition training set.
- the present disclosure provides a method for generating a speech recognition training set, including: acquiring audio to be processed and video to be processed, wherein the video to be processed includes text information corresponding to the audio to be processed ;Recognize the audio to be processed to obtain the audio text; recognize the text information in the video to be processed to obtain the video text; based on the consistency between the audio text and the video text, use the audio to be processed as the voice sample and the video text as the label to obtain the speech recognition Training set.
- the present disclosure provides an apparatus for generating a speech recognition training set, including: an acquisition unit configured to acquire audio to be processed and video to be processed, wherein the video to be processed includes information corresponding to The text information of the audio to be processed; the first recognition unit is configured to recognize the audio to be processed to obtain the audio text; the second recognition unit is configured to recognize the text information in the video to be processed to obtain the video text; the obtaining unit is configured Based on the consistency between the audio text and the video text, the speech recognition training set is obtained by using the audio to be processed as the speech sample and the video text as the label.
- the present disclosure provides a computer-readable medium, on which a computer program is stored, wherein, when the program is executed by a processor, the method described in any implementation manner of the first aspect is implemented.
- the present disclosure provides an electronic device, including: one or more processors; Multiple processors execute, so that one or more processors implement the method described in any implementation manner of the first aspect.
- the present disclosure provides a computer program product including a computer program, when the computer program is executed by a processor, the method as described in any implementation manner of the first aspect can be implemented.
- FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure can be applied;
- Fig. 2 is a flowchart of an embodiment of a method for generating a speech recognition training set according to the present disclosure
- Fig. 3 is a schematic diagram of the text splicing process according to the present embodiment
- FIG. 4 is a schematic diagram of an application scenario of a method for generating a speech recognition training set according to this embodiment
- FIG. 5 is a flow chart of another embodiment of a method for generating a speech recognition training set according to the present disclosure
- FIG. 6 is a structural diagram of an embodiment of a device for generating a speech recognition training set according to the present disclosure
- FIG. 7 is a schematic structural diagram of a computer system suitable for implementing the embodiments of the present disclosure.
- FIG. 1 shows an exemplary architecture 100 to which the method and device for generating a speech recognition training set of the present disclosure can be applied.
- a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
- the communication connections between the terminal devices 101 , 102 , and 103 constitute a topological network, and the network 104 is used to provide a communication link medium between the terminal devices 101 , 102 , 103 and the server 105 .
- Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
- the terminal devices 101, 102, and 103 may be hardware devices or software that support network connections for data interaction and data processing.
- the terminal devices 101, 102, 103 are hardware, they can be various electronic devices that support network connection, information acquisition, interaction, display, processing and other functions, including but not limited to smartphones, tablet computers, e-book readers, Laptops and desktop computers and more.
- the terminal devices 101, 102, 103 are software, they can be installed in the electronic devices listed above. It can be implemented, for example, as a plurality of software or software modules for providing distributed services, or as a single software or software module. No specific limitation is made here.
- the server 105 may be a server that provides various services, such as a background processing server that acquires the corresponding video and audio to be processed sent by the user through the terminal devices 101, 102, 103, performs information processing, and automatically constructs a speech recognition training set.
- the server can also train an initial speech recognition model based on the speech recognition training set, or optimize a pre-trained speech recognition model.
- server 105 may be a cloud server.
- the server may be hardware or software.
- the server can be implemented as a distributed server cluster composed of multiple servers, or as a single server.
- the server is software, it can be implemented as multiple software or software modules (such as software or software modules for providing distributed services), or as a single software or software module. No specific limitation is made here.
- the method for generating the speech recognition training set may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other.
- each part (for example, each unit) included in the apparatus for generating the speech recognition training set may be all set in the server, all may be set in the terminal device, or may be set in the server and the terminal device separately.
- a flow 200 of an embodiment of a method for generating a speech recognition training set is shown, including the following steps:
- Step 201 acquire audio and video to be processed.
- the execution subject of the method for generating the speech recognition training set can obtain the audio and video to be processed remotely or locally through a wired network connection or a wireless network connection.
- the video to be processed includes text information corresponding to the audio to be processed.
- the data including corresponding audio to be processed and video to be processed may be various audio and video data such as movies, TV dramas, and short videos.
- the text information in the video to be processed is subtitle information
- the audio to be processed is voice information corresponding to the subtitle information.
- the speech data represented by the audio to be processed may be various types of speech, including but not limited to foreign language audio, Mandarin audio, and dialect audio.
- the audio to be processed and the video to be processed may be data with a long duration or data with a short duration.
- Step 202 identifying the audio to be processed to obtain the audio text.
- the execution subject can identify the audio to be processed to obtain the audio text.
- the above execution subject may process the audio to be processed based on the automatic speech recognition model to obtain the audio text.
- the automatic speech recognition model is used to represent the correspondence between the audio to be processed and the text.
- the above execution subject may perform the above step 202 in the following manner:
- the silent part in the audio to be processed is deleted to obtain multiple non-silent audio segments.
- the execution subject may use the silent part in the audio to be processed as a segmentation point, and divide the audio to be processed after the silent part is deleted to obtain multiple audio segments.
- the above-mentioned executive body can set a duration threshold, and further cut the audio clips whose duration is longer than the duration threshold in units of the duration represented by the duration threshold, and record each audio clip Start and end time.
- multiple audio segments are identified to obtain multiple audio segment texts included in the audio text.
- the execution subject may input each of the multiple audio clips into the automatic speech recognition model to obtain multiple audio clip texts.
- a plurality of audio segments correspond to a plurality of audio segment texts one by one, and the plurality of audio segment texts constitute an audio text.
- Step 203 identifying text information in the video to be processed to obtain video text.
- the execution subject can identify the text information in the video to be processed to obtain the video text.
- the above execution subject can use OCR (Optical Character Recognition, Optical Character Recognition) technology to identify the text information included in the video frame, and according to the text information in the video to be processed The playback sequence of the video frames, splicing the text information corresponding to each video frame to obtain the video text.
- OCR Optical Character Recognition
- the OCR technology is a relatively mature technology at present, and will not be repeated here.
- the above execution subject may perform the above step 203 in the following manner:
- a plurality of video frame sequences corresponding to a plurality of audio segments are determined from the video to be processed.
- the execution subject extracts a plurality of video frames corresponding to the audio segment from the video to be processed to obtain a sequence of video frames.
- the start and end times of the audio clip are t sk and t ek respectively, and the above execution body can pass
- the start and end video frames of the video frame sequence corresponding to the audio segment are sequentially determined. in, and represent rounding up and rounding down respectively, and f P represents the frame rate of the video to be processed.
- the above-mentioned executive body can preset the sampling rate, and based on the sampling rate, start from the starting frame termination
- the video frame is extracted to obtain the video frame sequence corresponding to the audio segment.
- the text information in each video frame in the multiple video frame sequences is identified to obtain the video frame text included in the video text.
- the execution subject may use OCR technology to identify text information in each video frame in a plurality of video frame sequences, and obtain video frame text included in the video text.
- the execution subject may not recognize text information, that is, the video frame does not include text information; it may also recognize multiple places of text information to obtain multiple video frame texts.
- the multiple video frame texts include subtitle information in the video frames, and text information in the video frames (for example, store name information in shop signs, road name information in road signs, slogan information, etc.).
- adjacent video frames include the same subtitle information.
- a preset identifier may be added to a video frame to represent a situation that the video frame does not include text information, or includes the same text information as that of an adjacent video frame.
- the preset identifier may be any preset identifier, such as "Blank".
- Step 204 based on the consistency between the audio text and the video text, the audio to be processed is used as a speech sample, and the video text is used as a label to obtain a speech recognition training set.
- the execution subject can use the audio to be processed as a speech sample and the video text as a label to obtain a speech recognition training set.
- the execution subject uses the audio to be processed as a speech sample and the video text as a label to obtain a speech recognition training set.
- the above execution subject may perform the above step 204 in the following manner:
- the text information included in each video frame in the video frame sequence is spliced by taking one video frame text in at least one video frame text identified in the video frame in the video frame sequence as a unit, Multiple video frame sequence texts corresponding to the video frame sequence are obtained.
- the video frame sequence includes 3 video frames, and the number of video frame texts corresponding to the 3 video frames is 3, 4, and 3 in sequence, so the multiple video frame sequence texts corresponding to the video frame sequence have a total of 36 (3 ⁇ 4 ⁇ 3) pieces.
- the video frame text set corresponding to each video frame includes not only the video frame text identified from the video frame, but also the text characterizing that the video frame includes the same text as the adjacent video frame
- the default identifier for the information can also represent the situation that no text information is included in the video frame.
- the number of video frame texts corresponding to 3 video frames is 4, 5, and 4 in sequence, then the video frame sequence There are a total of 80 (4 ⁇ 5 ⁇ 4) corresponding multiple video frame sequence texts.
- the default identifier is "Blank”.
- the recognition results corresponding to video frames 301, 302, 303 are 304, 305, 306 in sequence.
- Each video frame text in each video frame can be combined with video frame text in other video frames to obtain multiple video frame sequence texts.
- the recognition results of video frames 301, 302, 303 may be combined as "what's the weather like today”.
- the target video frame sequence text is determined according to the edit distance between each video frame sequence text in the plurality of video frame sequence texts and the target audio segment text.
- the target audio segment text is the audio segment text corresponding to the audio segment corresponding to the video frame sequence.
- the edit distance refers to the minimum number of editing operations required to convert one string into another.
- the execution subject may determine the video frame sequence text with the smallest edit distance between the multiple video frame sequence texts and the target audio segment text as the target video frame sequence text.
- each of the multiple audio clips is used as a speech sample, and the target video frame sequence text corresponding to the audio clip is used as a label to obtain a speech recognition training set.
- the above execution subject may perform the above first step in the following manner:
- multiple texts to be spliced corresponding to the video frame are determined, and multiple texts to be spliced are spliced with at least one video frame text in the video frame to obtain multiple spliced texts.
- a preset number of spliced texts are selected from the multiple spliced texts, and used as multiple waiting frames corresponding to the next video frame of the video frame. Stitched text.
- the execution subject may sort the editing distances from small to large, and select a preset number of spliced texts as the multiple texts to be spliced corresponding to the next video frame of the video frame.
- the preset number can be specifically set according to the actual situation, for example, it can be 10.
- the above-mentioned executive body can set a preset distance threshold, and edit the spliced text whose distance is less than the preset distance threshold delete.
- the above-mentioned executive body can also combine the method of selecting a preset number of texts and deleting texts whose editing distance is less than the preset distance threshold to determine the next video frame of the video frame The corresponding multiple texts to be spliced.
- the above execution subject may determine the matching degree between the retained plurality of spliced texts and the audio text through the following formula:
- d( ⁇ , ⁇ ) represents the edit distance calculation function of two texts
- ⁇ represents the length of the text
- p ki represents the spliced text
- S k represents the audio text
- Q i represents the matching degree between the two texts.
- the execution subject may also, for each video frame sequence in the multiple video frame sequences, respond to determining the target video frame sequence text and the target audio segment corresponding to the video frame sequence If the editing distance between the texts is greater than the preset distance threshold, the training samples corresponding to the video frame sequence in the speech recognition training set are deleted, thereby filtering out low-quality training samples.
- FIG. 4 is a schematic diagram 400 of an application scenario of the method for generating a speech recognition training set according to this embodiment.
- the server 401 obtains the audio 402 to be processed and the video 403 to be processed.
- the video to be processed 403 includes text information corresponding to the audio to be processed 402 .
- the server 401 identifies the audio to be processed 402 to obtain an audio text 404 ; recognizes the text information in the video to be processed 403 to obtain a video text 405 .
- the server 401 determines the consistency between the audio text 404 and the video text 405, takes the audio 402 to be processed as a speech sample, and takes the video text 405 as a label to obtain a speech recognition training set 406.
- the method provided by the above-mentioned embodiments of the present disclosure obtains the audio to be processed and the video to be processed, wherein the video to be processed includes text information corresponding to the audio to be processed; the audio to be processed is identified to obtain the audio text; the video to be processed is identified Based on the consistency of the audio text and the video text, the audio to be processed is used as the speech sample, and the video text is used as the label to obtain the speech recognition training set, thereby providing an automatic acquisition of the speech recognition training set
- the method improves the flexibility and efficiency of constructing speech recognition training set.
- the execution subject may train an untrained initial speech recognition model based on a speech recognition training set, or optimize a pre-trained speech recognition model.
- the above-mentioned executive body adopts a machine learning algorithm, takes the audio to be processed in the training sample as input, and takes the input audio to be processed as the expected output, to train the untrained initial speech recognition model, or to optimize the pre-trained speech recognition model to get the final speech recognition model.
- FIG. 5 a schematic flow 500 of an embodiment of a method for generating a speech recognition training set according to the present disclosure is shown, including the following steps:
- Step 501 acquire audio and video to be processed.
- the video to be processed includes text information corresponding to the audio to be processed
- Step 502 delete the silent part in the audio to be processed to obtain a plurality of non-muted audio segments.
- Step 503 identifying a plurality of audio segments to obtain a plurality of audio segment texts included in the audio text.
- step 504 a plurality of video frame sequences corresponding to a plurality of audio segments are determined from the video to be processed.
- Step 505 identifying the text information in each video frame in the plurality of video frame sequences to obtain the video frame text included in the video text.
- Step 506 for each video frame sequence in the plurality of video frame sequences perform the following operations:
- Step 5061 splicing the text information contained in each video frame in the sequence of video frames by taking one video frame text in at least one video frame text identified in the video frames in the sequence of video frames as a unit, Multiple video frame sequence texts corresponding to the video frame sequence are obtained.
- Step 5062 Determine the target video frame sequence text according to the edit distance between each video frame sequence text in the plurality of video frame sequence texts and the target audio segment text, wherein the target audio segment text is corresponding to the video frame sequence The audio fragment text to which the audio fragment corresponds.
- Step 507 Each of the multiple audio clips is used as a speech sample, and the target video frame sequence text corresponding to the audio clip is used as a label to obtain a speech recognition training set.
- the process 400 of the method for generating a speech recognition training set in this embodiment specifically illustrates the segmentation process of the audio to be processed and the video to be processed, and the video
- the frame text splicing process improves the accuracy of the training samples in the speech recognition training set.
- the present disclosure provides an embodiment of a device for generating a speech recognition training set, which corresponds to the method embodiment shown in FIG. 2 ,
- the device can be specifically applied to various electronic devices.
- the generation device of the speech recognition training set includes: including: an acquisition unit 601 configured to acquire the audio to be processed and the video to be processed, wherein the video to be processed includes text information corresponding to the audio to be processed;
- a recognition unit 602 is configured to recognize the audio to be processed to obtain the audio text;
- a second recognition unit 603 is configured to recognize the text information in the video to be processed to obtain the video text;
- the obtaining unit 604 is configured to obtain the audio text based on the audio text and Consistency of video text, using the audio to be processed as a voice sample, and video text as a label, to obtain a speech recognition training set.
- the first identification unit 602 is further configured to: delete the silent part in the audio to be processed according to the silence detection algorithm to obtain multiple non-muted audio segments; audio segment, to obtain multiple audio segment texts included in the audio text.
- the second identification unit 603 is further configured to: determine a plurality of video frame sequences corresponding to a plurality of audio segments from the video to be processed; The text information in each video frame in the frame sequence is obtained to obtain the video frame text included in the video text.
- the obtaining unit 604 is further configured to: for each video frame sequence in the plurality of video frame sequences, perform the following operation: One video frame text in the identified at least one video frame text is taken as a unit, and the text information included in each video frame in the video frame sequence is spliced to obtain a plurality of video frame sequence texts corresponding to the video frame sequence ; Determine the target video frame sequence text according to the edit distance between each video frame sequence text and the target audio segment text in multiple video frame sequence texts, wherein the target audio segment text is an audio segment corresponding to the video frame sequence Corresponding audio clip text: using each audio clip in the multiple audio clips as a voice sample, and using the target video frame sequence text corresponding to the audio clip as a label to obtain a speech recognition training set.
- the obtaining unit 604 is further configured to: for each video frame including text information in the sequence of video frames, perform the following operation: Splicing the text, and splicing multiple texts to be spliced with at least one video frame text in the video frame to obtain multiple spliced texts; based on the edit distance between the multiple spliced texts and the target audio segment text, from multiple A preset number of spliced texts are selected from the spliced texts as multiple texts to be spliced corresponding to the next video frame of the video frame.
- the above device further includes: a deletion unit (not shown in the figure), configured to, for each video frame sequence in the plurality of video frame sequences, in response to determining that the video If the editing distance between the target video frame sequence text corresponding to the frame sequence and the target audio clip text is greater than a preset distance threshold, the training samples corresponding to the video frame sequence in the speech recognition training set are deleted.
- a deletion unit (not shown in the figure), configured to, for each video frame sequence in the plurality of video frame sequences, in response to determining that the video If the editing distance between the target video frame sequence text corresponding to the frame sequence and the target audio clip text is greater than a preset distance threshold, the training samples corresponding to the video frame sequence in the speech recognition training set are deleted.
- the acquisition unit in the generation device of the speech recognition training set acquires the audio to be processed and the video to be processed, wherein the video to be processed includes text information corresponding to the audio to be processed; the first recognition unit identifies the audio to be processed, Obtain the audio text; the second recognition unit recognizes the text information in the video to be processed to obtain the video text; the obtaining unit is based on the consistency between the audio text and the video text, using the audio to be processed as a voice sample and the video text as a label to obtain the speech recognition
- the training set provides an automatic acquisition device for the speech recognition training set, which improves the flexibility and efficiency of constructing the speech recognition training set.
- FIG. 7 it shows a schematic structural diagram of a computer system 700 suitable for implementing the devices of the embodiments of the present disclosure (such as the devices 101 , 102 , 103 , and 105 shown in FIG. 1 ).
- the device shown in FIG. 7 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
- a computer system 700 includes a processor (such as a CPU, central processing unit) 701, which can be loaded into a random access memory (RAM) according to a program stored in a read-only memory (ROM) 702 or from a storage section 708.
- the program in 703 performs various appropriate actions and processing.
- various programs and data necessary for the operation of the system 700 are also stored.
- the processor 701 , ROM 702 and RAM 703 are connected to each other via a bus 704 .
- An input/output (I/O) interface 705 is also connected to the bus 704 .
- the following components are connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, etc.; an output section 707 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 708 including a hard disk, etc. and a communication section 709 including a network interface card such as a LAN card, a modem, or the like.
- the communication section 709 performs communication processing via a network such as the Internet.
- a drive 710 is also connected to the I/O interface 705 as needed.
- a removable medium 711 such as a magnetic disk, optical disk, magneto-optical disk, semiconductor memory, etc. is mounted on the drive 710 as necessary so that a computer program read therefrom is installed into the storage section 708 as necessary.
- embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes program codes for executing the methods shown in the flowcharts.
- the computer program may be downloaded and installed from a network via communication portion 709 and/or installed from removable media 711 .
- the computer program is executed by the processor 701, the above-mentioned functions defined in the method of the present disclosure are performed.
- the computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
- a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
- a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wire, optical cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out the operations of the present disclosure can be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional procedural programming language—such as "C" or a similar programming language.
- the program code may execute entirely on the client computer, partly on the client computer, as a stand-alone software package, partly on the client computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer can be connected to the client computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).
- LAN local area network
- WAN wide area network
- Internet service provider such as AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
- each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
- the units involved in the embodiments described in the present disclosure may be implemented by software or by hardware.
- the described units may also be set in a processor, for example, may be described as: a processor including an acquiring unit, a first identifying unit, a second identifying unit, and an obtaining unit.
- a processor including an acquiring unit, a first identifying unit, a second identifying unit, and an obtaining unit.
- the names of these units do not constitute a limitation on the unit itself in some cases.
- the obtained unit can also be described as "based on the consistency between the audio text and the video text, the audio to be processed is the speech sample, and the The video text is the label to obtain the unit of the speech recognition training set".
- the present disclosure also provides a computer-readable medium, which may be included in the device described in the above embodiments, or may exist independently without being assembled into the device.
- the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the device, the computer device: acquires the audio to be processed and the video to be processed, wherein the video to be processed includes information corresponding to the Process the audio text information; identify the audio to be processed to obtain the audio text; identify the text information in the video to be processed to obtain the video text; based on the consistency between the audio text and the video text, use the audio to be processed as the voice sample and the video text as label to get the speech recognition training set.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
Description
Claims (15)
- 一种语音识别训练集的生成方法,包括:获取待处理音频和待处理视频,其中,所述待处理视频中包括对应于所述待处理音频的文本信息;识别所述待处理音频,得到音频文本;识别所述待处理视频中的文本信息,得到视频文本;以及基于所述音频文本与所述视频文本的一致性,以所述待处理音频为语音样本,以所述视频文本为标签,得到所述语音识别训练集。
- 根据权利要求1所述的方法,其中,所述识别所述待处理音频,得到音频文本,包括:根据静音检测算法,删除所述待处理音频中的静音部分,得到非静音的多个音频片段;以及识别所述多个音频片段,得到所述音频文本包括的多个音频片段文本。
- 根据权利要求2所述的方法,其中,所述识别所述待处理视频中的文本信息,得到视频文本,包括:从所述待处理视频中确定出与所述多个音频片段一一对应的多个视频帧序列;以及识别所述多个视频帧序列中的每个视频帧中的文本信息,得到所述视频文本包括的视频帧文本。
- 根据权利要求3所述的方法,其中,所述基于所述音频文本与所述视频文本的一致性,以所述待处理音频为语音样本,以所述视频文本为标签,得到所述语音识别训练集,包括:对于所述多个视频帧序列中的每个视频帧序列,执行如下操作:以该视频帧序列中的视频帧中被识别出的至少一个视频帧文本中的一个视频帧文本为单位,对该视频帧序列中的每个视频帧所包括的文本 信息进行拼接,得到该视频帧序列对应的多个视频帧序列文本;根据所述多个视频帧序列文本中的每个视频帧序列文本与目标音频片段文本之间的编辑距离,确定目标视频帧序列文本,其中,所述目标音频片段文本为对应于该视频帧序列的音频片段所对应的音频片段文本;以及以所述多个音频片段中的每个音频片段为语音样本,以该音频片段对应的目标视频帧序列文本为标签,得到所述语音识别训练集。
- 根据权利要求4所述的方法,其中,所述以该视频帧序列中的视频帧中被识别出的至少一个视频帧文本中的一个视频帧文本为单位,对该视频帧序列中的每个视频帧所包括的文本信息进行拼接,得到该视频帧序列对应的多个视频帧序列文本,包括:对于该视频帧序列中包括文本信息的每个视频帧,执行如下操作:确定该视频帧对应的多个待拼接文本,并将所述多个待拼接文本与该视频帧中的至少一个视频帧文本进行拼接,得到多个拼接后文本;以及基于所述多个拼接后文本与所述目标音频片段文本之间的编辑距离,从所述多个拼接后文本选取出预设数量个拼接后文本,作为该视频帧的下一视频帧所对应的多个待拼接文本。
- 根据权利要求4所述的方法,其中,还包括:对于所述多个视频帧序列中的每个视频帧序列,响应于确定该视频帧序列对应的目标视频帧序列文本与目标音频片段文本之间的编辑距离大于预设距离阈值,删除所述语音识别训练集中对应于该视频帧序列的训练样本。
- 一种语音识别训练集的生成装置,包括:获取单元,被配置成获取待处理音频和待处理视频,其中,所述待处理视频中包括对应于所述待处理音频的文本信息;第一识别单元,被配置成识别所述待处理音频,得到音频文本;第二识别单元,被配置成识别所述待处理视频中的文本信息,得到视频文本;得到单元,被配置成基于所述音频文本与所述视频文本的一致性,以所述待处理音频为语音样本,以所述视频文本为标签,得到所述语音识别训练集。
- 根据权利要求7所述的装置,其中,所述第一识别单元,进一步被配置成:根据静音检测算法,删除所述待处理音频中的静音部分,得到非静音的多个音频片段;识别所述多个音频片段,得到所述音频文本包括的多个音频片段文本。
- 根据权利要求8所述的装置,其中,所述第二识别单元,进一步被配置成:从所述待处理视频中确定出与所述多个音频片段一一对应的多个视频帧序列;识别所述多个视频帧序列中的每个视频帧中的文本信息,得到所述视频文本包括的视频帧文本。
- 根据权利要求9所述的装置,其中,所述得到单元,进一步被配置成:对于所述多个视频帧序列中的每个视频帧序列,执行如下操作:以该视频帧序列中的视频帧中被识别出的至少一个视频帧文本中的一个视频帧文本为单位,对该视频帧序列中的每个视频帧所包括的文本信息进行拼接,得到该视频帧序列对应的多个视频帧序列文本;根据所述多个视频帧序列文本中的每个视频帧序列文本与目标音频片段文本之间的编辑距离,确定目标视频帧序列文本,其中,所述目标音频片段文本为对应于该视频帧序列的音频片段所对应的音频片段文本;以所述多个音频片段中的每个音频片段为语音样本,以该音频片段对应的目标视频帧序列文本为标签,得到所述语音识别训练集。
- 根据权利要求10所述的装置,其中,所述得到单元,进一步被配置成:对于该视频帧序列中包括文本信息的每个视频帧,执行如下操作:确定该视频帧对应的多个待拼接文本,并将所述多个待拼接文本与该视频帧中的至少一个视频帧文本进行拼接,得到多个拼接后文本;基于所述多个拼接后文本与所述目标音频片段文本之间的编辑距离,从所述多个拼接后文本选取出预设数量个拼接后文本,作为该视频帧的下一视频帧所对应的多个待拼接文本。
- 根据权利要求10所述的装置,其中,还包括:删除单元,被配置成对于所述多个视频帧序列中的每个视频帧序列,响应于确定该视频帧序列对应的目标视频帧序列文本与目标音频片段文本之间的编辑距离大于预设距离阈值,删除所述语音识别训练集中对应于该视频帧序列的训练样本。
- 一种计算机可读介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如权利要求1-6中任一所述的方法。
- 一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-6中任一所述的方法。
- 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-6中任一项所述的方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023568632A JP2024517902A (ja) | 2021-05-08 | 2022-04-15 | 音声認識トレーニングセットの生成のための方法および装置 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110514350.XA CN115312032A (zh) | 2021-05-08 | 2021-05-08 | 语音识别训练集的生成方法及装置 |
CN202110514350.X | 2021-05-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022237448A1 true WO2022237448A1 (zh) | 2022-11-17 |
Family
ID=83853589
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/087029 WO2022237448A1 (zh) | 2021-05-08 | 2022-04-15 | 语音识别训练集的生成方法及装置 |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP2024517902A (zh) |
CN (1) | CN115312032A (zh) |
WO (1) | WO2022237448A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116631447A (zh) * | 2023-07-24 | 2023-08-22 | 科大讯飞股份有限公司 | 噪声提取方法、装置、设备及可读存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106604125A (zh) * | 2016-12-29 | 2017-04-26 | 北京奇艺世纪科技有限公司 | 一种视频字幕的确定方法及装置 |
CN108924626A (zh) * | 2018-08-17 | 2018-11-30 | 腾讯科技(深圳)有限公司 | 图片生成方法、装置、设备及存储介质 |
CN109257659A (zh) * | 2018-11-16 | 2019-01-22 | 北京微播视界科技有限公司 | 字幕添加方法、装置、电子设备及计算机可读存储介质 |
CN109803180A (zh) * | 2019-03-08 | 2019-05-24 | 腾讯科技(深圳)有限公司 | 视频预览图生成方法、装置、计算机设备及存储介质 |
CN111639233A (zh) * | 2020-05-06 | 2020-09-08 | 广东小天才科技有限公司 | 学习视频字幕添加方法、装置、终端设备和存储介质 |
-
2021
- 2021-05-08 CN CN202110514350.XA patent/CN115312032A/zh active Pending
-
2022
- 2022-04-15 JP JP2023568632A patent/JP2024517902A/ja active Pending
- 2022-04-15 WO PCT/CN2022/087029 patent/WO2022237448A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106604125A (zh) * | 2016-12-29 | 2017-04-26 | 北京奇艺世纪科技有限公司 | 一种视频字幕的确定方法及装置 |
CN108924626A (zh) * | 2018-08-17 | 2018-11-30 | 腾讯科技(深圳)有限公司 | 图片生成方法、装置、设备及存储介质 |
CN109257659A (zh) * | 2018-11-16 | 2019-01-22 | 北京微播视界科技有限公司 | 字幕添加方法、装置、电子设备及计算机可读存储介质 |
CN109803180A (zh) * | 2019-03-08 | 2019-05-24 | 腾讯科技(深圳)有限公司 | 视频预览图生成方法、装置、计算机设备及存储介质 |
CN111639233A (zh) * | 2020-05-06 | 2020-09-08 | 广东小天才科技有限公司 | 学习视频字幕添加方法、装置、终端设备和存储介质 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116631447A (zh) * | 2023-07-24 | 2023-08-22 | 科大讯飞股份有限公司 | 噪声提取方法、装置、设备及可读存储介质 |
CN116631447B (zh) * | 2023-07-24 | 2023-12-01 | 科大讯飞股份有限公司 | 噪声提取方法、装置、设备及可读存储介质 |
Also Published As
Publication number | Publication date |
---|---|
JP2024517902A (ja) | 2024-04-23 |
CN115312032A (zh) | 2022-11-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110349564B (zh) | 一种跨语言语音识别方法和装置 | |
CN111488489B (zh) | 视频文件的分类方法、装置、介质及电子设备 | |
CN109754783B (zh) | 用于确定音频语句的边界的方法和装置 | |
JP6681450B2 (ja) | 情報処理方法および装置 | |
CN113486833B (zh) | 多模态特征提取模型训练方法、装置、电子设备 | |
WO2023083142A1 (zh) | 分句方法、装置、存储介质及电子设备 | |
US20180157657A1 (en) | Method, apparatus, client terminal, and server for associating videos with e-books | |
US20190156832A1 (en) | Diarization Driven by the ASR Based Segmentation | |
JP2022088304A (ja) | ビデオを処理するための方法、装置、電子機器、媒体及びコンピュータプログラム | |
US20240029709A1 (en) | Voice generation method and apparatus, device, and computer readable medium | |
CN110138654B (zh) | 用于处理语音的方法和装置 | |
US20220385996A1 (en) | Method for generating target video, apparatus, server, and medium | |
CN108877779B (zh) | 用于检测语音尾点的方法和装置 | |
WO2023071578A1 (zh) | 一种文本对齐语音的方法、装置、设备及介质 | |
WO2024099171A1 (zh) | 视频生成方法和装置 | |
CN111160004A (zh) | 一种断句模型的建立方法及装置 | |
CN111241209A (zh) | 用于生成信息的方法和装置 | |
WO2022237448A1 (zh) | 语音识别训练集的生成方法及装置 | |
CN114550702A (zh) | 一种语音识别方法和装置 | |
US10468031B2 (en) | Diarization driven by meta-information identified in discussion content | |
CN112954453B (zh) | 视频配音方法和装置、存储介质和电子设备 | |
JP2024506495A (ja) | 議事録の処理方法、装置、機器及び媒体 | |
CN112633004A (zh) | 文本标点符号删除方法、装置、电子设备和存储介质 | |
CN112652329B (zh) | 文本重对齐方法、装置、电子设备和存储介质 | |
WO2022148239A1 (zh) | 信息输出方法、装置和电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22806418 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18559398 Country of ref document: US Ref document number: 2023568632 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22806418 Country of ref document: EP Kind code of ref document: A1 |