WO2022237448A1 - 语音识别训练集的生成方法及装置 - Google Patents

语音识别训练集的生成方法及装置 Download PDF

Info

Publication number
WO2022237448A1
WO2022237448A1 PCT/CN2022/087029 CN2022087029W WO2022237448A1 WO 2022237448 A1 WO2022237448 A1 WO 2022237448A1 CN 2022087029 W CN2022087029 W CN 2022087029W WO 2022237448 A1 WO2022237448 A1 WO 2022237448A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
text
audio
video
processed
Prior art date
Application number
PCT/CN2022/087029
Other languages
English (en)
French (fr)
Inventor
付立
Original Assignee
京东科技控股股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东科技控股股份有限公司 filed Critical 京东科技控股股份有限公司
Publication of WO2022237448A1 publication Critical patent/WO2022237448A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular to a method and device for generating a speech recognition training set.
  • ASR automatic speech recognition
  • Embodiments of the present disclosure propose a method and device for generating a speech recognition training set.
  • the present disclosure provides a method for generating a speech recognition training set, including: acquiring audio to be processed and video to be processed, wherein the video to be processed includes text information corresponding to the audio to be processed ;Recognize the audio to be processed to obtain the audio text; recognize the text information in the video to be processed to obtain the video text; based on the consistency between the audio text and the video text, use the audio to be processed as the voice sample and the video text as the label to obtain the speech recognition Training set.
  • the present disclosure provides an apparatus for generating a speech recognition training set, including: an acquisition unit configured to acquire audio to be processed and video to be processed, wherein the video to be processed includes information corresponding to The text information of the audio to be processed; the first recognition unit is configured to recognize the audio to be processed to obtain the audio text; the second recognition unit is configured to recognize the text information in the video to be processed to obtain the video text; the obtaining unit is configured Based on the consistency between the audio text and the video text, the speech recognition training set is obtained by using the audio to be processed as the speech sample and the video text as the label.
  • the present disclosure provides a computer-readable medium, on which a computer program is stored, wherein, when the program is executed by a processor, the method described in any implementation manner of the first aspect is implemented.
  • the present disclosure provides an electronic device, including: one or more processors; Multiple processors execute, so that one or more processors implement the method described in any implementation manner of the first aspect.
  • the present disclosure provides a computer program product including a computer program, when the computer program is executed by a processor, the method as described in any implementation manner of the first aspect can be implemented.
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure can be applied;
  • Fig. 2 is a flowchart of an embodiment of a method for generating a speech recognition training set according to the present disclosure
  • Fig. 3 is a schematic diagram of the text splicing process according to the present embodiment
  • FIG. 4 is a schematic diagram of an application scenario of a method for generating a speech recognition training set according to this embodiment
  • FIG. 5 is a flow chart of another embodiment of a method for generating a speech recognition training set according to the present disclosure
  • FIG. 6 is a structural diagram of an embodiment of a device for generating a speech recognition training set according to the present disclosure
  • FIG. 7 is a schematic structural diagram of a computer system suitable for implementing the embodiments of the present disclosure.
  • FIG. 1 shows an exemplary architecture 100 to which the method and device for generating a speech recognition training set of the present disclosure can be applied.
  • a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the communication connections between the terminal devices 101 , 102 , and 103 constitute a topological network, and the network 104 is used to provide a communication link medium between the terminal devices 101 , 102 , 103 and the server 105 .
  • Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
  • the terminal devices 101, 102, and 103 may be hardware devices or software that support network connections for data interaction and data processing.
  • the terminal devices 101, 102, 103 are hardware, they can be various electronic devices that support network connection, information acquisition, interaction, display, processing and other functions, including but not limited to smartphones, tablet computers, e-book readers, Laptops and desktop computers and more.
  • the terminal devices 101, 102, 103 are software, they can be installed in the electronic devices listed above. It can be implemented, for example, as a plurality of software or software modules for providing distributed services, or as a single software or software module. No specific limitation is made here.
  • the server 105 may be a server that provides various services, such as a background processing server that acquires the corresponding video and audio to be processed sent by the user through the terminal devices 101, 102, 103, performs information processing, and automatically constructs a speech recognition training set.
  • the server can also train an initial speech recognition model based on the speech recognition training set, or optimize a pre-trained speech recognition model.
  • server 105 may be a cloud server.
  • the server may be hardware or software.
  • the server can be implemented as a distributed server cluster composed of multiple servers, or as a single server.
  • the server is software, it can be implemented as multiple software or software modules (such as software or software modules for providing distributed services), or as a single software or software module. No specific limitation is made here.
  • the method for generating the speech recognition training set may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other.
  • each part (for example, each unit) included in the apparatus for generating the speech recognition training set may be all set in the server, all may be set in the terminal device, or may be set in the server and the terminal device separately.
  • a flow 200 of an embodiment of a method for generating a speech recognition training set is shown, including the following steps:
  • Step 201 acquire audio and video to be processed.
  • the execution subject of the method for generating the speech recognition training set can obtain the audio and video to be processed remotely or locally through a wired network connection or a wireless network connection.
  • the video to be processed includes text information corresponding to the audio to be processed.
  • the data including corresponding audio to be processed and video to be processed may be various audio and video data such as movies, TV dramas, and short videos.
  • the text information in the video to be processed is subtitle information
  • the audio to be processed is voice information corresponding to the subtitle information.
  • the speech data represented by the audio to be processed may be various types of speech, including but not limited to foreign language audio, Mandarin audio, and dialect audio.
  • the audio to be processed and the video to be processed may be data with a long duration or data with a short duration.
  • Step 202 identifying the audio to be processed to obtain the audio text.
  • the execution subject can identify the audio to be processed to obtain the audio text.
  • the above execution subject may process the audio to be processed based on the automatic speech recognition model to obtain the audio text.
  • the automatic speech recognition model is used to represent the correspondence between the audio to be processed and the text.
  • the above execution subject may perform the above step 202 in the following manner:
  • the silent part in the audio to be processed is deleted to obtain multiple non-silent audio segments.
  • the execution subject may use the silent part in the audio to be processed as a segmentation point, and divide the audio to be processed after the silent part is deleted to obtain multiple audio segments.
  • the above-mentioned executive body can set a duration threshold, and further cut the audio clips whose duration is longer than the duration threshold in units of the duration represented by the duration threshold, and record each audio clip Start and end time.
  • multiple audio segments are identified to obtain multiple audio segment texts included in the audio text.
  • the execution subject may input each of the multiple audio clips into the automatic speech recognition model to obtain multiple audio clip texts.
  • a plurality of audio segments correspond to a plurality of audio segment texts one by one, and the plurality of audio segment texts constitute an audio text.
  • Step 203 identifying text information in the video to be processed to obtain video text.
  • the execution subject can identify the text information in the video to be processed to obtain the video text.
  • the above execution subject can use OCR (Optical Character Recognition, Optical Character Recognition) technology to identify the text information included in the video frame, and according to the text information in the video to be processed The playback sequence of the video frames, splicing the text information corresponding to each video frame to obtain the video text.
  • OCR Optical Character Recognition
  • the OCR technology is a relatively mature technology at present, and will not be repeated here.
  • the above execution subject may perform the above step 203 in the following manner:
  • a plurality of video frame sequences corresponding to a plurality of audio segments are determined from the video to be processed.
  • the execution subject extracts a plurality of video frames corresponding to the audio segment from the video to be processed to obtain a sequence of video frames.
  • the start and end times of the audio clip are t sk and t ek respectively, and the above execution body can pass
  • the start and end video frames of the video frame sequence corresponding to the audio segment are sequentially determined. in, and represent rounding up and rounding down respectively, and f P represents the frame rate of the video to be processed.
  • the above-mentioned executive body can preset the sampling rate, and based on the sampling rate, start from the starting frame termination
  • the video frame is extracted to obtain the video frame sequence corresponding to the audio segment.
  • the text information in each video frame in the multiple video frame sequences is identified to obtain the video frame text included in the video text.
  • the execution subject may use OCR technology to identify text information in each video frame in a plurality of video frame sequences, and obtain video frame text included in the video text.
  • the execution subject may not recognize text information, that is, the video frame does not include text information; it may also recognize multiple places of text information to obtain multiple video frame texts.
  • the multiple video frame texts include subtitle information in the video frames, and text information in the video frames (for example, store name information in shop signs, road name information in road signs, slogan information, etc.).
  • adjacent video frames include the same subtitle information.
  • a preset identifier may be added to a video frame to represent a situation that the video frame does not include text information, or includes the same text information as that of an adjacent video frame.
  • the preset identifier may be any preset identifier, such as "Blank".
  • Step 204 based on the consistency between the audio text and the video text, the audio to be processed is used as a speech sample, and the video text is used as a label to obtain a speech recognition training set.
  • the execution subject can use the audio to be processed as a speech sample and the video text as a label to obtain a speech recognition training set.
  • the execution subject uses the audio to be processed as a speech sample and the video text as a label to obtain a speech recognition training set.
  • the above execution subject may perform the above step 204 in the following manner:
  • the text information included in each video frame in the video frame sequence is spliced by taking one video frame text in at least one video frame text identified in the video frame in the video frame sequence as a unit, Multiple video frame sequence texts corresponding to the video frame sequence are obtained.
  • the video frame sequence includes 3 video frames, and the number of video frame texts corresponding to the 3 video frames is 3, 4, and 3 in sequence, so the multiple video frame sequence texts corresponding to the video frame sequence have a total of 36 (3 ⁇ 4 ⁇ 3) pieces.
  • the video frame text set corresponding to each video frame includes not only the video frame text identified from the video frame, but also the text characterizing that the video frame includes the same text as the adjacent video frame
  • the default identifier for the information can also represent the situation that no text information is included in the video frame.
  • the number of video frame texts corresponding to 3 video frames is 4, 5, and 4 in sequence, then the video frame sequence There are a total of 80 (4 ⁇ 5 ⁇ 4) corresponding multiple video frame sequence texts.
  • the default identifier is "Blank”.
  • the recognition results corresponding to video frames 301, 302, 303 are 304, 305, 306 in sequence.
  • Each video frame text in each video frame can be combined with video frame text in other video frames to obtain multiple video frame sequence texts.
  • the recognition results of video frames 301, 302, 303 may be combined as "what's the weather like today”.
  • the target video frame sequence text is determined according to the edit distance between each video frame sequence text in the plurality of video frame sequence texts and the target audio segment text.
  • the target audio segment text is the audio segment text corresponding to the audio segment corresponding to the video frame sequence.
  • the edit distance refers to the minimum number of editing operations required to convert one string into another.
  • the execution subject may determine the video frame sequence text with the smallest edit distance between the multiple video frame sequence texts and the target audio segment text as the target video frame sequence text.
  • each of the multiple audio clips is used as a speech sample, and the target video frame sequence text corresponding to the audio clip is used as a label to obtain a speech recognition training set.
  • the above execution subject may perform the above first step in the following manner:
  • multiple texts to be spliced corresponding to the video frame are determined, and multiple texts to be spliced are spliced with at least one video frame text in the video frame to obtain multiple spliced texts.
  • a preset number of spliced texts are selected from the multiple spliced texts, and used as multiple waiting frames corresponding to the next video frame of the video frame. Stitched text.
  • the execution subject may sort the editing distances from small to large, and select a preset number of spliced texts as the multiple texts to be spliced corresponding to the next video frame of the video frame.
  • the preset number can be specifically set according to the actual situation, for example, it can be 10.
  • the above-mentioned executive body can set a preset distance threshold, and edit the spliced text whose distance is less than the preset distance threshold delete.
  • the above-mentioned executive body can also combine the method of selecting a preset number of texts and deleting texts whose editing distance is less than the preset distance threshold to determine the next video frame of the video frame The corresponding multiple texts to be spliced.
  • the above execution subject may determine the matching degree between the retained plurality of spliced texts and the audio text through the following formula:
  • d( ⁇ , ⁇ ) represents the edit distance calculation function of two texts
  • represents the length of the text
  • p ki represents the spliced text
  • S k represents the audio text
  • Q i represents the matching degree between the two texts.
  • the execution subject may also, for each video frame sequence in the multiple video frame sequences, respond to determining the target video frame sequence text and the target audio segment corresponding to the video frame sequence If the editing distance between the texts is greater than the preset distance threshold, the training samples corresponding to the video frame sequence in the speech recognition training set are deleted, thereby filtering out low-quality training samples.
  • FIG. 4 is a schematic diagram 400 of an application scenario of the method for generating a speech recognition training set according to this embodiment.
  • the server 401 obtains the audio 402 to be processed and the video 403 to be processed.
  • the video to be processed 403 includes text information corresponding to the audio to be processed 402 .
  • the server 401 identifies the audio to be processed 402 to obtain an audio text 404 ; recognizes the text information in the video to be processed 403 to obtain a video text 405 .
  • the server 401 determines the consistency between the audio text 404 and the video text 405, takes the audio 402 to be processed as a speech sample, and takes the video text 405 as a label to obtain a speech recognition training set 406.
  • the method provided by the above-mentioned embodiments of the present disclosure obtains the audio to be processed and the video to be processed, wherein the video to be processed includes text information corresponding to the audio to be processed; the audio to be processed is identified to obtain the audio text; the video to be processed is identified Based on the consistency of the audio text and the video text, the audio to be processed is used as the speech sample, and the video text is used as the label to obtain the speech recognition training set, thereby providing an automatic acquisition of the speech recognition training set
  • the method improves the flexibility and efficiency of constructing speech recognition training set.
  • the execution subject may train an untrained initial speech recognition model based on a speech recognition training set, or optimize a pre-trained speech recognition model.
  • the above-mentioned executive body adopts a machine learning algorithm, takes the audio to be processed in the training sample as input, and takes the input audio to be processed as the expected output, to train the untrained initial speech recognition model, or to optimize the pre-trained speech recognition model to get the final speech recognition model.
  • FIG. 5 a schematic flow 500 of an embodiment of a method for generating a speech recognition training set according to the present disclosure is shown, including the following steps:
  • Step 501 acquire audio and video to be processed.
  • the video to be processed includes text information corresponding to the audio to be processed
  • Step 502 delete the silent part in the audio to be processed to obtain a plurality of non-muted audio segments.
  • Step 503 identifying a plurality of audio segments to obtain a plurality of audio segment texts included in the audio text.
  • step 504 a plurality of video frame sequences corresponding to a plurality of audio segments are determined from the video to be processed.
  • Step 505 identifying the text information in each video frame in the plurality of video frame sequences to obtain the video frame text included in the video text.
  • Step 506 for each video frame sequence in the plurality of video frame sequences perform the following operations:
  • Step 5061 splicing the text information contained in each video frame in the sequence of video frames by taking one video frame text in at least one video frame text identified in the video frames in the sequence of video frames as a unit, Multiple video frame sequence texts corresponding to the video frame sequence are obtained.
  • Step 5062 Determine the target video frame sequence text according to the edit distance between each video frame sequence text in the plurality of video frame sequence texts and the target audio segment text, wherein the target audio segment text is corresponding to the video frame sequence The audio fragment text to which the audio fragment corresponds.
  • Step 507 Each of the multiple audio clips is used as a speech sample, and the target video frame sequence text corresponding to the audio clip is used as a label to obtain a speech recognition training set.
  • the process 400 of the method for generating a speech recognition training set in this embodiment specifically illustrates the segmentation process of the audio to be processed and the video to be processed, and the video
  • the frame text splicing process improves the accuracy of the training samples in the speech recognition training set.
  • the present disclosure provides an embodiment of a device for generating a speech recognition training set, which corresponds to the method embodiment shown in FIG. 2 ,
  • the device can be specifically applied to various electronic devices.
  • the generation device of the speech recognition training set includes: including: an acquisition unit 601 configured to acquire the audio to be processed and the video to be processed, wherein the video to be processed includes text information corresponding to the audio to be processed;
  • a recognition unit 602 is configured to recognize the audio to be processed to obtain the audio text;
  • a second recognition unit 603 is configured to recognize the text information in the video to be processed to obtain the video text;
  • the obtaining unit 604 is configured to obtain the audio text based on the audio text and Consistency of video text, using the audio to be processed as a voice sample, and video text as a label, to obtain a speech recognition training set.
  • the first identification unit 602 is further configured to: delete the silent part in the audio to be processed according to the silence detection algorithm to obtain multiple non-muted audio segments; audio segment, to obtain multiple audio segment texts included in the audio text.
  • the second identification unit 603 is further configured to: determine a plurality of video frame sequences corresponding to a plurality of audio segments from the video to be processed; The text information in each video frame in the frame sequence is obtained to obtain the video frame text included in the video text.
  • the obtaining unit 604 is further configured to: for each video frame sequence in the plurality of video frame sequences, perform the following operation: One video frame text in the identified at least one video frame text is taken as a unit, and the text information included in each video frame in the video frame sequence is spliced to obtain a plurality of video frame sequence texts corresponding to the video frame sequence ; Determine the target video frame sequence text according to the edit distance between each video frame sequence text and the target audio segment text in multiple video frame sequence texts, wherein the target audio segment text is an audio segment corresponding to the video frame sequence Corresponding audio clip text: using each audio clip in the multiple audio clips as a voice sample, and using the target video frame sequence text corresponding to the audio clip as a label to obtain a speech recognition training set.
  • the obtaining unit 604 is further configured to: for each video frame including text information in the sequence of video frames, perform the following operation: Splicing the text, and splicing multiple texts to be spliced with at least one video frame text in the video frame to obtain multiple spliced texts; based on the edit distance between the multiple spliced texts and the target audio segment text, from multiple A preset number of spliced texts are selected from the spliced texts as multiple texts to be spliced corresponding to the next video frame of the video frame.
  • the above device further includes: a deletion unit (not shown in the figure), configured to, for each video frame sequence in the plurality of video frame sequences, in response to determining that the video If the editing distance between the target video frame sequence text corresponding to the frame sequence and the target audio clip text is greater than a preset distance threshold, the training samples corresponding to the video frame sequence in the speech recognition training set are deleted.
  • a deletion unit (not shown in the figure), configured to, for each video frame sequence in the plurality of video frame sequences, in response to determining that the video If the editing distance between the target video frame sequence text corresponding to the frame sequence and the target audio clip text is greater than a preset distance threshold, the training samples corresponding to the video frame sequence in the speech recognition training set are deleted.
  • the acquisition unit in the generation device of the speech recognition training set acquires the audio to be processed and the video to be processed, wherein the video to be processed includes text information corresponding to the audio to be processed; the first recognition unit identifies the audio to be processed, Obtain the audio text; the second recognition unit recognizes the text information in the video to be processed to obtain the video text; the obtaining unit is based on the consistency between the audio text and the video text, using the audio to be processed as a voice sample and the video text as a label to obtain the speech recognition
  • the training set provides an automatic acquisition device for the speech recognition training set, which improves the flexibility and efficiency of constructing the speech recognition training set.
  • FIG. 7 it shows a schematic structural diagram of a computer system 700 suitable for implementing the devices of the embodiments of the present disclosure (such as the devices 101 , 102 , 103 , and 105 shown in FIG. 1 ).
  • the device shown in FIG. 7 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • a computer system 700 includes a processor (such as a CPU, central processing unit) 701, which can be loaded into a random access memory (RAM) according to a program stored in a read-only memory (ROM) 702 or from a storage section 708.
  • the program in 703 performs various appropriate actions and processing.
  • various programs and data necessary for the operation of the system 700 are also stored.
  • the processor 701 , ROM 702 and RAM 703 are connected to each other via a bus 704 .
  • An input/output (I/O) interface 705 is also connected to the bus 704 .
  • the following components are connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, etc.; an output section 707 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 708 including a hard disk, etc. and a communication section 709 including a network interface card such as a LAN card, a modem, or the like.
  • the communication section 709 performs communication processing via a network such as the Internet.
  • a drive 710 is also connected to the I/O interface 705 as needed.
  • a removable medium 711 such as a magnetic disk, optical disk, magneto-optical disk, semiconductor memory, etc. is mounted on the drive 710 as necessary so that a computer program read therefrom is installed into the storage section 708 as necessary.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes program codes for executing the methods shown in the flowcharts.
  • the computer program may be downloaded and installed from a network via communication portion 709 and/or installed from removable media 711 .
  • the computer program is executed by the processor 701, the above-mentioned functions defined in the method of the present disclosure are performed.
  • the computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wire, optical cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out the operations of the present disclosure can be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional procedural programming language—such as "C" or a similar programming language.
  • the program code may execute entirely on the client computer, partly on the client computer, as a stand-alone software package, partly on the client computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the client computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or by hardware.
  • the described units may also be set in a processor, for example, may be described as: a processor including an acquiring unit, a first identifying unit, a second identifying unit, and an obtaining unit.
  • a processor including an acquiring unit, a first identifying unit, a second identifying unit, and an obtaining unit.
  • the names of these units do not constitute a limitation on the unit itself in some cases.
  • the obtained unit can also be described as "based on the consistency between the audio text and the video text, the audio to be processed is the speech sample, and the The video text is the label to obtain the unit of the speech recognition training set".
  • the present disclosure also provides a computer-readable medium, which may be included in the device described in the above embodiments, or may exist independently without being assembled into the device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the device, the computer device: acquires the audio to be processed and the video to be processed, wherein the video to be processed includes information corresponding to the Process the audio text information; identify the audio to be processed to obtain the audio text; identify the text information in the video to be processed to obtain the video text; based on the consistency between the audio text and the video text, use the audio to be processed as the voice sample and the video text as label to get the speech recognition training set.

Abstract

本公开公开了一种语音识别训练集的生成方法及装置。方法的一具体实施方式包括:获取待处理音频和待处理视频,其中,待处理视频中包括对应于待处理音频的文本信息;识别待处理音频,得到音频文本;识别待处理视频中的文本信息,得到视频文本;基于音频文本与视频文本的一致性,以待处理音频为语音样本,以视频文本为标签,得到语音识别训练集。

Description

语音识别训练集的生成方法及装置
本专利申请要求于2021年05月08日提交的、申请号为202110514350.X、发明名称为“语音识别训练集的生成方法及装置”的中国专利申请的优先权,该申请的全文以引用的方式并入本申请中。
技术领域
本申请实施例涉及计算机技术领域,具体涉及一种语音识别训练集的生成方法及装置。
背景技术
近年来,随着深度学习技术的高速发展,采用基于深度神经网络的自动语音识别(Automatic Speech Recognition,ASR)模型进行语音识别,已经成为当前语音识别技术领域的主流趋势。为了提高语音识别模型的泛化性能,需要广泛、大量地收集语音数据,并通过人工标注构建的训练集来优化语音识别模型。
发明内容
本公开实施例提出了一种语音识别训练集的生成方法及装置。
在一种或多种实施例中,本公开提供了一种语音识别训练集的生成方法,包括:获取待处理音频和待处理视频,其中,待处理视频中包括对应于待处理音频的文本信息;识别待处理音频,得到音频文本;识别待处理视频中的文本信息,得到视频文本;基于音频文本与视频文本的一致性,以待处理音频为语音样本,以视频文本为标签,得到语音识别训练集。
在一种或多种实施例中,本公开提供了一种语音识别训练集的生成装置,包括:获取单元,被配置成获取待处理音频和待处理视频,其中,待处理视频中包括对应于待处理音频的文本信息;第一识别单 元,被配置成识别待处理音频,得到音频文本;第二识别单元,被配置成识别待处理视频中的文本信息,得到视频文本;得到单元,被配置成基于音频文本与视频文本的一致性,以待处理音频为语音样本,以视频文本为标签,得到语音识别训练集。
在一种或多种实施例中,本公开提供了一种计算机可读介质,其上存储有计算机程序,其中,程序被处理器执行时实现如第一方面任一实现方式描述的方法。
在一种或多种实施例中,本公开提供了一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如第一方面任一实现方式描述的方法。
在一种或多种实施例中,本公开提供了一种包括计算机程序的计算机程序产品,该计算机程序在被处理器执行时能够实现如第一方面中任一实现方式描述的方法。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本公开的其它特征、目的和优点将会变得更明显:
图1是本公开的一个实施例可以应用于其中的示例性系统架构图;
图2是根据本公开语音识别训练集的生成方法的一个实施例的流程图;
图3是根据本实施例的文本拼接过程的示意图;
图4是根据本实施例的语音识别训练集的生成方法的应用场景的示意图;
图5是根据本公开的语音识别训练集的生成方法的又一个实施例的流程图;
图6是根据本公开的语音识别训练集的生成装置的一个实施例的结构图;
图7是适于用来实现本公开实施例的计算机系统的结构示意图。
具体实施方式
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。
图1示出了可以应用本公开的语音识别训练集的生成方法及装置的示例性架构100。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。终端设备101、102、103之间通信连接构成拓扑网络,网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
终端设备101、102、103可以是支持网络连接从而进行数据交互和数据处理的硬件设备或软件。当终端设备101、102、103为硬件时,其可以是支持网络连接,信息获取、交互、显示、处理等功能的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、膝上型便携计算机和台式计算机等等。当终端设备101、102、103为软件时,可以安装在上述所列举的电子设备中。其可以实现成例如用来提供分布式服务的多个软件或软件模块,也可以实现成单个软件或软件模块。在此不做具体限定。
服务器105可以是提供各种服务的服务器,例如获取用户通过终端设备101、102、103发送的相对应的待处理视频和待处理音频,进行信息处理,自动构建语音识别训练集的后台处理服务器。此外,服务器还可以基于语音识别训练集训练初始语音识别模型,或优化预训练的语音识别模型。作为示例,服务器105可以是云端服务器。
需要说明的是,服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实 现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
还需要说明的是,本公开的实施例所提供的语音识别训练集的生成方法可以由服务器执行,也可以由终端设备执行,还可以由服务器和终端设备彼此配合执行。相应地,语音识别训练集的生成装置包括的各个部分(例如各个单元)可以全部设置于服务器中,也可以全部设置于终端设备中,还可以分别设置于服务器和终端设备中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。当语音识别训练集的生成方法运行于其上的电子设备不需要与其他电子设备进行数据传输时,该系统架构可以仅包括语音识别训练集的生成方法运行于其上的电子设备(例如服务器或终端设备)。继续参考图2,示出了语音识别训练集的生成方法的一个实施例的流程200,包括以下步骤:
步骤201,获取待处理音频和待处理视频。
本实施例中,语音识别训练集的生成方法的执行主体(例如图1中的服务器)可以通过有有线网络连接方式或无线网络连接方式从远程,或从本地获取待处理音频和待处理视频。其中,待处理视频中包括对应于待处理音频的文本信息。
作为示例,包括相对应的待处理音频和待处理视频的数据可以是电影、电视剧、短视频等各种音视频数据。待处理视频中的文本信息是字幕信息,待处理音频是对应于字幕信息的语音信息。
本实施例中,待处理音频所表征的语音数据可以是各种类型的语音,包括但不限于外语音频、国语音频、方言音频。待处理音频和待处理视频可以是时长较长的数据,也可以是时长较短的数据。
步骤202,识别待处理音频,得到音频文本。
本实施例中,上述执行主体可以识别待处理音频,得到音频文本。
作为示例,上述执行主体可以基于自动语音识别模型,处理待处理音频,得到音频文本。自动语音识别模型用于表征待处理音频与文 本之间的对应关系。
在本实施例的一些可选的实现方式中,上述执行主体可以通过如下方式执行上述步骤202:
第一,根据静音检测算法,删除待处理音频中的静音部分,得到非静音的多个音频片段。
本实现方式中,上述执行主体可以待处理音频中的静音部分为分割点,将删除静音部分后的待处理音频进行分割,得到多个音频片段。
针对于所得到的音频片段较长的情况,上述执行主体可以设置时长阈值,将时长大于时长阈值的音频片段,以时长阈值所表征的时长为单位进行进一步的切割,并记录每个音频片的起止时间。
作为示例,为了防止背景音乐等因素导致静音检测算法无法完全截断音频而导致获得的音频片段较长,设定时长阈值T,将时长大于时长阈值T的音频片段强行切割为多个时长为T的片段。其中,时长阈值可以根据实际情况具体设置,例如,T=10s。
第二,识别多个音频片段,得到音频文本包括的多个音频片段文本。
本实现方式中,上述执行主体可以将多个音频片段中的每个音频片段输入自动语音识别模型,得到多个音频片段文本。其中,多个音频片段与多个音频片段文本一一对应,多个音频片段文本构成音频文本。
步骤203,识别待处理视频中的文本信息,得到视频文本。
本实施例中,上述执行主体可以识别待处理视频中的文本信息,得到视频文本。
作为示例,针对于组成待处理视频的每个视频帧,上述执行主体可以利用OCR(Optical Character Recognition,光学字符识别)技术,识别该视频帧中所包括的文本信息,并按照待处理视频中的视频帧的播放顺序,拼接每个视频帧对应的的文本信息,得到视频文本。其中,OCR技术是目前较为成熟的技术,在此不做赘述。
在本实施例的一些可选的实现方式中,上述执行主体可以通过如下方式执行上述步骤203:
第一,从待处理视频中确定出与多个音频片段一一对应的多个视频帧序列。
本实现方式中,对于多个音频片段中的每个音频片段,上述执行主体从待处理视频中抽取该音频片段对应的多个视频帧,得到视频帧序列。
作为示例,该音频片段的起止时间分别为t sk和t ek,上述执行主体可以通过
Figure PCTCN2022087029-appb-000001
依次确定该音频片段对应的视频帧序列的起止视频帧。其中,
Figure PCTCN2022087029-appb-000002
Figure PCTCN2022087029-appb-000003
分别表征向上取整和向下取整,f P表征待处理视频的帧率。上述执行主体可以预先设置采样率,并基于采样率从起始帧
Figure PCTCN2022087029-appb-000004
终止
Figure PCTCN2022087029-appb-000005
之间抽取视频帧,得到该音频片段对应的视频帧序列。
第二,识别多个视频帧序列中的每个视频帧中的文本信息,得到视频文本包括的视频帧文本。
本实现方式中,上述执行主体可以利用OCR技术,识别多个视频帧序列中的每个视频帧中的文本信息,得到视频文本包括的视频帧文本。
可以理解,对于每个视频帧,上述执行主体可能并未识别出文本信息,也即该视频帧中并不包括文本信息;也可能识别出多处文本信息,得到多个视频帧文本。例如,多个视频帧文本包括视频帧中的字幕信息,以及视频帧画面中的文本信息(例如是店铺招牌中的店铺名称信息、道路指示牌中的道路名称信息、广告语信息等)。
在另一些情形中,相邻帧中包括的文本信息中存在相同的文本信息。例如,相邻的视频帧中包括的字幕信息相同。
本实施例中,可以为视频帧添加预设标识,以表征视频帧中不包括文本信息,或包括与相邻视频帧相同的文本信息的情形。其中,预设标识可以是预先设置的任意标识,例如是“Blank”。
步骤204,基于音频文本与视频文本的一致性,以待处理音频为语音样本,以视频文本为标签,得到语音识别训练集。
本实施例中,上述执行主体可以基于音频文本与视频文本的一致性,以待处理音频为语音样本,以视频文本为标签,得到语音识别训 练集。
作为示例,当音频文本与视频文本一致时,上述执行主体以待处理音频为语音样本,以视频文本为标签,得到语音识别训练集。
在本实施例的一些可选的实现方式中,上述执行主体可以通过如下方式执行上述步骤204:
首先,对于多个视频帧序列中的每个视频帧序列,执行如下操作:
第一,以该视频帧序列中的视频帧中被识别出的至少一个视频帧文本中的一个视频帧文本为单位,对该视频帧序列中的每个视频帧所包括的文本信息进行拼接,得到该视频帧序列对应的多个视频帧序列文本。
作为示例,该视频帧序列包括3个视频帧,3个视频帧对应的视频帧文本的数量依次为3个、4个、3个,则该视频帧序列对应的多个视频帧序列文本共有36(3×4×3)个。
在一些可选的实现方式中,每个视频帧对应的视频帧文本集合中,除了包括从该视频帧中识别到的视频帧文本,还包括表征该视频帧包括与相邻视频帧相同的文本信息的预设标识。预设标识还可以表征视频帧中不包括文本信息的情形。
继续参照上述示例,在每个视频帧对应的视频帧文本结合中添加预设标识后,3个视频帧对应的视频帧文本的数量依次为4个、5个、4个,则该视频帧序列对应的多个视频帧序列文本共有80(4×5×4)个。
如图3所示,预设标识为“Blank”。对应于视频帧301、302、303的识别结果依次为304、305、306。对于每一个视频帧中的每一个视频帧文本均可以与其他的视频帧中视频帧文本组合,得到多个视频帧序列文本。例如,可以将视频帧301、302、303的识别结果组合为“今天的天气怎么样”。
第二,根据多个视频帧序列文本中的每个视频帧序列文本与目标音频片段文本之间的编辑距离,确定目标视频帧序列文本。
其中,目标音频片段文本为对应于该视频帧序列的音频片段所对应的音频片段文本。编辑距离是指两个字符串之间,由一个转成另一 个所需要的最少编辑操作次数。
作为示例,上述执行主体可以将多个视频帧序列文本中与目标音频片段文本之间的编辑距离最小的视频帧序列文本,确定为目标视频帧序列文本。
然后,以多个音频片段中的每个音频片段为语音样本,以该音频片段对应的目标视频帧序列文本为标签,得到语音识别训练集。
在本实施例的一些可选的实现方式中,上述执行主体可以通过如下方式执行上述第一步骤:
对于该视频帧序列中包括文本信息的每个视频帧,执行如下操作:
首先,确定该视频帧对应的多个待拼接文本,并将多个待拼接文本与该视频帧中的至少一个视频帧文本进行拼接,得到多个拼接后文本。
然后,基于多个拼接后文本与目标音频片段文本之间的编辑距离,从多个拼接后文本选取出预设数量个拼接后文本,作为该视频帧的下一视频帧所对应的多个待拼接文本。
作为示例,上述执行主体可以将编辑距离由小到大进行排序,选取前预设数量个拼接后文本,作为该视频帧的下一视频帧所对应的多个待拼接文本。其中,预设数量可以根据实际情况进行具体设置,例如,可以是10。
在所得到的拼接后文本的数量较小(例如,拼接后文本的数量小于预设数量)的情况下,上述执行主体可以设置预设距离阈值,将编辑距离小于预设距离阈值的拼接后文本删除。
可以理解,上述执行主体针对于拼接后文本的数量较大的情况,也可以结合选取预设数量个文本、删除编辑距离小于预设距离阈值的文本的方式,确定该视频帧的下一视频帧所对应的多个待拼接文本。
作为又一示例,上述执行主体可以通过如下公式确定所保留的多个拼接后文本与音频文本之间的匹配度:
Q i=-|d(p ki,S k)-|‖S k‖-‖p ki‖||
其中,d(·,·)表示两个文本的编辑距离计算函数,‖·‖表示文本的长度,p ki表示拼接后文本,S k表示音频文本,Q i表示两个文本之间匹 配度。为了进一步减少每次拼接后得到的拼接后文本的数量,设计匹配度阈值T h,当Q i<T h时,则删除该拼接后文本。例如T h=-3。
在本实施例的一些可选的实现方式中,上述执行主体还可以对于多个视频帧序列中的每个视频帧序列,响应于确定该视频帧序列对应的目标视频帧序列文本与目标音频片段文本之间的编辑距离大于预设距离阈值,删除语音识别训练集中对应于该视频帧序列的训练样本,从而过滤掉低质量的训练样本。
继续参见图4,图4是根据本实施例的语音识别训练集的生成方法的应用场景的一个示意图400。在图4的应用场景中,首先,服务器401获取待处理音频402和待处理视频403。其中,待处理视频403中包括对应于待处理音频402的文本信息。然后,服务器401识别待处理音频402,得到音频文本404;识别待处理视频403中的文本信息,得到视频文本405。最后,服务器401基确定音频文本404与视频文本405的一致性,以待处理音频402为语音样本,以视频文本405为标签,得到语音识别训练集406。
本公开的上述实施例提供的方法,通过获取待处理音频和待处理视频,其中,待处理视频中包括对应于待处理音频的文本信息;识别待处理音频,得到音频文本;识别待处理视频中的文本信息,得到视频文本;基于音频文本与视频文本的一致性,以待处理音频为语音样本,以视频文本为标签,得到语音识别训练集,从而提供了一种语音识别训练集的自动获取方法,提高了构建语音识别训练集的灵活性和效率。
在本实施例的一些可选的实现方式中,上述执行主体可以基于语音识别训练集,训练未经训练的初始语音识别模型,或优化预训练的语音识别模型。
具体的,上述执行主体采用机器学习算法,以训练样本中的待处理音频为输入,以所输入的待处理音频为期望输出,训练未经训练的初始语音识别模型,或优化预训练的语音识别模型,得到最终的语音识别模型。
继续参考图5,示出了根据本公开的语音识别训练集的生成方法的一个实施例的示意性流程500,包括以下步骤:
步骤501,获取待处理音频和待处理视频。
其中,待处理视频中包括对应于待处理音频的文本信息;
步骤502,根据静音检测算法,删除待处理音频中的静音部分,得到非静音的多个音频片段。
步骤503,识别多个音频片段,得到音频文本包括的多个音频片段文本。
步骤504,从待处理视频中确定出与多个音频片段一一对应的多个视频帧序列。
步骤505,识别多个视频帧序列中的每个视频帧中的文本信息,得到视频文本包括的视频帧文本。
步骤506,对于多个视频帧序列中的每个视频帧序列,执行如下操作:
步骤5061,以该视频帧序列中的视频帧中被识别出的至少一个视频帧文本中的一个视频帧文本为单位,对该视频帧序列中的每个视频帧所包括的文本信息进行拼接,得到该视频帧序列对应的多个视频帧序列文本。
步骤5062,根据多个视频帧序列文本中的每个视频帧序列文本与目标音频片段文本之间的编辑距离,确定目标视频帧序列文本,其中,目标音频片段文本为对应于该视频帧序列的音频片段所对应的音频片段文本。
步骤507。以多个音频片段中的每个音频片段为语音样本,以该音频片段对应的目标视频帧序列文本为标签,得到语音识别训练集。
从本实施例中可以看出,与图2对应的实施例相比,本实施例中的语音识别训练集的生成方法的流程400具体说明了待处理音频和待处理视频的分割过程,以及视频帧文本的拼接过程,提高了语音识别训练集中的训练样本的准确性。
继续参考图6,作为对上述各图所示方法的实现,本公开提供了一种语音识别训练集的生成装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图6所示,语音识别训练集的生成装置包括:包括:获取单元601,被配置成获取待处理音频和待处理视频,其中,待处理视频中包括对应于待处理音频的文本信息;第一识别单元602,被配置成识别待处理音频,得到音频文本;第二识别单元603,被配置成识别待处理视频中的文本信息,得到视频文本;得到单元604,被配置成基于音频文本与视频文本的一致性,以待处理音频为语音样本,以视频文本为标签,得到语音识别训练集。
在本实施例的一些可选的实现方式中,第一识别单元602,进一步被配置成:根据静音检测算法,删除待处理音频中的静音部分,得到非静音的多个音频片段;识别多个音频片段,得到音频文本包括的多个音频片段文本。
在本实施例的一些可选的实现方式中,第二识别单元603,进一步被配置成:从待处理视频中确定出与多个音频片段一一对应的多个视频帧序列;识别多个视频帧序列中的每个视频帧中的文本信息,得到视频文本包括的视频帧文本。
在本实施例的一些可选的实现方式中,得到单元604,进一步被配置成:对于多个视频帧序列中的每个视频帧序列,执行如下操作:以该视频帧序列中的视频帧中被识别出的至少一个视频帧文本中的一个视频帧文本为单位,对该视频帧序列中的每个视频帧所包括的文本信息进行拼接,得到该视频帧序列对应的多个视频帧序列文本;根据多个视频帧序列文本中的每个视频帧序列文本与目标音频片段文本之间的编辑距离,确定目标视频帧序列文本,其中,目标音频片段文本为对应于该视频帧序列的音频片段所对应的音频片段文本;以多个音频片段中的每个音频片段为语音样本,以该音频片段对应的目标视频帧序列文本为标签,得到语音识别训练集。
在本实施例的一些可选的实现方式中,得到单元604,进一步被配置成:对于该视频帧序列中包括文本信息的每个视频帧,执行如下 操作:确定该视频帧对应的多个待拼接文本,并将多个待拼接文本与该视频帧中的至少一个视频帧文本进行拼接,得到多个拼接后文本;基于多个拼接后文本与目标音频片段文本之间的编辑距离,从多个拼接后文本选取出预设数量个拼接后文本,作为该视频帧的下一视频帧所对应的多个待拼接文本。
在本实施例的一些可选的实现方式中,上述装置还包括:删除单元(图中未示出),被配置成对于多个视频帧序列中的每个视频帧序列,响应于确定该视频帧序列对应的目标视频帧序列文本与目标音频片段文本之间的编辑距离大于预设距离阈值,删除语音识别训练集中对应于该视频帧序列的训练样本。
本实施例中,语音识别训练集的生成装置中的获取单元获取待处理音频和待处理视频,其中,待处理视频中包括对应于待处理音频的文本信息;第一识别单元识别待处理音频,得到音频文本;第二识别单元识别待处理视频中的文本信息,得到视频文本;得到单元基于音频文本与视频文本的一致性,以待处理音频为语音样本,以视频文本为标签,得到语音识别训练集,从而提供了一种语音识别训练集的自动获取装置,提高了构建语音识别训练集的灵活性和效率。
下面参考图7,其示出了适于用来实现本公开实施例的设备(例如图1所示的设备101、102、103、105)的计算机系统700的结构示意图。图7示出的设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图7所示,计算机系统700包括处理器(例如CPU,中央处理器)701,其可以根据存储在只读存储器(ROM)702中的程序或者从存储部分708加载到随机访问存储器(RAM)703中的程序而执行各种适当的动作和处理。在RAM703中,还存储有系统700操作所需的各种程序和数据。处理器701、ROM702以及RAM703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。
以下部件连接至I/O接口705:包括键盘、鼠标等的输入部分706;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的 输出部分707;包括硬盘等的存储部分708;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分709。通信部分709经由诸如因特网的网络执行通信处理。驱动器710也根据需要连接至I/O接口705。可拆卸介质711,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器710上,以便于从其上读出的计算机程序根据需要被安装入存储部分708。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分709从网络上被下载和安装,和/或从可拆卸介质711被安装。在该计算机程序被处理器701执行时,执行本公开的方法中限定的上述功能。
需要说明的是,本公开的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含 的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,程序设计语言包括面向目标的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如”C”语言或类似的程序设计语言。程序代码可以完全地在客户计算机上执行、部分地在客户计算机上执行、作为一个独立的软件包执行、部分在客户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到客户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器,包括获取单元、第一识别单元、第二识别单元和得到单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,得到单元还可以被描述为“基于音频文本与视频文本的一致性,以待处理音频为语音样本,以视频文本为标签,得到语音识别训练集的单元”。
作为另一方面,本公开还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的设备中所包含的;也可以是单独存在,而未装配入该设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该装置执行时,使得该计算机设备:获取待处理音频和待处理视频,其中,待处理视频中包括对应于待处理音频的文本信息;识别待处理音频,得到音频文本;识别待处理视频中的文本信息,得到视频文本;基于音频文本与视频文本的一致性,以待处理音频为语音样本,以视频文本为标签,得到语音识别训练集。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (15)

  1. 一种语音识别训练集的生成方法,包括:
    获取待处理音频和待处理视频,其中,所述待处理视频中包括对应于所述待处理音频的文本信息;
    识别所述待处理音频,得到音频文本;
    识别所述待处理视频中的文本信息,得到视频文本;以及
    基于所述音频文本与所述视频文本的一致性,以所述待处理音频为语音样本,以所述视频文本为标签,得到所述语音识别训练集。
  2. 根据权利要求1所述的方法,其中,所述识别所述待处理音频,得到音频文本,包括:
    根据静音检测算法,删除所述待处理音频中的静音部分,得到非静音的多个音频片段;以及
    识别所述多个音频片段,得到所述音频文本包括的多个音频片段文本。
  3. 根据权利要求2所述的方法,其中,所述识别所述待处理视频中的文本信息,得到视频文本,包括:
    从所述待处理视频中确定出与所述多个音频片段一一对应的多个视频帧序列;以及
    识别所述多个视频帧序列中的每个视频帧中的文本信息,得到所述视频文本包括的视频帧文本。
  4. 根据权利要求3所述的方法,其中,所述基于所述音频文本与所述视频文本的一致性,以所述待处理音频为语音样本,以所述视频文本为标签,得到所述语音识别训练集,包括:
    对于所述多个视频帧序列中的每个视频帧序列,执行如下操作:以该视频帧序列中的视频帧中被识别出的至少一个视频帧文本中的一个视频帧文本为单位,对该视频帧序列中的每个视频帧所包括的文本 信息进行拼接,得到该视频帧序列对应的多个视频帧序列文本;根据所述多个视频帧序列文本中的每个视频帧序列文本与目标音频片段文本之间的编辑距离,确定目标视频帧序列文本,其中,所述目标音频片段文本为对应于该视频帧序列的音频片段所对应的音频片段文本;以及
    以所述多个音频片段中的每个音频片段为语音样本,以该音频片段对应的目标视频帧序列文本为标签,得到所述语音识别训练集。
  5. 根据权利要求4所述的方法,其中,所述以该视频帧序列中的视频帧中被识别出的至少一个视频帧文本中的一个视频帧文本为单位,对该视频帧序列中的每个视频帧所包括的文本信息进行拼接,得到该视频帧序列对应的多个视频帧序列文本,包括:
    对于该视频帧序列中包括文本信息的每个视频帧,执行如下操作:
    确定该视频帧对应的多个待拼接文本,并将所述多个待拼接文本与该视频帧中的至少一个视频帧文本进行拼接,得到多个拼接后文本;以及
    基于所述多个拼接后文本与所述目标音频片段文本之间的编辑距离,从所述多个拼接后文本选取出预设数量个拼接后文本,作为该视频帧的下一视频帧所对应的多个待拼接文本。
  6. 根据权利要求4所述的方法,其中,还包括:
    对于所述多个视频帧序列中的每个视频帧序列,响应于确定该视频帧序列对应的目标视频帧序列文本与目标音频片段文本之间的编辑距离大于预设距离阈值,删除所述语音识别训练集中对应于该视频帧序列的训练样本。
  7. 一种语音识别训练集的生成装置,包括:
    获取单元,被配置成获取待处理音频和待处理视频,其中,所述待处理视频中包括对应于所述待处理音频的文本信息;
    第一识别单元,被配置成识别所述待处理音频,得到音频文本;
    第二识别单元,被配置成识别所述待处理视频中的文本信息,得到视频文本;
    得到单元,被配置成基于所述音频文本与所述视频文本的一致性,以所述待处理音频为语音样本,以所述视频文本为标签,得到所述语音识别训练集。
  8. 根据权利要求7所述的装置,其中,所述第一识别单元,进一步被配置成:
    根据静音检测算法,删除所述待处理音频中的静音部分,得到非静音的多个音频片段;识别所述多个音频片段,得到所述音频文本包括的多个音频片段文本。
  9. 根据权利要求8所述的装置,其中,所述第二识别单元,进一步被配置成:
    从所述待处理视频中确定出与所述多个音频片段一一对应的多个视频帧序列;识别所述多个视频帧序列中的每个视频帧中的文本信息,得到所述视频文本包括的视频帧文本。
  10. 根据权利要求9所述的装置,其中,所述得到单元,进一步被配置成:
    对于所述多个视频帧序列中的每个视频帧序列,执行如下操作:以该视频帧序列中的视频帧中被识别出的至少一个视频帧文本中的一个视频帧文本为单位,对该视频帧序列中的每个视频帧所包括的文本信息进行拼接,得到该视频帧序列对应的多个视频帧序列文本;根据所述多个视频帧序列文本中的每个视频帧序列文本与目标音频片段文本之间的编辑距离,确定目标视频帧序列文本,其中,所述目标音频片段文本为对应于该视频帧序列的音频片段所对应的音频片段文本;
    以所述多个音频片段中的每个音频片段为语音样本,以该音频片段对应的目标视频帧序列文本为标签,得到所述语音识别训练集。
  11. 根据权利要求10所述的装置,其中,所述得到单元,进一步被配置成:
    对于该视频帧序列中包括文本信息的每个视频帧,执行如下操作:确定该视频帧对应的多个待拼接文本,并将所述多个待拼接文本与该视频帧中的至少一个视频帧文本进行拼接,得到多个拼接后文本;基于所述多个拼接后文本与所述目标音频片段文本之间的编辑距离,从所述多个拼接后文本选取出预设数量个拼接后文本,作为该视频帧的下一视频帧所对应的多个待拼接文本。
  12. 根据权利要求10所述的装置,其中,还包括:
    删除单元,被配置成对于所述多个视频帧序列中的每个视频帧序列,响应于确定该视频帧序列对应的目标视频帧序列文本与目标音频片段文本之间的编辑距离大于预设距离阈值,删除所述语音识别训练集中对应于该视频帧序列的训练样本。
  13. 一种计算机可读介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如权利要求1-6中任一所述的方法。
  14. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-6中任一所述的方法。
  15. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-6中任一项所述的方法。
PCT/CN2022/087029 2021-05-08 2022-04-15 语音识别训练集的生成方法及装置 WO2022237448A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110514350.XA CN115312032A (zh) 2021-05-08 2021-05-08 语音识别训练集的生成方法及装置
CN202110514350.X 2021-05-08

Publications (1)

Publication Number Publication Date
WO2022237448A1 true WO2022237448A1 (zh) 2022-11-17

Family

ID=83853589

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/087029 WO2022237448A1 (zh) 2021-05-08 2022-04-15 语音识别训练集的生成方法及装置

Country Status (2)

Country Link
CN (1) CN115312032A (zh)
WO (1) WO2022237448A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631447A (zh) * 2023-07-24 2023-08-22 科大讯飞股份有限公司 噪声提取方法、装置、设备及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106604125A (zh) * 2016-12-29 2017-04-26 北京奇艺世纪科技有限公司 一种视频字幕的确定方法及装置
CN108924626A (zh) * 2018-08-17 2018-11-30 腾讯科技(深圳)有限公司 图片生成方法、装置、设备及存储介质
CN109257659A (zh) * 2018-11-16 2019-01-22 北京微播视界科技有限公司 字幕添加方法、装置、电子设备及计算机可读存储介质
CN109803180A (zh) * 2019-03-08 2019-05-24 腾讯科技(深圳)有限公司 视频预览图生成方法、装置、计算机设备及存储介质
CN111639233A (zh) * 2020-05-06 2020-09-08 广东小天才科技有限公司 学习视频字幕添加方法、装置、终端设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106604125A (zh) * 2016-12-29 2017-04-26 北京奇艺世纪科技有限公司 一种视频字幕的确定方法及装置
CN108924626A (zh) * 2018-08-17 2018-11-30 腾讯科技(深圳)有限公司 图片生成方法、装置、设备及存储介质
CN109257659A (zh) * 2018-11-16 2019-01-22 北京微播视界科技有限公司 字幕添加方法、装置、电子设备及计算机可读存储介质
CN109803180A (zh) * 2019-03-08 2019-05-24 腾讯科技(深圳)有限公司 视频预览图生成方法、装置、计算机设备及存储介质
CN111639233A (zh) * 2020-05-06 2020-09-08 广东小天才科技有限公司 学习视频字幕添加方法、装置、终端设备和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631447A (zh) * 2023-07-24 2023-08-22 科大讯飞股份有限公司 噪声提取方法、装置、设备及可读存储介质
CN116631447B (zh) * 2023-07-24 2023-12-01 科大讯飞股份有限公司 噪声提取方法、装置、设备及可读存储介质

Also Published As

Publication number Publication date
CN115312032A (zh) 2022-11-08

Similar Documents

Publication Publication Date Title
CN110349564B (zh) 一种跨语言语音识别方法和装置
CN108052577B (zh) 一种通用文本内容挖掘方法、装置、服务器及存储介质
CN111488489B (zh) 视频文件的分类方法、装置、介质及电子设备
CN109754783B (zh) 用于确定音频语句的边界的方法和装置
JP6681450B2 (ja) 情報処理方法および装置
WO2020052069A1 (zh) 用于分词的方法和装置
CN113486833B (zh) 多模态特征提取模型训练方法、装置、电子设备
US20180157657A1 (en) Method, apparatus, client terminal, and server for associating videos with e-books
US20190156832A1 (en) Diarization Driven by the ASR Based Segmentation
JP7394809B2 (ja) ビデオを処理するための方法、装置、電子機器、媒体及びコンピュータプログラム
WO2023083142A1 (zh) 分句方法、装置、存储介质及电子设备
US20240029709A1 (en) Voice generation method and apparatus, device, and computer readable medium
WO2022037419A1 (zh) 音频内容识别方法、装置、设备和计算机可读介质
CN110138654B (zh) 用于处理语音的方法和装置
US20220385996A1 (en) Method for generating target video, apparatus, server, and medium
CN108877779B (zh) 用于检测语音尾点的方法和装置
WO2023071578A1 (zh) 一种文本对齐语音的方法、装置、设备及介质
CN111160004A (zh) 一种断句模型的建立方法及装置
WO2022237448A1 (zh) 语音识别训练集的生成方法及装置
CN111241209A (zh) 用于生成信息的方法和装置
CN114550702A (zh) 一种语音识别方法和装置
US10468031B2 (en) Diarization driven by meta-information identified in discussion content
CN112954453B (zh) 视频配音方法和装置、存储介质和电子设备
JP2024506495A (ja) 議事録の処理方法、装置、機器及び媒体
CN112633004A (zh) 文本标点符号删除方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22806418

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18559398

Country of ref document: US

Ref document number: 2023568632

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE