WO2020024690A1 - 语音标注方法、装置及设备 - Google Patents

语音标注方法、装置及设备 Download PDF

Info

Publication number
WO2020024690A1
WO2020024690A1 PCT/CN2019/089176 CN2019089176W WO2020024690A1 WO 2020024690 A1 WO2020024690 A1 WO 2020024690A1 CN 2019089176 W CN2019089176 W CN 2019089176W WO 2020024690 A1 WO2020024690 A1 WO 2020024690A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
pinyin
speech
original
data
Prior art date
Application number
PCT/CN2019/089176
Other languages
English (en)
French (fr)
Inventor
官砚楚
杨磊
陈力
韩喆
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020024690A1 publication Critical patent/WO2020024690A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • This specification relates to the field of data processing, and in particular, to a method, a device, and a device for voice annotation.
  • this specification provides a method, a device, and a device for voice annotation.
  • a voice annotation method includes:
  • the voice data including: recording data obtained by reading the original text information aloud;
  • Segmenting the speech data to obtain at least one piece of speech sentence data
  • the recognition sentence information obtained by performing speech recognition on the voice sentence data is compared with the original sentence information in the original text information, and according to the comparison result, the original sentence information and the voice sentence data are used to form a text speech pair.
  • the step of segmenting the voice data to obtain at least one piece of voice sentence data includes:
  • segment the speech data to obtain at least one segment of speech sentence data, and the number of frames of the speech sentence data is greater than or equal to a preset Frame number threshold.
  • the comparison of the similarity between the identification sentence information and the original sentence information includes: comparing the pinyin of the identification sentence information with that of the original sentence information, and the pinyin is a tone of pinyin.
  • the comparison of the recognition sentence information obtained by performing speech recognition on the voice sentence data with the original sentence information in the original text information includes:
  • the original pinyin sequence includes the original pinyin sentence
  • the identified pinyin sequence includes the identified pinyin sentence
  • the original pinyin sentence of the current sequence number is compared with the identified pinyin sentence in the current sequence number in the identified pinyin sequence and the specified sequence number offset from the front and back to obtain the comparison result.
  • the using the original sentence information and the voice sentence data according to the comparison result to form a text-to-speech pair includes:
  • the identified pinyin sentences are filtered from the identified pinyin sentences within the current sequence number and the specified sequence number shifted back and forth;
  • the text information obtained by the check is used as the annotation data of the speech sentence data corresponding to the identification sentence information to form a text speech pair.
  • the preset screening conditions include one of the following conditions:
  • the maximum similarity in the comparison result is greater than a preset similarity threshold and there are at least two maximum similarities, selecting the identified pinyin sentence with the highest serial number among the identified pinyin sentences corresponding to the maximum similarity;
  • the identifying pinyin sentence with the highest serial number among the identified pinyin sentences corresponding to the maximum similarity and the second largest similarity is selected.
  • a voice tagging device includes:
  • An information acquisition module configured to acquire original text information and voice data, where the voice data includes: recording data obtained by reading the original text information aloud;
  • a data segmentation module configured to segment the speech data to obtain at least one segment of speech sentence data
  • the voice pair constituting module is configured to compare the similarity information obtained by performing speech recognition on the voice sentence data with the original sentence information in the original text information, and use the original sentence information and the speech sentence data to form a text according to the comparison result.
  • Voice pair is configured to compare the similarity information obtained by performing speech recognition on the voice sentence data with the original sentence information in the original text information, and use the original sentence information and the speech sentence data to form a text according to the comparison result.
  • the data segmentation module is specifically configured to:
  • segment the speech data to obtain at least one segment of speech sentence data, and the number of frames of the speech sentence data is greater than or equal to a preset Frame number threshold.
  • the comparison of the similarity between the identification sentence information and the original sentence information includes: comparing the pinyin of the identification sentence information with that of the original sentence information, and the pinyin is a tone of pinyin.
  • the voice pair constituting module is specifically configured to:
  • the original pinyin sequence includes the original pinyin sentence
  • the identified pinyin sequence includes the identified pinyin sentence
  • the original pinyin sentence of the current sequence number is compared with the identified pinyin sentence in the current sequence number in the identified pinyin sequence and the specified sequence number offset from the front and back to obtain the comparison result.
  • the voice pair constituting module is specifically configured to include:
  • the identified pinyin sentences are filtered from the identified pinyin sentences within the current sequence number and the specified sequence number offset from the front and back;
  • the text information obtained by the check is used as the annotation data of the speech sentence data corresponding to the identification sentence information to form a text speech pair.
  • the preset screening conditions include one of the following conditions:
  • the maximum similarity in the comparison result is greater than a preset similarity threshold and there are at least two maximum similarities, selecting the identified pinyin sentence with the highest serial number among the identified pinyin sentences corresponding to the maximum similarity;
  • the identifying pinyin sentence with the highest serial number among the identified pinyin sentences corresponding to the maximum similarity and the second largest similarity is selected.
  • a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the program as follows method:
  • the voice data including: recording data obtained by reading the original text information aloud;
  • Segmenting the speech data to obtain at least one piece of speech sentence data
  • the recognition sentence information obtained by performing speech recognition on the voice sentence data is compared with the original sentence information in the original text information, and according to the comparison result, the original sentence information and the voice sentence data are used to form a text speech pair.
  • the original text information and the speech data corresponding to the original text information by obtaining the original text information and the speech data corresponding to the original text information, segmenting the speech data to obtain multiple pieces of speech sentence data with speech, and then performing speech recognition on the speech sentence data, and The obtained recognition sentence information is compared with the original sentence information in the original text information for similarity, and then the original sentence information and the speech sentence data are used to form a text-to-speech pair according to the comparison result, to implement automatic annotation processing and improve the efficiency of obtaining the text-to-speech pair.
  • Fig. 1 is a flow chart showing a voice annotation method according to an exemplary embodiment of the present specification.
  • Fig. 2 is a flow chart showing another method for voice annotation according to an exemplary embodiment of the present specification.
  • Fig. 3 is a hardware structural diagram of a computer device in which a voice annotation device is shown according to an exemplary embodiment of the present specification.
  • Fig. 4 is a block diagram of a voice annotation device according to an exemplary embodiment of the present specification.
  • first, second, third, etc. may be used in this specification to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information.
  • word “if” as used herein can be interpreted as “at” or "when” or "in response to determination”.
  • Speech recognition can be the conversion of speech into text.
  • acoustic models and language models are involved.
  • Speech synthesis can map text to audio.
  • Acoustic models are also involved in the synthesis process.
  • an end-to-end speech recognition model can be connected from the input end (speech waveform or feature sequence) to the output end (word or character sequence) with a neural network, and put traditional acoustic models, pronunciation dictionaries, language models and other traditional modules in Processing in neural networks.
  • the speech synthesis model can be from the input end (word or character sequence) to the output end (speech waveform or feature sequence).
  • the establishment of an acoustic model depends on a large amount of speech data and correct text information corresponding to the speech data, so as to obtain the statistical relationship between speech and text, and use the speech data and correct text information corresponding to the speech data to train the model to obtain Acoustic model.
  • the process of determining the voice data and the correct text information corresponding to the voice data can be referred to as voice annotation, and the correct text information can be used as the annotation result of the voice data, and can also be referred to as annotation data.
  • the embodiment of the present specification by obtaining the original text information and the speech data corresponding to the original text information, segmenting the speech data to obtain multiple pieces of speech sentence data with speech, and then performing speech recognition on the speech sentence data, The recognition sentence information obtained from the recognition is compared with the original sentence information in the original text information, and then based on the comparison result, the original sentence information and the speech sentence data are used to form a text-to-speech pair. effectiveness.
  • FIG. 1 it is a flowchart of a voice annotation method according to an exemplary embodiment of the present specification.
  • the method includes:
  • step 102 original text information and voice data are obtained, and the voice data includes: recording data obtained by reading the original text information aloud;
  • step 104 sentence segmentation is performed on the voice data to obtain at least one piece of voice sentence data
  • step 106 the recognition sentence information obtained by performing speech recognition on the voice sentence data is compared with the original sentence information in the original text information, and based on the comparison result, the original sentence information and the voice sentence data are used to form a text-to-speech pair. .
  • the original text information may include read aloud text (also referred to as recorded text) for voice recording;
  • the voice data may include recorded data obtained by reading the original text information aloud.
  • a professional may be invited to read the original text information aloud to obtain speech data corresponding to the original text information.
  • news text and news voice can be obtained from a news platform.
  • One or more sentences may exist in the original text information.
  • the speech data also contains one or more speech sentences.
  • sentences in the original text information can be called original sentence information.
  • Each original sentence information can constitute an original text sequence.
  • the original sentence information can be converted into Pinyin to obtain the original Pinyin sentence.
  • Each original pinyin sentence can constitute an original pinyin sequence.
  • the speech data is processed for sentence segmentation to obtain speech sentence data. Perform speech recognition on the speech sentence data to obtain the recognition sentence information.
  • Each recognition sentence information may constitute a recognition text sequence. Recognize the sentence information and perform Pinyin conversion to obtain the identified Pinyin sentence.
  • Each recognition pinyin sentence can constitute a recognition pinyin sequence.
  • segmentation and segmentation of voice data may be performed to obtain voice sentence data with continuous voice.
  • Speech sentence data can also be called speech segment data and speech segment data.
  • Different punctuation marks indicate that the length of the pause is different. The punctuation at the end of a sentence is longer than the pause indicated by the punctuation in the sentence, and whether the pause can be judged based on the presence of voice. Therefore, segmentation of voice data can be achieved. Minute.
  • sentence segmentation can be performed on the voice data based on the voice endpoint detection mode.
  • Voice Endpoint Detection also known as voice activity detection and voice boundary detection
  • Voice endpoint detection can separate voice signals and non-speech signals from the original voice data, and locate the start and end points of the voice signal.
  • the start and end points can be referred to as endpoints. Therefore, the embodiments of the present specification can detect endpoints in the manner of voice endpoint detection, and determine whether two sounds belong to the same sentence in combination with the endpoint interval, thereby achieving sentence segmentation of voice data. It not only realizes processing only for voice signals, but does not care about non-voice signals, and can also obtain voice sentence data with voice. Further, sentence segmentation can also be performed in combination with the pause time.
  • the energy threshold (or short-term energy) can be used to detect and cut the speech. Minute.
  • segmenting the voice data through voice endpoint detection to obtain at least one piece of voice sentence data with voice may include:
  • sentence segmentation is performed on the speech data to obtain at least one segment of speech sentence data.
  • frame and window processing can be performed on the speech data, and the short-term energy of each frame is calculated.
  • the speech data is divided into multiple short-term energy frames, and the energy of the short-term energy frame can be determined according to the amplitude of the audio signal corresponding to each sampling point in the short-term energy frame.
  • the energy of each sampling point can be determined according to the amplitude of the audio signal corresponding to each sampling point in the short-term energy frame, and then the energies are added, and the resulting energy sum is used as the short-term energy frame. energy of.
  • the start and end positions of continuous speech in the speech data can be determined. Based on the adjacent speech segments, the end point of the previous speech segment and the start point of the current speech segment Interval, determine whether adjacent speech segments are the same sentence, and then obtain speech sentence data with a speaking voice.
  • the speech data may contain speech that is not related to the original text information.
  • a reader such as an announcer
  • the speech sentence data may be speech sentence data with a frame number equal to or equal to a preset frame number threshold.
  • the initial speech sentence data whose frame number is greater than or equal to the preset frame number threshold may be retained as the speech sentence data.
  • the number of frames may be the number of consecutive speech frames.
  • a frame can take 20-50ms, and a word is about 200-300ms.
  • the preset frame number threshold can be obtained based on the expected character length conversion. For example, if it is desired to exclude fragment data of no more than 5 Chinese characters in the speech sentence data, the number of audio frames corresponding to the 5 Chinese characters can be determined as a preset frame number threshold.
  • This embodiment excludes speech sentence data whose frame number is less than a preset frame number threshold, can filter speech that is not included in the original text information in the speech data, and can obtain speech sentence data that has a higher correlation with the original text information, thereby improving Accuracy of voice annotations.
  • the speech sentence data can be subjected to speech recognition to obtain the identification sentence information.
  • the means for speech recognition may be the means for speech recognition in related technologies, which is not limited herein.
  • a sentence segmentation method can be used to perform sentence segmentation on the original text information to obtain the original sentence information.
  • the recognition sentence information the text obtained by identifying the speech sentence data
  • the text obtained by the original text information segmentation processing is called the original sentence information.
  • sentence segmentation processing may be performed by using whether a character is a designated character.
  • the designated character may be a period, an exclamation point, a question mark, an ellipsis, a line break, or the like.
  • the sentence data is processed by the speech data and identified to obtain the recognition sentence information, and the original text information is processed to perform sentence segmentation.
  • the similarity of the recognized sentence information and the original sentence information is compared. Comparing fragments to fragments can improve comparison efficiency. In particular, by comparing the position of the speech sentence data in the speech data with the position of the original sentence information in the original text information, comparing the similarity of the local text information can improve the comparison efficiency.
  • the comparison of the similarity between the recognition sentence information and the original sentence information may include: comparing the pinyin of the sentence information with the pinyin of the original sentence information.
  • the pinyin may be a pinyin with a tone. Tonal pinyin can improve the accuracy of text similarity comparison.
  • comparing the similarity between the identification sentence information and the original sentence information may include: comparing the identification pinyin sequence obtained by the conversion of the identification sentence information with the original pinyin sequence obtained by the conversion of the original sentence information.
  • the i-th sequence number in the original pinyin sequence may be used.
  • the original Pinyin sentences of the comparison with the identified Pinyin sentences in the sequence from (ik) to (i + k) in the identified Pinyin sequence are compared to achieve local comparison.
  • the comparison of the recognition sentence information obtained by performing speech recognition on the voice sentence data with the original sentence information in the original text information may include:
  • the original pinyin sequence includes the original pinyin sentence
  • the identified pinyin sequence includes the identified pinyin sentence
  • the original pinyin sentence of the current sequence number is compared with the identified pinyin sentence in the current sequence number in the identified pinyin sequence and the specified sequence number offset from the front and back to obtain the comparison result.
  • the original text information is segmented by punctuation to obtain the original sentence information, and the original sentence information is sorted in the order of the original text information to obtain the original text sequence.
  • the original pinyin sequence can be obtained by converting the original sentence information in the original text sequence into tone pinyin.
  • the original pinyin sequence consists of the original pinyin sentences.
  • the recognition sentence information in the recognition text sequence is converted into tone pinyin to obtain the recognition pinyin sequence.
  • the recognition pinyin sequence is composed of the recognition pinyin sentence.
  • the tone of the pinyin can be represented using four different numbers, for example, using 1 to 4. In order to distinguish two groups of pinyin sequences, they are named original pinyin sequence and recognition pinyin sequence.
  • the identified pinyin sentence formation object in the (ik) th sequence to the (i + k) th sequence number can be used as the comparison object of the original pinyin sentence in the ith sequence, thereby achieving The rapid matching of the local text similarity based on pinyin improves the accuracy of the text-to-speech pair.
  • the similarity can be determined based on the ratio of the number of matching pinyin in the two Pinyin sentences (the original Pinyin sentence and the identifying Pinyin sentence) to the total number of Pinyin in the two Pinyin sentences.
  • the similarity between the original pinyin sentence and the identified pinyin sentence can be used as the similarity between the original sentence information and the speech sentence data. Therefore, according to the comparison result, the speech sentence data that satisfies the condition and the original sentence can be filtered.
  • the sentence information constitutes a text-to-speech pair.
  • the preset filtering condition may be a condition for filtering out appropriate speech sentence data.
  • the preset filtering condition may be: if the maximum similarity in the comparison result is greater than a preset similarity threshold and there is a maximum similarity, selecting a pinyin sentence corresponding to the maximum similarity.
  • the speech sentence data corresponding to the similarity that is greater than the preset similarity threshold and is the maximum value in the comparison result may be filtered to form a text-speech pair with the original sentence information.
  • the preset filtering condition may be: selecting the identified pinyin sentence with the highest sequence number from the identified pinyin sentences corresponding to the similarity greater than the preset similarity threshold.
  • the voice sentence data corresponding to the identified pinyin sentence with the largest serial number can be selected and compared with the original sentence
  • the information constitutes a text-to-speech pair.
  • the preset filtering condition may be: if the maximum similarity in the comparison result is greater than the preset similarity threshold and there are at least two maximum similarities, selecting the one with the largest sequence number in the identified pinyin sentence corresponding to the maximum similarity Identify pinyin sentences.
  • the identified pinyin sentences with the highest serial number can be filtered from the identified pinyin sentences corresponding to the maximum similarity.
  • a recognition pinyin sentence with the highest sequence number in the identified pinyin sentences corresponding to the maximum similarity and the second largest similarity is selected.
  • the identified pinyin sentences with the highest serial number can be filtered from the identified pinyin sentences corresponding to the maximum similarity and the second largest similarity.
  • the maximum similarity and the second largest similarity are the first two similarities, that is, there is only one maximum similarity and the second largest similarity in the comparison result. If the first two major similarities in the comparison result are greater than the preset similarity threshold, you can select the speech sentence data corresponding to the identified pinyin sentence with the largest sequence number and form a text-to-speech pair with the original sentence information. When repeating aloud, the accuracy of the last sentence is often high.
  • the original sentence information can also be verified using the recognition sentence information. To remove missed characters or add more read characters.
  • the speech sentence data that meets the similarity condition is filtered, and the verified text information constitutes a text-speech pair.
  • the forming a text-to-speech pair using the original sentence information and the speech sentence data according to the comparison result may include:
  • the identified pinyin sentences are filtered from the identified pinyin sentences within the current sequence number and the specified sequence number shifted back and forth;
  • the text information obtained by the check is used as the annotation data of the speech sentence data corresponding to the identification sentence information to form a text speech pair.
  • the preset screening conditions include one of the following conditions:
  • the identifying pinyin sentence corresponding to the maximum similarity is selected; if the maximum similarity in the comparison result is greater than the preset similarity threshold and there are at least two When the maximum similarity degree is selected, the identification pinyin sentence with the highest serial number among the identified pinyin sentences corresponding to the maximum similarity is selected;
  • the identifying pinyin sentence with the highest serial number among the identified pinyin sentences corresponding to the maximum similarity and the second largest similarity is selected.
  • verification process may also include other processes, which are not described in detail here.
  • the original sentence information is verified by using the recognition sentence information, thereby obtaining correct text information of the voice sentence data, and avoiding the difficulties caused by modifying the voice data.
  • FIG. 2 it is a flowchart of another method for voice annotation according to an exemplary embodiment of the present specification.
  • the voice generated by reading the original text can be collected to obtain the voice.
  • the file corresponds to the speech data in FIG. 1 (step 204).
  • a conversion algorithm of Chinese characters and pinyin can be used to convert the original sentence information (sentence) in the original text to pinyin to obtain the original pinyin sentence (step 206).
  • a short-term energy-based speech segmentation is performed on the speech file to obtain a plurality of speech sentence data (step 208).
  • the speech sentence data is subjected to speech recognition and converted into Chinese characters to obtain the recognition sentence information (step 210).
  • the Chinese characters are converted into tonal pinyin to obtain a recognized pinyin sentence (step 212).
  • fast text similarity matching in a local range is performed to obtain a comparison result (step 214).
  • the selected sentence set is filtered according to the comparison result.
  • the selected sentence set may include the current pin number and the identified pinyin sentences within a specified number before and after the offset.
  • the original sentence information corresponding to the identified pinyin sentence is verified, and punctuation marks are modified to obtain the annotation data of the speech sentence data corresponding to the identified sentence information, and further A text-to-speech pair is formed (step 216).
  • This embodiment includes the entire process from speech segmentation, to speech recognition, to rapid matching of local text similarity based on pinyin, and the entire process can automatically complete the automatic annotation for the end-to-end speech model.
  • the method of pinyin and tone is used to avoid the inaccurate speech recognition process.
  • the cleaning of the original speech data is well solved by the degree of Pinyin matching. At the same time, improve the accuracy of speech annotation and recall.
  • this specification also provides embodiments of the voice annotation device and the electronic equipment to which it is applied.
  • the embodiments of the voice annotation device of this specification can be applied to computer equipment.
  • the device embodiments may be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running it through the processor of the computer equipment in which it is located.
  • FIG. 3 it is a hardware structure diagram of the computer equipment where the voice tag device of this specification is located, except for the processor 310, the network interface 320, the memory 330, and the nonvolatile memory shown in FIG.
  • the computer device in which the voice annotation device 331 is located in the embodiment may generally include other hardware according to the actual function of the device, and details are not described herein again.
  • FIG. 4 it is a block diagram of a voice annotation device according to an exemplary embodiment of the present specification.
  • the device includes:
  • An information obtaining module 42 is configured to obtain original text information and voice data, where the voice data includes: recording data obtained by reading the original text information aloud;
  • a data segmentation module 44 configured to segment the speech data to obtain at least one segment of speech data
  • the voice pair constituting module 46 is configured to compare the similarity of the sentence information obtained by performing speech recognition on the voice sentence data with the original sentence information in the original text information, and use the original sentence information and the speech sentence data to form the result based on the comparison Text-to-speech.
  • the data segmentation module is specifically configured to:
  • segment the speech data to obtain at least one segment of speech sentence data, and the number of frames of the speech sentence data is greater than or equal to a preset Frame number threshold.
  • the comparison of the similarity between the identification sentence information and the original sentence information includes: comparing the pinyin of the identification sentence information with that of the original sentence information, and the pinyin is a tone of pinyin.
  • the voice pair constituting module is specifically configured to:
  • the original pinyin sequence includes the original pinyin sentence
  • the identified pinyin sequence includes the identified pinyin sentence
  • the original pinyin sentence of the current sequence number is compared with the identified pinyin sentence in the current sequence number in the identified pinyin sequence and the specified sequence number offset from the front and back to obtain the comparison result.
  • the voice pair constituting module is specifically configured to include:
  • the identified pinyin sentences are filtered from the identified pinyin sentences within the current sequence number and the specified sequence number shifted back and forth;
  • the text information obtained by the check is used as the annotation data of the speech sentence data corresponding to the identification sentence information to form a text speech pair.
  • the preset screening conditions include one of the following conditions:
  • the maximum similarity in the comparison result is greater than a preset similarity threshold and there are at least two maximum similarities, selecting the identified pinyin sentence with the highest serial number among the identified pinyin sentences corresponding to the maximum similarity;
  • the identifying pinyin sentence with the highest serial number among the identified pinyin sentences corresponding to the maximum similarity and the second largest similarity is selected.
  • the relevant part may refer to the description of the method embodiment.
  • the device embodiments described above are only schematic, and the modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, which may be located in One place, or can be distributed to multiple network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in this specification. Those of ordinary skill in the art can understand and implement without creative efforts.
  • an embodiment of the present specification further provides a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the following method when the program is executed:
  • the voice data including: recording data obtained by reading the original text information aloud;
  • Segmenting the speech data to obtain at least one piece of speech sentence data
  • the recognition sentence information obtained by performing speech recognition on the voice sentence data is compared with the original sentence information in the original text information, and according to the comparison result, the original sentence information and the voice sentence data are used to form a text speech pair.
  • a computer storage medium stores program instructions in the storage medium, and the program instructions include:
  • the voice data including: recording data obtained by reading the original text information aloud;
  • Segmenting the speech data to obtain at least one piece of speech sentence data
  • the recognition sentence information obtained by performing speech recognition on the voice sentence data is compared with the original sentence information in the original text information, and according to the comparison result, the original sentence information and the voice sentence data are used to form a text speech pair.
  • the embodiments of the present specification may take the form of a computer program product implemented on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing program code.
  • Computer-usable storage media includes permanent and non-permanent, removable and non-removable media, and information can be stored by any method or technology.
  • Information may be computer-readable instructions, data structures, modules of a program, or other data.
  • Examples of computer storage media include, but are not limited to: phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media may be used to store information that can be accessed by computing devices.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technologies
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disc
  • Magnetic tape cartridges magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media may be used to store information that can be accessed

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

本说明书实施例提供一种涉及语音标注方法、装置及设备,通过获取原始文本信息以及与原始文本信息对应的语音数据,对语音数据进行断句切分,可以获得多段带语音的语音句子数据,然后对语音句子数据进行语音识别,并将识别获得的识别句子信息与原始文本信息中的原始语句信息进行相似度比较,进而根据比较结果利用原始语句信息与语音句子数据构成文本语音对,实现自动化标注处理,提高获得文本语音对的效率。

Description

语音标注方法、装置及设备 技术领域
本说明书涉及数据处理领域,尤其涉及语音标注方法、装置及设备。
背景技术
不管是语音识别场景,还是语音合成场景,为了训练良好的声学模型,都需要依赖大量的语音数据以及与语音数据对应的正确的文本信息,语音数据以及与语音数据对应的正确的文本信息,简称为文本语音对。确定语音数据以及与语音数据对应的正确的文本信息的过程,可以称为语音标注,正确的文本信息可以称为语音数据的标注数据。相关技术中,常采用人工听写的方式将语音数据转录为文本信息,再通过人工判断,结合语义语境等因素,确定与语音数据对应的正确的文本信息,获得文本语音对。然而,这种语音标注的方式依赖于人力劳动,效率低、且人力成本高。
发明内容
为克服相关技术中存在的问题,本说明书提供了语音标注方法、装置及设备。
根据本说明书实施例的第一方面,提供一种语音标注方法,所述方法包括:
获取原始文本信息和语音数据,所述语音数据包括:朗读原始文本信息而获得的录音数据;
对所述语音数据进行断句切分,获得至少一段语音句子数据;
将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,根据比较结果利用原始语句信息与语音句子数据构成文本语音对。
在一个实施例中,所述对所述语音数据进行断句切分,获得至少一段语音句子数据,包括:
根据语音数据中每帧的短时能量与预设能量阈值的关系,确定语音数据中连续语音的起点位置和终点位置;
根据所确定的起始位置和终点位置、以及终点位置与起始位置间的间隔,对所述语音数据进行断句切分,获得至少一段语音句子数据,语音句子数据的帧数大于或等于预设帧数阈值。
在一个实施例中,识别句子信息与原始语句信息的相似度比较包括:识别句子信息的拼音与所述原始语句信息的拼音的比较,所述拼音为带声调的拼音。
在一个实施例中,所述将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,包括:
将所述原始文本信息按标点符号进行断句划分和排序,获得原始文本序列;
对所述语音句子数据进行语音识别和排序,获得识别文本序列;
分别将所述原始文本序列和识别文本序列转换为带声调的拼音,获得原始拼音序列和识别拼音序列,原始拼音序列包括原始拼音句子,识别拼音序列包括识别拼音句子;
针对原始拼音序列内每段原始拼音句子,将当前序号的原始拼音句子、与识别拼音序列中当前序号及其前后偏移指定序号内的识别拼音句子进行相似度比较,获得比较结果。
在一个实施例中,所述根据比较结果利用原始语句信息与语音句子数据构成文本语音对,包括:
根据比较结果和预设筛选条件,从当前序号及其前后偏移指定序号内的识别拼音句子中筛选识别拼音句子;
利用筛选获得的识别拼音句子所对应的识别句子信息,对所述原始拼音句子所对应的原始语句信息进行校验,所述校验包括删除漏读的字符或添加多读的字符;
将校验获得的文本信息作为所述识别句子信息对应的语音句子数据的标注数据,构成文本语音对。
在一个实施例中,所述预设筛选条件包括以下一种条件:
若比较结果中最大相似度大于预设相似度阈值、且存在一个最大相似度时,选取最大相似度所对应识别拼音句子;
若比较结果中最大相似度大于预设相似度阈值、且存在至少两个最大相似度时,选取最大相似度所对应识别拼音句子中序号最大的识别拼音句子;
若比较结果中最大相似度和次大相似度均大于预设相似度阈值,选取最大相似度和次大相似度所对应识别拼音句子中序号最大的识别拼音句子。
根据本说明书实施例的第二方面,提供一种语音标注装置,所述装置包括:
信息获取模块,用于获取原始文本信息和语音数据,所述语音数据包括:朗读原始文本信息而获得的录音数据;
数据切分模块,用于对所述语音数据进行断句切分,获得至少一段语音句子数据;
语音对构成模块,用于将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,根据比较结果利用原始语句信息与语音句子数据构成文本语音对。
在一个实施例中,所述数据切分模块具体用于:
根据语音数据中每帧的短时能量与预设能量阈值的关系,确定语音数据中连续语音的起点位置和终点位置;
根据所确定的起始位置和终点位置、以及终点位置与起始位置间的间隔,对所述语音数据进行断句切分,获得至少一段语音句子数据,语音句子数据的帧数大于或等于预设帧数阈值。
在一个实施例中,识别句子信息与原始语句信息的相似度比较包括:识别句子信息的拼音与所述原始语句信息的拼音的比较,所述拼音为带声调的拼音。
在一个实施例中,所述语音对构成模块具体用于:
将所述原始文本信息按标点符号进行断句划分和排序,获得原始文本序列;
对所述语音句子数据进行语音识别和排序,获得识别文本序列;
分别将所述原始文本序列和识别文本序列转换为带声调的拼音,获得原始拼音序列和识别拼音序列,原始拼音序列包括原始拼音句子,识别拼音序列包括识别拼音句子;
针对原始拼音序列内每段原始拼音句子,将当前序号的原始拼音句子、与识别拼音序列中当前序号及其前后偏移指定序号内的识别拼音句子进行相似度比较,获得比较结果。
在一个实施例中,所述语音对构成模块具体用于包括:
根据比较结果和预设筛选条件,从当前序号及其前后偏移指定序号内的识别拼音句 子中筛选识别拼音句子;
利用筛选获得的识别拼音句子所对应的识别句子信息,对所述原始拼音句子所对应的原始语句信息进行校验,所述校验包括删除漏读的字符或添加多读的字符;
将校验获得的文本信息作为所述识别句子信息对应的语音句子数据的标注数据,构成文本语音对。
在一个实施例中,所述预设筛选条件包括以下一种条件:
若比较结果中最大相似度大于预设相似度阈值、且存在一个最大相似度时,选取最大相似度所对应识别拼音句子;
若比较结果中最大相似度大于预设相似度阈值、且存在至少两个最大相似度时,选取最大相似度所对应识别拼音句子中序号最大的识别拼音句子;
若比较结果中最大相似度和次大相似度均大于预设相似度阈值,选取最大相似度和次大相似度所对应识别拼音句子中序号最大的识别拼音句子。
根据本说明书实施例的第三方面,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如下方法:
获取原始文本信息和语音数据,所述语音数据包括:朗读原始文本信息而获得的录音数据;
对所述语音数据进行断句切分,获得至少一段语音句子数据;
将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,根据比较结果利用原始语句信息与语音句子数据构成文本语音对。
本说明书的实施例提供的技术方案可以包括以下有益效果:
本说明书实施例,通过获取原始文本信息以及与原始文本信息对应的语音数据,对语音数据进行断句切分,可以获得多段带语音的语音句子数据,然后对语音句子数据进行语音识别,并将识别获得的识别句子信息与原始文本信息中的原始语句信息进行相似度比较,进而根据比较结果利用原始语句信息与语音句子数据构成文本语音对,实现自动化标注处理,提高获得文本语音对的效率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能 限制本说明书。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本说明书的实施例,并与说明书一起用于解释本说明书的原理。
图1是本说明书根据一示例性实施例示出的一种语音标注方法的流程图。
图2是本说明书根据一示例性实施例示出的另一种语音标注方法的流程图。
图3是本说明书根据一示例性实施例示出的一种语音标注装置所在计算机设备的一种硬件结构图。
图4是本说明书根据一示例性实施例示出的一种语音标注装置的框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本说明书相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本说明书的一些方面相一致的装置和方法的例子。
在本说明书使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书。在本说明书和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本说明书可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本说明书范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
语音识别,可以是将语音转换成文本,在识别过程中会涉及到声学模型和语言模型。语音合成,可以是将文本映射为音频,在合成过程中也会涉及到声学模型。例如,端到端语音识别模型,可以是从输入端(语音波形或特征序列)到输出端(单词或字符序列)用神经网络相连,把传统声学模型、发音词典、语言模型等传统模块放在神经网络中进 行处理。语音合成模型可以是从输入端(单词或字符序列)到输出端(语音波形或特征序列)。
而声学模型的建立需要依赖大量的语音数据,以及语音数据对应的正确的文本信息,从而得到语音与文字的统计关系,并利用语音数据以及语音数据对应的正确的文本信息对模型进行训练,获得声学模型。确定语音数据以及与语音数据对应的正确的文本信息的过程,可以称为语音标注,正确的文本信息可以作为语音数据的标注结果,也可以称为标注数据。
目前的语音标注方法中,常采用人工听写的方式将语音数据转录为文本信息,获得文本语音对。然而,所需文本语音对的数量较大,人工标注的方式存在效率低、且人力成本高的问题。
鉴于此,本说明书实施例,通过获取原始文本信息以及与原始文本信息对应的语音数据,对语音数据进行断句切分,可以获得多段带语音的语音句子数据,然后对语音句子数据进行语音识别,并将识别获得的识别句子信息与原始文本信息中的原始语句信息进行相似度比较,进而根据比较结果利用原始语句信息与语音句子数据构成文本语音对,实现自动化标注处理,提高获得文本语音对的效率。以下结合附图对本说明书实施例进行示例说明。
如图1所示,是本说明书根据一示例性实施例示出的一种语音标注方法的流程图,所述方法包括:
在步骤102中,获取原始文本信息和语音数据,所述语音数据包括:朗读原始文本信息而获得的录音数据;
在步骤104中,对所述语音数据进行断句切分,获得至少一段语音句子数据;
在步骤106中,将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,根据比较结果利用原始语句信息与语音句子数据构成文本语音对。
在该实施例中,原始文本信息和语音数据间存在关联关系。原始文本信息可以包括用于语音录制的朗读文本(又可以称为录音文本);语音数据可以包括朗读原始文本信息而获得的录音数据。例如,可以邀请专业人士朗读原始文本信息,从而获得与原始文本信息对应的语音数据。又如,可以从新闻平台中获取新闻文本和新闻语音等。
原始文本信息中,可能存在一句或多句语句。相应的,语音数据也存在一句或多句 语音句子。本实施例为了进行区分,针对由原始文本信息获得的信息,命名前增加“原始”;针对由语音数据识别获得的数据,命名前增加“识别”。例如,原始文本信息中的句子,可以称为原始语句信息。各原始语句信息可以构成原始文本序列。原始语句信息进行拼音转换,可以获得原始拼音句子。各原始拼音句子可以构成原始拼音序列。相应的,语音数据进行断句处理,可以获得语音句子数据。对语音句子数据进行语音识别,可以获得识别句子信息。各识别句子信息可以构成识别文本序列。识别句子信息进行拼音转换,可以获得识别拼音句子。各识别拼音句子可以构成识别拼音序列。
本实施例通过对语音数据进行断句切分,可以是为了获得具有连续语音的语音句子数据。语音句子数据也可以称为语音段数据、语音片段数据。不同的标点符号表示停顿的时间长短是不一样的,句末标点符号比句中标点符号表示的停顿长,而是否停顿又可以基于是否存在语音进行判断,因此,可以实现对语音数据的断句切分。
在一个实施例中,可以基于语音端点检测方式对语音数据进行断句切分。其中,语音端点检测(Voice Activity Detection,VAD)又称语音活动检测、语音边界检,可以是在噪声环境中检测语音的存在与否。语音端点检测,可以从原始语音数据中分离出语音信号和非语音信号,定位出语音信号的开始点和结束点,开始点和结束点可以称为端点。因此,本说明书实施例可以采用语音端点检测的方式检测端点、并结合端点间隔判断两段声音是否属于同一句话,从而实现语音数据的断句切分。不仅实现只针对语音信号进行处理,而不去关心非语音信号,同时还可以获得具有语音的语音句子数据。进一步的,还可以结合停顿时间进行断句切分。
由于静音和说话状态下的能量存在很大差异,因此,为了提高切分的准确性,在一个实施例中,可以采用能量阈值(或者称为短时能量)的方式对语音进行检测并进行切分。如,通过语音端点检测将所述语音数据进行切分,获得至少一段有语音的语音句子数据,可以包括:
根据语音数据中每帧的短时能量与预设能量阈值的关系,确定语音数据中连续语音的起点位置和终点位置;
根据所确定的起始位置和终点位置、以及终点位置与起始位置间的间隔,对所述语音数据进行断句切分,获得至少一段语音句子数据。
其中,可以对语音数据进行分帧和加窗处理,并计算每帧的短时能量。例如,根据预设语音信号的频率,将语音数据划分为多个短时能量帧,可以根据短时能量帧中每个 采样点对应的音频信号的幅值,确定短时能量帧的能量。具体的,可以根据短时能量帧中每个采样点对应的音频信号的幅值,确定出每个采样点的能量,然后将各能量相加,将最终得到的能量和作为该短时能量帧的能量。通过将短时能量帧的能量与预设能量阈值进行比较,可以确定语音数据中连续语音的起点位置和终点位置,基于相邻语音片段中,前一语音片段的终点与当前语音片段的起点间的间隔,判断相邻语音片段是否为同一句话,进而获得有说话声音的语音句子数据。
实际应用中,语音数据中可能包含与原始文本信息无关的语音,例如,朗读者(如播音员)朗读原始文本信息期间发出的语气词、话外音等,而这部分语音往往比较短,为了提高语音标注的准确性,在一个实施例中,语音句子数据可以是帧数于或等于预设帧数阈值的语音句子数据。例如,在使用能量阈值的方式获得初始语音句子数据后,可以将帧数大于或等于预设帧数阈值的初始语音句子数据保留为语音句子数据。
其中,帧数可以是连续语音帧的数量。在一个例子中,一帧可以取20-50ms,一个字大概200-300ms。预设帧数阈值可以基于期望达到的字符长度换算而获得。例如,期望将语音句子数据中不大于5个汉字的片段数据排除,则可以将5个汉字所对应的音频帧数确定为预设帧数阈值。
该实施例排除帧数小于预设帧数阈值的语音句子数据,可以过滤语音数据中所包含的与原始文本信息无关的语音,可以得到与原始文本信息相关性更高的语音句子数据,从而提高语音标注的准确性。
在获得语音句子数据后,可以将语音句子数据进行语音识别,获得识别句子信息。语音识别的手段可以采用相关技术中的语音识别手段,在此不做限定。而针对原始文本信息,为了提高语音标注的准确性,以及语音标注的效率,可以采用分句方法对原始文本信息进行断句处理,获得原始语句信息。为了区分两种方式获得的文本,将由语音句子数据进行识别获得的文本称为识别句子信息,将由原始文本信息断句处理获得的文本称为原始语句信息。在一个例子中,可以采用字符是否为指定字符的方式进行断句处理,例如,指定字符可以是句号、感叹号、问号、省略号、换行符等。
由于语音数据中的停顿,往往是由于断句造成,因此通过对语音数据进行断句处理,并识别获得识别句子信息,对原始文本信息进行断句处理,将识别句子信息与原始语句信息进行相似度比较,片段与片段的比较可以提高比较效率。特别是,结合语音句子数据在语音数据中的位置,与原始语句信息在原始文本信息中的位置,进行局部文本信息的相似度比较,可以提高比较效率。
在实际应用中,由于相同拼音可能表示不同的词,例如语音“bǐ’jì”对应的文本可能是“笔记”和“笔迹”等,因此,语音识别中可能存在识别不准的情况,本说明书为了避免识别误差对文本信息相似度比较造成的影响,在一个实例中,识别句子信息与原始语句信息的相似度比较可以包括:识别句子信息的拼音与所述原始语句信息的拼音的比较。
可见,该实施例通过拼音比较的方式,可以规避语音识别过程不准确的情况。
进一步的,所述拼音可以为带声调的拼音。通过带声调的拼音,可以提高文本相似度比较的准确性。
为了提高文本比较效率,可以对原始文本信息和语音识别获得的识别句子信息分别进行排序,基于序列进行比较。例如,识别句子信息与原始语句信息的相似度比较可以包括:由识别句子信息转换获得的识别拼音序列与由原始语句信息转换获得的原始拼音序列的比较。
而在实际应用中,可能由于朗读者读错,而导致某句文本被重复朗读,为了能获得准确的文本语音对,在进行文本相似度比较的时候,可以将原始拼音序列中的第i序号的原始拼音句子,与识别拼音序列中第(i-k)序号至第(i+k)序号内的识别拼音句子进行相似度比较,以实现局部比较。具体的,所述将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,可以包括:
将所述原始文本信息按标点符号进行断句划分和排序,获得原始文本序列;
对所述语音句子数据进行语音识别和排序,获得识别文本序列;
分别将所述原始文本序列和识别文本序列转换为带声调的拼音,获得原始拼音序列和识别拼音序列,原始拼音序列包括原始拼音句子,识别拼音序列包括识别拼音句子;
针对原始拼音序列内每段原始拼音句子,将当前序号的原始拼音句子、与识别拼音序列中当前序号及其前后偏移指定序号内的识别拼音句子进行相似度比较,获得比较结果。
其中,将原始文本信息按标点符号进行断句划分,可以获得原始语句信息,将原始语句信息按其在原始文本信息中的顺序进行排序,可以获得原始文本序列。对语音句子数据进行语音识别,获得识别句子信息,将识别句子信息按语音句子数据在语音数据中的顺序进行排序,可以获得识别文本序列。将原始文本序列中原始语句信息转换为带声调的拼音,可以获得原始拼音序列,原始拼音序列由原始拼音句子构成。将识别文本序列中识别句子信息转换为带声调的拼音,获得识别拼音序列,识别拼音序列由识别拼音 句子构成。在一个例子中,拼音的声调可以利用四个不同的数字表示,例如,利用1至4表示。为了区分两组拼音序列,分别命名为原始拼音序列和识别拼音序列。
对原始拼音序列中原始拼音句子进行遍历,针对原始拼音序列内每段原始拼音句子执行以下比较处理:
将当前序号的原始拼音句子、与识别拼音序列中当前序号及其前后偏移指定序号内的识别拼音句子进行相似度比较,获得比较结果。
其中,假设当前序号为i,指定序号为k,则可以将第(i-k)序号至第(i+k)序号内的识别拼音句子构成对象作为第i序号的原始拼音句子的比较对象,从而实现基于拼音的局部文本相似度快速匹配,进而提高获得作为语音句子数据标注结果的文本信息,即提高文本语音对的准确性。
关于相似度,在一个例子中,相似度可以基于两段拼音句子(原始拼音句子和识别拼音句子)中匹配的拼音个数与两段拼音句子中拼音的总个数的比值确定。例如,可以采用公式sim(i,j)=2*M/T,其中,M表示第i序号的原始拼音句子与第j序号的识别拼音句子能匹配的拼音个数,T表示第i序号的原始拼音句子与第j序号的识别拼音句子中拼音的总个数,j∈[i-k,i+k]。
在一个实施例中,由于原始拼音句子与识别拼音句子的相似度,可以作为原始语句信息与语音句子数据的相似度,因此,可以根据比较结果筛选出相似度满足条件的语音句子数据、与原始语句信息构成文本语音对。
其中,预设筛选条件可以是用于筛选出合适的语音句子数据的条件。
在一个例子中,预设筛选条件可以是:若比较结果中最大相似度大于预设相似度阈值、且存在一个最大相似度时,选取最大相似度所对应识别拼音句子。
在该实施例中,可以筛选出比较结果中大于预设相似度阈值、且为最大值的相似度所对应的语音句子数据,与原始语句信息构成文本语音对。
在另一个例子中,预设筛选条件可以是:从大于预设相似度阈值的相似度所对应的识别拼音句子中,选取序号最大的识别拼音句子。
在该实施例中,若比较结果中出现大于预设相似度阈值的相似度的个数为两个及以上时,可以选择其中序号最大的识别拼音句子所对应的语音句子数据,并与原始语句信息构成文本语音对,以实现考虑重复朗读时,往往最后一句朗读准确率比较高。
在另一个例子中,预设筛选条件可以是:若比较结果中最大相似度大于预设相似度阈值、且存在至少两个最大相似度时,选取最大相似度所对应识别拼音句子中序号最大的识别拼音句子。
在该实施例中,在比较结果中可能存在至少两个最大相似度,因此,可以从最大相似度所对应的识别拼音句子中,筛选出序号最大的识别拼音句子。
在另一个例子中,若比较结果中最大相似度和次大相似度均大于预设相似度阈值,选取最大相似度和次大相似度所对应识别拼音句子中序号最大的识别拼音句子。
在该实施例中,可以从最大相似度和次大相似度所对应的识别拼音句子中,筛选出序号最大的识别拼音句子。进一步的,最大相似度和次大相似度为前两大相似度,即比较结果中仅存在一个最大相似度和次大相似度。若比较结果中前两大的相似度均大于预设相似度阈值,可以选择其中序号最大的识别拼音句子所对应的语音句子数据,并与原始语句信息构成文本语音对,以实现考虑实际情况中,重复朗读时,往往最后一句朗读准确率比较高。
可以理解的是,还可以是其他预设筛选条件,只要至少以相似度为筛选因子即可。
实际应用中,录音过程可能出现少读或多读的情况,又由于对语音进行修改的难度比较大,鉴于此,在一个实施例中,还可以利用识别句子信息对原始语句信息进行校验,以删除漏读的字符或添加多读的字符。根据比较结果筛选出相似度满足条件的语音句子数据、与校验后的文本信息构成文本语音对。例如,所述根据比较结果利用原始语句信息与语音句子数据构成文本语音对,可以包括:
根据比较结果和预设筛选条件,从当前序号及其前后偏移指定序号内的识别拼音句子中筛选识别拼音句子;
利用筛选获得的识别拼音句子所对应的识别句子信息,对所述原始拼音句子所对应的原始语句信息进行校验,所述校验包括删除漏读的字符或添加多读的字符;
将校验获得的文本信息作为所述识别句子信息对应的语音句子数据的标注数据,构成文本语音对。
其中,所述预设筛选条件包括以下一种条件:
若比较结果中最大相似度大于预设相似度阈值、且存在一个最大相似度时,选 取最大相似度所对应识别拼音句子;若比较结果中最大相似度大于预设相似度阈值、且存在至少两个最大相似度时,选取最大相似度所对应识别拼音句子中序号最大的识别拼音句子;
若比较结果中最大相似度和次大相似度均大于预设相似度阈值,选取最大相似度和次大相似度所对应识别拼音句子中序号最大的识别拼音句子。
可以理解的是,校验过程还可以包括其他过程,在此不一一赘述。
本实施例通过利用识别句子信息对原始语句信息进行校验,从而获得语音句子数据的正确的文本信息,避免由于修改语音数据造成的困难。
以上实施方式中的各种技术特征可以任意进行组合,只要特征之间的组合不存在冲突或矛盾,但是限于篇幅,未进行一一描述,因此上述实施方式中的各种技术特征的任意进行组合也属于本说明书公开的范围。
以下以其中一种组合进行示例说明。
如图2所示,是本说明书根据一示例性实施例示出的另一种语音标注方法的流程图,在获取原始文本(步骤202)后,可以采集由于朗读原始文本而产生的语音,获得语音文件,对应于图1中的语音数据(步骤204)。针对原始文本,可以采用汉字与拼音的转换算法,将原始文本中原始语句信息(句子)转换为拼音,获得原始拼音句子(步骤206)。对于语音文件进行基于短时能量的语音切分,获得多个语音句子数据(步骤208)。将语音句子数据进行语音识别,转换为汉字,获得识别句子信息(步骤210)。将汉字转换为带声调的拼音,获得识别拼音句子(步骤212)。基于原始拼音句子和识别拼音句子,进行局部范围内的快速文本相似度匹配,获得比较结果(步骤214)。根据比较结果对待选句子集合进行筛选,待选句子集合可以包括当前序号及其前后偏移指定序号内的识别拼音句子。利用筛选获得的识别拼音句子所对应的识别句子信息,对原始拼音句子所对应的原始语句信息进行校验,并对标点符号进行修正,获得作为识别句子信息对应的语音句子数据的标注数据,进而构成文本语音对(步骤216)。
本实施例包括从语音切分,到语音识别,到基于拼音的局部文本相似度快速匹配等整体过程,整体流程可以自动完成面向端到端语音模型的自动化标注。在比较文本相似度时采用拼音和音调的方式,规避了语音识别过程不准确的情况。比较文本时,结合文本的相对位置,进行局部比较,通过拼音的匹配程度很好解决原始语音数据的清洗。同时,提高语音标注准确率和召回率。
与前述语音标注方法的实施例相对应,本说明书还提供了语音标注装置及其所应用的电子设备的实施例。
本说明书语音标注装置的实施例可以应用在计算机设备。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在计算机设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图3所示,为本说明书语音标注装置所在计算机设备的一种硬件结构图,除了图3所示的处理器310、网络接口320、内存330、以及非易失性存储器340之外,实施例中语音标注装置331所在的计算机设备通常根据该设备的实际功能,还可以包括其他硬件,对此不再赘述。
如图4所示,是本说明书根据一示例性实施例示出的一种语音标注装置的框图,所述装置包括:
信息获取模块42,用于获取原始文本信息和语音数据,所述语音数据包括:朗读原始文本信息而获得的录音数据;
数据切分模块44,用于对所述语音数据进行断句切分,获得至少一段语音句子数据;
语音对构成模块46,用于将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,根据比较结果利用原始语句信息与语音句子数据构成文本语音对。
在一个实施例中,所述数据切分模块具体用于:
根据语音数据中每帧的短时能量与预设能量阈值的关系,确定语音数据中连续语音的起点位置和终点位置;
根据所确定的起始位置和终点位置、以及终点位置与起始位置间的间隔,对所述语音数据进行断句切分,获得至少一段语音句子数据,语音句子数据的帧数大于或等于预设帧数阈值。
在一个实施例中,识别句子信息与原始语句信息的相似度比较包括:识别句子信息的拼音与所述原始语句信息的拼音的比较,所述拼音为带声调的拼音。
在一个实施例中,所述语音对构成模块具体用于:
将所述原始文本信息按标点符号进行断句划分和排序,获得原始文本序列;
对所述语音句子数据进行语音识别和排序,获得识别文本序列;
分别将所述原始文本序列和识别文本序列转换为带声调的拼音,获得原始拼音序列和识别拼音序列,原始拼音序列包括原始拼音句子,识别拼音序列包括识别拼音句子;
针对原始拼音序列内每段原始拼音句子,将当前序号的原始拼音句子、与识别拼音序列中当前序号及其前后偏移指定序号内的识别拼音句子进行相似度比较,获得比较结果。
在一个实施例中,所述语音对构成模块具体用于包括:
根据比较结果和预设筛选条件,从当前序号及其前后偏移指定序号内的识别拼音句子中筛选识别拼音句子;
利用筛选获得的识别拼音句子所对应的识别句子信息,对所述原始拼音句子所对应的原始语句信息进行校验,所述校验包括删除漏读的字符或添加多读的字符;
将校验获得的文本信息作为所述识别句子信息对应的语音句子数据的标注数据,构成文本语音对。
在一个实施例中,所述预设筛选条件包括以下一种条件:
若比较结果中最大相似度大于预设相似度阈值、且存在一个最大相似度时,选取最大相似度所对应识别拼音句子;
若比较结果中最大相似度大于预设相似度阈值、且存在至少两个最大相似度时,选取最大相似度所对应识别拼音句子中序号最大的识别拼音句子;
若比较结果中最大相似度和次大相似度均大于预设相似度阈值,选取最大相似度和次大相似度所对应识别拼音句子中序号最大的识别拼音句子。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本说明书方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
相应的,本说明书实施例还提供一种计算机设备,包括存储器、处理器及存储 在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如下方法:
获取原始文本信息和语音数据,所述语音数据包括:朗读原始文本信息而获得的录音数据;
对所述语音数据进行断句切分,获得至少一段语音句子数据;
将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,根据比较结果利用原始语句信息与语音句子数据构成文本语音对。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
一种计算机存储介质,所述存储介质中存储有程序指令,所述程序指令包括:
获取原始文本信息和语音数据,所述语音数据包括:朗读原始文本信息而获得的录音数据;
对所述语音数据进行断句切分,获得至少一段语音句子数据;
将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,根据比较结果利用原始语句信息与语音句子数据构成文本语音对。
本说明书实施例可采用在一个或多个其中包含有程序代码的存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。计算机可用存储介质包括永久性和非永久性、可移动和非可移动媒体,可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括但不限于:相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。
本领域技术人员在考虑说明书及实践这里申请的发明后,将容易想到本说明书的其它实施方案。本说明书旨在涵盖本说明书的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本说明书的一般性原理并包括本说明书未申请的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本说明书的真正范围和精神由下面的权利要求指出。
应当理解的是,本说明书并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本说明书的范围仅由所附的权利要求来限制。
以上所述仅为本说明书的较佳实施例而已,并不用以限制本说明书,凡在本说明书的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书保护的范围之内。

Claims (13)

  1. 一种语音标注方法,所述方法包括:
    获取原始文本信息和语音数据,所述语音数据包括:朗读原始文本信息而获得的录音数据;
    对所述语音数据进行断句切分,获得至少一段语音句子数据;
    将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,根据比较结果利用原始语句信息与语音句子数据构成文本语音对。
  2. 根据权利要求1所述的方法,所述对所述语音数据进行断句切分,获得至少一段语音句子数据,包括:
    根据语音数据中每帧的短时能量与预设能量阈值的关系,确定语音数据中连续语音的起点位置和终点位置;
    根据所确定的起始位置和终点位置、以及终点位置与起始位置间的间隔,对所述语音数据进行断句切分,获得至少一段语音句子数据,语音句子数据的帧数大于或等于预设帧数阈值。
  3. 根据权利要求1所述的方法,识别句子信息与原始语句信息的相似度比较包括:识别句子信息的拼音与所述原始语句信息的拼音的比较,所述拼音为带声调的拼音。
  4. 根据权利要求3所述的方法,所述将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,包括:
    将所述原始文本信息按标点符号进行断句划分和排序,获得原始文本序列;
    对所述语音句子数据进行语音识别和排序,获得识别文本序列;
    分别将所述原始文本序列和识别文本序列转换为带声调的拼音,获得原始拼音序列和识别拼音序列,原始拼音序列包括原始拼音句子,识别拼音序列包括识别拼音句子;
    针对原始拼音序列内每段原始拼音句子,将当前序号的原始拼音句子、与识别拼音序列中当前序号及其前后偏移指定序号内的识别拼音句子进行相似度比较,获得比较结果。
  5. 根据权利要求4所述的方法,所述根据比较结果利用原始语句信息与语音句子数据构成文本语音对,包括:
    根据比较结果和预设筛选条件,从当前序号及其前后偏移指定序号内的识别拼音句子中筛选识别拼音句子;
    利用筛选获得的识别拼音句子所对应的识别句子信息,对所述原始拼音句子所对应 的原始语句信息进行校验,所述校验包括删除漏读的字符或添加多读的字符;
    将校验获得的文本信息作为所述识别句子信息对应的语音句子数据的标注数据,构成文本语音对。
  6. 根据权利要求5所述的方法,所述预设筛选条件包括以下一种条件:
    若比较结果中最大相似度大于预设相似度阈值、且存在一个最大相似度时,选取最大相似度所对应识别拼音句子;
    若比较结果中最大相似度大于预设相似度阈值、且存在至少两个最大相似度时,选取最大相似度所对应识别拼音句子中序号最大的识别拼音句子;
    若比较结果中最大相似度和次大相似度均大于预设相似度阈值,选取最大相似度和次大相似度所对应识别拼音句子中序号最大的识别拼音句子。
  7. 一种语音标注装置,所述装置包括:
    信息获取模块,用于获取原始文本信息和语音数据,所述语音数据包括:朗读原始文本信息而获得的录音数据;
    数据切分模块,用于对所述语音数据进行断句切分,获得至少一段语音句子数据;
    语音对构成模块,用于将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,根据比较结果利用原始语句信息与语音句子数据构成文本语音对。
  8. 根据权利要求7所述的装置,所述数据切分模块具体用于:
    根据语音数据中每帧的短时能量与预设能量阈值的关系,确定语音数据中连续语音的起点位置和终点位置;
    根据所确定的起始位置和终点位置、以及终点位置与起始位置间的间隔,对所述语音数据进行断句切分,获得至少一段语音句子数据,语音句子数据的帧数大于或等于预设帧数阈值。
  9. 根据权利要求7所述的装置,识别句子信息与原始语句信息的相似度比较包括:识别句子信息的拼音与所述原始语句信息的拼音的比较,所述拼音为带声调的拼音。
  10. 根据权利要求9所述的装置,所述语音对构成模块具体用于:
    将所述原始文本信息按标点符号进行断句划分和排序,获得原始文本序列;
    对所述语音句子数据进行语音识别和排序,获得识别文本序列;
    分别将所述原始文本序列和识别文本序列转换为带声调的拼音,获得原始拼音序列和识别拼音序列,原始拼音序列包括原始拼音句子,识别拼音序列包括识别拼音句子;
    针对原始拼音序列内每段原始拼音句子,将当前序号的原始拼音句子、与识别拼音 序列中当前序号及其前后偏移指定序号内的识别拼音句子进行相似度比较,获得比较结果。
  11. 根据权利要求10所述的装置,所述语音对构成模块具体用于包括:
    根据比较结果和预设筛选条件,从当前序号及其前后偏移指定序号内的识别拼音句子中筛选识别拼音句子;
    利用筛选获得的识别拼音句子所对应的识别句子信息,对所述原始拼音句子所对应的原始语句信息进行校验,所述校验包括删除漏读的字符或添加多读的字符;
    将校验获得的文本信息作为所述识别句子信息对应的语音句子数据的标注数据,构成文本语音对。
  12. 根据权利要求11所述的装置,所述预设筛选条件包括以下一种条件:
    若比较结果中最大相似度大于预设相似度阈值、且存在一个最大相似度时,选取最大相似度所对应识别拼音句子;
    若比较结果中最大相似度大于预设相似度阈值、且存在至少两个最大相似度时,选取最大相似度所对应识别拼音句子中序号最大的识别拼音句子;
    若比较结果中最大相似度和次大相似度均大于预设相似度阈值,选取最大相似度和次大相似度所对应识别拼音句子中序号最大的识别拼音句子。
  13. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如下方法:
    获取原始文本信息和语音数据,所述语音数据包括:朗读原始文本信息而获得的录音数据;
    对所述语音数据进行断句切分,获得至少一段语音句子数据;
    将对所述语音句子数据进行语音识别获得的识别句子信息,与原始文本信息中的原始语句信息进行相似度比较,根据比较结果利用原始语句信息与语音句子数据构成文本语音对。
PCT/CN2019/089176 2018-08-02 2019-05-30 语音标注方法、装置及设备 WO2020024690A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810873608.3A CN109065031B (zh) 2018-08-02 2018-08-02 语音标注方法、装置及设备
CN201810873608.3 2018-08-02

Publications (1)

Publication Number Publication Date
WO2020024690A1 true WO2020024690A1 (zh) 2020-02-06

Family

ID=64832878

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/089176 WO2020024690A1 (zh) 2018-08-02 2019-05-30 语音标注方法、装置及设备

Country Status (3)

Country Link
CN (1) CN109065031B (zh)
TW (1) TW202008349A (zh)
WO (1) WO2020024690A1 (zh)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065031B (zh) * 2018-08-02 2020-05-12 阿里巴巴集团控股有限公司 语音标注方法、装置及设备
CN109830229A (zh) * 2018-12-11 2019-05-31 平安科技(深圳)有限公司 音频语料智能清洗方法、装置、存储介质和计算机设备
CN109493869A (zh) * 2018-12-25 2019-03-19 苏州思必驰信息科技有限公司 音频数据的采集方法及系统
CN109948124B (zh) * 2019-03-15 2022-12-23 腾讯科技(深圳)有限公司 语音文件切分方法、装置及计算机设备
CN110310626A (zh) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 语音训练数据生成方法、装置、设备及可读存储介质
CN110534100A (zh) * 2019-08-27 2019-12-03 北京海天瑞声科技股份有限公司 一种基于语音识别的中文语音校对方法和装置
CN110400580B (zh) * 2019-08-30 2022-06-17 北京百度网讯科技有限公司 音频处理方法、装置、设备和介质
CN110503958A (zh) * 2019-08-30 2019-11-26 厦门快商通科技股份有限公司 语音识别方法、系统、移动终端及存储介质
CN110610698B (zh) * 2019-09-12 2022-09-27 上海依图信息技术有限公司 一种语音标注方法及装置
CN110718226B (zh) * 2019-09-19 2023-05-05 厦门快商通科技股份有限公司 语音识别结果处理方法、装置、电子设备及介质
CN110827827A (zh) * 2019-11-27 2020-02-21 维沃移动通信有限公司 一种语音播报方法及电子设备
CN112069805A (zh) * 2019-12-20 2020-12-11 北京来也网络科技有限公司 结合rpa与ai的文本标注方法、装置、设备及存储介质
CN113112997A (zh) * 2019-12-25 2021-07-13 华为技术有限公司 数据采集的方法及装置
CN112307748A (zh) * 2020-03-02 2021-02-02 北京字节跳动网络技术有限公司 用于处理文本的方法和装置
CN111429880A (zh) * 2020-03-04 2020-07-17 苏州驰声信息科技有限公司 一种切割段落音频的方法、系统、装置、介质
CN111710332B (zh) * 2020-06-30 2023-07-07 北京达佳互联信息技术有限公司 语音处理方法、装置、电子设备及存储介质
CN111883110B (zh) * 2020-07-30 2024-02-06 上海携旅信息技术有限公司 语音识别的声学模型训练方法、系统、设备及介质
CN111986654B (zh) * 2020-08-04 2024-01-19 云知声智能科技股份有限公司 降低语音识别系统延时的方法及系统
CN112133309B (zh) * 2020-09-22 2021-08-24 掌阅科技股份有限公司 音频和文本的同步方法、计算设备及存储介质
CN112185390B (zh) * 2020-09-27 2023-10-03 中国商用飞机有限责任公司北京民用飞机技术研究中心 机上信息辅助方法及装置
CN113535017B (zh) * 2020-09-28 2024-03-15 腾讯科技(深圳)有限公司 一种绘本文件的处理、同步显示方法、装置及存储介质
CN112863490B (zh) * 2021-01-07 2024-04-30 广州欢城文化传媒有限公司 一种语料获取方法及装置
CN113205814B (zh) * 2021-04-28 2024-03-12 平安科技(深圳)有限公司 语音数据标注方法、装置、电子设备及存储介质
CN113672760B (zh) * 2021-08-19 2023-07-11 北京字跳网络技术有限公司 一种文本对应关系构建方法及其相关设备
CN113723086B (zh) * 2021-08-31 2023-09-05 平安科技(深圳)有限公司 一种文本处理方法、系统、设备及介质
CN113923479A (zh) * 2021-11-12 2022-01-11 北京百度网讯科技有限公司 音视频剪辑方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0965176A (ja) * 1995-08-21 1997-03-07 Fujitsu General Ltd プロンプタ装置
US20110153316A1 (en) * 2009-12-21 2011-06-23 Jonathan Pearl Acoustic Perceptual Analysis and Synthesis System
CN107516509A (zh) * 2017-08-29 2017-12-26 苏州奇梦者网络科技有限公司 用于新闻播报语音合成的语音库构建方法及系统
CN107657947A (zh) * 2017-09-20 2018-02-02 百度在线网络技术(北京)有限公司 基于人工智能的语音处理方法及其装置
CN109065031A (zh) * 2018-08-02 2018-12-21 阿里巴巴集团控股有限公司 语音标注方法、装置及设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9129605B2 (en) * 2012-03-30 2015-09-08 Src, Inc. Automated voice and speech labeling
CN105632484B (zh) * 2016-02-19 2019-04-09 云知声(上海)智能科技有限公司 语音合成数据库停顿信息自动标注方法及系统
CN107578769B (zh) * 2016-07-04 2021-03-23 科大讯飞股份有限公司 语音数据标注方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0965176A (ja) * 1995-08-21 1997-03-07 Fujitsu General Ltd プロンプタ装置
US20110153316A1 (en) * 2009-12-21 2011-06-23 Jonathan Pearl Acoustic Perceptual Analysis and Synthesis System
CN107516509A (zh) * 2017-08-29 2017-12-26 苏州奇梦者网络科技有限公司 用于新闻播报语音合成的语音库构建方法及系统
CN107657947A (zh) * 2017-09-20 2018-02-02 百度在线网络技术(北京)有限公司 基于人工智能的语音处理方法及其装置
CN109065031A (zh) * 2018-08-02 2018-12-21 阿里巴巴集团控股有限公司 语音标注方法、装置及设备

Also Published As

Publication number Publication date
TW202008349A (zh) 2020-02-16
CN109065031A (zh) 2018-12-21
CN109065031B (zh) 2020-05-12

Similar Documents

Publication Publication Date Title
WO2020024690A1 (zh) 语音标注方法、装置及设备
CN110557589B (zh) 用于整合记录的内容的系统和方法
CN109599093B (zh) 智能质检的关键词检测方法、装置、设备及可读存储介质
JP4600828B2 (ja) 文書対応付け装置、および文書対応付け方法
WO2018192186A1 (zh) 语音识别方法及装置
KR101587866B1 (ko) 음성 인식용 발음사전 확장 장치 및 방법
CN112133277B (zh) 样本生成方法及装置
CN112259083B (zh) 音频处理方法及装置
Nasib et al. A real time speech to text conversion technique for bengali language
JP2017058507A (ja) 音声認識装置、音声認識方法、プログラム
Peláez-Moreno et al. Analyzing phonetic confusions using formal concept analysis
Wu et al. Music chord recognition based on midi-trained deep feature and blstm-crf hybird decoding
Meinedo et al. Age and gender detection in the I-DASH project
Mary et al. Searching speech databases: features, techniques and evaluation measures
Howitt Vowel landmark detection
CN109213970B (zh) 笔录生成方法及装置
CN111179914B (zh) 一种基于改进动态时间规整算法的语音样本筛选方法
CN109389969B (zh) 语料库优化方法及装置
Backstrom et al. Forced-alignment of the sung acoustic signal using deep neural nets
JP4825290B2 (ja) 無声化位置検出装置及び方法とそれを用いたセグメンテーション装置及び方法、及びプログラム
Raškinis et al. From speech corpus to intonation corpus: clustering phrase pitch contours of Lithuanian
Reddy et al. Automatic pitch accent contour transcription for Indian languages
JP6565416B2 (ja) 音声検索装置、音声検索方法及びプログラム
Darģis et al. Development and evaluation of speech synthesis corpora for Latvian
WO2018169772A2 (en) Quality feedback on user-recorded keywords for automatic speech recognition systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19843077

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19843077

Country of ref document: EP

Kind code of ref document: A1