WO2019218467A1 - 一种音视频通话方言识别方法、装置、终端设备及介质 - Google Patents

一种音视频通话方言识别方法、装置、终端设备及介质 Download PDF

Info

Publication number
WO2019218467A1
WO2019218467A1 PCT/CN2018/097145 CN2018097145W WO2019218467A1 WO 2019218467 A1 WO2019218467 A1 WO 2019218467A1 CN 2018097145 W CN2018097145 W CN 2018097145W WO 2019218467 A1 WO2019218467 A1 WO 2019218467A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice data
pronunciation
word
dialect
sequence
Prior art date
Application number
PCT/CN2018/097145
Other languages
English (en)
French (fr)
Inventor
张辉
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019218467A1 publication Critical patent/WO2019218467A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the present application belongs to the field of data processing technologies, and in particular, to a method, device, terminal device and medium for identifying an audio and video call dialect.
  • the embodiment of the present application provides an audio and video call dialect identification method and a terminal device, so as to solve the problem that the prior art cannot accurately realize the translation of the dialect in the process of audio and video, so as to ensure the normal operation of the audio and video call.
  • the problem is to solve the problem that the prior art cannot accurately realize the translation of the dialect in the process of audio and video, so as to ensure the normal operation of the audio and video call.
  • a first aspect of the embodiments of the present application provides a method for identifying an audio and video call dialect, including:
  • Obtaining a pronunciation dictionary and a language model corresponding to the dialect type performing voice recognition on the voice data segment based on the pronunciation dictionary, and determining a plurality of candidate character sequences corresponding to the voice data, wherein the pronunciation dictionary records a word a dialect pronunciation data corresponding to the word, wherein the language model stores grammar rule data of the dialect type;
  • the terminal device performs display.
  • the second aspect of the embodiment of the present application provides an audio and video call dialect identifying apparatus, including:
  • a voice collection module configured to collect voice data of a user during a call, and determine a dialect type of the voice data
  • a data segment dividing module configured to divide the voice data into a plurality of voice data segments, where the voice data segments are in one-to-one correspondence with each word constituting the voice data;
  • a voice recognition module configured to acquire a pronunciation dictionary and a language model corresponding to the dialect type, perform voice recognition on the voice data segment based on the pronunciation dictionary, and determine a plurality of candidate character sequences corresponding to the voice data,
  • the dialect pronunciation data corresponding to the word is recorded in the pronunciation dictionary, and the grammar rule data of the dialect type is stored in the language model;
  • a text filtering module configured to analyze the plurality of candidate character sequences based on the language model, to filter a sequence of characters having the highest matching degree with the voice data from the plurality of candidate character sequences, and to The sequence is sent to the end device of the call for display.
  • a third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and the computer storing computer readable instructions executable on the processor, where the processor executes the computer The following steps are implemented when reading the instruction:
  • Obtaining a pronunciation dictionary and a language model corresponding to the dialect type performing voice recognition on the voice data segment based on the pronunciation dictionary, and determining a plurality of candidate character sequences corresponding to the voice data, wherein the pronunciation dictionary records a word a dialect pronunciation data corresponding to the word, wherein the language model stores grammar rule data of the dialect type;
  • the terminal device performs display.
  • a fourth aspect of the embodiments of the present application provides a computer readable storage medium storing computer readable instructions, wherein the computer readable instructions are implemented by at least one processor The following steps:
  • Obtaining a pronunciation dictionary and a language model corresponding to the dialect type performing voice recognition on the voice data segment based on the pronunciation dictionary, and determining a plurality of candidate character sequences corresponding to the voice data, wherein the pronunciation dictionary records a word a dialect pronunciation data corresponding to the word, wherein the language model stores grammar rule data of the dialect type;
  • the terminal device performs display.
  • the embodiment of the present application has the beneficial effects that: in the embodiment of the present application, the voice data of the user is collected and the voice data segment of the word level is divided, thereby determining each word in the user voice data. The speech data segment corresponding to the word, and then using the pronunciation dictionary and the language model corresponding to the user dialect to perform speech recognition and text sequence screening on the speech data segment, thereby ensuring that the final text sequence is the most matching recognition result with the user speech data. In turn, accurate identification of the user's dialect is guaranteed. Finally, the recognized sequence of characters is sent to the end device of the audio and video call for display, so that the call object can still know the content of the call through the displayed text even if the user does not understand the dialect of the user. Therefore, the embodiment of the present application can realize accurate recognition and translation of the dialect during the audio and video call, and can ensure that both parties can know the content of the call even if the dialect is not understood, and ensure the normal operation of the audio and video call.
  • FIG. 1 is a schematic flowchart of an implementation process of an audio and video call dialect identification method according to Embodiment 1 of the present application;
  • FIG. 2 is a schematic flowchart showing an implementation process of an audio and video call dialect identification method provided in Embodiment 2 of the present application;
  • FIG. 3 is a schematic flowchart showing an implementation process of an audio and video call dialect identification method according to Embodiment 3 of the present application;
  • FIG. 4 is a schematic flowchart showing an implementation process of an audio and video call dialect identification method provided in Embodiment 4 of the present application;
  • FIG. 5 is a schematic flowchart of an implementation process of an audio and video call dialect identification method provided in Embodiment 5 of the present application;
  • FIG. 6 is a schematic structural diagram of an audio and video call dialect identifying apparatus provided in Embodiment 6 of the present application;
  • FIG. 7 is a schematic diagram of a terminal device provided in Embodiment 7 of the present application.
  • FIG. 1 is a flowchart showing an implementation of an audio and video call dialect identification method according to Embodiment 1 of the present application, which is described in detail as follows:
  • S101 Collect voice data of the user during the call, and determine a dialect type of the voice data.
  • the identification of the dialect type may be manually selected by the user, or the language recognition technology may be used to complete the recognition of the dialect. Considering that there are many dialects in the current actual situation, the language recognition technology is difficult to implement, and the language recognition occupies more processor resources. Therefore, all types of dialects supporting the recognition and translation can be provided to the user for viewing, and The dialect type is manually selected by the user.
  • the voice data is divided into a plurality of voice data segments, and the voice data segments are in one-to-one correspondence with each word constituting the voice data.
  • the voice data After the voice data is acquired, the voice data needs to be voice-recognized, and in order to realize the recognition of the voice data, the voice data needs to be first divided, and the start and end time corresponding to each word is determined to achieve the word level.
  • the voice data is divided into voice data segments corresponding to each word one by one, so that the specific words corresponding to each voice data segment and the probability of conversion can be queried according to the pronunciation dictionary.
  • the method for dividing the voice data segment of the word level on the voice data may be set by the technician according to actual needs, including but not limited to, using a traditional voice analysis algorithm such as a dynamic programming algorithm, or using a neural network to adapt.
  • a traditional voice analysis algorithm such as a dynamic programming algorithm
  • a neural network to adapt.
  • To divide speech data from word level such as using a connectionist time classifier (Connectionist)
  • the Temporal Classification (CTC) model is used to divide the speech data from the word segment.
  • S103 Acquire a pronunciation dictionary and a language model corresponding to the dialect type, perform voice recognition on the voice data segment based on the pronunciation dictionary, determine a plurality of candidate character sequences corresponding to the voice data, and record the dialect pronunciation data corresponding to the word in the pronunciation dictionary, language
  • the grammar rule data of the dialect type is stored in the model.
  • the pronunciation dictionary is used to store the pronunciation of the dialect corresponding to each word for use in speech recognition.
  • a word in a dialect may also have a variety of different pronunciations, so in the pronunciation dictionary, each word corresponds to one or more dialect pronunciations.
  • the pronunciation dictionary in the embodiment of the present application may be obtained by the technician in advance recording the pronunciation of the word dialect by a large number of users, or may be obtained by the user recording the dialect pronunciation data of the word.
  • the language model is used to store the grammar rules of the dialect, to perform grammatical check on the text sequence obtained by the language recognition, and to judge the matching degree with the voice data, wherein the grammar rules in the language model can be manually set directly by the technician.
  • the grammar rules such as the subject-predicate in Chinese characters can also be based on the linguistic laws of the dialects obtained after a large amount of text training to obtain the corresponding grammar rules.
  • each speech data segment may be identified as having multiple similar dialect pronunciations, that is, each speech data.
  • a segment may have a corresponding plurality of candidate words, such as the pronunciation "zhang” in Chinese, which can simultaneously correspond to words such as Zhang, Chang, and Zhang, so when identifying, several words may be recognized as candidate words. . Therefore, in the embodiment of the present application, when the voice data segment is identified, a plurality of candidate character sequences composed of candidate characters corresponding to the voice data are obtained, such as “zhang in Chinese”. Da” may be identified as: candidate text sequences such as grow up, Zhang Da, swell, and Zhang Da.
  • S104 Analyze a plurality of candidate character sequences based on the language model, so as to filter out the character sequence with the highest degree of matching with the voice data from the plurality of candidate character sequences, and send the text sequence to the call end terminal device for display.
  • the candidate word sequence can be split into the product of the probability of each word based on the chain rule, and then based on the conversion probability of the speech data segment and the candidate text, and according to the grammar rule.
  • the probability of candidate words, etc. obtains the probability that each word appears after the previous word, to obtain the probability of the character sequence corresponding to each candidate text sequence (ie, the degree of matching with the voice data). Since the method for performing the candidate character sequence probability calculation and screening by using the language model in the prior art is relatively mature, the specific detailed steps are not described in the embodiment of the present application.
  • the voice data segment of the user is divided and the voice data segment of the word level is divided, thereby determining the voice data segment corresponding to each word in the user voice data, and then using the pronunciation dictionary corresponding to the user dialect and
  • the language model performs speech recognition and text sequence screening on the speech data segment, thereby ensuring that the final text sequence is the most matching recognition result with the user speech data, thereby ensuring accurate identification of the user dialect.
  • the recognized sequence of characters is sent to the end device of the audio and video call for display, so that the call object can still know the content of the call through the displayed text even if the user does not understand the dialect of the user. Therefore, the embodiment of the present application can realize accurate recognition and translation of the dialect during the audio and video call, and can ensure that both parties can know the content of the call even if the dialect is not understood, and ensure the normal operation of the audio and video call.
  • the commonly used languages such as Chinese and English have a large number of people and a high degree of standardization of pronunciation. Therefore, when the recognition is performed directly based on the universal pronunciation standard for speech recognition, high-accuracy recognition can be achieved, such as Speech recognition is performed directly based on Chinese Pinyin and English phonetic symbols.
  • Speech recognition is performed directly based on Chinese Pinyin and English phonetic symbols.
  • dialects because of the small number of users and the lack of uniform pronunciation standards, they are basically passed down from generation to generation. Therefore, even in the same dialect, the pronunciation of people in different places may exist. Earth difference. Therefore, in dialect speech recognition, it is difficult to directly set a pronunciation standard like the common language to ensure the accuracy of speech recognition.
  • the dialect pronunciation corresponding to the word must be set in the pronunciation dictionary. Whether the dialect speech recognition is accurate or not is directly related to the accuracy of the dialect pronunciation data of the words stored in the pronunciation dictionary.
  • S201 Analyze the voice data and the text sequence to determine a voice data segment corresponding to each word in the voice sequence.
  • the speech data segment of each word is highly relevant to the current user, and a non-uniform is relatively directly preset.
  • the pronunciation of words in the speech data segments of these words can better satisfy the current user's personal characteristics. Therefore, if these speech data segments are used as a standard, the dialect pronunciation of the words in the pronunciation dictionary is updated. Data can greatly improve the accuracy of speech recognition for users.
  • the specific method of the update is not limited by the embodiment of the present application, and may be selected by a technician, including but not limited to, the voice data segment corresponding to the word, directly as its corresponding dialect pronunciation data, and in the pronunciation dictionary.
  • the dialect pronunciation data of the originally stored words are stored together (that is, like multi-phonetic words, the plurality of dialect pronunciation data corresponding to the words are simultaneously associated and stored), and at this time, the amount of dialect pronunciation data corresponding to the words in the pronunciation dictionary will follow With the increasing use of users, the rate of speech recognition for users will also increase.
  • the similarity calculation may be performed on the stored plurality of dialect pronunciation data corresponding to the same word after each storage, and the similarity is similar. Excessive dialect pronunciation data is deduplicated and one of them is retained.
  • the user may also provide some text modification functions, so that after obtaining the wrong character sequence, the error can be corrected in time to ensure the final
  • the accuracy of the text sequence for updating the pronunciation dictionary is to ensure the accuracy of the pronunciation dictionary update.
  • the pronunciation dictionary is updated in time, so that the pronunciation dictionary can be more suitable for the user's personal voice recognition, and the recognition of the user dialect is improved.
  • the rate ensures that the pronunciation dictionary is accurate and effective in real time.
  • the third embodiment of the present application provides the user with a pronunciation dictionary setting function, so that the user can dial the dialect of each word in the pronunciation dictionary.
  • the pronunciation data is updated and set so that the pronunciation dictionary can further satisfy the actual situation of the user, as detailed below:
  • the user When the user needs to set the pronunciation dictionary, it is necessary to first determine which dialect pronunciation data of the words to be modified, that is, determine which are the to-be-modified pronunciation words. After the determination is made, the corresponding pronunciation modification instruction can be input to the audio and video device.
  • S302 Extract, from the collected voice data of the user, user pronunciation data corresponding to the pronunciation word to be modified, and update the dialect pronunciation data corresponding to the modified pronunciation word based on the user pronunciation data.
  • the user After receiving the pronunciation modification instruction input by the user, and determining the pronunciation word to be modified, the user receives the pronunciation data of the pronunciation word to be modified, and updates the dialect pronunciation data of the modified pronunciation word.
  • the specific update method is not limited herein. For reference, refer to the related description in the second embodiment of the present application.
  • the relative general pronunciation data is more in line with the actual situation of the user, and the speech recognition based on the reference can greatly improve the recognition accuracy caused by the non-standard pronunciation of the dialect.
  • the problem of satisfying the user's personalized characteristics makes the recognition rate of the user's dialect greatly guaranteed, and improves the accuracy of the pronunciation dictionary.
  • the user is automatically identified whether the user needs to modify the text sequence, and
  • the pronunciation dictionary is updated based on the modified text sequence and the corresponding voice data, as detailed below:
  • S401 Collect voice data of the user during the call, divide the voice data into multiple voice data segments, and analyze the voice data segment based on the pronunciation dictionary and the language model to determine a text sequence with the highest matching degree corresponding to the voice data.
  • S402. Calculate and determine the current voice data according to the number of voice data segments of the currently collected voice data and the duration of the voice data, and the number of voice data segments of the voice data collected last time and the duration of the voice data. Whether the difference between the speech rate of the voice data is greater than a preset speech rate threshold, and determining whether the similarity between the text sequence corresponding to the current speech data and the text sequence corresponding to the previous speech data is greater than a preset similarity threshold.
  • the speech rate the number of voice data segments / the duration of the voice data, considering that in real life, when people find that the other party does not hear clearly or do not understand their own words, they will reduce their own speed to repeat. Therefore, in order to facilitate the use of the user and improve the accuracy of the pronunciation dictionary, the embodiment of the present application automatically identifies whether the user needs to modify the previous sentence by using the user's speech rate and the text sequence content of the previous sentence. And use the modified content to update the pronunciation dictionary in a targeted manner.
  • the speech rate threshold and the similarity threshold can be set by the technician according to the actual situation.
  • the voice data segment is further updated based on the extracted voice data segment to the dialect pronunciation data corresponding to the word having the difference.
  • the fifth embodiment of the present application provides the user with a manual modification function for the text sequence after obtaining the recognized text sequence.
  • the pronunciation dictionary is updated based on the modified text sequence and the corresponding voice data, as detailed below:
  • S502 Determine, in the text sequence, the modified word to be modified as indicated by the text modification instruction, and replace the word to be modified with the standard word indicated in the text modification instruction.
  • the user is provided with a manual modification function, and the user only needs to modify the text sequence after obtaining the text sequence recognition result.
  • Modify the pronunciation word input the corresponding standard word for replacing the word to be modified, and click the confirmation modification to generate the corresponding text modification instruction.
  • the embodiment of the present application can automatically replace the word to be modified with the standard word.
  • the function of the word For example, when the recognition result is “I am like eating a cake”, the user only needs to manually select the word “image” to be modified, and input the standard word “think”, and then click to confirm the modification, then a text modification instruction can be generated.
  • the video device receives the text modification instruction, the video sequence is modified accordingly.
  • the standard word input must be an accurate 100% word, that is, the voice data segment corresponding to the word to be modified, which is actually the user voice data segment of the standard word, so based on
  • This update of the dialect pronunciation data corresponding to the standard words can greatly improve the accuracy of the pronunciation dictionary and improve the accuracy of speech recognition.
  • the specific update method is not limited herein. For reference, refer to the related description in the second embodiment of the present application.
  • the method further includes:
  • the sequence of characters generated during the user's conversation is recorded, and the recorded sequence of characters is used as training sample data to train and update the language model.
  • the language model is updated and trained based on the result of the text sequence of the user's real speech recognition.
  • the training of the model requires more sample data. Therefore, in the embodiment of the present application, the language model is trained and updated only after the user has repeatedly listened to the audio and video call and obtained more text sequence recognition results. For example, after the user has passed 20 audio and video calls, the text sequence results obtained from 20 audio and video calls are used for deep learning, and the grammatical structure is analyzed to obtain corresponding grammar rules, thereby realizing the language model. Training updates.
  • the voice data segment of the user is divided and the voice data segment of the word level is divided, thereby determining the voice data segment corresponding to each word in the user voice data, and then using the pronunciation dictionary corresponding to the user dialect and
  • the language model performs speech recognition and text sequence screening on the speech data segment, thereby ensuring that the final text sequence is the most matching recognition result with the user speech data, thereby ensuring accurate identification of the user dialect.
  • the recognized sequence of characters is sent to the end device of the audio and video call for display, so that the call object can still know the content of the call through the displayed text even if the user does not understand the dialect of the user.
  • the embodiment of the present application further provides the user with manual and automatic modification of the recognized text sequence, and updates the pronunciation dictionary based on the character sequence obtained by the normal recognition or the modified character sequence, and performs the language model.
  • the training is updated, so that the pronunciation dictionary and the language model in the embodiment of the present application can more fully load the personalized characteristics of the user with the user's use, and satisfy the individual needs of the user, so that the pronunciation dictionary and the language model in the embodiment of the present application are
  • the accuracy of the user's dialect speech recognition is increasing with the use of the user, which provides a strong guarantee for the accurate recognition and translation of the user's dialect.
  • the user can also set or modify the pronunciation dictionary by himself, so that the modification efficiency of the pronunciation dictionary is improved, and the pronunciation dictionary can more satisfy the actual personal needs of the user. Therefore, the embodiment of the present application can realize accurate recognition and translation of the dialect during the audio and video call, and can ensure that both parties can know the content of the call even if the dialect is not understood, and ensure the normal operation of the audio and video call.
  • FIG. 6 is a structural block diagram of an audio-video calling dialect identifying apparatus provided by an embodiment of the present application. For the convenience of description, only parts related to the embodiment of the present application are shown.
  • the audio-video dialing dialect recognition apparatus illustrated in FIG. 6 may be the execution body of the audio-video dialing dialect recognition method provided in the foregoing first embodiment.
  • the audio and video dialect recognition device includes:
  • the voice collection module 61 is configured to collect voice data of the user during the call, and determine a dialect type of the voice data.
  • the data segment dividing module 62 is configured to divide the voice data into a plurality of voice data segments, and the voice data segments are in one-to-one correspondence with each word constituting the voice data.
  • a voice recognition module 63 configured to acquire a pronunciation dictionary and a language model corresponding to the dialect type, perform voice recognition on the voice data segment based on the pronunciation dictionary, and determine a plurality of candidate text sequences corresponding to the voice data.
  • the dialect pronunciation data corresponding to the word is recorded in the pronunciation dictionary, and the language model stores the grammar rule data of the dialect type.
  • a text filtering module 64 configured to analyze the plurality of candidate character sequences based on the language model, to filter a sequence of characters having the highest matching degree with the voice data from the plurality of candidate character sequences, and to The text sequence is sent to the call end terminal device for display.
  • audio and video call dialect identification device further includes:
  • the voice data and the sequence of characters are analyzed to determine a segment of voice data corresponding to each word in the sequence of words in the voice data.
  • the dialect pronunciation data corresponding to each word constituting the character sequence in the pronunciation dictionary is updated.
  • audio and video call dialect identification device further includes:
  • the to-be-modified pronunciation word indicated by the pronunciation modification instruction is determined in the pronunciation dictionary.
  • audio and video call dialect identification device further includes:
  • audio and video call dialect identification device further includes:
  • the sequence of characters generated during the conversation of the user is recorded, and the sequence of characters recorded is used as training sample data, and the language model is trained and updated.
  • first, second, and the like are used in the text to describe various elements in the embodiments of the present application, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.
  • the first contact can be named a second contact, and similarly, the second contact can be named the first contact without departing from the scope of the various described embodiments. Both the first contact and the second contact are contacts, but they are not the same contact.
  • FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application.
  • the terminal device 7 of this embodiment includes a processor 70, a memory 71 in which computer readable instructions 72 executable on the processor 70 are stored.
  • the processor 70 executes the computer readable instructions 72, the steps in the foregoing embodiments of the respective audio and video call dialect identification methods are implemented, such as steps 101 to 104 shown in FIG.
  • the processor 70 when executing the computer readable instructions 72, implements the functions of the various modules/units in the various apparatus embodiments described above, such as the functions of the modules 61-64 shown in FIG.
  • the terminal device 7 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the terminal device may include, but is not limited to, a processor 70 and a memory 71. It will be understood by those skilled in the art that FIG. 7 is only an example of the terminal device 7, and does not constitute a limitation of the terminal device 7, and may include more or less components than those illustrated, or combine some components or different components.
  • the terminal device may further include an input transmitting device, a network access device, a bus, and the like.
  • the so-called processor 70 can be a central processing unit (Central Processing Unit, CPU), can also be other general purpose processors, digital signal processors (DSP), application specific integrated circuits (Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the memory 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7.
  • the memory 71 may also be an external storage device of the terminal device 7, for example, a plug-in hard disk provided on the terminal device 7, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc. Further, the memory 71 may also include both an internal storage unit of the terminal device 7 and an external storage device.
  • the memory 71 is configured to store the computer readable instructions and other programs and data required by the terminal device.
  • the memory 71 can also be used to temporarily store data that has been sent or is about to be transmitted.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated modules/units if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium.
  • the present application implements all or part of the processes in the foregoing embodiments, and may also be implemented by computer readable instructions, which may be stored in a computer readable storage medium.
  • the computer readable instructions when executed by a processor, may implement the steps of the various method embodiments described above.
  • the computer readable instructions comprise computer readable instruction code, which may be in the form of source code, an object code form, an executable file or some intermediate form or the like.
  • the computer readable medium may include any entity or device capable of carrying the computer readable instruction code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media. It should be noted that the content contained in the computer readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in a jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, computer readable media Does not include electrical carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请提供了一种音视频通话方言识别方法、装置、终端设备及介质,适用于数据处理技术领域,该方法包括:采集通话过程中用户的语音数据,并确定出语音数据的方言种类;将语音数据划分为多个语音数据段;获取方言种类对应的发音词典以及语言模型,基于发音词典对语音数据段进行语音识别,确定出语音数据对应的多个候选文字序列;基于语言模型对多个候选文字序列进行分析,以从多个候选文字序列筛选出与语音数据匹配度最高的文字序列,并将文字序列发送至通话对端终端设备进行显示。本申请实施例可以实现对音视频通话过程中对方言的准确识别翻译,能够保证即使在听不懂方言的情况下通话双方也能获知通话内容,保证音视频通话的正常进行。

Description

一种音视频通话方言识别方法、装置、终端设备及介质
本申请要求于2018年05月14日提交中国专利局、申请号为201810456906.2 、发明名称为“一种音视频通话方言识别方法及终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于数据处理技术领域,尤其涉及一种音视频通话方言识别方法、装置、终端设备及介质。
背景技术
现有的音视频通话软件在进行音视频通话时,仅能听到通话双方的声音,当通话双方中有人使用方言时,若另一方不懂其使用的方言,则会使得音视频通话过程出现语言障碍,从而使得通话双方无法正常进行音视频通话。因此,现有技术无法在音视频过程中实现对方言的准确识别翻译,以保证音视频通话的正常进行。
技术问题
有鉴于此,本申请实施例提供了一种音视频通话方言识别方法及终端设备,以解决现有技术中无法在音视频过程中实现对方言的准确识别翻译,以保证音视频通话的正常进行的问题。
技术解决方案
本申请实施例的第一方面提供了一种音视频通话方言识别方法,包括:
采集通话过程中用户的语音数据,并确定出所述语音数据的方言种类;
将所述语音数据划分为多个语音数据段,所述语音数据段与组成所述语音数据的每个字词一一对应;
获取所述方言种类对应的发音词典以及语言模型,基于所述发音词典对所述语音数据段进行语音识别,确定出所述语音数据对应的多个候选文字序列,所述发音词典中记录有字词对应的方言发音数据,所述语言模型中存储有所述方言种类的语法规则数据;
基于所述语言模型对所述多个候选文字序列进行分析,以从所述多个候选文字序列筛选出与所述语音数据匹配度最高的文字序列,并将所述文字序列发送至通话对端终端设备进行显示。
本申请实施例的第二方面提供了一种音视频通话方言识别装置,包括:
语音采集模块,用于采集通话过程中用户的语音数据,并确定出所述语音数据的方言种类;
数据段划分模块,用于将所述语音数据划分为多个语音数据段,所述语音数据段与组成所述语音数据的每个字词一一对应;
语音识别模块,用于获取所述方言种类对应的发音词典以及语言模型,基于所述发音词典对所述语音数据段进行语音识别,确定出所述语音数据对应的多个候选文字序列,所述发音词典中记录有字词对应的方言发音数据,所述语言模型中存储有所述方言种类的语法规则数据;
文字筛选模块,用于基于所述语言模型对所述多个候选文字序列进行分析,以从所述多个候选文字序列筛选出与所述语音数据匹配度最高的文字序列,并将所述文字序列发送至通话对端终端设备进行显示。
本申请实施例的第三方面提供了一种终端设备,包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
采集通话过程中用户的语音数据,并确定出所述语音数据的方言种类;
将所述语音数据划分为多个语音数据段,所述语音数据段与组成所述语音数据的每个字词一一对应;
获取所述方言种类对应的发音词典以及语言模型,基于所述发音词典对所述语音数据段进行语音识别,确定出所述语音数据对应的多个候选文字序列,所述发音词典中记录有字词对应的方言发音数据,所述语言模型中存储有所述方言种类的语法规则数据;
基于所述语言模型对所述多个候选文字序列进行分析,以从所述多个候选文字序列筛选出与所述语音数据匹配度最高的文字序列,并将所述文字序列发送至通话对端终端设备进行显示。
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被至少一个处理器执行时实现如下步骤:
采集通话过程中用户的语音数据,并确定出所述语音数据的方言种类;
将所述语音数据划分为多个语音数据段,所述语音数据段与组成所述语音数据的每个字词一一对应;
获取所述方言种类对应的发音词典以及语言模型,基于所述发音词典对所述语音数据段进行语音识别,确定出所述语音数据对应的多个候选文字序列,所述发音词典中记录有字词对应的方言发音数据,所述语言模型中存储有所述方言种类的语法规则数据;
基于所述语言模型对所述多个候选文字序列进行分析,以从所述多个候选文字序列筛选出与所述语音数据匹配度最高的文字序列,并将所述文字序列发送至通话对端终端设备进行显示。
有益效果
本申请实施例与现有技术相比存在的有益效果是:本申请实施例中通过对用户的语音数据进行采集与字词层面的语音数据段划分,从而确定出了用户语音数据中每个字词对应的语音数据段,再利用用户方言对应的发音词典以及语言模型来对语音数据段进行语音识别和文字序列筛选,从而保证了最终得到的文字序列是与用户语音数据最为匹配的识别结果,进而保证了对用户方言的准确识别。最后将识别出的文字序列发送到音视频通话对端终端设备进行显示,从而使得通话对象即使听不懂用户的方言,依然可以通过显示的文字来获知通话内容。因此,本申请实施例可以实现对音视频通话过程中对方言的准确识别翻译,能够保证即使在听不懂方言的情况下通话双方也能获知通话内容,保证音视频通话的正常进行。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例一提供的音视频通话方言识别方法的实现流程示意图;
图2是本申请实施例二提供的音视频通话方言识别方法的实现流程示意图;
图3是本申请实施例三提供的音视频通话方言识别方法的实现流程示意图;
图4是本申请实施例四提供的音视频通话方言识别方法的实现流程示意图;
图5是本申请实施例五提供的音视频通话方言识别方法的实现流程示意图;
图6是本申请实施例六提供的音视频通话方言识别装置的结构示意图;
图7是本申请实施例七提供的终端设备的示意图。
本发明的实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
为了说明本申请所述的技术方案,下面通过具体实施例来进行说明。
图1示出了本申请实施例一提供的音视频通话方言识别方法的实现流程图,详述如下:
S101,采集通话过程中用户的语音数据,并确定出语音数据的方言种类。
其中,对方言种类的识别,既可以是用户手动选定的,也可以是采用语种识别技术来完成对方言的识别。考虑到当前实际情况中方言种类较多,语种识别技术实现难度较大,且语种识别所占用的处理器资源较多,因此优选地,可将所有支持识别翻译的方言种类提供给用户查看,并由用户手动选取方言种类。
S102,将语音数据划分为多个语音数据段,语音数据段与组成语音数据的每个字词一一对应。
在获取到语音数据之后,需要对语音数据进行语音识别,而为了实现对语音数据的识别首先需要对语音数据进行划分,确定出每个字词对应的起止时间,以实现在字词层面上将语音数据划分为每个字词一一对应的语音数据段,使得后续能够根据发音词典查询出每个语音数据段对应的具体字词及转换的概率。
其中,对语音数据进行字词层面的语音数据段划分方法,具体可由技术人员根据实际需求设定,包含但不限于如使用动态规划算法等传统的语音分析算法来实现,或者使用神经网络来自适应的将语音数据从字词层面进行学习划分,如使用联结主义时间分类器(Connectionist Temporal Classification,CTC)模型,来实现将语音数据从字词层面的数据段划分。
S103,获取方言种类对应的发音词典以及语言模型,基于发音词典对语音数据段进行语音识别,确定出语音数据对应的多个候选文字序列,发音词典中记录有字词对应的方言发音数据,语言模型中存储有方言种类的语法规则数据。
发音词典用来存储每个字词对应的方言发音,以供语音识别时查询使用。实际情况中,正如汉语中的多音字现象,方言中一个字词也可能具有多种不同的发音,因此在发音词典中,每个字词对应着一个或多个方言发音。其中,本申请实施例中的发音词典,既可以是由技术人员预先对大量用户进行字词方言发音录制后得到,也可以是由用户自行录制自己对字词的方言发音数据后得到。语言模型,用于存储方言的语法规则,以对语言识别得到文字序列进行语法校验,判断与语音数据的匹配度,其中,语言模型中的语法规则,既可以是直接由技术人员手动设定的,如汉字中的主谓宾等语法规则,也可以是是基于大量文本训练后得到的方言本身的语言规律,以得到对应的语法规则。
由于字词发音会存在较多的相同或相似的发音情况,因此在根据语音数据段进行语音识别时,每个语音数据段可能会被识别出有多个相似的方言发音,即每个语音数据段可能有着对应的多个候选文字,如汉语中的发音“zhang”,可以同时对应着:张、长以及章等字词,因此在识别时,可能会将几个字词都识别为候选文字。因此,本申请实施例在对语音数据段进行识别时,会得到与语音数据对应的由候选文字组成的多个候选文字序列,如汉语中“zhang da”,可能被识别为:长大、张大、胀大以及张达等候选文字序列。
S104,基于语言模型对多个候选文字序列进行分析,以从多个候选文字序列筛选出与语音数据匹配度最高的文字序列,并将文字序列发送至通话对端终端设备进行显示。
在得到多个候选文字序列后,可以基于链式法则,将候选文字序列拆分为其中每个字词的概率之积,再基于语音数据段与候选文字的转换概率,以及根据语法规则得到的候选文字的连续概率等,得到每个字词出现在前一个字词之后的概率,以得出每个候选文字序列对应的文字序列概率(即与语音数据的匹配度)。由于现有技术中利用语言模型来进行候选文字序列概率计算筛选的方法已经较为成熟,因此,本申请实施例中不再对其中具体详细的步骤进行赘述。
作为本申请的一个优选实施例,考虑到仅仅根据语法规则,可能会导致对一些词性相同且发音相同或相近的字词识别准确率较低的情况发生,如汉语中的“人才”和“人材”,仅仅根据简单的语法规则可能难以准确区分开来,因此,本申请实施例中还可以在语言模型处理并筛选出几个概率相近的最大匹配度的候选文字序列后,再对这些候选文字序列进行语义分析,以确定出最佳的文字序列。
本申请实施例中通过对用户的语音数据进行采集与字词层面的语音数据段划分,从而确定出了用户语音数据中每个字词对应的语音数据段,再利用用户方言对应的发音词典以及语言模型来对语音数据段进行语音识别和文字序列筛选,从而保证了最终得到的文字序列是与用户语音数据最为匹配的识别结果,进而保证了对用户方言的准确识别。最后将识别出的文字序列发送到音视频通话对端终端设备进行显示,从而使得通话对象即使听不懂用户的方言,依然可以通过显示的文字来获知通话内容。因此,本申请实施例可以实现对音视频通话过程中对方言的准确识别翻译,能够保证即使在听不懂方言的情况下通话双方也能获知通话内容,保证音视频通话的正常进行。
实际应用中,汉语、英语等常用语种,其使用人群数量多,发音的标准化程度高,因此在进行识别时直接基于其通用的发音标准来进行语音识别,即可实现高准确率的识别,如直接根据汉语拼音和英语音标来进行语音识别。但对于方言而言,由于其使用人数较少且较少有统一的发音标准,基本都是一代代人口口相传下来的,因此,即使是同一种方言,不同地方的人的发音也可能存在极大地差别。因此方言语音识别时,难以像常用语种一样直接设置一个发音标准就可以保证对语音识别的准确率,但在进行语音识别时,又必须在发音词典中设置字词对应的方言发音,由此可知,方言语音识别的准确与否,与发音词典中存储的字词的方言发音数据的准确与否存在着直接关联。
因此,作为本申请实施例二,如图2所示,为了提高对语音识别的准确率,本申请实施例二会在得到识别出的文字序列后,利用文字序列和对应的语音数据,来对发音词典进行更新,详述如下:
S201,对语音数据以及文字序列进行分析,确定出文字序列中每个字词在语音数据中分别对应的语音数据段。
由于得到的文字序列理论上都是对当前用户的语音数据识别翻译的正确结果,因此,其中每个字词的语音数据段对当前用户而言都极具参考意义,相对直接预设一个非统一标准的方言发音数据而言,这些字词的语音数据段中的字词发音,更能满足当前用户的个人特点,因此,若以这些语音数据段为标准来更新发音词典中字词的方言发音数据,可以极大程度上提升对用户个人的语音识别准确率。
S202,基于分析结果,对发音词典中组成文字序列的每个字词对应的方言发音数据进行更新。
其中,本申请实施例未对更新的具体方法进行限定,可由技术人员自行选取设置,包括但不限于如将字词对应的语音数据段,直接作为其对应的方言发音数据,并与发音词典中原本存储的字词的方言发音数据一并存储(即像多音字一样,把字词对应的多个方言发音数据同时关联存储),此时,发音词典中字词对应的方言发音数据量会随着用户的使用而日益增大,对用户的语音识别率也会日益增强。优选地,为了防止一并存储导致的方言发音数据量过大的问题,可以在每次存储后,对已存储的同一个字词对应的多个方言发音数据进行相似度计算,并将其中相似度过大的方言发音数据进行去重,并保留其中的一个即可。
作为本申请的一个优选实施例,为了提升对发音词典更新的准确率,在S201之前,还可以为用户提供一些文字修改功能,使得在得到存在错误的文字序列后,能够及时纠正错误,保证最终用于更新发音词典的文字序列的准确率,以保证对发音词典更新的准确率,具体可参考本申请实施例四以及本申请实施例五相关说明。
在本申请实施例中,通过在得到识别出的用户语音数据的文字序列之后,及时对发音词典进行更新,从而使得发音词典能更加适用于用户个人的语音识别,提高了对用户方言识别的准确率,保证了发音词典的实时准确有效。
作为本申请实施例三,如图3所示,为了提高对语音识别的准确率,本申请实施例三为用户提供了发音词典设置功能,使得用户可以自行对发音词典中每个字词的方言发音数据进行设置更新,以使得发音词典能进一步地满足用户个人的实际情况,详述如下:
S301,若接收到用户输入的发音修改指令,在发音词典中确定出发音修改指令所指示修改的待修改发音字词。
当用户需要对发音词典进行设置的时候,需要首先确定好到底是要修改哪些字词的方言发音数据,即先确定哪些是待修改发音字词。在确定好之后,向音视频设备输入相应的发音修改指令即可。
S302,从采集到的用户的语音数据中,提取出待修改发音字词对应的用户发音数据,并基于用户发音数据对待修改发音字词对应的方言发音数据进行更新。
在接收到用户输入的发音修改指令,并确定出待修改发音字词之后,接收用户对待修改发音字词的发音数据,并对待修改发音字词的方言发音数据进行更新。其中,具体的更新方法此处不予限定,可参考本申请实施例二中的相关说明。
由于用户自行录制的字词的发音数据,相对预设的通用发音数据更加符合用户的实际情况,以此为基准来进行语音识别,可以极大地改善由于方言无标准发音导致的识别准确率不高的问题,满足用户的个性化特点,使得对用户方言识别率得到了极大的保障,提高了发音词典的准确性。
作为本申请实施例四,如图4所示,为了提高对语音识别的准确率,本申请实施例四会在得到识别出的文字序列后,自动识别用户是否需要对文字序列进行修改,并会基于修改后的文字序列和对应的语音数据来对发音词典进行更新,详述如下:
S401,采集通话过程中用户的语音数据,将语音数据划分为多个语音数据段,并基于发音词典以及语言模型对语音数据段进行分析,确定出语音数据对应的匹配度最高的文字序列。
此处与本申请实施例一的操作相同,此处不予赘述。
S402,根据当前采集到的语音数据的语音数据段的数量以及语音数据的时长,和上一次采集到的语音数据的语音数据段的数量以及语音数据的时长,计算并判断当前语音数据与上一次语音数据的语速的差值是否大于预设语速阈值,并判断当前语音数据对应的文字序列与上一次语音数据对应的文字序列的相似度是否大于预设相似度阈值。
其中,语速=语音数据段的数量/语音数据的时长,考虑到实际生活中,当人们发现说话对方没有听清楚或者没有理解自己的话的时候,都会降低自己的语速重复一遍。因此,为了方便用户的使用,同时提高发音字典的准确性,本申请实施例中会通过用户的语速和相对上一句话的文字序列内容,来自动识别用户是否需要对上一句话进行修改,并利用修改的内容对发音词典进行针对性的更新。其中语速阈值和相似度阈值,均可由技术人员根据实际情况设定。
S403,若语速的差值大于语速阈值,且相似度大于相似度阈值,确定出当前语音数据对应的文字序列中存在差异的字词,并从语音数据中提取出存在差异的字词对应的语音数据段,再基于提取出的语音数据段,对存在差异的字词对应的方言发音数据进行更新。
当语速明显下降且说话内容的重复度较高时,说明用户在对上一句话进行修改,其中文字序列中存在差异的字词即为所需修改的内容。实际应用中,正是由于对语音数据的识别存在了误差所以才需要进行修改,因此,存在差异的字词识别出错的概率相对较大,针对这些识别出错概率较大的字词的方言发音数据进行更新,可以极大地提升发音词典的准确性,提升对语音识别的准确性。其中,具体的更新方法此处不予限定,可参考本申请实施例二中的相关说明。
作为本申请实施例五,如图5所示,为了提高对语音识别的准确率,本申请实施例五会在得到识别出的文字序列后,为用户提供对文字序列的手动修改功能,并会基于修改后的文字序列和对应的语音数据来对发音词典进行更新,详述如下:
S501,接收用户输入的文字修改指令。
S502,在文字序列中确定出文字修改指令所指示修改的待修改字词,并用文字修改指令中指示的标准字词替换待修改字词。
为了方便用户在语音识别出错时对出错的文字序列进行修改,本申请实施例中,为用户提供了手动修改的功能,用户只要在得到文字序列识别结果后,选定文字序列中需要修改的待修改发音字词,输入相应的用于替换待修改字词的标准字词,并点击确认修改即可生成相应的文字修改指令,本申请实施例即可实现自动将待修改字词替换为标准字词的功能。例如,当识别结果为“我像吃蛋糕”时,用户只需要手动选定待修改字词“像”,并输入标准字词“想”,再点击确认修改,即可生成文字修改指令,音视频设备在接收到文字修改指令时,再对文字序列进行相应的修改。
S503,从语音数据中提取出待修改字词对应的语音数据段,以对标准字词对应的方言发音数据进行更新。
由于只有一定出错了用户才会修改,且输入的标准字词必定是准确百分百的字词,即待修改字词对应的语音数据段,实际是标准字词的用户语音数据段,因此基于此对标准字词对应的方言发音数据进行更新,可以极大地提升发音词典的准确性,提升对语音识别的准确性。其中,具体的更新方法此处不予限定,可参考本申请实施例二中的相关说明。
作为本申请的一个优选实施例,在用户进行多次音视频通话之后,还包括:
对在用户的通话过程中产生的文字序列进行记录,并将记录的文字序列作为训练样本数据,对语言模型进行训练更新。
由于语言模型中的语法规则,都是根据技术人员的经验或者是根据大多数方言使用者的语言习惯设置的,但由于方言本身使用人数就少语法的标准化程度相对也较弱,因此,预先设置的语法规则不一定能很好地适用于用户个人的情况。为了提升语言模型的有效性,使得语言模型能更加吻合用户的实际语言习惯,本申请实施例中会基于用户真实的语音识别的文字序列结果,来对语言模型进行更新训练。其中,由于对模型的训练,需要较多的样本数据,因此,本申请实施例只有在用户多次音视频通话,得到了较多的文字序列识别结果之后,才会对语言模型进行训练更新。如在用户经过了20次音视频通话之后,对着20次音视频通话得到的文字序列结果,来进行深度学习,分析出其中的语法结构,以得到对应的语法规则,从而实现对语言模型的训练更新。
本申请实施例中通过对用户的语音数据进行采集与字词层面的语音数据段划分,从而确定出了用户语音数据中每个字词对应的语音数据段,再利用用户方言对应的发音词典以及语言模型来对语音数据段进行语音识别和文字序列筛选,从而保证了最终得到的文字序列是与用户语音数据最为匹配的识别结果,进而保证了对用户方言的准确识别。最后将识别出的文字序列发送到音视频通话对端终端设备进行显示,从而使得通话对象即使听不懂用户的方言,依然可以通过显示的文字来获知通话内容。
同时,本申请实施例还为用户提供对识别出的文字序列的手动及自动修改,并基于正常识别得到的文字序列,或者修改后得到的文字序列,对发音词典进行更新,并对语言模型进行更新训练,从而使得本申请实施例中的发音词典和语言模型能够随着用户的使用,越发负荷用户的个性化特点,满足用户个人需求,从而使得本申请实施例中的发音词典和语言模型对用户的方言语音识别准确率,随着用户的使用而愈发上升,为对用户方言的准确识别翻译提供了有力的保障。其用户还可以自行设置或修改发音词典,从而使得对发音词典的修改效率越发提高,使得发音词典能更加满足用户的实际个人需求。因此,本申请实施例可以实现对音视频通话过程中对方言的准确识别翻译,能够保证即使在听不懂方言的情况下通话双方也能获知通话内容,保证音视频通话的正常进行。
对应于上文实施例的方法,图6示出了本申请实施例提供的音视频通话方言识别装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。图6示例的音视频通话方言识别装置可以是前述实施例一提供的音视频通话方言识别方法的执行主体。
参照图6,该音视频通话方言识别装置包括:
语音采集模块61,用于采集通话过程中用户的语音数据,并确定出所述语音数据的方言种类。
数据段划分模块62,用于将所述语音数据划分为多个语音数据段,所述语音数据段与组成所述语音数据的每个字词一一对应。
语音识别模块63,用于获取所述方言种类对应的发音词典以及语言模型,基于所述发音词典对所述语音数据段进行语音识别,确定出所述语音数据对应的多个候选文字序列,所述发音词典中记录有字词对应的方言发音数据,所述语言模型中存储有所述方言种类的语法规则数据。
文字筛选模块64,用于基于所述语言模型对所述多个候选文字序列进行分析,以从所述多个候选文字序列筛选出与所述语音数据匹配度最高的文字序列,并将所述文字序列发送至通话对端终端设备进行显示。
进一步地,该音视频通话方言识别装置,还包括:
对所述语音数据以及所述文字序列进行分析,确定出所述文字序列中每个字词在所述语音数据中分别对应的语音数据段。
基于分析结果,对所述发音词典中组成所述文字序列的每个字词对应的所述方言发音数据进行更新。
进一步地,该音视频通话方言识别装置,还包括:
若接收到用户输入的发音修改指令,在所述发音词典中确定出所述发音修改指令所指示修改的待修改发音字词。
从采集到的用户的语音数据中,提取出所述待修改发音字词对应的用户发音数据,并基于所述用户发音数据对所述待修改发音字词对应的所述方言发音数据进行更新。
进一步地,该音视频通话方言识别装置,还包括:
接收用户输入的文字修改指令。
在所述文字序列中确定出所述文字修改指令所指示修改的待修改字词,并用所述文字修改指令中指示的标准字词替换所述待修改字词。
从所述语音数据中提取出所述待修改字词对应的语音数据段,以对所述标准字词对应的所述方言发音数据进行更新。
进一步地,该音视频通话方言识别装置,还包括:
对在用户的通话过程中产生的文字序列进行记录,并将记录的所述文字序列作为训练样本数据,对所述语言模型进行训练更新。
本申请实施例提供的音视频通话方言识别装置中各模块实现各自功能的过程,具体可参考前述图1所示实施例一的描述,此处不再赘述。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
还应理解的是,虽然术语“第一”、“第二”等在文本中在一些本申请实施例中用来描述各种元素,但是这些元素不应该受到这些术语的限制。这些术语只是用来将一个元素与另一元素区分开。例如,第一接触可以被命名为第二接触,并且类似地,第二接触可以被命名为第一接触,而不背离各种所描述的实施例的范围。第一接触和第二接触都是接触,但是它们不是同一接触。
图7是本申请一实施例提供的终端设备的示意图。如图7所示,该实施例的终端设备7包括:处理器70、存储器71,所述存储器71中存储有可在所述处理器70上运行的计算机可读指令72。所述处理器70执行所述计算机可读指令72时实现上述各个音视频通话方言识别方法实施例中的步骤,例如图1所示的步骤101至104。或者,所述处理器70执行所述计算机可读指令72时实现上述各装置实施例中各模块/单元的功能,例如图6所示模块61至64的功能。
所述终端设备7可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述终端设备可包括,但不仅限于,处理器70、存储器71。本领域技术人员可以理解,图7仅仅是终端设备7的示例,并不构成对终端设备7的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端设备还可以包括输入发送设备、网络接入设备、总线等。
所称处理器70可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器71可以是所述终端设备7的内部存储单元,例如终端设备7的硬盘或内存。所述存储器71也可以是所述终端设备7的外部存储设备,例如所述终端设备7上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器71还可以既包括所述终端设备7的内部存储单元也包括外部存储设备。所述存储器71用于存储所述计算机可读指令以及所述终端设备所需的其他程序和数据。所述存储器71还可以用于暂时地存储已经发送或者将要发送的数据。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使对应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种音视频通话方言识别方法,其特征在于,包括:
    采集通话过程中用户的语音数据,并确定出所述语音数据的方言种类;
    将所述语音数据划分为多个语音数据段,所述语音数据段与组成所述语音数据的每个字词一一对应;
    获取所述方言种类对应的发音词典以及语言模型,基于所述发音词典对所述语音数据段进行语音识别,确定出所述语音数据对应的多个候选文字序列,所述发音词典中记录有字词对应的方言发音数据,所述语言模型中存储有所述方言种类的语法规则数据;
    基于所述语言模型对所述多个候选文字序列进行分析,以从所述多个候选文字序列筛选出与所述语音数据匹配度最高的文字序列,并将所述文字序列发送至通话对端终端设备进行显示。
  2. 如权利要求1所述的音视频通话方言识别方法,其特征在于,还包括:
    对所述语音数据以及所述文字序列进行分析,确定出所述文字序列中每个字词在所述语音数据中分别对应的语音数据段;
    基于分析结果,对所述发音词典中组成所述文字序列的每个字词对应的所述方言发音数据进行更新。
  3. 如权利要求1所述的音视频通话方言识别方法,其特征在于,还包括:
    若接收到用户输入的发音修改指令,在所述发音词典中确定出所述发音修改指令所指示修改的待修改发音字词;
    从采集到的用户的语音数据中,提取出所述待修改发音字词对应的用户发音数据,并基于所述用户发音数据对所述待修改发音字词对应的所述方言发音数据进行更新。
  4. 如权利要求1所述的音视频通话方言识别方法,其特征在于,还包括:
    接收用户输入的文字修改指令;
    在所述文字序列中确定出所述文字修改指令所指示修改的待修改字词,并用所述文字修改指令中指示的标准字词替换所述待修改字词;
    从所述语音数据中提取出所述待修改字词对应的语音数据段,以对所述标准字词对应的所述方言发音数据进行更新。
  5. 如权利要求1所述的音视频通话方言识别方法,其特征在于,还包括:
    对在用户的通话过程中产生的文字序列进行记录,并将记录的所述文字序列作为训练样本数据,对所述语言模型进行训练更新。
  6. 一种音视频通话方言识别装置,其特征在于,包括:
    语音采集模块,用于采集通话过程中用户的语音数据,并确定出所述语音数据的方言种类;
    数据段划分模块,用于将所述语音数据划分为多个语音数据段,所述语音数据段与组成所述语音数据的每个字词一一对应;
    语音识别模块,用于获取所述方言种类对应的发音词典以及语言模型,基于所述发音词典对所述语音数据段进行语音识别,确定出所述语音数据对应的多个候选文字序列,所述发音词典中记录有字词对应的方言发音数据,所述语言模型中存储有所述方言种类的语法规则数据;
    文字筛选模块,用于基于所述语言模型对所述多个候选文字序列进行分析,以从所述多个候选文字序列筛选出与所述语音数据匹配度最高的文字序列,并将所述文字序列发送至通话对端终端设备进行显示。
  7. 如权利要求6所述的音视频通话方言识别装置,其特征在于,还包括:
    对应关系分析模块,用于对所述语音数据以及所述文字序列进行分析,确定出所述文字序列中每个字词在所述语音数据中分别对应的语音数据段;
    第一数据更新模块,用于基于分析结果,对所述发音词典中组成所述文字序列的每个字词对应的所述方言发音数据进行更新。
  8. 如权利要求6所述的音视频通话方言识别装置,其特征在于,还包括:
    字词确定模块,用于若接收到用户输入的发音修改指令,在所述发音词典中确定出所述发音修改指令所指示修改的待修改发音字词;
    第二数据更新模块,用于从采集到的用户的语音数据中,提取出所述待修改发音字词对应的用户发音数据,并基于所述用户发音数据对所述待修改发音字词对应的所述方言发音数据进行更新。
  9. 如权利要求6所述的音视频通话方言识别装置,其特征在于,还包括:
    指令接收模块,用于接收用户输入的文字修改指令;
    替换模块,用于在所述文字序列中确定出所述文字修改指令所指示修改的待修改字词,并用所述文字修改指令中指示的标准字词替换所述待修改字词;
    第三数据更新模块,用于从所述语音数据中提取出所述待修改字词对应的语音数据段,以对所述标准字词对应的所述方言发音数据进行更新。
  10. 如权利要求6所述的音视频通话方言识别装置,其特征在于,还包括:
    对在用户的通话过程中产生的文字序列进行记录,并将记录的所述文字序列作为训练样本数据,对所述语言模型进行训练更新。
  11. 一种终端设备,其特征在于,所述终端设备包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    采集通话过程中用户的语音数据,并确定出所述语音数据的方言种类;
    将所述语音数据划分为多个语音数据段,所述语音数据段与组成所述语音数据的每个字词一一对应;
    获取所述方言种类对应的发音词典以及语言模型,基于所述发音词典对所述语音数据段进行语音识别,确定出所述语音数据对应的多个候选文字序列,所述发音词典中记录有字词对应的方言发音数据,所述语言模型中存储有所述方言种类的语法规则数据;
    基于所述语言模型对所述多个候选文字序列进行分析,以从所述多个候选文字序列筛选出与所述语音数据匹配度最高的文字序列,并将所述文字序列发送至通话对端终端设备进行显示。
  12. 如权利要求11所述的终端设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:
    对所述语音数据以及所述文字序列进行分析,确定出所述文字序列中每个字词在所述语音数据中分别对应的语音数据段;
    基于分析结果,对所述发音词典中组成所述文字序列的每个字词对应的所述方言发音数据进行更新。
  13. 如权利要求11所述的终端设备,其特征在于,,所述处理器执行所述计算机可读指令时还实现如下步骤:
    若接收到用户输入的发音修改指令,在所述发音词典中确定出所述发音修改指令所指示修改的待修改发音字词;
    从采集到的用户的语音数据中,提取出所述待修改发音字词对应的用户发音数据,并基于所述用户发音数据对所述待修改发音字词对应的所述方言发音数据进行更新。
  14. 如权利要求11所述的终端设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:
    接收用户输入的文字修改指令;
    在所述文字序列中确定出所述文字修改指令所指示修改的待修改字词,并用所述文字修改指令中指示的标准字词替换所述待修改字词;
    从所述语音数据中提取出所述待修改字词对应的语音数据段,以对所述标准字词对应的所述方言发音数据进行更新。
  15. 如权利要求11所述的终端设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:
    对在用户的通话过程中产生的文字序列进行记录,并将记录的所述文字序列作为训练样本数据,对所述语言模型进行训练更新。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被至少一个处理器执行时实现如下步骤:
    采集通话过程中用户的语音数据,并确定出所述语音数据的方言种类;
    将所述语音数据划分为多个语音数据段,所述语音数据段与组成所述语音数据的每个字词一一对应;
    获取所述方言种类对应的发音词典以及语言模型,基于所述发音词典对所述语音数据段进行语音识别,确定出所述语音数据对应的多个候选文字序列,所述发音词典中记录有字词对应的方言发音数据,所述语言模型中存储有所述方言种类的语法规则数据;
    基于所述语言模型对所述多个候选文字序列进行分析,以从所述多个候选文字序列筛选出与所述语音数据匹配度最高的文字序列,并将所述文字序列发送至通话对端终端设备进行显示。
  17. 根据权利要求16所述的计算机可读存储介质,其特征在于,还包括:
    对所述语音数据以及所述文字序列进行分析,确定出所述文字序列中每个字词在所述语音数据中分别对应的语音数据段;
    基于分析结果,对所述发音词典中组成所述文字序列的每个字词对应的所述方言发音数据进行更新。
  18. 根据权利要求16所述的计算机可读存储介质,其特征在于,还包括:
    若接收到用户输入的发音修改指令,在所述发音词典中确定出所述发音修改指令所指示修改的待修改发音字词;
    从采集到的用户的语音数据中,提取出所述待修改发音字词对应的用户发音数据,并基于所述用户发音数据对所述待修改发音字词对应的所述方言发音数据进行更新。
  19. 根据权利要求16所述的计算机可读存储介质,其特征在于,还包括:
    接收用户输入的文字修改指令;
    在所述文字序列中确定出所述文字修改指令所指示修改的待修改字词,并用所述文字修改指令中指示的标准字词替换所述待修改字词;
    从所述语音数据中提取出所述待修改字词对应的语音数据段,以对所述标准字词对应的所述方言发音数据进行更新。
  20. 根据权利要求16所述的计算机可读存储介质,其特征在于,还包括:
    对在用户的通话过程中产生的文字序列进行记录,并将记录的所述文字序列作为训练样本数据,对所述语言模型进行训练更新。
PCT/CN2018/097145 2018-05-14 2018-07-26 一种音视频通话方言识别方法、装置、终端设备及介质 WO2019218467A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810456906.2 2018-05-14
CN201810456906.2A CN108682420B (zh) 2018-05-14 2018-05-14 一种音视频通话方言识别方法及终端设备

Publications (1)

Publication Number Publication Date
WO2019218467A1 true WO2019218467A1 (zh) 2019-11-21

Family

ID=63805007

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/097145 WO2019218467A1 (zh) 2018-05-14 2018-07-26 一种音视频通话方言识别方法、装置、终端设备及介质

Country Status (2)

Country Link
CN (1) CN108682420B (zh)
WO (1) WO2019218467A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460214A (zh) * 2020-04-02 2020-07-28 北京字节跳动网络技术有限公司 分类模型训练方法、音频分类方法、装置、介质及设备
CN112905247A (zh) * 2021-01-25 2021-06-04 斑马网络技术有限公司 自动检测并切换语言的方法及装置、终端设备、存储介质
CN117690416A (zh) * 2024-02-02 2024-03-12 江西科技学院 一种人工智能交互方法及人工智能交互系统

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109448699A (zh) * 2018-12-15 2019-03-08 深圳壹账通智能科技有限公司 语音转换文本方法、装置、计算机设备及存储介质
CN110211565B (zh) * 2019-05-06 2023-04-04 平安科技(深圳)有限公司 方言识别方法、装置及计算机可读存储介质
CN110047467B (zh) * 2019-05-08 2021-09-03 广州小鹏汽车科技有限公司 语音识别方法、装置、存储介质及控制终端
WO2021000068A1 (zh) * 2019-06-29 2021-01-07 播闪机械人有限公司 一种非母语人士使用的语音识别方法及装置
CN110517664B (zh) * 2019-09-10 2022-08-05 科大讯飞股份有限公司 多方言识别方法、装置、设备及可读存储介质
CN110827803A (zh) * 2019-11-11 2020-02-21 广州国音智能科技有限公司 方言发音词典的构建方法、装置、设备及可读存储介质
CN111326144B (zh) * 2020-02-28 2023-03-03 网易(杭州)网络有限公司 语音数据处理方法、装置、介质和计算设备
CN112652309A (zh) * 2020-12-21 2021-04-13 科大讯飞股份有限公司 一种方言语音转换方法、装置、设备及存储介质
CN113053362A (zh) * 2021-03-30 2021-06-29 建信金融科技有限责任公司 语音识别的方法、装置、设备和计算机可读介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455530A (zh) * 2012-10-25 2013-12-18 河南省佰腾电子科技有限公司 随身携带式创建个性化语音对应文本文字数据库的装置
CN103578464A (zh) * 2013-10-18 2014-02-12 威盛电子股份有限公司 语言模型的建立方法、语音辨识方法及电子装置
CN105573988A (zh) * 2015-04-28 2016-05-11 宇龙计算机通信科技(深圳)有限公司 一种语音转换的方法及终端
CN106356065A (zh) * 2016-10-31 2017-01-25 努比亚技术有限公司 一种移动终端及语音转换方法
CN106384593A (zh) * 2016-09-05 2017-02-08 北京金山软件有限公司 一种语音信息转换、信息生成方法及装置
US20180096687A1 (en) * 2016-09-30 2018-04-05 International Business Machines Corporation Automatic speech-to-text engine selection

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7124080B2 (en) * 2001-11-13 2006-10-17 Microsoft Corporation Method and apparatus for adapting a class entity dictionary used with language models
CN1841496A (zh) * 2005-03-31 2006-10-04 株式会社东芝 测量语速的方法和装置以及录音设备
US20070239455A1 (en) * 2006-04-07 2007-10-11 Motorola, Inc. Method and system for managing pronunciation dictionaries in a speech application
JP5105943B2 (ja) * 2007-04-13 2012-12-26 日本放送協会 発話評価装置及び発話評価プログラム
JP5029168B2 (ja) * 2007-06-25 2012-09-19 富士通株式会社 音声読み上げのための装置、プログラム及び方法
CN103187052B (zh) * 2011-12-29 2015-09-02 北京百度网讯科技有限公司 一种建立用于语音识别的语言模型的方法及装置
CN103680498A (zh) * 2012-09-26 2014-03-26 华为技术有限公司 一种语音识别方法和设备
CN103578471B (zh) * 2013-10-18 2017-03-01 威盛电子股份有限公司 语音辨识方法及其电子装置
CN103903615B (zh) * 2014-03-10 2018-11-09 联想(北京)有限公司 一种信息处理方法及电子设备
CN106935239A (zh) * 2015-12-29 2017-07-07 阿里巴巴集团控股有限公司 一种发音词典的构建方法及装置
CN107068144A (zh) * 2016-01-08 2017-08-18 王道平 一种语音识别中便于人工修改文字的方法
CN106448675B (zh) * 2016-10-21 2020-05-01 科大讯飞股份有限公司 识别文本修正方法及系统
CN106531182A (zh) * 2016-12-16 2017-03-22 上海斐讯数据通信技术有限公司 一种语言学习系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455530A (zh) * 2012-10-25 2013-12-18 河南省佰腾电子科技有限公司 随身携带式创建个性化语音对应文本文字数据库的装置
CN103578464A (zh) * 2013-10-18 2014-02-12 威盛电子股份有限公司 语言模型的建立方法、语音辨识方法及电子装置
CN105573988A (zh) * 2015-04-28 2016-05-11 宇龙计算机通信科技(深圳)有限公司 一种语音转换的方法及终端
CN106384593A (zh) * 2016-09-05 2017-02-08 北京金山软件有限公司 一种语音信息转换、信息生成方法及装置
US20180096687A1 (en) * 2016-09-30 2018-04-05 International Business Machines Corporation Automatic speech-to-text engine selection
CN106356065A (zh) * 2016-10-31 2017-01-25 努比亚技术有限公司 一种移动终端及语音转换方法

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460214A (zh) * 2020-04-02 2020-07-28 北京字节跳动网络技术有限公司 分类模型训练方法、音频分类方法、装置、介质及设备
CN111460214B (zh) * 2020-04-02 2024-04-19 北京字节跳动网络技术有限公司 分类模型训练方法、音频分类方法、装置、介质及设备
CN112905247A (zh) * 2021-01-25 2021-06-04 斑马网络技术有限公司 自动检测并切换语言的方法及装置、终端设备、存储介质
CN117690416A (zh) * 2024-02-02 2024-03-12 江西科技学院 一种人工智能交互方法及人工智能交互系统
CN117690416B (zh) * 2024-02-02 2024-04-12 江西科技学院 一种人工智能交互方法及人工智能交互系统

Also Published As

Publication number Publication date
CN108682420A (zh) 2018-10-19
CN108682420B (zh) 2023-07-07

Similar Documents

Publication Publication Date Title
WO2019218467A1 (zh) 一种音视频通话方言识别方法、装置、终端设备及介质
KR102401942B1 (ko) 번역품질 평가 방법 및 장치
Xiao et al. " Rate my therapist": automated detection of empathy in drug and alcohol counseling via speech and language processing
WO2020224119A1 (zh) 用于语音识别的音频语料筛选方法、装置及计算机设备
CN112115706B (zh) 文本处理方法、装置、电子设备及介质
CN110457673B (zh) 一种自然语言转换为手语的方法及装置
Wassink et al. Uneven success: automatic speech recognition and ethnicity-related dialects
CN110135879B (zh) 基于自然语言处理的客服质量自动评分方法
CN110600033B (zh) 学习情况的评估方法、装置、存储介质及电子设备
CN109256133A (zh) 一种语音交互方法、装置、设备及存储介质
Bitchener The relationship between the negotiation of meaning and language learning: A longitudinal study
CN109102824B (zh) 基于人机交互的语音纠错方法和装置
Turrisi et al. EasyCall corpus: a dysarthric speech dataset
ES2751375T3 (es) Análisis lingüístico basado en una selección de palabras y dispositivo de análisis lingüístico
CN111062221A (zh) 数据处理方法、装置、电子设备以及存储介质
US20210264812A1 (en) Language learning system and method
CN113393841B (zh) 语音识别模型的训练方法、装置、设备及存储介质
CN111966839B (zh) 数据处理方法、装置、电子设备及计算机存储介质
CN111831832B (zh) 词表构建方法、电子设备及计算机可读介质
CN112309429A (zh) 一种失爆检测方法、装置、设备及计算机可读存储介质
CN112349290B (zh) 一种基于三元组的语音识别准确率计算方法
US11947872B1 (en) Natural language processing platform for automated event analysis, translation, and transcription verification
CN115132182A (zh) 一种数据识别方法、装置、设备及可读存储介质
CN113761865A (zh) 声文重对齐及信息呈现方法、装置、电子设备和存储介质
CN112509570B (zh) 语音信号处理方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18918890

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18918890

Country of ref document: EP

Kind code of ref document: A1