WO2018036466A1 - 语音识别处理方法及装置 - Google Patents

语音识别处理方法及装置 Download PDF

Info

Publication number
WO2018036466A1
WO2018036466A1 PCT/CN2017/098437 CN2017098437W WO2018036466A1 WO 2018036466 A1 WO2018036466 A1 WO 2018036466A1 CN 2017098437 W CN2017098437 W CN 2017098437W WO 2018036466 A1 WO2018036466 A1 WO 2018036466A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
voice
sound
voice information
time
Prior art date
Application number
PCT/CN2017/098437
Other languages
English (en)
French (fr)
Inventor
闫晓梅
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2018036466A1 publication Critical patent/WO2018036466A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents

Definitions

  • the present invention relates to the field of office equipment, and in particular, to a voice recognition processing method and apparatus.
  • the presenter usually projects the PPT through the projector to make a presentation, but in many cases, the content of the preacher's presentation is not fully written into the PPT, which brings great inconvenience to the audience, especially those with hearing impairment.
  • the intelligent projector has added a voice recognition function, which can project the content temporarily played by the presenter in text form, but the projected text form is single and the effect is poor.
  • a main object of the embodiments of the present invention is to provide a voice recognition processing method and apparatus, which are intended to implement various display forms and improve display effects.
  • a voice recognition processing method provided by an embodiment of the present invention includes the following steps:
  • the text after the mark processing is projected and displayed.
  • the converting the voice information into text and marking the characters in the text according to the sound feature comprises:
  • the converting the voice information into text and marking the characters in the text according to the sound feature further comprises:
  • the converting the voice information into a text and marking the characters in the text according to the sound feature further comprises:
  • the amplitude of the sound wave in the first set time t1 is recalculated with the end time of the n+1th t1 as the start time.
  • the average value is set as the first reference value X 0
  • the frequency average value of the sound wave in the first set time t1 is recalculated and set as the second reference value Y 0 .
  • the marking process comprises bolding, coloring or highlighting.
  • an embodiment of the present invention further provides a voice recognition processing device, where the voice recognition processing device includes:
  • a voice acquiring module configured to acquire voice information detected by a voice recognition module of the projector, where the voice information includes voice content and sound features;
  • a voice processing module configured to convert the voice information into text, and perform marking processing on the characters in the text according to the sound feature
  • the projection module is configured to project and display the text after the mark processing.
  • the voice processing module comprises:
  • the amplitude reference unit is configured to calculate the amplitude average value of the sound wave in the first set time t1 by using the start time of the voice information converted into text as the start time, and set it as the first reference value X 0 ;
  • An amplitude obtaining unit configured to calculate an amplitude average value X n of the sound waves in the n+1th t1, where n is any positive integer;
  • the first marking processing unit is configured to perform marking processing on the character corresponding to the voice information in the n+1th t1 when (X n -X 0 )/X 0 is greater than a preset value.
  • the voice processing module further includes:
  • the frequency reference unit is configured to calculate a frequency average value of the sound wave in the first set time t1 by using a start time of the voice information converted into text as a start time, and set it as a second reference value Y 0 ;
  • a frequency acquisition unit configured to calculate a frequency average Y n of the acoustic waves in the n+1th t1, where n is any positive integer;
  • the second marking processing unit is configured to perform marking processing on the character corresponding to the voice information in the n+1th t1 when (Y n -Y 0 )/Y 0 is greater than the preset value.
  • the voice recognition processing device further includes:
  • the reference value acquisition module is configured to set the second set time t2 as the time period.
  • the average value of the amplitude of the sound wave in the time t1 is set as the first reference value X 0
  • the frequency average value of the sound wave in the first set time t1 is recalculated and set as the second reference value Y 0 .
  • the marking process comprises bolding, coloring or highlighting.
  • a storage medium is further provided, and the storage medium may store an execution instruction for performing the implementation of the voice recognition processing method in the above embodiment.
  • the voice recognition processing method and apparatus provided by the embodiment of the present invention first acquires voice information detected by a voice recognition module of a projector, the voice information is from a voice broadcasted by a presenter in real time, or a voice file saved in a projector, and then The speech information is converted into text, and the characters corresponding to the important content speech are colored, highlighted or bolded according to the sound characteristics such as the volume or the level of the sound, and finally the text processed by the marking is projected and displayed.
  • FIG. 1 is a schematic flow chart of a first embodiment of a voice recognition processing method according to the present invention
  • FIG. 2 is a schematic diagram of a refinement flow of converting voice information into text in a second embodiment of a voice recognition processing method according to the present invention
  • FIG. 3 is a schematic diagram of a refinement flow of converting voice information into text in a third embodiment of a voice recognition processing method according to the present invention.
  • FIG. 4 is a schematic diagram of functional modules of a first embodiment of a voice recognition processing apparatus according to the present invention.
  • FIG. 5 is a schematic diagram of a refinement function module of a speech processing module in a second embodiment of the speech recognition processing device of the present invention.
  • FIG. 6 is a schematic diagram of a refinement function module of a speech processing module in a third embodiment of the speech recognition processing apparatus of the present invention.
  • the invention provides a speech recognition processing method and device.
  • the voice recognition processing method includes the following steps:
  • Step S100 acquiring voice information detected by the voice recognition module of the projector, the language
  • the audio information includes voice content and sound characteristics.
  • voice information detected by the voice recognition module of the projector there are generally two types of voice information detected by the voice recognition module of the projector, one is real-time voice information acquired by the projector from the microphone, and two microphones can be used to collect voice, and the target is to reduce noise through voice.
  • Step S200 Convert the voice information into text, and perform marking processing on the characters in the text according to the sound feature.
  • VAD Voice over Detection Deformation
  • the waveform has almost no descriptive power in the time domain, so the waveform must be transformed.
  • a common method of transformation is to extract the MFCC features and transform each frame waveform into a multi-dimensional vector according to the physiological characteristics of the human ear. It can be simply understood that this vector contains the content information of the frame speech. This process is called acoustic feature extraction. At this point, the sound becomes a 12-line (assuming the acoustic feature is 12-dimensional), a matrix of N columns, called the observation sequence, where N is the total number of frames. Each frame is represented by a 12-dimensional vector, and the color shade of the patch indicates the magnitude of the vector value. Next, we will show you how to turn this matrix into text. First introduce two concepts:
  • Phoneme The pronunciation of a word consists of phonemes.
  • a commonly used phoneme set is a set of 39 phonemes composed by Carnegie Mellon University.
  • Chinese generally uses all initials and finals as phoneme sets, and Chinese recognition is also divided into atonal.
  • the first step is to identify the frame as a state
  • the second step is to combine the states into phonemes
  • the third step is to combine the phonemes into words.
  • Each small vertical bar represents one frame, several frame speeches correspond to one state, and each three states are combined into one phoneme, and several phonemes are combined into one word.
  • Sound is a sound wave generated by the vibration of an object. It is a wave that propagates through a medium (air or solid, liquid) and can be perceived by human or animal auditory organs.
  • the object that initially emits vibration (vibration) is called the sound source.
  • the frequency and amplitude become an important attribute of the description wave.
  • the frequency is corresponding to the so-called pitch.
  • the sound with a frequency between 20Hz and 20kHz can be recognized by the human ear.
  • the amplitude affects the size of the sound.
  • the sound can be broken down into superpositions of sine waves of different intensities of different frequencies. This process of transformation (or decomposition) is called the Fourier transform.
  • Sound has many characteristics, such as loudness, tone, and tone. We distinguish sounds based on these characteristics of sound. Loudness indicates the size of the person's subjective perception of sound (commonly known as volume), and the unit is decibel dB. The loudness is determined by the amplitude and the distance from the person from the sound source.
  • the louder the louder the smaller the distance between the person and the sound source, and the louder the louder.
  • the pitch indicates the level of the sound (treble, bass) in Hertz Hz.
  • the pitch is determined by the frequency, and the higher the frequency, the higher the pitch.
  • the voice content that needs to be emphasized in the presenter or the voice file can be identified, and the characters corresponding to the voice content are marked, such as bold, color, highlight, underline, etc., and the listener can quickly obtain the most Important content, improve the effectiveness of the presentation, but also improve the fun of the projector.
  • step S300 the text after the mark processing is projected and displayed.
  • the technology is not limited to use in projectors, and other techniques for converting speech to text can incorporate techniques for tagging text based on sound characteristics.
  • the voice recognition processing method proposed by the present invention first acquires voice information detected by a voice recognition module of a projector, the voice information is from a voice broadcasted by a presenter in real time, or a voice file saved in a projector, and then converted into the voice information. For text, and according to the sound level or the sound characteristics of the sound, the characters corresponding to the important content voice are colored, highlighted or bolded. Processing, and finally displaying the text after the mark processing is projected.
  • a second embodiment of a voice recognition processing method according to the present invention based on the first embodiment of the voice recognition processing method of the present invention, the voice information is converted into text, and according to the sound feature pair
  • the steps of performing tag processing on the characters in the text include:
  • Step S210 Calculate the amplitude average value of the sound wave in the first set time t1 by using the start time of the voice information converted into text as the start time, and set it as the first reference value X 0 .
  • the first set time t1 is 1 s
  • the start time of converting the voice information into text is the start time
  • the average value of the amplitude of the 1st sound wave is calculated within one second after the start time is calculated.
  • the amplitude average is determined to be the first reference value X 0 .
  • Step S220 calculating an amplitude average value X n of the acoustic waves in the n+1th t1, where n is any positive integer.
  • the average value of the amplitude of the sound waves per second that is, the amplitude average value X n of the sound waves of the second s, the third s, the fourth sth, and the (n+1)th, is acquired.
  • Step S230 when (X n -X 0 )/X 0 is greater than the preset value, the character corresponding to the voice information in the n+1th t1 is marked.
  • the voice information is converted into text.
  • the step of marking the characters in the text according to the sound feature further includes:
  • Step S211 calculating a frequency average value of the sound wave in the first set time t1 by using the start time of the voice information converted into text as the start time, and setting it as the second reference value Y 0 .
  • the present embodiment determines whether the speech is a portion that needs to be emphasized with reference to the frequency of the acoustic wave.
  • the first set time t1 is also 1 s
  • the start time of converting the voice information into text is the start time
  • the frequency average value of the first s sound wave is calculated within one second after the start time is calculated.
  • the frequency average is the first reference value Y 0 .
  • Step S221 calculating a frequency average value Y n of the acoustic waves in the n+1th t1, where n is any positive integer.
  • the frequency average value of the sound waves per second after the acquisition that is, the frequency average value Y n of the sound waves of the second s, third s, fourth s, ..., n+1th, is acquired.
  • step S231 when (Y n -Y 0 )/Y 0 is greater than the preset value, the characters corresponding to the voice information in the n+1th t1 are marked.
  • the converting the voice information into a text and marking the characters in the text according to the sound feature further comprises:
  • the amplitude of the sound wave in the first set time t1 is recalculated with the end time of the n+1th t1 as the start time.
  • the average value is set as the first reference value X 0
  • the frequency average value of the sound wave in the first set time t1 is recalculated and set as the second reference value Y 0 .
  • the second set time t2 is 10 min and the first set time t1 is 1 s
  • the end of a time period is indicated when the 600 s is reached, and the next time period begins with the 601 s, at which time the 601 s sound wave is re-determined.
  • the amplitude average value is set to a new first reference value X 0 , then proceeds to step S220, step S230, and the frequency average of the 601s sound wave is re-determined and set to a new second reference value Y 0 , and then Proceed to step S221 and step S231.
  • the voice in the presenter or the voice file has changed. If the volume is increased and/or the treble is used, The voice is a portion that needs to be emphasized, and the character corresponding to the voice is marked.
  • the speech recognition processing method further includes: saving the text processed by the tag to a mobile device connected to the projector.
  • a voice recognition processing apparatus includes:
  • the voice acquiring module 100 is configured to acquire voice information detected by a voice recognition module of the projector, where the voice information includes voice content and sound features.
  • voice information detected by the voice recognition module of the projector there are generally two types of voice information detected by the voice recognition module of the projector, one is real-time voice information acquired by the projector from the microphone, and two microphones can be used to collect voice, and the target is to reduce noise through voice.
  • the voice processing module 200 is configured to convert the voice information into text, and perform tag processing on the characters in the text according to the sound feature.
  • the waveform has almost no descriptive power in the time domain, so the waveform must be transformed.
  • a common method of transformation is to extract the MFCC features and transform each frame waveform into a multi-dimensional vector according to the physiological characteristics of the human ear. It can be simply understood that this vector contains the content information of the frame speech. This process is called acoustic feature extraction. At this point, the sound becomes a 12-line (assuming the acoustic feature is 12-dimensional), a matrix of N columns, called the observation sequence, where N is the total number of frames. Each frame is represented by a 12-dimensional vector, and the color shade of the patch indicates the magnitude of the vector value. Next, we will show you how to turn this matrix into text. First introduce two concepts:
  • Phoneme The pronunciation of a word consists of phonemes.
  • a commonly used phoneme set is a set of 39 phonemes composed by Carnegie Mellon University.
  • Chinese generally uses all initials and finals as phoneme sets, and Chinese recognition is also divided into atonal.
  • the first step is to identify the frame as a state
  • the second step is to combine the states into phonemes
  • the third step is to combine the phonemes into words.
  • Each small vertical bar represents one frame, several frame speeches correspond to one state, and each three states are combined into one phoneme, and several phonemes are combined into one word.
  • Sound is a sound wave generated by the vibration of an object. It is a wave that propagates through a medium (air or solid, liquid) and can be perceived by human or animal auditory organs.
  • the object that initially emits vibration (vibration) is called the sound source.
  • the frequency and amplitude become an important attribute of the description wave.
  • the frequency is corresponding to the so-called pitch.
  • the sound with a frequency between 20Hz and 20kHz can be recognized by the human ear.
  • the amplitude affects the size of the sound.
  • the sound can be broken down into superpositions of sine waves of different intensities of different frequencies. This process of transformation (or decomposition) is called the Fourier transform.
  • Sound has many characteristics, such as loudness, tone, and tone. We distinguish sounds based on these characteristics of sound. Loudness indicates the size of the person's subjective perception of sound (commonly known as volume), and the unit is decibel dB. The loudness is determined by the amplitude and the distance of the person from the sound source.
  • the pitch indicates the level of the sound (treble, bass) in Hertz Hz. The pitch is determined by the frequency, and the higher the frequency, the higher the pitch.
  • the voice content that needs to be emphasized in the presenter or the voice file can be identified, and the characters corresponding to the voice content are marked, such as bold, color, highlight, underline, etc., and the listener can quickly obtain the most Important content, improve the effectiveness of the presentation, but also improve the fun of the projector.
  • the projection module 300 is configured to project and display the text after the mark processing.
  • the technology is not limited to use in projectors, and other techniques for converting speech to text can incorporate techniques for tagging text based on sound characteristics.
  • the voice recognition processing device of the present invention first acquires the voice information detected by the voice recognition module of the projector, and the voice information is from the voice announced by the presenter in real time, or the voice file saved in the projector, and then the voice processing.
  • the module 200 converts the voice information into text, and performs coloring, highlighting or bolding on the characters corresponding to the important content voice according to the sound characteristics such as the volume or the level of the sound, and finally the projection module 300 processes the mark.
  • the text is displayed for projection.
  • the voice processing module 200 includes:
  • the amplitude reference unit 210 is configured to calculate the amplitude average value of the sound wave in the first set time t1 by setting the start time of the voice information into the text as the start time, and set it as the first reference value X 0 .
  • the first set time t1 is 1 s
  • the start time of converting the voice information into text is the start time
  • the average value of the amplitude of the 1st sound wave is calculated within one second after the start time is calculated.
  • the amplitude average is determined to be the first reference value X 0 .
  • the amplitude acquisition unit 220 is configured to calculate an amplitude average value X n of the acoustic waves in the n+1th t1, where n is any positive integer.
  • the average value of the amplitude of the sound waves per second that is, the amplitude average value X n of the sound waves of the second s, the third s, the fourth sth, and the (n+1)th, is acquired.
  • the first tag processing unit 230 is configured to perform tag processing on the character corresponding to the voice information in the n+1th t1 when (X n -X 0 )/X 0 is greater than a preset value.
  • the voice processing module 200 further includes:
  • the frequency reference unit 211 is configured to calculate a frequency average value of the acoustic wave in the first set time t1 by using the start time of the voice information converted into text as the start time, and set it as the second reference value Y 0 .
  • the present embodiment determines whether the speech is a portion that needs to be emphasized with reference to the frequency of the acoustic wave.
  • the first set time t1 is also 1 s
  • the start time of converting the voice information into text is the start time
  • the frequency average value of the first s sound wave is calculated within one second after the start time is calculated.
  • the frequency average is the first reference value Y 0 .
  • the frequency acquisition unit 221 is configured to calculate a frequency average value Y n of the acoustic waves in the n+1th t1, where n is any positive integer.
  • the frequency average value of the sound waves per second after the acquisition that is, the frequency average value Y n of the sound waves of the second s, third s, fourth s, ..., n+1th, is acquired.
  • the second tag processing unit 231 is configured to perform tag processing on the text corresponding to the voice information in the n+1th t1 when (Y n -Y 0 )/Y 0 is greater than the preset value.
  • the voice recognition processing device further includes:
  • the reference value acquisition module is configured to set the second set time t2 as the time period.
  • the average value of the amplitude of the sound wave in the time t1 is set as the first reference value X 0
  • the frequency average value of the sound wave in the first set time t1 is recalculated and set as the second reference value Y 0 .
  • the second set time t2 is 10 min and the first set time t1 is 1 s
  • the end of a time period is indicated when the 600 s is reached, and the next time period begins with the 601 s, at which time the 601 s sound wave is re-determined.
  • the amplitude average or the frequency average is set, and the two values are respectively set to a new first reference value X 0 and a second reference value Y 0 .
  • the voice in the presenter or the voice file has changed. If the volume is increased and/or the treble is used, The voice is a portion that needs to be emphasized, and the character corresponding to the voice is marked.
  • the foregoing technical solution provided by the embodiment of the present invention can be applied to a voice recognition processing process, first acquiring voice information detected by a voice recognition module of a projector, and the voice information is from a preaching The voice that is presented in real time, or the voice file saved in the projector, then converts the voice information into text, and adds, highlights, or adds the characters corresponding to the important content voice according to the sound characteristics such as the volume or level of the sound. The label processing is performed in a rough manner, and finally the text after the label processing is projected and displayed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)
  • Telephonic Communication Services (AREA)
  • Controls And Circuits For Display Device (AREA)

Abstract

一种语音识别处理方法及装置,方法包括以下步骤:获取投影仪的语音识别模块检测到的语音信息,语音信息包括语音内容和声音特征(S100);将语音信息转换为文本,并根据声音特征对文本中的字符进行标记处理(S200);将标记处理后的文本进行投影显示(S300)。该方法实现了多种显示形式,提高了显示效果。

Description

语音识别处理方法及装置 技术领域
本发明涉及办公设备领域,尤其涉及语音识别处理方法及装置。
背景技术
在会议中,宣讲者通常会通过投影仪投影PPT进行宣讲,但是很多时候,宣讲者所宣讲的内容并没有全部写入PPT中,给听众尤其是听力有障碍者带来了很大不便。为了达到声文并茂的效果,目前已有智能投影仪加入了语音识别功能,能将宣讲者临时发挥的内容以文字形式投影出来,但是投影的文字形式单一,效果较差。
发明内容
本发明实施例的主要目的在于提供一种语音识别处理方法及装置,旨在实现多种显示形式,提高显示效果。
为实现上述目的,本发明实施例提供的一种语音识别处理方法包括以下步骤:
获取投影仪的语音识别模块检测到的语音信息,所述语音信息包括语音内容和声音特征;
将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理;
将标记处理后的所述文本进行投影显示。
优选地,所述将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理包括:
以所述语音信息转换为文本的开始时间为起始时间,计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0
计算第n+1个t1内声波的振幅平均值Xn,其中n为任意正整数;
当(Xn-X0)/X0大于预设值时,对第n+1个t1内语音信息对应的字符进行标记处理。
优选地,所述将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理还包括:
以所述语音信息转换为文本的开始时间为起始时间,计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
计算第n+1个t1内声波的频率平均值Yn,其中n为任意正整数;
当(Yn-Y0)/Y0大于预设值时,对第n+1个t1内语音信息对应的字符进行标记处理。
优选地,所述将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理之后还包括:
以第二设定时间t2为时间周期,当(n+1)×t1=t2时,以第n+1个t1的结束时间为起始时间,重新计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0,以及重新计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
优选地,所述标记处理包括加粗、加色或加亮。
此外,为实现上述目的,本发明实施例还提供一种语音识别处理装置,所述语音识别处理装置包括:
语音获取模块,设置为获取投影仪的语音识别模块检测到的语音信息,所述语音信息包括语音内容和声音特征;
语音处理模块,设置为将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理;
投影模块,设置为将标记处理后的所述文本进行投影显示。
优选地,所述语音处理模块包括:
振幅基准单元,设置为以所述语音信息转换为文本的开始时间为起始时间,计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0
振幅获取单元,设置为计算第n+1个t1内声波的振幅平均值Xn,其中n为任意正整数;
第一标记处理单元,设置为当(Xn-X0)/X0大于预设值时,对第n+1个t1内语音信息对应的字符进行标记处理。
优选地,所述语音处理模块还包括:
频率基准单元,设置为以所述语音信息转换为文本的开始时间为起始时间,计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
频率获取单元,设置为计算第n+1个t1内声波的频率平均值Yn,其中n为任意正整数;
第二标记处理单元,设置为当(Yn-Y0)/Y0大于预设值时,对第n+1个t1内语音信息对应的字符进行标记处理。
优选地,所述语音识别处理装置还包括:
基准值获取模块,设置为以第二设定时间t2为时间周期,当(n+1)×t1=t2时,以第n+1个t1的结束时间为起始时间,重新计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0,以及重新计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
优选地,所述标记处理包括加粗、加色或加亮。
在本发明实施例中,还提供了一种存储介质,该存储介质可以存储有执行指令,该执行指令用于执行上述实施例中的语音识别处理方法的实现。
本发明实施例提出的语音识别处理方法及装置,首先获取投影仪的语音识别模块检测到的语音信息,该语音信息来自宣讲者实时宣讲的语音,或者投影仪中保存的语音文件,接着将所述语音信息转换为文本,并根据声音的音量或高低等声音特性对重要内容语音对应的字符进行加色、加亮或加粗等标记处理,最后将标记处理后的所述文本进行投影显示。
附图说明
图1为本发明语音识别处理方法第一实施例的流程示意图;
图2为本发明语音识别处理方法第二实施例中将语音信息转换为文本的细化流程示意图;
图3为本发明语音识别处理方法第三实施例中将语音信息转换为文本的细化流程示意图;
图4为本发明语音识别处理装置第一实施例的功能模块示意图;
图5为本发明语音识别处理装置第二实施例中语音处理模块的细化功能模块示意图;
图6为本发明语音识别处理装置第三实施例中语音处理模块的细化功能模块示意图。
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
本发明提供一种语音识别处理方法及装置。
参照图1,为本发明语音识别处理方法的第一实施例,所述语音识别处理方法包括以下步骤:
步骤S100,获取投影仪的语音识别模块检测到的语音信息,所述语 音信息包括语音内容和声音特征。
具体地,投影仪的语音识别模块检测到的语音信息来源一般有两种,一种是投影仪从麦克风中获取的实时语音信息,可以使用两个麦克风来采集语音,目标是为了通过语音降噪技术降低语音噪音;另一种是通过投影仪本身保存的语音文件获取语音信息。
步骤S200,将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理。
具体地,关于语音信息转换为文本,目前已经有比较成熟的技术原理,为了便于理解本发明,如下介绍语音转换为文本的原理。
在开始语音识别之前,有时需要把首尾端的静音切除,降低对后续步骤造成的干扰。这个静音切除的操作一般称为VAD,需要用到信号处理的一些技术。要对声音进行分析,需要对声音分帧,也就是把声音切开成许多小段,每小段称为一帧。分帧操作一般不是简单的切开,而是使用移动窗函数来实现。帧与帧之间一般是有交叠的,每帧的长度为25毫秒,每两帧之间有25-10=15毫秒的交叠。我们称为以帧长25ms、帧移10ms分帧。分帧后,语音就变成了很多小段。但波形在时域上几乎没有描述能力,因此必须将波形作变换。常见的一种变换方法是提取MFCC特征,根据人耳的生理特性,把每一帧波形变成一个多维向量,可以简单地理解为这个向量包含了这帧语音的内容信息。这个过程叫做声学特征提取。至此,声音就成了一个12行(假设声学特征是12维)、N列的一个矩阵,称之为观察序列,这里N为总帧数。每一帧都用一个12维的向量表示,色块的颜色深浅表示向量值的大小。接下来介绍怎样把这个矩阵变成文本。首先要介绍两个概念:
音素:单词的发音由音素构成。对英语,一种常用的音素集是卡内基梅隆大学的一套由39个音素构成的音素集。汉语一般直接用全部声母和韵母作为音素集,另外汉语识别还分有调无调。
状态:比音素更细致的语音单位。通常把一个音素划分成3个状态。
通过如下步骤进行语音识别:
第一步,把帧识别成状态;
第二步,把状态组合成音素;
第三步,把音素组合成单词。
每个小竖条代表一帧,若干帧语音对应一个状态,每三个状态组合成一个音素,若干个音素组合成一个单词。
声音是由物体振动产生的声波,是通过介质(空气或固体、液体)传播并能被人或动物听觉器官所感知的波动现象。最初发出振动(震动)的物体叫声源。
声音作为波的一种,频率和振幅就成了描述波的重要属性,频率的大小与我们通常所说的音高对应,频率在20Hz~20kHz之间的声音是可以被人耳识别的,而振幅影响声音的大小。声音可以被分解为不同频率不同强度正弦波的叠加。这种变换(或分解)的过程,称为傅立叶变换。声音具备许多特性,比如响度、音调、音色,我们正是根据声音的这些特性来区分声音。响度表示人主观上感觉声音的大小(俗称音量),单位是分贝dB。响度由振幅和人离声源的距离决定,振幅越大响度越大,人和声源的距离越小,响度越大。音调表示声音的高低(高音、低音),单位是赫兹Hz。音调由频率决定,频率越高音调越高。
通过声音的特性可以识别宣讲者或者语音文件中需要强调的语音内容,将这些语音内容对应的字符进行标记处理,比如加粗、加色、加亮、下划线等强调标记,听众能够快速地获取最重要的内容,提高宣讲效果,同时也能提高投影仪的使用趣味性。
步骤S300,将标记处理后的所述文本进行投影显示。
该技术不仅限于使用在投影仪,其他使用语音转换为文本的技术均可以加入根据声音特性对文本进行标记处理这一技术。
本发明提出的语音识别处理方法,首先获取投影仪的语音识别模块检测到的语音信息,该语音信息来自宣讲者实时宣讲的语音,或者投影仪中保存的语音文件,接着将所述语音信息转换为文本,并根据声音的音量或高低等声音特性对重要内容语音对应的字符进行加色、加亮或加粗等标记 处理,最后将标记处理后的所述文本进行投影显示。
进一步地,参照图2,为本发明语音识别处理方法的第二实施例,基于本发明语音识别处理方法的第一实施例,上述将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理的步骤包括:
步骤S210,以所述语音信息转换为文本的开始时间为起始时间,计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0
具体地,在该实施例中第一设定时间t1为1s,以语音信息转换为文本的开始时间为起始时间,计算起始时间后一秒内,即第1s声波的振幅平均值,设定该振幅平均值为第一基准值X0
步骤S220,计算第n+1个t1内声波的振幅平均值Xn,其中n为任意正整数。
具体地,设定第一基准值X0后,获取以后每秒声波的振幅平均值,即第2s、第3s、第4s……第n+1s的声波的振幅平均值Xn
步骤S230,当(Xn-X0)/X0大于预设值时,对第n+1个t1内语音信息对应的字符进行标记处理。
具体地,定义对字符进行标记处理用flag标记。获取Xn后,计算(Xn-X0)/X0。若(Xn-X0)/X0大于预设值,比如预设值为10%,则说明宣讲者提高了音量,该语音为宣讲者想重点强调的内容,赋值flag=1;若(Xn-X0)/X0小于或等于10%,则说明宣讲者宣讲的这部分内容不是重点,赋值flag=0。在语音转换为文本时,如果flag=1,则对字符进行加粗、加色或加亮等标记处理,如果flag=0,则停止对字符进行标记处理。
本实施例通过定义第一基准值X0,并将后续获得的振幅平均值与第一基准值X0对比,判断语音是否是需要强调的内容,从而对字符进行标记,使得听众能直观获取重点内容,提高宣讲效果。
进一步地,参照图3,为本发明语音识别处理方法的第三实施例,在上述第一实施例或第二实施例的基础上,上述将所述语音信息转换为文本, 并根据所述声音特征对所述文本中的字符进行标记处理的步骤还包括:
步骤S211,以所述语音信息转换为文本的开始时间为起始时间,计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
具体地,本实施例以声波的频率为基准来判断语音是否是需要重点强调的部分。在该实施例中第一设定时间t1同样为1s,以语音信息转换为文本的开始时间为起始时间,计算起始时间后一秒内,即第1s声波的频率平均值,设定该频率平均值为第一基准值Y0
步骤S221,计算第n+1个t1内声波的频率平均值Yn,其中n为任意正整数。
具体地,设定第一基准值Y0后,获取以后每秒声波的频率平均值,即第2s、第3s、第4s……第n+1s的声波的频率平均值Yn
步骤S231,当(Yn-Y0)/Y0大于预设值时,对第n+1个t1内语音信息对应的字符进行标记处理。
具体地,仍然定义对字符进行标记处理用flag标记。获取Yn后,计算(Yn-Y0)/Y0。若(Yn-Y0)/Y0大于预设值,比如预设值为10%,则说明宣讲者此时使用了高音,该语音为宣讲者想重点强调的内容,赋值flag=1;若(Yn-Y0)/Y0小于或等于10%,则说明宣讲者宣讲的这部分内容不是重点,赋值flag=0。在语音转换为文本时,如果flag=1,则对字符进行加粗、加色或加亮等标记处理,如果flag=0,则停止对字符进行标记处理。
可以与第二实施例中声波的振幅共同来判断语音是否为重点内容,也可以单独利用声波的振幅或者频率来判断语音是否为重点内容。
优选地,所述将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理之后还包括:
以第二设定时间t2为时间周期,当(n+1)×t1=t2时,以第n+1个t1的结束时间为起始时间,重新计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0,以及重新计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
具体地,假如第二设定时间t2为10min,第一设定时间t1为1s,那 么到达第600s时说明一个时间周期结束,以第601s开始下一个时间周期,此时重新确定第601s声波的振幅平均值,并设定为新的第一基准值X0,接着继续步骤S220、步骤S230,以及重新确定第601s声波的频率平均值,并设定为新的第二基准值Y0,接着继续步骤S221、步骤S231。
通过间隔一段时间重新确定第一基准值X0和第二基准值Y0的方式可以更加准确地判断宣讲者或者语音文件中的语音是否发生变化,若音量提高和/或使用了高音,则说明该语音为需要强调的部分,对该语音对应的字符进行标记处理。
进一步地,所述语音识别处理方法还包括:将所述标记处理后的所述文本保存至与投影仪连接的移动设备。
参照图4,本发明第一实施例提出的语音识别处理装置包括:
语音获取模块100,设置为获取投影仪的语音识别模块检测到的语音信息,所述语音信息包括语音内容和声音特征。
具体地,投影仪的语音识别模块检测到的语音信息来源一般有两种,一种是投影仪从麦克风中获取的实时语音信息,可以使用两个麦克风来采集语音,目标是为了通过语音降噪技术降低语音噪音;另一种是通过投影仪本身保存的语音文件获取语音信息。
语音处理模块200,设置为将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理。
具体地,关于语音信息转换为文本,目前已经有比较成熟的技术原理,为了便于理解本发明,如下介绍语音转换为文本的原理。
在开始语音识别之前,有时需要把首尾端的静音切除,降低对后续步骤造成的干扰。这个静音切除的操作一般称为VAD,需要用到信号处理的一些技术。要对声音进行分析,需要对声音分帧,也就是把声音切开成许多小段,每小段称为一帧。分帧操作一般不是简单的切开,而是使用移动窗函数来实现。帧与帧之间一般是有交叠的,每帧的长度为25毫秒,每两帧之间有25-10=15毫秒的交叠。我们称为以帧长25ms、帧移10ms分 帧。分帧后,语音就变成了很多小段。但波形在时域上几乎没有描述能力,因此必须将波形作变换。常见的一种变换方法是提取MFCC特征,根据人耳的生理特性,把每一帧波形变成一个多维向量,可以简单地理解为这个向量包含了这帧语音的内容信息。这个过程叫做声学特征提取。至此,声音就成了一个12行(假设声学特征是12维)、N列的一个矩阵,称之为观察序列,这里N为总帧数。每一帧都用一个12维的向量表示,色块的颜色深浅表示向量值的大小。接下来介绍怎样把这个矩阵变成文本。首先要介绍两个概念:
音素:单词的发音由音素构成。对英语,一种常用的音素集是卡内基梅隆大学的一套由39个音素构成的音素集。汉语一般直接用全部声母和韵母作为音素集,另外汉语识别还分有调无调。
状态:比音素更细致的语音单位。通常把一个音素划分成3个状态。
通过如下步骤进行语音识别:
第一步,把帧识别成状态;
第二步,把状态组合成音素;
第三步,把音素组合成单词。
每个小竖条代表一帧,若干帧语音对应一个状态,每三个状态组合成一个音素,若干个音素组合成一个单词。
声音是由物体振动产生的声波,是通过介质(空气或固体、液体)传播并能被人或动物听觉器官所感知的波动现象。最初发出振动(震动)的物体叫声源。
声音作为波的一种,频率和振幅就成了描述波的重要属性,频率的大小与我们通常所说的音高对应,频率在20Hz~20kHz之间的声音是可以被人耳识别的,而振幅影响声音的大小。声音可以被分解为不同频率不同强度正弦波的叠加。这种变换(或分解)的过程,称为傅立叶变换。声音具备许多特性,比如响度、音调、音色,我们正是根据声音的这些特性来区分声音。响度表示人主观上感觉声音的大小(俗称音量),单位是分贝dB。响度由振幅和人离声源的距离决定,振幅越大响度越大,人和声源的 距离越小,响度越大。音调表示声音的高低(高音、低音),单位是赫兹Hz。音调由频率决定,频率越高音调越高。
通过声音的特性可以识别宣讲者或者语音文件中需要强调的语音内容,将这些语音内容对应的字符进行标记处理,比如加粗、加色、加亮、下划线等强调标记,听众能够快速地获取最重要的内容,提高宣讲效果,同时也能提高投影仪的使用趣味性。
投影模块300,设置为将标记处理后的所述文本进行投影显示。
该技术不仅限于使用在投影仪,其他使用语音转换为文本的技术均可以加入根据声音特性对文本进行标记处理这一技术。
本发明提出的语音识别处理装置,首先语音获取模块100获取投影仪的语音识别模块检测到的语音信息,该语音信息来自宣讲者实时宣讲的语音,或者投影仪中保存的语音文件,接着语音处理模块200将所述语音信息转换为文本,并根据声音的音量或高低等声音特性对重要内容语音对应的字符进行加色、加亮或加粗等标记处理,最后投影模块300将标记处理后的所述文本进行投影显示。
进一步地,参照图5,为本发明语音识别处理装置的第二实施例,在上述第一实施例的基础上,所述语音处理模块200包括:
振幅基准单元210,设置为以所述语音信息转换为文本的开始时间为起始时间,计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0
具体地,在该实施例中第一设定时间t1为1s,以语音信息转换为文本的开始时间为起始时间,计算起始时间后一秒内,即第1s声波的振幅平均值,设定该振幅平均值为第一基准值X0
振幅获取单元220,设置为计算第n+1个t1内声波的振幅平均值Xn,其中n为任意正整数。
具体地,设定第一基准值X0后,获取以后每秒声波的振幅平均值,即第2s、第3s、第4s……第n+1s的声波的振幅平均值Xn
第一标记处理单元230,设置为当(Xn-X0)/X0大于预设值时,对第n+1个t1内语音信息对应的字符进行标记处理。
具体地,定义对字符进行标记处理用flag标记。获取Xn后,计算(Xn-X0)/X0。若(Xn-X0)/X0大于预设值,比如预设值为10%,则说明宣讲者提高了音量,该语音为宣讲者想重点强调的内容,赋值flag=1;若(Xn-X0)/X0小于或等于10%,则说明宣讲者宣讲的这部分内容不是重点,赋值flag=0。在语音转换为文本时,如果flag=1,则对字符进行加粗、加色或加亮等标记处理,如果flag=0,则停止对字符进行标记处理。
本实施例通过定义第一基准值X0,并将后续获得的振幅平均值与第一基准值X0对比,判断语音是否是需要强调的内容,从而对字符进行标记,使得听众能直观获取重点内容,提高宣讲效果。
进一步地,参照图6,为本发明语音识别处理装置的第三实施例,在上述第一实施例或第二实施例的基础上,所述语音处理模块200还包括:
频率基准单元211,设置为以所述语音信息转换为文本的开始时间为起始时间,计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
具体地,本实施例以声波的频率为基准来判断语音是否是需要重点强调的部分。在该实施例中第一设定时间t1同样为1s,以语音信息转换为文本的开始时间为起始时间,计算起始时间后一秒内,即第1s声波的频率平均值,设定该频率平均值为第一基准值Y0
频率获取单元221,设置为计算第n+1个t1内声波的频率平均值Yn,其中n为任意正整数。
具体地,设定第一基准值Y0后,获取以后每秒声波的频率平均值,即第2s、第3s、第4s……第n+1s的声波的频率平均值Yn
第二标记处理单元231,设置为当(Yn-Y0)/Y0大于预设值时,对第n+1个t1内语音信息对应的文本进行标记处理。
具体地,仍然定义对文本进行标记处理用flag标记。获取Yn后,计 算(Yn-Y0)/Y0。若(Yn-Y0)/Y0大于预设值,比如预设值为10%,则说明宣讲者此时使用了高音,该语音为宣讲者想重点强调的内容,赋值flag=1;若(Yn-Y0)/Y0小于或等于10%,则说明宣讲者宣讲的这部分内容不是重点,赋值flag=0。在语音转换为文本时,如果flag=1,则对字符进行加粗、加色或加亮等标记处理,如果flag=0,则停止对字符进行标记处理。
可以与第二实施例中声波的振幅共同来判断语音是否为重点内容,也可以单独利用声波的振幅或者频率来判断语音是否为重点内容。
优选地,所述语音识别处理装置还包括:
基准值获取模块,设置为以第二设定时间t2为时间周期,当(n+1)×t1=t2时,以第n+1个t1的结束时间为起始时间,重新计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0,以及重新计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
具体地,假如第二设定时间t2为10min,第一设定时间t1为1s,那么到达第600s时说明一个时间周期结束,以第601s开始下一个时间周期,此时重新确定第601s声波的振幅平均值或者频率平均值,并分别设定这两个值为新的第一基准值X0和第二基准值Y0
通过间隔一段时间重新确定第一基准值X0和第二基准值Y0的方式可以更加准确地判断宣讲者或者语音文件中的语音是否发生变化,若音量提高和/或使用了高音,则说明该语音为需要强调的部分,对该语音对应的字符进行标记处理。
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。
工业实用性
本发明实施例提供的上述技术方案,可以应用于语音识别处理过程中,首先获取投影仪的语音识别模块检测到的语音信息,该语音信息来自宣讲 者实时宣讲的语音,或者投影仪中保存的语音文件,接着将所述语音信息转换为文本,并根据声音的音量或高低等声音特性对重要内容语音对应的字符进行加色、加亮或加粗等标记处理,最后将标记处理后的所述文本进行投影显示。

Claims (10)

  1. 一种语音识别处理方法,所述语音识别处理方法包括以下步骤:
    获取投影仪的语音识别模块检测到的语音信息,所述语音信息包括语音内容和声音特征;
    将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理;
    将标记处理后的所述文本进行投影显示。
  2. 如权利要求1所述的语音识别处理方法,其中,所述将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理包括:
    以所述语音信息转换为文本的开始时间为起始时间,计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0
    计算第n+1个t1内声波的振幅平均值Xn,其中n为任意正整数;
    当(Xn-X0)/X0大于预设值时,对第n+1个t1内语音信息对应的字符进行标记处理。
  3. 如权利要求1或2所述的语音识别处理方法,其中,所述将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理还包括:
    以所述语音信息转换为文本的开始时间为起始时间,计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
    计算第n+1个t1内声波的频率平均值Yn,其中n为任意正整数;
    当(Yn-Y0)/Y0大于预设值时,对第n+1个t1内语音信息对应的字符进行标记处理。
  4. 如权利要求3所述的语音识别处理方法,其中,所述将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理之后还包括:
    以第二设定时间t2为时间周期,当(n+1)×t1=t2时,以第n+1个t1的结束时间为起始时间,重新计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0,以及重新计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
  5. 如权利要求1所述的语音识别处理方法,其中,所述标记处理包括加粗、加色或加亮。
  6. 一种语音识别处理装置,所述语音识别处理装置包括:
    语音获取模块,设置为获取投影仪的语音识别模块检测到的语音信息,所述语音信息包括语音内容和声音特征;
    语音处理模块,设置为将所述语音信息转换为文本,并根据所述声音特征对所述文本中的字符进行标记处理;
    投影模块,设置为将标记处理后的所述文本进行投影显示。
  7. 如权利要求6所述的语音识别处理装置,其中,所述语音处理模块包括:
    振幅基准单元,设置为以所述语音信息转换为文本的开始时间为起始时间,计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0
    振幅获取单元,设置为计算第n+1个t1内声波的振幅平均值Xn,其中n为任意正整数;
    第一标记处理单元,设置为当(Xn-X0)/X0大于预设值时,对第n+1个t1内语音信息对应的字符进行标记处理。
  8. 如权利要求6或7所述的语音识别处理装置,其中,所述语音处理模块还包括:
    频率基准单元,设置为以所述语音信息转换为文本的开始时间为起始 时间,计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
    频率获取单元,设置为计算第n+1个t1内声波的频率平均值Yn,其中n为任意正整数;
    第二标记处理单元,设置为当(Yn-Y0)/Y0大于预设值时,对第n+1个t1内语音信息对应的字符进行标记处理。
  9. 如权利要求8所述的语音识别处理装置,其中,所述语音识别处理装置还包括:
    基准值获取模块,设置为以第二设定时间t2为时间周期,当(n+1)×t1=t2时,以第n+1个t1的结束时间为起始时间,重新计算第一设定时间t1内声波的振幅平均值,并将其设定为第一基准值X0,以及重新计算第一设定时间t1内声波的频率平均值,并将其设定为第二基准值Y0
  10. 如权利要求6所述的语音识别处理装置,其中,所述标记处理包括加粗、加色或加亮。
PCT/CN2017/098437 2016-08-24 2017-08-22 语音识别处理方法及装置 WO2018036466A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610715090.1A CN107785020B (zh) 2016-08-24 2016-08-24 语音识别处理方法及装置
CN201610715090.1 2016-08-24

Publications (1)

Publication Number Publication Date
WO2018036466A1 true WO2018036466A1 (zh) 2018-03-01

Family

ID=61245498

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/098437 WO2018036466A1 (zh) 2016-08-24 2017-08-22 语音识别处理方法及装置

Country Status (2)

Country Link
CN (1) CN107785020B (zh)
WO (1) WO2018036466A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108769638B (zh) * 2018-07-25 2020-07-21 京东方科技集团股份有限公司 一种投影的控制方法、装置、投影设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050087312A (ko) * 2004-02-26 2005-08-31 한국흑판교재주식회사 강의내용의 음성인식 방법과 이를 이용한 강의자료 편집시스템
JP2006245876A (ja) * 2005-03-02 2006-09-14 Matsushita Electric Ind Co Ltd ネットワーク機能を有するプロジェクタを使用した会議システム
CN102290049A (zh) * 2010-06-18 2011-12-21 上海市静安区教育学院附属学校 一种语音文字转换装置
CN102339193A (zh) * 2010-07-21 2012-02-01 Tcl集团股份有限公司 一种声控会议演讲的方法及系统
CN103869471A (zh) * 2014-01-09 2014-06-18 盈诺飞微电子(上海)有限公司 头戴式语音识别投影装置及系统
CN104796584A (zh) * 2015-04-23 2015-07-22 南京信息工程大学 具有语音识别功能的提词装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101176146B (zh) * 2005-05-18 2011-05-18 松下电器产业株式会社 声音合成装置
DE102007007830A1 (de) * 2007-02-16 2008-08-21 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Vorrichtung und Verfahren zum Erzeugen eines Datenstroms und Vorrichtung und Verfahren zum Lesen eines Datenstroms
JP5433696B2 (ja) * 2009-07-31 2014-03-05 株式会社東芝 音声処理装置
US8447610B2 (en) * 2010-02-12 2013-05-21 Nuance Communications, Inc. Method and apparatus for generating synthetic speech with contrastive stress
EP2763136B1 (en) * 2013-02-04 2016-04-06 Telefónica, S.A. Method and system for obtaining relevant information from a voice communication
US10629188B2 (en) * 2013-03-15 2020-04-21 International Business Machines Corporation Automatic note taking within a virtual meeting
EP2860706A3 (en) * 2013-09-24 2015-08-12 Agnitio S.L. Anti-spoofing
CN104184870A (zh) * 2014-07-29 2014-12-03 小米科技有限责任公司 通话记录标记方法、装置及电子设备
CN105810211B (zh) * 2015-07-13 2019-11-29 维沃移动通信有限公司 一种音频数据的处理方法及终端
CN105206271A (zh) * 2015-08-25 2015-12-30 北京宇音天下科技有限公司 智能设备的语音唤醒方法及实现所述方法的系统
CN105679312B (zh) * 2016-03-04 2019-09-10 重庆邮电大学 一种噪声环境下声纹识别的语音特征处理方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050087312A (ko) * 2004-02-26 2005-08-31 한국흑판교재주식회사 강의내용의 음성인식 방법과 이를 이용한 강의자료 편집시스템
JP2006245876A (ja) * 2005-03-02 2006-09-14 Matsushita Electric Ind Co Ltd ネットワーク機能を有するプロジェクタを使用した会議システム
CN102290049A (zh) * 2010-06-18 2011-12-21 上海市静安区教育学院附属学校 一种语音文字转换装置
CN102339193A (zh) * 2010-07-21 2012-02-01 Tcl集团股份有限公司 一种声控会议演讲的方法及系统
CN103869471A (zh) * 2014-01-09 2014-06-18 盈诺飞微电子(上海)有限公司 头戴式语音识别投影装置及系统
CN104796584A (zh) * 2015-04-23 2015-07-22 南京信息工程大学 具有语音识别功能的提词装置

Also Published As

Publication number Publication date
CN107785020B (zh) 2022-01-25
CN107785020A (zh) 2018-03-09

Similar Documents

Publication Publication Date Title
US8762144B2 (en) Method and apparatus for voice activity detection
JP6673828B2 (ja) 自閉症における言語処理向上のための装置
US8473282B2 (en) Sound processing device and program
Marxer et al. The impact of the Lombard effect on audio and visual speech recognition systems
US20090197224A1 (en) Language Learning Apparatus, Language Learning Aiding Method, Program, and Recording Medium
US20210327446A1 (en) Method and apparatus for reconstructing voice conversation
EP4189974A2 (de) System und verfahren zur kopfhörerentzerrung und raumanpassung zur binauralen wiedergabe bei augmented reality
Cooke et al. Computational auditory scene analysis: Listening to several things at once
JP2023503718A (ja) 音声認識
US9058820B1 (en) Identifying speech portions of a sound model using various statistics thereof
WO2018036466A1 (zh) 语音识别处理方法及装置
CN111613223B (zh) 语音识别方法、系统、移动终端及存储介质
Azar et al. Sound visualization for the hearing impaired
WO2020085323A1 (ja) 音声処理方法、音声処理装置及び音声処理プログラム
JP2016164628A (ja) 音読評価装置、音読評価方法、及びプログラム
JP2006139162A (ja) 語学学習装置
JP2005209000A (ja) 音声可視化方法及び該方法を記憶させた記録媒体
JP6918471B2 (ja) 対話補助システムの制御方法、対話補助システム、及び、プログラム
Mishra et al. Automatic speech recognition using template model for man-machine interface
Zilany A novel neural feature for a text-dependent speaker identification system.
Waghmare et al. A Comparative Study of the Various Emotional Speech Databases
US20230038118A1 (en) Correction method of synthesized speech set for hearing aid
JP2019213001A (ja) 補聴器及びプログラム
Xu et al. Interactions of tone and intonation in whispered Mandarin
US20230223032A1 (en) Method and apparatus for reconstructing voice conversation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17842889

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17842889

Country of ref document: EP

Kind code of ref document: A1