WO2022160749A1 - 一种用于语音处理装置的角色分离方法及其语音处理装置 - Google Patents

一种用于语音处理装置的角色分离方法及其语音处理装置 Download PDF

Info

Publication number
WO2022160749A1
WO2022160749A1 PCT/CN2021/120412 CN2021120412W WO2022160749A1 WO 2022160749 A1 WO2022160749 A1 WO 2022160749A1 CN 2021120412 W CN2021120412 W CN 2021120412W WO 2022160749 A1 WO2022160749 A1 WO 2022160749A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
character
time
angle
text
Prior art date
Application number
PCT/CN2021/120412
Other languages
English (en)
French (fr)
Inventor
陈文明
张世明
吕周谨
朱浩华
陈永金
Original Assignee
深圳壹秘科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹秘科技有限公司 filed Critical 深圳壹秘科技有限公司
Publication of WO2022160749A1 publication Critical patent/WO2022160749A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Definitions

  • the present invention relates to the technical field of audio, and in particular, to the technical field of speech recognition.
  • role separation technology In the field of speech recognition technology, role separation technology was proposed decades ago, but the results of practical application are not very satisfactory.
  • the so-called role separation is to distinguish the voices of two or more different people from the voice information.
  • Character separation technology was originally embodied as voice separation technology, which originated from the "cocktail party effect", that is, in a complex mixture of voices, humans can effectively select and track the voice of one of them. This is an inherent physiological characteristic of human beings, but it is not easy to realize it through science and technology.
  • the concept of blind source signal separation proposed by Herault and Jutten in the 1980s, refers to the analysis of the original unobserved signal from multiple observed mixed signals. The word “blind” for blind signals emphasizes two points: 1) the original signal is not known; 2) the method of signal mixing is not known.
  • the traditional commonly used blind source separation methods mainly include three methods, namely blind separation algorithm based on information theory or likelihood estimation, blind separation algorithm based on second-order statistics, and higher-order statistics (HOS, Higher-Order Statistics).
  • blind separation algorithm these three methods are all implemented based on statistical information classification methods, there are errors, especially in the noisy environment, the error will be larger.
  • the present application provides a method for character separation with high accuracy and a voice processing device thereof.
  • a method for role separation for a speech processing device includes: performing speech recognition on acquired audio information to acquire first text information; wherein the first text information includes text information and the text information Corresponding first time information; obtain the orientation information of the audio information, the orientation information includes angle information and second time information; wherein, the angle information is the sound source relative to the preset 0 degree on the speech processing device The angle information corresponds to the character information; according to the first time information and the second time information, the text information is associated with the character information corresponding to the angle information.
  • a voice processing device which includes: a voice recognition unit configured to perform voice recognition on the acquired audio information to acquire first text information; wherein the first text information includes text information and the text The first time information corresponding to the information; the orientation obtaining unit is used to obtain the orientation information of the audio information, the orientation information includes angle information and second time information, and the angle information is relative to the preset on the voice processing device.
  • a turning angle between 0 degrees, the angle information corresponds to the character information; the character separation unit is used to separate the text information and the angle information according to the first time information and the second time information.
  • the role information is associated.
  • the beneficial effect of the present application is that after voice recognition is performed on the acquired audio information, its text information and the first time information corresponding to the text information are acquired, and at the same time, the angle information at which the sound source corresponding to the audio information is introduced into the sound pickup device is also acquired, and second time information corresponding to the angle information.
  • the angle information corresponds to the role information.
  • the role information corresponding to the text information is determined, thereby realizing role separation.
  • FIG. 1 is a flowchart of a method for role separation for a speech processing apparatus according to Embodiment 1 of the present application.
  • FIG. 2 is a schematic diagram of partitioning a space around a speech processing device in Embodiment 1 of the present application.
  • FIG. 3 is a schematic diagram of a first mode of matching text information and character information in the first embodiment of the present application.
  • FIG. 4 is a schematic diagram of a second method of matching text information and character information in the first embodiment of the present application.
  • FIG. 5 is a schematic block diagram of a speech processing apparatus according to Embodiment 2 of the present application.
  • FIG. 6 is a schematic structural diagram of a speech processing apparatus according to Embodiment 3 of the present application.
  • the embodiments of the present application can be applied to various speech processing apparatuses with a speech input function.
  • speech processing apparatuses with a speech input function For example: voice recorder, audio conference terminal, or intelligent electronic device with recording function, etc.
  • a preferred application scenario of the embodiment of the present application is a scenario where the positions of personnel are relatively fixed, for example, one-on-one interviews, face-to-face interviews, or interviews.
  • the technical solutions of the present application will be described below through specific embodiments.
  • a method for role separation for a speech processing apparatus includes:
  • the first text information includes text information and first time information corresponding to the text information; optionally, acquiring through a sound pickup device the audio information; optionally, the sound pickup device may be a microphone or a microphone array; optionally, the first time information is the start time and end time of the text information; Text information can be converted into JSON format;
  • S120 Acquire orientation information of the audio information, where the orientation information includes angle information and second time information; wherein the angle information is a rotation angle between the sound source relative to a preset 0 degrees on the speech processing device, and the The angle information corresponds to the character information; wherein, the angle of rotation may be the angle of clockwise rotation between the sound source and the above-mentioned 0 degree, or the angle of counterclockwise rotation;
  • the set time interval generate and record the orientation information once; optionally, the second time information is the moment when the angle information is recorded;
  • the start time of the first time, the end time of the first time, and the second time are all time offsets relative to the moment when the sound pickup device starts to pick up sound, that is, a time difference.
  • the time when the sound starts to be picked up is also the time when the audio device starts to recognize the voice. Therefore, the start time of the first time, the end time of the first time, and the second time can also be relative to the start of voice recognition. The time offset of the identified moment.
  • S110 the performing speech recognition on the acquired audio information to acquire the first text information, including:
  • Voice recognition is performed on the acquired audio information, and the first time information corresponding to each word in the text information is recognized.
  • the recognized text content is "this is a complete sentence”
  • the first time information of this sentence includes: the start time of this sentence is 500 milliseconds, that is, the start time is relative to the time interval when speech recognition starts. 500 milliseconds; the end time of this sentence is 2500 milliseconds, that is, the end time is 2500 milliseconds apart from the moment when speech recognition starts.
  • the first time information of each word is: “this” has a start time of 500 milliseconds and an end time of 800 milliseconds; “a” has a start time of 800 milliseconds and an end time of 1200 milliseconds; “complete” The start time is 1200ms, the end time is 1800ms, and the start time of "sentence” is 1800ms and the end time is 2500ms.
  • the orientation information of the audio information is acquired; if the audio information is acquired through a sound pickup device, the audio information can be generated according to the direction of arrival (DOA, Direction of Arrival) technology of the sound pickup device.
  • DOA direction of arrival
  • angle information wherein the sound pickup device can be a microphone or a microphone array; the orientation information can also be generated according to the sound source and position information of the sound pickup device, wherein the sound pickup device can be a directional microphone.
  • the angle information is the angle between the sound source direction and the 0-degree direction on the sound pickup device.
  • 0 degree is a fixed direction on the voice processing device, which can be marked on the voice processing device.
  • the direction corresponding to this sign is zero degrees.
  • the space around the speech processing device is divided into two spaces, one space corresponds to character 1 and the other space Corresponds to role 2. For example: if the angle between the sound source direction and 0 degrees in the clockwise direction is within the first preset range, such as between 0 and 90 degrees or between 270 and 360 degrees, the sound source will be regarded as Confirmed as the sound of character 1; if the angle between the sound source direction and 0 degrees in the clockwise direction is within the second preset range, such as between 90 degrees and 270 degrees, the sound source will be confirmed as Character 2's voice.
  • character 1 sits opposite character 2, and the 0 degree of the speech processing device faces towards character 1.
  • the voice processing device acquires voice information, it can determine whether it is the voice information of character 1 or the voice information of character 2 according to the angle between the direction of the sound source that sends out the voice information and the 0-degree direction of the voice processing device. voice message.
  • the orientation information further includes session type information, and the session type information can be used to distinguish the usage of the angle information.
  • the conversation type information may include at least one of the following: 1 indicates that the conversation type is a local two-person dialogue, 2 indicates that the conversation type is telephone mode, and 3 indicates that the conversation type is speech mode.
  • the type information may be obtained by means of hardware input, that is, a button of a corresponding type is preset on the voice processing device, and when the button of the corresponding type is triggered, the voice processing device can obtain the corresponding button.
  • Type information or, the type information can be judged by the way of obtaining the voice information, the number of characters contained in the voice information, etc., for example, the voices of the two characters in the voice information are obtained through a local voice pickup device If there is only one character's voice in the voice information, and it is obtained through a local voice pickup device, then it is determined as a speech mode; if there are two voices in the voice information, one is through If one is acquired by the local voice pickup device, and one is acquired by the communication module of the internal circuit, it is confirmed as the call mode.
  • the method when the session type is 1, that is, a local two-person session, the method will divide the acquired angle information into roles according to a preset range, such as role 1 and role 2. Please refer to FIG. 2 and the above text description about FIG. 2 for the division method.
  • the angle value of the other party's character (assuming character 1) is directly set to a preset value, and the preset value can be any value other than 0 to 360 degrees. , such as 361 degrees, and the angle of the locally received audio information may be any angle between 0 and 360 degrees, then confirm the sound information with the angle information between 0 and 360 degrees as the local role (assuming role 2 ). In this way, in the call mode, the local character can be distinguished from the counterpart character through the angle information.
  • the angle information can also be used to adjust the sound pickup direction of the microphone array, that is, the sound at the specified angle is strengthened, and other directions are weakened.
  • the following is an example to illustrate the orientation information obtained in S120. Assuming that the voice processing device generates and records the orientation information every 40 milliseconds, the acquired information can be stored or recorded in the following format in Table 1:
  • the voice processing device can determine: at the moment when the second time information is 0 milliseconds, in the generated and recorded orientation information, the conversation type is 1, which is a local two-person dialogue, and the voice information at this time is the same as the Corresponds to role 1; at the moment when the second time information is 40 milliseconds, in the generated and recorded orientation information, the conversation type is still a local two-person dialogue, and the voice information at this time corresponds to role 2.
  • Manner 1 Referring to FIG. 3 , first determine the character information corresponding to the angle information, and then associate the text information with the character information according to the first time and the second time. Specifically, it includes the following steps:
  • the second time when the second time matches the first time, confirm that the text information matches the character information corresponding to the angle information.
  • the second time in this application matches the first time, which may be the same as the first time, or the second time is within the time range of the first time; the so-called text information and role information Matching may be to confirm that the two are related, that is, to confirm that the text information corresponds to the character information.
  • the orientation information generated and recorded in the time period of the first time information is obtained.
  • the orientation information has confirmed its corresponding role information in step S1311, so it can be
  • the text information corresponding to the first time information is matched with the character information.
  • Method 2 Referring to FIG. 4, first associate the text information with the angle information according to the timestamps of the first time and the second time; then determine the corresponding character information according to the angle information, so as to associate the text information with the character information related. Specifically, it includes the following steps:
  • the location information generated and recorded in the time period of the first time information is obtained, then the character information corresponding to the location information is determined, and finally, the first time information can be The text information corresponding to the information matches the character information.
  • the first character appears 48 times and the second character (character 2) appears 3 times, then determine the text information within the first time range "This is a complete sentence" corresponds to the first character.
  • the number of occurrences of each character corresponding to each word may also be counted.
  • the first time information of the word "complete" is, the start time is 1200ms, and the end time is 1800ms; then the orientation information between 1200ms and 1800ms is obtained, and the first character in this time period is counted according to the orientation information. and the number of appearances of the second character, and the character with the most appearances is used as the character information corresponding to the word.
  • the method further includes:
  • S140 Output second text information, where the second text information includes the character information and text information corresponding to the character information.
  • the second text information can be output in the form of printing or generating an electronic text file, so that users can view or edit it.
  • the text information and the first time information corresponding to the text information are acquired, and the distance between the sound source corresponding to the audio information and the 0 degree of the speech processing device is also acquired. and the second time information corresponding to the angle information.
  • the angle information corresponds to the role information. Through the first time information and the second time information, the role information corresponding to the text information is determined, thereby realizing role separation.
  • the character corresponding to the text information converted into the audio information is determined according to the input angle of the audio source, there is no need to increase the hardware deployment, and a corresponding sound pickup device is set for each character, and there is no need to adopt Algorithms or deep learning methods separate the roles in the audio information. Therefore, it can save hardware costs and is not limited by venues, making it flexible and convenient to apply.
  • the angle information is directly used to determine the corresponding character, and the angle information is relatively accurate, it is not like using an algorithm or a deep learning method for character separation, which is prone to errors. Therefore, it can also reduce the computational complexity of the speech processing device. Improve the accuracy of role separation.
  • FIG. 5 shows a speech processing apparatus 200 according to Embodiment 2 of the present application.
  • the voice processing device 200 includes, but is not limited to, a voice recorder, an audio conference terminal, or an intelligent electronic device with a recording function. Voice devices, computers or other intelligent electronic devices. It is not limited in the second embodiment. Therefore, the speech processing apparatus 200 includes:
  • the speech recognition unit 210 is configured to perform speech recognition on the acquired audio information to acquire first text information; wherein, the first text information includes the text information and the first time information corresponding to the text information; optionally, it is Obtain the audio information through a sound pickup device; optionally, the sound pickup device may be a microphone or a microphone array; optionally, the first time information is the start time and end time of the text information;
  • the orientation obtaining unit 220 is configured to obtain the orientation information of the audio information, the orientation information includes angle information and second time information, and the angle information is the sound source relative to the preset 0 degrees on the speech processing device.
  • Rotation angle the angle information corresponds to the character information; wherein, the rotation angle may be the angle between the sound source and the above-mentioned 0 degree rotated clockwise, or it may be the angle rotated in the counterclockwise direction; optionally, each At a preset time interval, the azimuth information is generated and recorded once; optionally, the second time information is the moment when the angle information is recorded;
  • a role separation unit 230 configured to associate the text information with the role information corresponding to the angle information according to the first time information and the second time information.
  • the voice processing apparatus 200 further includes: a sound pickup apparatus 240 for acquiring voice information.
  • the sound pickup device 240 may be a microphone, or a microphone array.
  • the start time of the first time, the end time of the first time, and the second time are all time offsets relative to the moment when the sound pickup device starts to pick up sound, that is, a time difference.
  • the time when the sound starts to be picked up is also the time when the audio device starts to recognize the voice. Therefore, the start time of the first time, the end time of the first time, and the second time can also be relative to the start of voice recognition. The time offset of the identified moment.
  • the speech recognition unit 210 is specifically configured to perform speech recognition on the acquired audio information, and recognize the first time information corresponding to each word in the text information.
  • the speech recognition unit 210 is specifically configured to perform speech recognition on the acquired audio information, and recognize the first time information corresponding to each word in the text information.
  • S110 in the first embodiment, which will not be repeated here.
  • the orientation obtaining unit 220 may use a sound pickup device to obtain the orientation information; then the angle information may be generated according to the direction of arrival (DOA, Direction of Arrival) technology of the sound pickup device, wherein the sound
  • the pickup device may be a microphone or a microphone array; the orientation information may also be generated according to the sound source and position information of the sound pickup device, wherein the sound pickup device may be a directional microphone.
  • DOA Direction of Arrival
  • the orientation information and the role information are related, please refer to Embodiment 1 and the description of S120 in FIG. 2 , which will not be repeated here.
  • the role separation unit 230 may implement associating the text information with the role information in two ways. specific:
  • the character separation unit 230 is specifically configured to confirm that the text information matches the character information corresponding to the angle information when the second time matches the first time.
  • the description of S1311 and S1312 will not be repeated here.
  • the role separation unit 230 is specifically configured to confirm that the text information corresponds to the angle information when the second time matches the first time; and determine the role information corresponding to the angle information; It is determined that the character information matches the character information corresponding to the angle information.
  • the description please refer to Embodiment 1 and FIG. 4 .
  • the descriptions of S1321 to S1323 are not repeated here.
  • the role separation unit 230 is also specifically configured to count the number of appearances of the first role and the second role within the first time period; when the number of appearances of the first role is much greater than the number of appearances of the second role, It is determined that the text information within the first time range corresponds to the first character.
  • the text information within the first time range corresponds to the first character.
  • the character information includes at least a first character and a second character; if the angle information is within the first range, the first character is, and the angle information within the second range is the second character Role.
  • the orientation information further includes a session type, where the session type is used to distinguish the usage of the angle information.
  • the role separation unit 230 is further configured to output second text information, where the second text information includes the role information and text information corresponding to the role information.
  • FIG. 6 is a schematic structural diagram of a speech processing apparatus 300 according to Embodiment 3 of the present application.
  • the video processing apparatus 300 includes: a processor 310 , a memory 320 and a communication interface 340 .
  • the processor 310, the memory 320 and the communication interface 340 are connected to each other through a bus system.
  • the processor 310 may be an independent component, or may be a collective term for multiple processing components. For example, it may be a CPU, an ASIC, or one or more integrated circuits configured to implement the above method, such as at least one microprocessor DSP, or at least one programmable gate FPGA, etc.
  • the memory 320 is a computer-readable storage medium on which programs executable on the processor 310 are stored.
  • the processor 310 invokes the program in the memory 320 to execute any one of the role separation methods for the speech processing device provided in the first embodiment, and transmits the result obtained by the processor 310 through the communication interface 340, wirelessly or wiredly, transfer to other devices.
  • the voice processing device 300 further includes: a sound pickup device 330 for acquiring voice information.
  • the processor 310 , the memory 320 , the sound pickup device 330 and the communication interface 340 realize mutual communication connection through a bus system.
  • the processor 310 calls the program in the memory 320, executes any one of the role separation methods for the voice processing device provided in the first embodiment, processes the voice information obtained by the voice pickup device 330, and sends the processor 310 through the communication interface 340.
  • the obtained results are transmitted to other devices by wireless or wired means.
  • the functions described in the specific embodiments of the present application may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software When implemented in software, it may be implemented by a processor executing software instructions.
  • the software instructions may consist of corresponding software modules.
  • the software modules may be stored in a computer-readable storage medium, which may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes an integration of one or more available media.
  • the available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, Digital Video Disc (DVD)), or semiconductor media (eg, Solid State Disk (SSD)) )Wait.
  • the computer-readable storage medium includes but is not limited to random access memory (Random Access Memory, RAM), flash memory, read only memory (Read Only Memory, ROM), Erasable Programmable Read Only Memory (Erasable Programmable ROM, EPROM) ), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), registers, hard disks, removable hard disks, compact disks (CD-ROMs), or any other form of storage medium known in the art.
  • An exemplary computer-readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the computer-readable storage medium.
  • the computer-readable storage medium can also be an integral part of the processor.
  • the processor and computer-readable storage medium may reside in an ASIC. Additionally, the ASIC may reside in access network equipment, target network equipment or core network equipment.
  • the processor and the computer-readable storage medium may also exist as discrete components in the access network device, the target network device or the core network device. When implemented in software, it can also be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer program instructions may be stored on or transmitted from one computer readable storage medium to another computer readable storage medium as described above, for example, the computer instructions may be downloaded from a website, computer, server or The data center transmits to another website site, computer, server or data center through wired (such as coaxial cable, optical fiber, Digital Subscriber Line, DSL) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, Digital Subscriber Line, DSL
  • wireless such as infrared, wireless, microwave, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Telephone Function (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种用于语音处理装置的角色分离方法及其语音处理装置。方法包括:对声音拾取装置获取的音频信息进行语音识别,获取第一文本信息(S110);其中,第一文本信息包含文字信息和文字信息对应的第一时间信息;通过声音拾取装置获取音频信息的方位信息,方位信息包括角度信息和第二时间信息(S120);其中,角度信息为声源相对于语音处理装置上预设的0度之间的转角,角度信息与角色信息相对应;根据第一时间信息与第二时间信息,将文字信息与角度信息对应的角色信息相关联(S130)。本方法及装置可以在不增加硬件成本、硬件部署,也不采用传统算法和深度学习方法的情况下,提升角色分离的准确度,实现语音信息处理中的角色分离功能。

Description

一种用于语音处理装置的角色分离方法及其语音处理装置 技术领域
本发明涉及音频技术领域,尤其涉及一种语音识别的技术领域。
背景技术
在语音识别技术领域中,角色分离技术早在几十年前就被提出,但实际应用的结果却令人不甚满意。所谓角色分离,就是从语音信息中区分出两个或两个人以上不同人的声音。
角色分离技术最初体现为语音分离技术,源于“鸡尾酒会效应”,即在复杂的混合声音中,人类能有效地选择并跟踪其中某一人的声音。这是人类自有的生理特性,但要通过科学技术来实现,并不容易。20世纪80年代的Herault和Jutten提出的盲源信号分离概念,指的是从多个观测到的混合信号中分析出没有观测的原始信号。盲信号的“盲”字强调了两点:1)不知道原始信号;2)不知道信号混合的方法。传统常用的盲源分离方法,主要是三种方法,即基于信息论或似然估计的盲分离算法、基于二阶统计量的盲分离算法、基于高阶统计量(HOS,Higher-Order Statistics)的盲分离算法,这三种方法都是基于统计信息的分类方法来实现的,存在误差,特别是在人声嘈杂的环境下,误差会更大。
由于盲源分离算法不准确,后来出现了基于硬件来确定声源的方案。比如在会场中,每个人对应一个麦克风,这样收集到的每个人的语音都是独立的,角色自然也就分离出来了。这种方法虽然比以往盲源分离技术更准确,但是需要预先部署硬件,前期准备工作多,操作复杂,投入成本高,使用不灵活。
近几年随着人工智能的发展,深度学习取代了一些传统算法,于是也出现了不少采用深度学习来实现角色分离的方案。该方案广泛使用MFCC(Mel-scale Frequency Cepstral Coefficients,梅尔倒谱系数)来提取声音特征,再经过神经网络训练出模型,为了进一步提升识别率,还可以预先录制一段语音,其准确率比传统算法高。但是,这需要庞大的数据来支撑,成本高,并且也存在一定的不准确性。
发明内容
本申请提供一种能准确性较高的角色分离的方法及其语音处理装置。
本申请提供以下技术方案:
一方面,提供一种用于语音处理装置的角色分离方法,其包括:对获取的音频信息进行语音识别,获取第一文本信息;其中,所述第一文本信息包含文字信息和所述文字信息对应的第一时间信息;获取所述音频信息的方位信息,所述方位信息包括角度信息和第二时间信息;其中,所述角度信息为声源相对于语音处理装置上预设的0度之间的转角,所述角度信息与角色信息相对应;根据所述第一时间信息与所述第二时间信息,将所述文字信息与所述角度信息对应的所述角色信息相关联。
另一方面,提供一种语音处理装置,其包括:语音识别单元,用于对获取的音频信息进行语音识别,获取第一文本信息;其中,所述第一文本信息包含文字信息和所述文字信息对应的第一时间信息;方位获取单元,用于获取所述音频信息的方位信息,所述方位信息包括角度信息和第二时间信息,所述角度信息为相对于语音处理装置上预设的0度之间的转角,所述角度信息与角色信息相对应;角色分离单元,用于根据所述第一时间信息与所述第二时间信息,将所述文字信息与所述角度信息对应的所述角色信息相关联。
本申请的有益效果在于,对获取的音频信息进行语音识别后,获取其文字信息以及文字信息对应的第一时间信息,同时还获取该音频信息对应的声源传入声音拾取装置的角度信息,以及该角度信息对应的第二时间信息。其中角度信息是与角色信息对应的。通过第一时间信息与所述第二时间信息,确定出文字信息对应的角色信息,从而实现角色分离。本方案中,由于是通过音源输入声音拾取装置的角度来确定角色信息的,因此,既不需要增加硬件部署,针对每个角色设置对应的声音拾取装置,也不需要采用算法或者深度学习的方法将音频信息中的角色进行分离,因此,既可以节省硬件成本,还不受场地限制,应用起来灵活方便。同时,由于是直接用角度信息来确定对应角色的,而角度信息也是比较精准的,因此,也可以不需要采用传统算法或则深度学习的方法进行角色分离,因此,还可以降低语音处理装置的运算复杂性以及提升角色分离的准确性。
附图说明
图1为本申请实施方式一提供的一种用于语音处理装置的角色分离方法的流程图。
图2为本申请实施方式一中对语音处理装置周围空间进行分区的示意图。
图3为本申请实施方式一中进行文字信息与角色信息匹配的方式一的示意图。
图4为本申请实施方式一中进行文字信息与角色信息匹配的方式二的示意图。
图5为本申请实施方式二提供的一种语音处理装置的方框示意图。
图6本申请实施方式三提供的一种语音处理装置的结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施方式,对本申请进行进一步详细说明。应当理解,此处所描述的实施方式仅用以解释本申请,并不用于限定本申请。但是,本申请可以以多种不同的形式来实现,并不限于本文所描述的实施方式。相反地,提供这些实施方式的目的是使对本实用新型的公开内容的理解更加透彻全面。
除非另有定义,本文所实用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施方式的目的,不是旨在限制本申请。
应理解,本文中术语“系统”或“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
本申请实施例可以应用于各种带有语音录入功能的语音处理装置中。例如:录音笔、音频会议终端、或者有录音功能的智能电子设备等。
本申请实施方式优选的应用场景是人员位置相对固定场景,例如:一对一访谈、面对面的访谈或采访。以下将通过具体的实施方式对本申请的技术方案进行阐述。
实施方式一
请参看图1,为本申请实施方式一提供的一种用于语音处理装置的角色分离方法,其包括:
S110,对获取的音频信息进行语音识别,获取第一文本信息;其中,所述第一文本信息包含文字信息和所述文字信息对应的第一时间信息;可选的,是通过声音拾取装置获取该音频信息;可选的,该声音拾取装置可以是麦克风,或者麦克风阵列;可选的,所述第一时间信息为所述文字信息的开始时间与结束时间;可选的,所述第一文本信息可以转换成JSON格式;
S120,获取所述音频信息的方位信息,所述方位信息包括角度信息和第二时间信息;其中,所述角度信息为声源相对于语音处理装置上预设的0度之间的转角,所述角度信息与角色信息相对应;其中,所述转角可以是声源与上述0度之间的顺时针方向转动的角度,也可以是逆时针方向转动的角度;可选的,每间隔一预设的时间间隔,生成并记录一次所述方位信息;可选的,所述第二时间信息则为记录所述角度信息的时刻;
S130,根据所述第一时间信息与所述第二时间信息,将所述文字信息与所述角度信息对应的所述角色信息相关联。
可选的,所述第一时间的开始时间、所述第一时间的结束时间、所述第二时间均为相对于声音拾取装置开始拾取声音时刻的时间偏移量,即时间差值。通常,开始拾取声音的时刻,也是音频装置开始识别语音的时刻,故,所述第一时间的开始时间、所述第一时间的结束时间、所述第二时间也可以是相对于开始进行语音识别时刻的时间偏移量。
可选的,S110,所述对获取的音频信息进行语音识别,获取第一文本信息,包括:
对获取的音频信息进行语音识别,识别出所述文字信息中每个词语对应的所述第一时间信息。
以下对S110进行举例说明。假设,用户说了“这是一个完整的句子。”音频装置该段语音信息之后,识别出如下信息:
Figure PCTCN2021120412-appb-000001
Figure PCTCN2021120412-appb-000002
即,识别出的文字内容是“这是一个完整的句子”,这个句子的第一时间信息包括:这个句子的起始时间是500毫秒,即该起始时间相对于开始进行语音识别的时刻间隔500毫秒;这个句子的结束时间是2500毫秒,即该结束时间相对于开始进行语音识别的时刻间隔2500毫秒。
进一步优化,还可以将该句子中每个词语识别出来,同时也确定每个词语的起始时间和结束时间。每个词语的第一时间信息分别为:“这是”的起始时间是500毫秒,结束时间是800毫秒;“一个”的起始时间是800毫秒,结束时间是1200毫秒;“完整的”起始时间是1200毫秒,结束时间是1800毫秒,“句子”的起始时间是1800毫秒,结束时间是2500毫秒。
可选的,S120,获取所述音频信息的方位信息;若是通过声音拾取装置获取所述音频信息,则可以是根据所述声音拾取装置的波达方向(DOA,Direction of Arrival)技术生成所述角度信息,其中,所述声音拾取装置可以是麦克风或麦克风阵列;也可以是 根据所述声音拾取装置的音源和位置信息生成所述方位信息,其中,所述声音拾取装置可以是指向型麦克风。
请参看图2,所述角度信息是声源方向相对于声音拾取装置上的0度方向之间的角度。可选的,0度是该语音处理装置上的一个固定方向,可以在该语音处理装置上进行标识。该标识对应的方向即为零度。
假设角色信息至少包括第一角色(图2中角色1)与第二角色(图2中角色2),则对语音处理装置周围的空间划分出两个空间,一个空间对应角色1,另一个空间对应角色2。例如:若声源方向与0度之间的沿顺时针方向的夹角在第一预设范围以内,如0至90度之间或者270度至360度之间时,该声源会被为确认为角色1的声音;若声源方向与0度之间的沿顺时针方向的夹角在第二预设范围以内,如90度至270度之间时,该声源会被为确认为角色2的声音。
例如:有两个人A和B在进行访谈或会话,声源A与0度之间的沿顺时针方向的夹角为80度,则声源A会被确认认为角色1;声源B与0度之间的沿顺时针方向的夹角250度,则声源B会被确认为角色2。
使用时,角色1与角色2相对而坐,将语音处理装置的0度朝向角色1。在访谈或采访过程中,语音处理装置获取到语音信息,则可根据发出该语音信息的声源方向与语音处理装置的0度方向之间的角度,确定是角色1的语音信息还是角色2的语音信息。
以上两个角色仅为举例,可选的,该方案也可以设置三个或四个角色。
可选的,所述方位信息还包括会话类型信息,所述会话类型信息可以用于区分所述角度信息的用途。
可选的,所述会话类型信息可以包括以下中至少一种:1表示会话类型为本地双人对话,2表示会话类型为电话模式,3表示会话类型为演讲模式。
可选的,所述类型信息可以是通过硬件输入的方式获取的,即,在语音处理装置上预设对应类型的按钮,当对应类型的按钮被触发时,该语音处理装置即可获取对应的类型信息;或者,所述类型信息可以是通过语音信息获取的途径、语音信息中包含的角色数量等信息自行判断的,如,语音信息中两个角色的声音都是通过本地的语音拾取装置获取的,那么确定为本地双人会话;如果语音信息中只有一个角色的声音,且是通过本地的语音拾取装置获取的,那么确定为演讲模式;如果语音信息中有两个角色的声音,一个是通过本地的语音拾取装置获取的,一个是内部电路的通信模块获取的,则确认为通话模式。
例如:当会话类型为1,即本地双人会话时,则该方法会将获取的角度信息根据预设的范围划分角色,如角色1与角色2。其划分方式请参看图2以及上述关于图2的文字说明。
再如:当会话类型为2,即电话模式时,则直接将对方角色(假设为角色1)的角度值设置为一个预设值,该预设值可以是0至360度以外的任意一个数值,如361度,而本地接收到的音频信息的角度可能是0至360度之间的任意一个角度,则将角度信息在0至360度之间的声音信息确认为本地角色(假设为角色2)。由此,即可在通话模式下,通过角度信息将本地角色与对方角色区分开。
又如:当会话模式为3,即演讲模式时,因只有一个角色的语音输入,此时,则确定所有的角度信息均对应一个角色(假设为角色1),所有的文字信息均与该角色1对应。优化的,在演讲模式下,角度信息还可用来调整麦克风阵列的拾音方向,即,指定该角度的声音加强,其他方向削弱。
以下举例说明,S120中获取的方位信息。假设,语音处理装置每隔40毫秒生成并记录该方位信息,则获取的信息可采用如下表1的格式进行存储或记录:
第二时间信息 会话类型信息 角度信息
0 1 80
40 1 250
表1
针对表1中的方位信息,语音处理装置可确定:在第二时间信息为0毫秒的时刻,生成并记录的方位信息中,会话类型为1,即为本地双人对话,此时的语音信息与角色1相对应;在第二时间信息为40毫秒的时刻,生成并记录的方位信息中,会话类型仍然为本地双人对话,此时的语音信息与角色2相对应。
可选的,S130,根据所述第一时间信息与所述第二时间信息,将所述文字信息与所述角度信息对应的所述角色信息相关联,其具体实现方式可以包含以下两种:
方式一:请参见图3,先确定角度信息对应的角色信息,在根据第一时间和第二时间,将所述文字信息与该角色信息关联上。具体的,其包括以下步骤:
S1311,确定所述角度信息对应的角色信息;
S1312,当所述第二时间与所述第一时间相匹配时,确认所述文字信息与所述角度信息对应的角色信息相匹配。本申请中所称的第二时间与第一时间相匹配相匹配,可以是第二时间与第一时间相同,或者第二时间在第一时间的时间范围以内;所称的文字信 息与角色信息相匹配,可以是确认二者相关联的,即,确认该文字信息是与该角色信息相对应的。
具体的,根据第一时间信息和第二时间信息,获取在第一时间信息的时间段内生成并记录的方位信息,该方位信息已在步骤S1311中确认了其对应的角色信息,因此,可将第一时间信息对应的文字信息与角色信息相匹配。
方式二:请参见图4,先根据第一时间和第二时间的时间戳,将所述文字信息与角度信息进行关联;在根据角度信息确定对应的角色信息,从而将文字信息与所述角色信息关联上。具体的,其包括以下步骤:
S1321,当所述第二时间与所述第一时间相匹配时,确认所述文字信息与所述角度信息对应;
S1322,确定所述角度信息对应的角色信息;
S1323,确定所述文字信息与所述角度信息对应的角色信息相匹配。
具体的,根据第一时间信息和第二时间信息,获取在第一时间信息的时间段内生成并记录的方位信息,然后,确定该方位信息对应的角色信息,最后,即可将第一时间信息对应的文字信息与角色信息相匹配。
可选的,S1312和S1323中,确认所述文字信息与所述角度信息对应的角色信息相匹配,具体包括:
统计所述第一时间段内第一角色和第二角色出现的次数;
当第一角色出现的次数大于或者远大于第二角色出现的次数时,确定所述第一时间范围内的文字信息对应第一角色。
如图3所示,在第一时间范围(500ms至2500ms)内,第一角色(角色1)出现48次,第二角色(角色2)出现3次,则确定第一时间范围内的文字信息“这事一个完整的句子”对应的第一角色。
可选的,本方案还可以针对每个词语统计对应的每个角色出现的次数。例如:“完整的”这个词语的第一时间信息为,起始时间是1200ms,结束时间为1800ms;则获取1200ms至1800ms之间的方位信息,根据方位信息统计在该时间段内的第一角色和第二角色出现的次数,以出现次数多的角色作为该词语对应的角色信息。
可选的,该方法还包括:
S140,输出第二文本信息,所述第二文本信息所述包含所述角色信息以及与所述角色信息对应的文字信息。可选的,可采用打印,或者生成电子文本文件的形式输出,以便于用户可以查看或编辑。
本申请的实施方式一,对获取的音频信息进行语音识别后,获取其文字信息以及文字信息对应的第一时间信息,同时还获取该音频信息对应的声源与语音处理装置的0度之间的角度信息,以及该角度信息对应的第二时间信息。其中角度信息是与角色信息对应的。通过第一时间信息与所述第二时间信息,确定出文字信息对应的角色信息,从而实现角色分离。实施方式一中,由于是根据音源输入的角度来确定该音频信息转化成的文字信息所对应角色,因此,既不需要增加硬件部署,针对每个角色设置对应的声音拾取装置,也不需要采用算法或者深度学习的方法,将音频信息中的角色进行分离,因此,既可以节省硬件成本,还不受场地限制,应用起来灵活方便。同时,由于是直接用角度信息来确定对应角色的,而角度信息比较精准,不似采用算法或则深度学习的方法进行角色分离容易出现误差,因此,还可以降低语音处理装置的运算复杂性以及提升角色分离的准确性。
实施方式二
请参看图5,为本申请实施方式二提供的一种语音处理装置200。该语音处理装置200包括但不限于录音笔、音频会议终端、或者有录音功能的智能电子设备等中任意一种该语音处理装置,也可以是不包含语音拾取功能,仅包含角色分离处理功能的语音装置、电脑或其他智能电子设备。在本实施方式二中不做限定。故,该语音处理装置200包括:
语音识别单元210,用于对获取的音频信息进行语音识别,获取第一文本信息;其中,所述第一文本信息包含文字信息和所述文字信息对应的第一时间信息;可选的,是通过声音拾取装置获取该音频信息;可选的,该声音拾取装置可以是麦克风,或者麦克风阵列;可选的,所述第一时间信息为所述文字信息的开始时间与结束时间;
方位获取单元220,用于获取所述音频信息的方位信息,所述方位信息包括角度信息和第二时间信息,所述角度信息为声源相对于语音处理装置上预设的0度之间的转角,所述角度信息与角色信息相对应;其中,所述转角可以是声源与上述0度之间的顺时针方向转动的角度,也可以是逆时针方向转动的角度;可选的,每间隔一预设的时间间隔,生成并记录一次所述方位信息;可选的,所述第二时间信息则为记录所述角度信息的时刻;
角色分离单元230,用于根据所述第一时间信息与所述第二时间信息,将所述文字信息与所述角度信息对应的所述角色信息相关联。
可选的,该语音处理装置200还包括:声音拾取装置240,用于获取语音信息。具体的,该声音拾取装置240可以是麦克风,或者,麦克风阵列。
可选的,所述第一时间的开始时间、所述第一时间的结束时间、所述第二时间均为相对于声音拾取装置开始拾取声音时刻的时间偏移量,即时间差值。通常,开始拾取声音的时刻,也是音频装置开始识别语音的时刻,故,所述第一时间的开始时间、所述第一时间的结束时间、所述第二时间也可以是相对于开始进行语音识别时刻的时间偏移量。
可选的,该语音识别单元210,具体用于对获取的音频信息进行语音识别,识别出所述文字信息中每个词语对应的所述第一时间信息。具体举例可参见实施方式一中针对S110的举例,在此不做重复赘述。
可选的,该方位获取单元220可以是采用声音拾取装置获取所述方位信息;则可根据声音拾取装置的波达方向(DOA,Direction of Arrival)技术生成所述角度信息,其中,所述声音拾取装置可以是麦克风或麦克风阵列;也可以是根据所述声音拾取装置的音源和位置信息生成所述方位信息,其中,所述声音拾取装置可以是指向型麦克风。其中,角度信息和角色信息之间如何关联对应的,请参看实施方式一以及图2针对S120的描述,在此不做重复赘述。
可选的,该角色分离单元230可以有两种方式实现将所述文字信息与该角色信息关联。具体的:
方式一:角色分离单元230,具体用于当所述第二时间与所述第一时间相匹配时,确认所述文字信息与所述角度信息对应的角色信息相匹配。其详细描述请参见实施方式一以及图3,对S1311以及S1312的描述,在此不做重复赘述。
方式二:该角色分离单元230,具体用于当所述第二时间与所述第一时间相匹配时,确认所述文字信息与所述角度信息对应;确定所述角度信息对应的角色信息;确定所述文字信息与所述角度信息对应的角色信息相匹配。其详细描述请参见实施方式一以及图4,对S1321至S1323的描述,在此不做重复赘述。
可选的,该角色分离单元230,还具体用于统计所述第一时间段内第一角色和第二角色出现的次数;当第一角色出现的次数远大于第二角色出现的次数时,确定所述第一 时间范围内的文字信息对应第一角色。具体举例请参见实施方式一中对应的描述,在此不做重复赘述。
可选的,所述角色信息至少包括第一角色与第二角色;所述角度信息在第一范围内的为所述第一角色,所述角度信息在第二范围内的为所述第二角色。
可选的,所述方位信息还包括会话类型,所述会话类型用于区分所述角度信息的用途。
可选的,该角色分离单元230,还用于输出第二文本信息,所述第二文本信息所述包含所述角色信息以及与所述角色信息对应的文字信息。
本实施方式二中有不详尽之处,请参见上述实施方式一中相同或对应的部分,在此不做重复赘述。
实施方式三
请参看图6,本申请实施方式三提供的一种语音处理装置300的结构示意图。该视频处理装置300包括:处理器310、存储器320以及通信接口340。处理器310、存储器320与通信接口340之间通过总线系统实现相互的通信连接。
该处理器310可以是一个独立的元器件,也可以是多个处理元件的统称。例如,可以是CPU,也可以是ASIC,或者被配置成实施以上方法的一个或多个集成电路,如至少一个微处理器DSP,或至少一个可编程门这列FPGA等。存储器320为一计算机可读存储介质,其上存储可在处理器310上运行的程序。
处理器310调用存储器320中的程序,执行上述实施方式一提供的任意一种用于语音处理装置的角色分离方法,并通过通信接口340将处理器310获得的结果,通过无线或有线的方式,传输给其他装置。
可选的,该语音处理装置300还包括:声音拾取装置330用于获取语音信息。处理器310、存储器320、声音拾取装置330与通信接口340之间通过总线系统实现相互的通信连接。处理器310调用存储器320中的程序,执行上述实施方式一提供的任意一种用于语音处理装置的角色分离方法,处理该声音拾取装置330获取的语音信息,并通过通信接口340将处理器310获得的结果,通过无线或有线的方式,传输给其他装置。
本实施方式三中有不详尽之处,请参见上述实施方式一中相同或对应的部分,在此不做重复赘述。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请具体实施方式所描述的功能可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使 用软件实现时,可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成。软件模块可以被存放于计算机可读存储介质中,所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,数字视频光盘(Digital Video Disc,DVD))、或者半导体介质(例如,固态硬盘(Solid State Disk,SSD))等。所述计算机可读存储介质包括但不限于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质。一种示例性的计算机可读存储介质耦合至处理器,从而使处理器能够从该计算机可读存储介质读取信息,且可向该计算机可读存储介质写入信息。当然,计算机可读存储介质也可以是处理器的组成部分。处理器和计算机可读存储介质可以位于ASIC中。另外,该ASIC可以位于接入网设备、目标网络设备或核心网设备中。当然,处理器和计算机可读存储介质也可以作为分立组件存在于接入网设备、目标网络设备或核心网设备中。当使用软件实现时,也可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机或芯片上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请具体实施方式所述的流程或功能,该芯片可包含有处理器。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机程序指令可以存储在上述计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。
上述实施方式说明但并不限制本发明,本领域的技术人员能在权利要求的范围内设计出多个可代替实例。所属领域的技术人员应该意识到,本申请并不局限于上面已经描述并在附图中示出的精确结构,对在没有违反如所附权利要求书所定义的本发明的范围之内,可对具体实现方案做出适当的调整、修改、、等同替换、改进等。因此,凡依据本发明的构思和原则,所做的任意修改和变化,均在所附权利要求书所定义的本发明的范围之内。

Claims (14)

  1. 一种用于语音处理装置的角色分离方法,其特征在于,所述方法包括:
    对获取的音频信息进行语音识别,获取第一文本信息;其中,所述第一文本信息包含文字信息和所述文字信息对应的第一时间信息;
    获取所述音频信息的方位信息,所述方位信息包括角度信息和第二时间信息;其中,所述角度信息为声源相对于语音处理装置上预设的0度之间的转角,所述角度信息与角色信息相对应;
    根据所述第一时间信息与所述第二时间信息,将所述文字信息与所述角度信息对应的所述角色信息相关联。
  2. 如权利要求1所述的用于语音处理装置的角色分离方法,其特征在于,所述对获取的音频信息进行语音识别,获取第一文本信息,包括:
    对获取的所述音频信息进行语音识别,识别出所述文字信息中每个词语对应的所述第一时间信息。
  3. 如权利要求1所述的用于语音处理装置的角色分离方法,其特征在于,所述根据所述第一时间信息与所述第二时间信息,将所述文字信息与所述角度信息对应的所述角色信息相关联,包括:
    确定所述角度信息对应的角色信息;
    当所述第二时间与所述第一时间相匹配时,确认所述文字信息与所述角度信息对应的角色信息相匹配。
  4. 如权利要求1所述的用于语音处理装置的角色分离方法,其特征在于,所述根据所述第一时间信息与所述第二时间信息,将所述文字信息与所述角度信息对应的所述角色信息相关联,包括:
    当所述第二时间与所述第一时间相匹配时,确认所述文字信息与所述角度信息对应;
    确定所述角度信息对应的角色信息;
    确定所述文字信息与所述角度信息对应的角色信息相匹配。
  5. 如权利要求1至4中任意一项所述的用于语音处理装置的角色分离方法,其特征在于,所述角色信息至少包括第一角色与第二角色;所述角度信息在第一范围内的为所述第一角色,所述角度信息在第二范围内的为所述第二角色。
  6. 如权利要求1至4中任意一项所述的用于语音处理装置的角色分离方法,其特征在于,所述方位信息还包括会话类型,所述会话类型用于区分所述角度信息的用途。
  7. 如权利要求1至4中任意一项所述的用于语音处理装置的角色分离方法,其中,该方法还包括:输出第二文本信息,所述第二文本信息所述包含所述角色信息以及与所述角色信息对应的文字信息。
  8. 一种语音处理装置,其特征在于,所述语音处理装置包括:
    语音识别单元,用于对获取的音频信息进行语音识别,获取第一文本信息;其中,所述第一文本信息包含文字信息和所述文字信息对应的第一时间信息;
    方位获取单元,用于获取所述音频信息的方位信息,所述方位信息包括角度信息和第二时间信息,所述角度信息为相对于语音处理装置上预设的0度之间的转角,所述角度信息与角色信息相对应;
    角色分离单元,用于根据所述第一时间信息与所述第二时间信息,将所述文字信息与所述角度信息对应的所述角色信息相关联。
  9. 如权利要求8所述的语音处理装置,其特征在于,所述语音识别单元,具体用于对获取的所述音频信息进行语音识别,识别出所述文字信息中每个词语对应的所述第一时间信息。
  10. 如权利要求8所述的语音处理装置,其特征在于,所述角色分离单元,具体用于当所述第二时间与所述第一时间相匹配时,确认所述文字信息与所述角度信息对应的角色信息相匹配。
  11. 如权利要求8所述的语音处理装置,其特征在于,所述角色分离单元,具体用于当所述第二时间与所述第一时间相匹配时,确认所述文字信息与所述角度信息对应;确定所述角度信息对应的角色信息;确定所述文字信息与所述角度信息对应的角色信息相匹配。
  12. 如权利要求8至11中任意一项所述的语音处理装置,其特征在于,所述角色信息至少包括第一角色与第二角色;所述角度信息在第一范围内的为所述第一角色,所述角度信息在第二范围内的为所述第二角色。
  13. 如权利要求8至11中任意一项所述的语音处理装置,其特征在于,所述方位信息还包括会话类型,所述会话类型用于区分所述角度信息的用途。
  14. 如权利要求8至11中任意一项所述的语音处理装置,其特征在于,所述角色分离单元,还用于输出第二文本信息,所述第二文本信息所述包含所述角色信息以及与所述角色信息对应的文字信息。
PCT/CN2021/120412 2021-01-29 2021-09-24 一种用于语音处理装置的角色分离方法及其语音处理装置 WO2022160749A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110127955.3A CN112908336A (zh) 2021-01-29 2021-01-29 一种用于语音处理装置的角色分离方法及其语音处理装置
CN202110127955.3 2021-01-29

Publications (1)

Publication Number Publication Date
WO2022160749A1 true WO2022160749A1 (zh) 2022-08-04

Family

ID=76121307

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120412 WO2022160749A1 (zh) 2021-01-29 2021-09-24 一种用于语音处理装置的角色分离方法及其语音处理装置

Country Status (2)

Country Link
CN (1) CN112908336A (zh)
WO (1) WO2022160749A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112908336A (zh) * 2021-01-29 2021-06-04 深圳壹秘科技有限公司 一种用于语音处理装置的角色分离方法及其语音处理装置
CN113835065B (zh) * 2021-09-01 2024-05-17 深圳壹秘科技有限公司 基于深度学习的声源方向确定方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097878A (zh) * 2018-01-30 2019-08-06 阿拉的(深圳)人工智能有限公司 多角色语音提示方法、云端设备、提示系统及存储介质
US20190251344A1 (en) * 2018-02-12 2019-08-15 Avodah Labs, Inc. Visual language interpretation system and user interface
CN110175260A (zh) * 2019-05-21 2019-08-27 深圳壹秘科技有限公司 录音角色的区分方法、设备及计算机可读存储介质
CN110322869A (zh) * 2019-05-21 2019-10-11 平安科技(深圳)有限公司 会议分角色语音合成方法、装置、计算机设备和存储介质
CN110691258A (zh) * 2019-10-30 2020-01-14 中央电视台 一种节目素材制作方法、装置及计算机存储介质、电子设备
CN112908336A (zh) * 2021-01-29 2021-06-04 深圳壹秘科技有限公司 一种用于语音处理装置的角色分离方法及其语音处理装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160026317A (ko) * 2014-08-29 2016-03-09 삼성전자주식회사 음성 녹음 방법 및 장치
CN110459239A (zh) * 2019-03-19 2019-11-15 深圳壹秘科技有限公司 基于声音数据的角色分析方法、装置和计算机可读存储介质
CN110189764B (zh) * 2019-05-29 2021-07-06 深圳壹秘科技有限公司 展示分离角色的系统、方法和录音设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097878A (zh) * 2018-01-30 2019-08-06 阿拉的(深圳)人工智能有限公司 多角色语音提示方法、云端设备、提示系统及存储介质
US20190251344A1 (en) * 2018-02-12 2019-08-15 Avodah Labs, Inc. Visual language interpretation system and user interface
US20200387697A1 (en) * 2018-02-12 2020-12-10 Avodah, Inc. Real-time gesture recognition method and apparatus
CN110175260A (zh) * 2019-05-21 2019-08-27 深圳壹秘科技有限公司 录音角色的区分方法、设备及计算机可读存储介质
CN110322869A (zh) * 2019-05-21 2019-10-11 平安科技(深圳)有限公司 会议分角色语音合成方法、装置、计算机设备和存储介质
CN110691258A (zh) * 2019-10-30 2020-01-14 中央电视台 一种节目素材制作方法、装置及计算机存储介质、电子设备
CN112908336A (zh) * 2021-01-29 2021-06-04 深圳壹秘科技有限公司 一种用于语音处理装置的角色分离方法及其语音处理装置

Also Published As

Publication number Publication date
CN112908336A (zh) 2021-06-04

Similar Documents

Publication Publication Date Title
JP7536789B2 (ja) 分散システムにおいてユーザの好みに最適化するためのカスタマイズされた出力
EP3963576B1 (en) Speaker attributed transcript generation
US10743107B1 (en) Synchronization of audio signals from distributed devices
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
Donley et al. Easycom: An augmented reality dataset to support algorithms for easy communication in noisy environments
US11875796B2 (en) Audio-visual diarization to identify meeting attendees
US9626970B2 (en) Speaker identification using spatial information
WO2022160749A1 (zh) 一种用于语音处理装置的角色分离方法及其语音处理装置
US9210269B2 (en) Active speaker indicator for conference participants
CN113906503A (zh) 处理来自分布式设备的重叠语音
WO2020073633A1 (zh) 会议音箱及会议记录方法、设备、系统和计算机存储介质
US20200351603A1 (en) Audio Stream Processing for Distributed Device Meeting
CN109560941A (zh) 会议记录方法、装置、智能终端及存储介质
US11114115B2 (en) Microphone operations based on voice characteristics
US11468895B2 (en) Distributed device meeting initiation
CN113611308B (zh) 一种语音识别方法、装置、系统、服务器及存储介质
CN113921026A (zh) 语音增强方法和装置
WO2022143040A1 (zh) 一种音量调节方法、电子设备、终端及可存储介质
JP7207568B2 (ja) 出力方法、出力プログラム、および出力装置
US20230421702A1 (en) Distributed teleconferencing using personalized enhancement models
EP4300492A1 (en) Method of noise reduction for intelligent network communication
TWI764020B (zh) 視訊會議系統及其方法
RU2821283C2 (ru) Индивидуально настроенный вывод, который оптимизируется для пользовательских предпочтений в распределенной системе
WO2023088156A1 (zh) 一种声速矫正方法以及装置
CN118248168A (zh) 基于大模型及工牌设备的用户画像分析方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21922341

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21922341

Country of ref document: EP

Kind code of ref document: A1