WO2019227552A1 - 基于行为识别的语音定位方法以及装置 - Google Patents

基于行为识别的语音定位方法以及装置 Download PDF

Info

Publication number
WO2019227552A1
WO2019227552A1 PCT/CN2018/092791 CN2018092791W WO2019227552A1 WO 2019227552 A1 WO2019227552 A1 WO 2019227552A1 CN 2018092791 W CN2018092791 W CN 2018092791W WO 2019227552 A1 WO2019227552 A1 WO 2019227552A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
speech
video signal
venue
behavior
Prior art date
Application number
PCT/CN2018/092791
Other languages
English (en)
French (fr)
Inventor
卢启伟
杨宁
刘胜强
Original Assignee
深圳市鹰硕技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市鹰硕技术有限公司 filed Critical 深圳市鹰硕技术有限公司
Publication of WO2019227552A1 publication Critical patent/WO2019227552A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a speech positioning method, device, electronic device, and computer-readable storage medium based on behavior recognition.
  • the rapid positioning of the speaker can make the corresponding audio or / and video acquisition device quickly and automatically locate the speaker, and improve the audio or / and video acquisition effect.
  • the existing facial recognition method based on the spokesperson has higher requirements on the spokesperson and other members, and requires significantly different facial features.
  • the hardware requirements of the video capture device with facial features are also higher.
  • the method of multi-microphone positioning or speaker-based speech system positioning needs to add a large number of auxiliary devices, which increases the configuration and operation costs.
  • application number CN201611131001.5 discloses a voice positioning method, device, and system.
  • the method includes: receiving voice information through multiple microphones, and determining whether the voice information contains the first keyword voice; if it contains For the first keyword voice, the positioning information of the first keyword voice received by each of the microphones is recorded; and based on the position coordinates of the microphones and the positioning information, the Sound source location.
  • the speech positioning method, device and system of the present invention can be implemented in a multi-person conference or other speech recognition occasions. The speaker only needs to speak the keyword voice, and then the speaker's direction can be located immediately to realize the directional pickup of sound. Conducive to improving the quality of the picked up sound.
  • the localization system includes an image recognition tracking subsystem and a voice localization and enhancement subsystem.
  • the image recognition and tracking subsystem includes: a camera for capturing an image sequence; an image recognition and tracking unit for identifying a person and buffering the three-dimensional coordinates of the face; a first predefined operation performed by the identification person to wake up the voice localization and enhancement subsystem, and Send three-dimensional coordinates of the face; track and identify the person, and send updated three-dimensional coordinates of the face.
  • the voice localization and enhancement subsystem includes: a microphone array for collecting voice information; a voice localization and enhancement unit for controlling the microphone array directional focus to collect the voice information of the person according to the spatial filtering algorithm and the received three-dimensional coordinate of the face, and The person is located according to the collected voice information.
  • Application number CN201510066532.X discloses a partial processing array-type speech localization and enhancement method, including a basic structure of a generalized sidelobe canceller, a design of a blocking matrix, a design of a component filter, and an external Wiener filtering part.
  • This method borrows the component structure, plus a post-Wiener filter, and uses partial adaptive technology to ensure the denoising performance of the algorithm, effectively suppress non-coherent noise and coherent noise, accelerate the convergence speed of the algorithm, and reduce the complexity of the operation
  • the improved speech enhancement system has a higher output signal-to-noise ratio.
  • the simulation experimental test structure shows that the method of the present invention has a higher output signal-to-noise ratio than a microphone array speech enhancement system based on a full-band generalized sidelobe canceller.
  • the above methods are all based on the speaker's facial feature recognition method or the speaker location positioning method based on multi-microphone positioning. None of them can solve the problem of simply and reliably positioning the speaker without relying on too many auxiliary devices.
  • the purpose of the present disclosure is to provide a speech positioning method, device, electronic device, and computer-readable storage medium based on behavior recognition, so as to at least to some extent overcome one or more problems caused by the limitations and defects of related technologies.
  • a speech localization method based on behavior recognition including:
  • a signal acquisition step when receiving a specific voice signal, acquiring time information for receiving the specific voice signal, and a video signal collected by a video acquisition device in a time period corresponding to the time information;
  • a behavior feature matching step analyzing N user behavior features in the video signal, and matching the N user behavior features with a preset standard behavior feature to obtain a matching result;
  • a speaker determination step if it is determined that the N user behavior characteristics include a speech behavior characteristic according to the matching result, and a user corresponding to the speech behavior characteristic in a video signal is used as a speaker;
  • the speaking position positioning step analyzes the position in the venue where the speaker is located in the video signal, determines the speaking position in the venue where the speaker is located, and controls the audio / video acquisition device to align the speaking position in the venue where the speaker is located.
  • the signal acquisition step includes:
  • the first speech feature matched to the key speech feature is used as a specific speech signal.
  • the behavior feature matching step includes:
  • the N user behavior characteristics include a user behavior characteristic that has a degree of matching with a preset standard behavior characteristic that is greater than or equal to the preset matching degree
  • the user behavior characteristic that is greater than or equal to the preset matching degree is used as the speech behavior characteristic .
  • the matching result is that speech behavior characteristics are included in the N user behavior characteristics
  • the matching result is that the N user behavior characteristics do not include the speech behavior characteristics.
  • the method further includes:
  • the location of the venue in the video signal after the location mapping is divided into regions, and the location identifier of each region in the video signal and the corresponding region in the actual venue is determined.
  • analyzing the position of the speaker in the venue where the speaker is located in the video signal includes:
  • the determining the speaking position in the venue where the speaker is located in the speaking position positioning step includes:
  • the audio / video capture device includes at least one microphone.
  • the speaking position positioning step includes:
  • the audio collection sound intensity of the microphone is adjusted.
  • the speaking position positioning step includes:
  • the method includes:
  • the audio / video acquisition device is reset.
  • the determining that the speech ends includes:
  • the speaking behavior characteristics include a raising hand behavior characteristic and a standing behavior characteristic.
  • a speech positioning device based on behavior recognition including:
  • a signal acquisition module configured to obtain, when a specific voice signal is received, time information for receiving the specific voice signal, and a video signal collected by a video acquisition device in a time period corresponding to the time information;
  • a behavior characteristic matching module configured to analyze N user behavior characteristics in the video signal, and match the N user behavior characteristics with preset standard behavior characteristics to obtain a matching result
  • a speaker determination module configured to determine, according to the matching result, that the N user behavior characteristics include a speech behavior characteristic, and use a user corresponding to the speech behavior characteristic in a video signal as a speaker;
  • Speaking position positioning module for analyzing the position of the speaker in the venue where the speaker is located in the video signal, determining the position of the speaker in the venue where the speaker is located, and controlling the audio / video acquisition device to align the speech in the venue where the speaker is located position.
  • an electronic device including:
  • a memory where computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, the method according to any one of the foregoing is implemented.
  • a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the method according to any one of the above.
  • a speech recognition method based on behavior recognition when a specific speech signal is received, acquiring time information of receiving the specific speech signal and a video signal collected by a video acquisition device in a time period corresponding to the time information Analyzing the N user behavior characteristics in the video signal and matching the N user behavior characteristics with a preset standard behavior characteristic; if it is determined according to the matching result that the N user behavior characteristics include a speech behavior characteristic,
  • the user in the video signal that corresponds to the characteristics of the speech acts as a speaker; analyzes and determines the position of the speaker in the video signal, and controls the audio / video acquisition device to align the speech position in the venue of the speaker .
  • the recognition characteristics are more obvious and easy to match, and the matching recognition rate is improved. The effect of matching saves the matching operation resources and further improves the matching recognition rate.
  • FIG. 1 shows a flowchart of a speech recognition method based on behavior recognition according to an exemplary embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of an application scenario of a speech localization method based on behavior recognition according to an exemplary embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of an application scenario of a speech localization method based on behavior recognition according to an exemplary embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of an interactive application scenario of a voice positioning method based on behavior recognition according to an exemplary embodiment of the present disclosure
  • FIG. 5 shows a schematic block diagram of a speech recognition device based on behavior recognition according to an exemplary embodiment of the present disclosure
  • FIG. 6 schematically illustrates a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.
  • FIG. 7 schematically illustrates a computer-readable storage medium according to an exemplary embodiment of the present disclosure.
  • a speech recognition method based on behavior recognition is first provided and can be applied to electronic devices such as computers.
  • the speech recognition method based on behavior recognition may include the following steps:
  • the signal acquisition step S110 when a specific voice signal is received, time information for receiving the specific voice signal and a video signal collected by a video acquisition device in a time period corresponding to the time information are acquired;
  • a behavior feature matching step S120 analyzing N user behavior features in the video signal, and matching the N user behavior features with a preset standard behavior feature to obtain a matching result;
  • the speaker determines step S130. If it is determined that the N user behavior characteristics include a speech behavior characteristic according to the matching result, a user corresponding to the speech behavior characteristic in the video signal is used as a speaker;
  • Speaking position positioning step S140 analyze the position of the speaker in the venue where the speaker is located in the video signal, determine the position of the speaker in the venue where the speaker is located, and control the audio / video acquisition device to align with the speaker in the venue .
  • the recognition feature is more obvious and easy to match, and the matching recognition rate is improved;
  • the specific speech signal is used as an enable marker for behavior feature matching, which has the effect of assisting matching, saving matching computing resources, and further improving the matching recognition rate.
  • time information for receiving the specific voice signal and a video signal collected by a video acquisition device in a time period corresponding to the time information may be acquired.
  • a user-specific voice signal is used as a start signal of a voice localization method based on behavior recognition, which can reduce misjudgment of standard behavior characteristics, and determine the time for acquiring the specific voice signal after determining the specific voice signal. Information in order to obtain video images at the same time in the video capture device for the next step of authentication of behavior characteristics.
  • the signal collecting step includes: collecting a first audio signal; extracting a first voice feature in the first audio signal; and combining the first voice feature with a preset key voice feature model Matching the key voice features; and using the first voice feature matched with the key voice feature as a specific voice signal.
  • the specific voice signal may be "please *** to speak” and “please *** to speak” commonly used in meetings, or it may be "please *** to answer” "classmates raising hands” commonly used in teaching scenarios And other common alternative users are the iconic specific voice signals.
  • N user behavior features in the video signal may be analyzed, and the N user behavior features may be matched with a preset standard behavior feature to obtain a matching result.
  • the behavior feature matching step includes: analyzing N user behavior features in the video signal, and matching the N user behavior features with a preset standard behavior feature; if the N is determined, The user behavior characteristics include user behavior characteristics whose degree of matching with the preset standard behavior characteristics is greater than or equal to the preset matching degree, and the user behavior characteristics which are greater than or equal to the preset matching degree are used as the speech behavior characteristics, and the matching result It is said that N user behavior characteristics include speech behavior characteristics; if the feature matching degrees of the N user behavior characteristics and preset standard behavior characteristics are less than the preset matching degree, the matching result is not among the N user behavior characteristics. Contains speech behavior characteristics.
  • each user ’s behavior characteristics are more or less different from the standard behavior characteristics.
  • the degree of matching is used to reflect such differences, and the degree of matching is preset as a measure of the user Whether the behavioral characteristics are the standard of standard behavioral characteristics.
  • the speaking behavior characteristics include a raising hand behavior characteristic and a standing behavior characteristic.
  • FIG. 2 it is a schematic diagram of the behavior characteristics of a user's hand being identified and matched as standard behavior characteristics in a certain teaching scene.
  • the speaker determination step S130 if it is determined that the N user behavior characteristics include a speech behavior characteristic according to the matching result, a user corresponding to the speech behavior characteristic in the video signal is used as the speaker.
  • the subject of the user behavior characteristic is determined as a speaker.
  • the speaking position positioning step S140 the position in the venue where the speaker is located in the video signal may be analyzed, the speaking position in the venue where the speaker is located, and the audio / video acquisition device may be controlled to be aligned in the venue where the speaker is located. Position.
  • the speaker's speaking position in the venue is determined according to the venue information or video signal analysis, so as to further adjust the volume, focus, and direction of the audio / video acquisition device. Wait for actions that are good for audio / video acquisition.
  • the method further includes: determining a ratio of a venue position in the video signal to an actual venue position; performing a position mapping of the venue position in the video signal and the actual venue position according to the ratio; The location of the venue in the mapped video signal is divided into regions to determine the location identifier of each region in the video signal and the corresponding region in the actual venue. For the preset area division of places such as conferences or teaching, accurate positioning of speakers can be achieved only by video signals collected by video acquisition equipment, without the need for complicated algorithms that occupy a lot of computing resources.
  • analyzing the position of the speaker in the video signal in the venue includes: analyzing the area of the speaker in the video signal; and determining that the speaker is in the video signal.
  • the location ID corresponding to the area.
  • the determining the speaking position in the venue where the speaker is located in the speaking position locating step includes: using the position identifier corresponding to the area of the speaker in the video signal as the speaking position identifier; and using the speaking The area corresponding to the position identifier is used as the speaking position, and the audio / video collection device is controlled to align the corresponding area of the speaking position identifier in the actual venue.
  • aligning the audio equipment to the corresponding area of the speaking position identifier in the actual venue can more clearly capture the spoken speech of the speaker; aligning the video equipment to the speaking position identifier at the actual venue
  • the corresponding area in the image can be used for focusing and other operations, so as to obtain a clearer video picture of the speaker.
  • the audio / video acquisition device includes at least one microphone.
  • there is an audio collection array composed of multiple microphones which has a higher requirement for the angle of the microphone. Therefore, the positioning of the microphone by this method can improve the audio collection effect of the microphone. To enhance the user experience. As shown in FIG. 3, in a scene of an audio collection array composed of multiple microphones, after locating a user's position through user behavior feature recognition, the multiple microphones are aligned with the speaker.
  • the speaking position positioning step includes: determining a distance between the speaker and the microphone according to the speaking position; and adjusting the audio collection sound intensity of the microphone according to the distance. Adjusting the sound intensity is also an important aspect of enhancing the user experience. For example, in a simple application scenario, the sound intensity adjustment of a speaker who is far away and has a long distance from the microphone can be enhanced to capture a clearer speaker's Audio.
  • the step of locating the speaking position includes: after determining the speaking position in the venue where the speaker is located, controlling the audio collection device to track and align the speaking position.
  • the positioning and tracking of the speaker is achieved by matching the behavior characteristics of the speaker with the video acquisition device, and the continuous alignment and tracking of the speaker by the audio / video acquisition device can be achieved when the speaker dynamically changes the position.
  • the method includes: resetting the audio / video capture device after determining that the speech is finished. After the speaker's explanation is finished, the audio / video acquisition device is reset to the initial position, so that the next speaker can be quickly aligned and tracked.
  • determining that the speech ends includes: acquiring a second audio signal collected by an audio / video acquisition device; extracting a second voice feature in the second audio signal; and if the second voice feature is determined to meet a pre- Set the characteristic characteristics of the conclusion to determine the end of the speech.
  • the feature condition of the epilogue may be a preset duration. If it is detected that the voice / video capture device does not emit a sound when it is longer than the preset duration, the preset feature characteristic condition is satisfied; it may also be a preset
  • the ending feature database matches the second phonetic feature with the ending feature database. If the matching is successful, the preset ending feature conditions are met.
  • the second phonetic feature may be "end of speech", "response completed", "thank you *” ** Speech "and so on.
  • FIG. 4 is a schematic diagram of a data interaction scenario applied to a PC and a portable handheld device.
  • the speech recognition device 500 based on behavior recognition may include: a signal acquisition module 510, a behavior feature matching module 520, a speaker determination module 530, and a speaking position positioning module 540. among them:
  • the signal acquisition module 510 is configured to acquire time information of receiving a specific voice signal and a video signal collected by a video acquisition device in a time period corresponding to the time information when a specific voice signal is received;
  • a behavior feature matching module 520 configured to analyze N user behavior features in the video signal, and match the N user behavior features with a preset standard behavior feature to obtain a matching result;
  • the speaker determination module 530 is configured to: if it is determined that the N user behavior characteristics include a speech behavior characteristic according to the matching result, use a user corresponding to the speech behavior characteristic in the video signal as a speaker;
  • Speaking position positioning module 540 is used to analyze the position of the speaker in the venue where the speaker is located in the video signal, determine the position of the speaker in the venue where the speaker is located, and control the audio / video acquisition device to be aligned with the speaker in the venue where the speaker is located. Speaking position.
  • modules or units of the speech localization device 500 based on behavior recognition are mentioned in the detailed description above, this division is not mandatory.
  • the features and functions of two or more modules or units described above may be embodied in one module or unit.
  • the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
  • an electronic device capable of implementing the above method.
  • FIG. 6 An electronic device 600 according to such an embodiment of the present invention is described below with reference to FIG. 6.
  • the electronic device 600 shown in FIG. 6 is merely an example, and should not impose any limitation on the functions and scope of use of the embodiment of the present invention.
  • the electronic device 600 is expressed in the form of a general-purpose computing device.
  • the components of the electronic device 600 may include, but are not limited to, the at least one processing unit 610, the at least one storage unit 620, a bus 630 connecting different system components (including the storage unit 620 and the processing unit 610), and a display unit 640.
  • the storage unit stores program code, and the program code can be executed by the processing unit 610, so that the processing unit 610 executes various exemplary embodiments according to the present invention described in the above-mentioned "exemplary method" section of the present specification. Examples of steps.
  • the processing unit 610 may perform steps S110 to S140 as shown in FIG. 1.
  • the storage unit 620 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 6201 and / or a cache storage unit 6202, and may further include a read-only storage unit (ROM) 6203.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 620 may further include a program / utility tool 6204 having a set of (at least one) program modules 6205.
  • program modules 6205 include, but are not limited to, an operating system, one or more application programs, other program modules, and program data. Each or some combination of these examples may include an implementation of a network environment.
  • the bus 630 may be one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure in a variety of bus structures bus.
  • the electronic device 600 may also communicate with one or more external devices 670 (such as a keyboard, pointing device, Bluetooth device, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 600, and / or with Any device (eg, router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. This communication can be performed through an input / output (I / O) interface 650.
  • the electronic device 600 may also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN), and / or a public network, such as the Internet) through the network adapter 660. As shown, the network adapter 660 communicates with other modules of the electronic device 600 through the bus 630.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • the technical solution according to the embodiment of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a U disk, a mobile hard disk, etc.) or on a network Including instructions to cause a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to an embodiment of the present disclosure.
  • a non-volatile storage medium which may be a CD-ROM, a U disk, a mobile hard disk, etc.
  • a computing device which may be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium on which a program product capable of implementing the above-mentioned method of the present specification is stored.
  • various aspects of the present invention may also be implemented in the form of a program product, which includes program code.
  • the program product runs on a terminal device, the program code is used to make the program product
  • the terminal device performs the steps according to various exemplary embodiments of the present invention described in the above-mentioned "exemplary method" section of this specification.
  • a program product 700 for implementing the above method according to an embodiment of the present invention is described, which may use a portable compact disc read-only memory (CD-ROM) and include program code, and may be stored in a terminal device. For example running on a personal computer.
  • CD-ROM compact disc read-only memory
  • the program product of the present invention is not limited thereto.
  • the readable storage medium may be any tangible medium containing or storing a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • the program product may employ any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
  • the computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave, which carries readable program code. Such a propagated data signal may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with an instruction execution system, apparatus, or device.
  • the program code contained on the readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wired, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • the program code for performing the operations of the present invention can be written in any combination of one or more programming languages, which include object-oriented programming languages—such as Java, C ++, etc.—and also include conventional procedural Programming language—such as "C" or a similar programming language.
  • the program code can be executed entirely on the user computing device, partly on the user device, as an independent software package, partly on the user computing device, partly on the remote computing device, or entirely on the remote computing device or server On.
  • the remote computing device may be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it may be connected to an external computing device (e.g., provided by using an Internet service) (Commercially connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • an external computing device e.g., provided by using an Internet service
  • the recognition characteristics are more obvious and easy to match, and the matching recognition rate is improved.
  • the effect of matching saves the matching operation resources and further improves the matching recognition rate.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Telephone Function (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

一种基于行为识别的语音定位方法、装置、电子设备以及存储介质。其中,该方法包括:在接收到特定语音信号时,获取接收特定语音信号的时间信息,以及视频采集设备在时间信息对应时间段采集的视频信号(S110);分析视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配,得到匹配结果(S120);若根据匹配结果确定N个用户行为特征中包含发言行为特征,将视频信号中与所述发言行为特征对应的用户作为发言人(S130);分析并确定视频信号中所述发言人所在场地中的位置,控制音/视频采集设备对准所述发言人所在场地中的发言位置(S140)。该方法可以通过发言人行为特征的识别并匹配实现对发言人位置的定位。

Description

基于行为识别的语音定位方法以及装置 技术领域
本公开涉及计算机技术领域,具体而言,涉及一种基于行为识别的语音定位方法、装置、电子设备以及计算机可读存储介质。
背景技术
在会议或者教学等场合中,对发言人的快速定位,可以使对应的音或/和视频采集装置快速、自动的定位到所述发言人,提高音或/和视频采集效果。
然而,现有基于发言人面部特征识别的方式对发言人及其它成员的要求较高,需要有较大差异的面部特征,同时,对面部特征的视频采集装置的硬件条件要求也更高;基于多麦克风定位或者基于发言人发言系统定位的方式又需要增加大量的附属装置,增加了配置和运行成本。
在现有技术中,申请号为CN 201611131001.5公开了一种语音定位方法、装置和系统,其中方法包括:通过多个麦克风接收语音信息,并判断语音信息中是否含有第一关键字语音;如果含有所述第一关键字语音,则记录各所述麦克风接收到第一关键字语音的定位信息;根据各所述麦克风的位置坐标,以及所述定位信息,计算发出所述第一关键字语音的声源位置。本发明的语音定位方法、装置和系统,可以实现在多人会议场合或者其它语音识别场合,发言者只需要说出关键字语音,就可以马上定位发言者的方向,以实现定向拾取声音,有利于提高拾取声音的质量。
申请号为CN201610304047.6公开了一种结合图像的语音定位和增强系统及方法,所述定位系统包括图像识别跟踪子系统和语音定位和增强子系统。图像识别跟踪子系统包括:摄像头,用于采集图像序列;图像识别跟踪单元,用于识别人员并缓存脸部三维坐标;通过识别人员执行的第一预定义操作唤醒语音定位和增强子系统,并发送脸部三维坐标;跟踪识别所述人员,并发送更新的脸部三维坐标。语音定位和增强子系统包括:麦 克风阵列,用于采集语音信息;语音定位和增强单元,用于根据空间滤波算法和接收的脸部三维坐标控制麦克风阵列定向聚焦采集所述人员的语音信息,并根据所采集的语音信息对所述人员进行定位。
申请号为CN201510066532.X公开了一种分部处理式阵列式语音定位和增强方法,包括广义旁瓣抵消器的基本结构、阻塞矩阵的设计、分量滤波器的设计和外置维纳滤波部分。该方法借鉴分量结构,外加后置维纳滤波器,利用部分自适应技术,保证了算法的去噪性能,有效地抑制非相干噪声和相干噪声,加快了算法的收敛速度,降低了运算复杂度,相对于传统的广义旁瓣相消器的麦克风阵列语音增强系统,采用改进的语音增强系统具有更高的输出信噪比。仿真实验测试结构表明,相对于基于全带广义旁瓣抵消器的麦克风阵列语音增强系统,本发明的方法具有更高的输出信噪比。
以上方法均基于发言人面部特征识别的方式或基于多麦克风定位的发言人位置定位的方法,都不能解决不依赖过多的附属设备,简单可靠的实现发言人位置定位的问题。
因此,需要提供一种或多种至少能够解决上述问题的技术方案。
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。
发明内容
本公开的目的在于提供一种基于行为识别的语音定位方法、装置、电子设备以及计算机可读存储介质,进而至少在一定程度上克服由于相关技术的限制和缺陷而导致的一个或者多个问题。
根据本公开的一个方面,提供一种基于行为识别的语音定位方法,包括:
信号采集步骤,在接收到特定语音信号时,获取接收特定语音信号的 时间信息,以及视频采集设备在所述时间信息对应时间段采集的视频信号;
行为特征匹配步骤,分析所述视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配,得到匹配结果;
发言人确定步骤,若根据匹配结果确定所述N个用户行为特征中包含发言行为特征,将视频信号中与所述发言行为特征对应的用户作为发言人;
发言位置定位步骤,分析视频信号中所述发言人所在场地中的位置,确定所述发言人所在场地中的发言位置,控制音/视频采集设备对准所述发言人所在场地中的发言位置。
在本公开的一种示例性实施例中,所述信号采集步骤,包括:
采集第一音频信号;
提取所述第一音频信号中的第一语音特征;
将所述第一语音特征与预设的关键语音特征模型中的关键语音特征进行匹配;
将匹配到所述关键语音特征的第一语音特征作为特定语音信号。
在本公开的一种示例性实施例中,所述行为特征匹配步骤,包括:
分析所述视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配;
若判断所述N个用户行为特征中包含与预设的标准行为特征的特征匹配度大于或等于预设匹配度的用户行为特征,将大于或等于预设匹配度的用户行为特征作为发言行为特征,所述匹配结果为N个用户行为特征中包含发言行为特征;
若所述N个用户行为特征与预设的标准行为特征的特征匹配度均小于预设匹配度,所述匹配结果为N个用户行为特征中不包含发言行为特征。
在本公开的一种示例性实施例中,所述方法还包括:
确定所述视频信号中的场地位置与实际场地位置的比例;
根据所述比例将视频信号中的场地位置与实际场地位置进行位置映射;
对进行位置映射后的视频信号中的场地位置进行区域划分,确定视频信号中每个区域与实际场地中对应区域的位置标识。
在本公开的一种示例性实施例中,所述发言位置定位步骤中,分析视频信号中所述发言人所在场地中的位置包括:
分析所述发言人在视频信号中的区域;
确定所述发言人在视频信号中的区域对应的位置标识。
在本公开的一种示例性实施例中,所述发言位置定位步骤中确定所述发言人所在场地中的发言位置,包括:
将所述发言人在视频信号中的区域对应的位置标识作为发言位置标识;
将所述发言位置标识对应的区域作为发言位置,控制音/视频采集设备对准所述发言位置标识在实际场地中的对应区域。
在本公开的一种示例性实施例中,所述音/视频采集设备包括至少一个麦克风。
在本公开的一种示例性实施例中,所述发言位置定位步骤包括:
根据发言位置,确定发言人与麦克风之间的距离;
根据所述距离,调节麦克风的音频采集音强。
在本公开的一种示例性实施例中,所述发言位置定位步骤包括:
在确定所述发言人所在场地中的发言位置后,控制所述音频采集设备跟踪对准所述发言位置。
在本公开的一种示例性实施例中,所述方法包括:
在确定发言结束后,将所述音/视频采集设备复位。
在本公开的一种示例性实施例中,所述确定发言结束包括:
获取音/视频采集设备采集的第二音频信号;
提取所述第二音频信号中的第二语音特征;
若所述第二语音特征确定满足预设的结束语特征条件,确定发言结束。
在本公开的一种示例性实施例中,所述发言行为特征包括举手行为特征、起立行为特征。
在本公开的一个方面,提供一种基于行为识别的语音定位装置,包括:
信号采集模块,用于在接收到特定语音信号时,获取接收特定语音信号的时间信息,以及视频采集设备在所述时间信息对应时间段采集的视频信号;
行为特征匹配模块,用于分析所述视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配,得到匹配结果;
发言人确定模块,用于若根据匹配结果确定所述N个用户行为特征中包含发言行为特征,将视频信号中与所述发言行为特征对应的用户作为发言人;
发言位置定位模块,用于分析视频信号中所述发言人所在场地中的位置,确定所述发言人所在场地中的发言位置,控制音/视频采集设备对准所述发言人所在场地中的发言位置。
在本公开的一个方面,提供一种电子设备,包括:
处理器;以及
存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现根据上述任意一项所述的方法。
在本公开的一个方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现根据上述任意一项所述的方法。
本公开的示例性实施例中的基于行为识别的语音定位方法,在接收到特定语音信号时,获取接收特定语音信号的时间信息,以及视频采集设备在所述时间信息对应时间段采集的视频信号;分析所述视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配;若根据匹配结果确定所述N个用户行为特征中包含发言行为特征,将视频信号中与所述发言行为特征对应的用户作为发言人;分析并确定视频信号 中所述发言人所在场地中的位置,控制音/视频采集设备对准所述发言人所在场地中的发言位置。一方面,由于采用用户行为特征作为发言人的匹配特征,使识别特征更加明显清晰容易匹配,提高了匹配识别率;另一方面,通过特定语音信号作为行为特征匹配的使能标记,起到了辅助匹配的效果,节省了匹配运算资源,进一步提高了匹配识别率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
通过参照附图来详细描述其示例实施例,本公开的上述和其它特征及优点将变得更加明显。
图1示出了根据本公开一示例性实施例的基于行为识别的语音定位方法的流程图;
图2示出了根据本公开一示例性实施例的基于行为识别的语音定位方法应用场景的示意图;
图3示出了根据本公开一示例性实施例的基于行为识别的语音定位方法应用场景的示意图;
图4示出了根据本公开一示例性实施例的基于行为识别的语音定位方法交互应用场景的示意图;
图5示出了根据本公开一示例性实施例的基于行为识别的语音定位装置的示意框图;
图6示意性示出了根据本公开一示例性实施例的电子设备的框图;以及
图7示意性示出了根据本公开一示例性实施例的计算机可读存储介质的示意图。
具体实施方式
现在将参考附图更全面地描述示例实施例。然而,示例实施例能够以多种形式实施,且不应被理解为限于在此阐述的实施例;相反,提供这些实施例使得本公开将全面和完整,并将示例实施例的构思全面地传达给本领域的技术人员。在图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本公开的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而没有所述特定细节中的一个或更多,或者可以采用其它的方法、组元、材料、装置、步骤等。在其它情况下,不详细示出或描述公知结构、方法、装置、实现、材料或者操作以避免模糊本公开的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个软件硬化的模块中实现这些功能实体或功能实体的一部分,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
在本示例实施例中,首先提供了一种基于行为识别的语音定位方法,可以应用于计算机等电子设备;参考图1中所示,该基于行为识别的语音定位方法可以包括以下步骤:
信号采集步骤S110,在接收到特定语音信号时,获取接收特定语音信号的时间信息,以及视频采集设备在所述时间信息对应时间段采集的视频信号;
行为特征匹配步骤S120,分析所述视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配,得到匹配结果;
发言人确定步骤S130,若根据匹配结果确定所述N个用户行为特征中包含发言行为特征,将视频信号中与所述发言行为特征对应的用户作为发言人;
发言位置定位步骤S140,分析视频信号中所述发言人所在场地中的位 置,确定所述发言人所在场地中的发言位置,控制音/视频采集设备对准所述发言人所在场地中的发言位置。
根据本示例实施例中的基于行为识别的语音定位方法,一方面,由于采用用户行为特征作为发言人的匹配特征,使识别特征更加明显清晰容易匹配,提高了匹配识别率;另一方面,通过特定语音信号作为行为特征匹配的使能标记,起到了辅助匹配的效果,节省了匹配运算资源,进一步提高了匹配识别率。
下面,将对本示例实施例中的基于行为识别的语音定位方法进行进一步的说明。
在信号采集步骤S110中,可以在接收到特定语音信号时,获取接收特定语音信号的时间信息,以及视频采集设备在所述时间信息对应时间段采集的视频信号。
本示例实施方式中,通过用户特定语音信号,作为基于行为识别的语音定位方法的起始信号,可以减少对标准行为特征的误判,判断为特定语音信号后,获取所述特定语音信号的时间信息,以便在视频采集设备中获取相同时间的视频图像来进行下一步行为特征的认证。
本示例实施方式中,所述信号采集步骤,包括:采集第一音频信号;提取所述第一音频信号中的第一语音特征;将所述第一语音特征与预设的关键语音特征模型中的关键语音特征进行匹配;将匹配到所述关键语音特征的第一语音特征作为特定语音信号。所述特定的语音信号可以是会议时常见的“请***发言”“有请***讲话”,也可以是在教学场景中常见的“请***回答”“举手的同学”等常见的备选用户发言的标志性的特定语音信号。
在行为特征匹配步骤S120中,可以分析所述视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配,得到匹配结果。
本示例实施方式中,一般在会议或者教学场景中,会有多个用户,会对应多个用户行为特征,识别这些行为特征并统计个数,将所述行为特征与预设的标准行为特征进行匹配,得到的匹配结果可以用来进一步判断是 否为发言人对应的行为特征。
本示例实施方式中,所述行为特征匹配步骤,包括:分析所述视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配;若判断所述N个用户行为特征中包含与预设的标准行为特征的特征匹配度大于或等于预设匹配度的用户行为特征,将大于或等于预设匹配度的用户行为特征作为发言行为特征,所述匹配结果为N个用户行为特征中包含发言行为特征;若所述N个用户行为特征与预设的标准行为特征的特征匹配度均小于预设匹配度,所述匹配结果为N个用户行为特征中不包含发言行为特征。在实际场景的用户行为特征匹配时,每个用户的行为特征与标准行为特征相比都会有或多或少的差异,用匹配度来反应这样的差异,并预设匹配度作为衡量所述用户的行为特征是否为标准行为特征的标准。
本示例实施方式中,根据不同的应用场景,所述发言行为特征包括举手行为特征、起立行为特征。如图2所示的一种基于行为识别的语音定位方法的示意图中,为在某教学场景中,用户举手行为特征被识别并匹配为标准行为特征的示意图。
在发言人确定步骤S130中,可以若根据匹配结果确定所述N个用户行为特征中包含发言行为特征,将视频信号中与所述发言行为特征对应的用户作为发言人。
本示例实施方式中,在将用户行为特征与所述标准行为特征匹配,并确定所述用户行为特征为标准行为特征后,代表所述用户有发言意向或者趋势,做出所述行为特征,则将所述用户行为特征的主体确定为发言人。
在发言位置定位步骤S140中,可以分析视频信号中所述发言人所在场地中的位置,确定所述发言人所在场地中的发言位置,控制音/视频采集设备对准所述发言人所在场地中的发言位置。
本示例实施方式中,在确定发言人后,根据场地信息或者视频信号分析等方式,确定所述发言人在所在场所中的发言位置,以便进行进一步调整音/视频采集设备的音量、对焦、方向等有利于音/视频采集的动作。
本示例实施方式中,所述方法还包括:确定所述视频信号中的场地位置与实际场地位置的比例;根据所述比例将视频信号中的场地位置与实际场地位置进行位置映射;对进行位置映射后的视频信号中的场地位置进行区域划分,确定视频信号中每个区域与实际场地中对应区域的位置标识。对会议或教学等场所的预设区域划分,可以仅通过视频采集设备采集的视频信号就可以实现对发言人的准确定位,不需要占用大量计算资源的复杂算法。
本示例实施方式中,所述发言位置定位步骤中,分析视频信号中所述发言人所在场地中的位置包括:分析所述发言人在视频信号中的区域;确定所述发言人在视频信号中的区域对应的位置标识。将所述发言人在视频信号中的区域与实际场景中的位置进行对应,并生成位置标识,作为所述发言人对应的位置信息。
本示例实施方式中,述发言位置定位步骤中确定所述发言人所在场地中的发言位置,包括:将所述发言人在视频信号中的区域对应的位置标识作为发言位置标识;将所述发言位置标识对应的区域作为发言位置,控制音/视频采集设备对准所述发言位置标识在实际场地中的对应区域。在会议或者教学场景中,将音频设备对准所述发言位置标识在实际场地中的对应区域可以更加清晰的采集所述发言人的发言语音;将视频设备对准所述发言位置标识在实际场地中的对应区域可以进行对焦等操作,进而得到更清晰的所述发言人的视频画面。
本示例实施方式中,所述音/视频采集设备包括至少一个麦克风。特别的,在一些对音频要求较高的场景中,有多个麦克风组成的音频采集阵列,对麦克风的角度有较高额要求,所以通过本方法实现对麦克风的定位可以提高麦克风的音频采集效果,增强用户体验。如图3所示,为在多个麦克风组成的音频采集阵列的场景中,通过用户行为特征识别对用户位置定位后,将所述多个麦克风对准所述发言人的示意图。
本示例实施方式中,所述发言位置定位步骤包括:根据发言位置,确定发言人与麦克风之间的距离;根据所述距离,调节麦克风的音频采集音强。调节音强也是增强用户体验的一个重要的方面,如在简单的应用场景 中,对较远位置、与麦克风距离较长的发言人的音频音强调节增强,可以采集到更清晰的发言人的音频。
本示例实施方式中,所述发言位置定位步骤包括:在确定所述发言人所在场地中的发言位置后,控制所述音频采集设备跟踪对准所述发言位置。通过视频采集设备对发言人的行为特征匹配来实现发言人的定位并跟踪,可以实现在发言人动态变换位置的时候,音/视频采集设备对所述发言人的持续对准和跟踪。
本示例实施方式中,所述方法包括:在确定发言结束后,将所述音/视频采集设备复位。在发言人发言解结束后,将所述音/视频采集设备复位到初始位置,以便对下一个发言人进行快速的对准和跟踪。
本示例实施方式中,所述确定发言结束包括:获取音/视频采集设备采集的第二音频信号;提取所述第二音频信号中的第二语音特征;若所述第二语音特征确定满足预设的结束语特征条件,确定发言结束。所述结束语特征条件可以是设置预设时长,若检测到大于预设时长,该音/视频采集设备对准的发言位置都没有发出声音,则满足预设的结束语特征条件;还可以是预设结束语特征库,将第二语音特征与结束语特征库进行匹配,若匹配成功,则满足预设的结束语特征条件,所述第二语音特征可以是“我的发言结束”“回答完毕”“感谢***的演讲”等。
本示例实施方式中,所述方法可以在PC端应用,也可以在便携式手持设备上应用,且可以在上述两种设备间实现数据交互。如图4所示为本方法在PC端与便携式手持设备上应用且数据交互场景的示意图。
需要说明的是,尽管在附图中以特定顺序描述了本公开中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
此外,在本示例实施例中,还提供了一种基于行为识别的语音定位装置。参照图5所示,该基于行为识别的语音定位装置500可以包括:信号 采集模块510、行为特征匹配模块520、发言人确定模块530以及发言位置定位模块540。其中:
信号采集模块510,用于在接收到特定语音信号时,获取接收特定语音信号的时间信息,以及视频采集设备在所述时间信息对应时间段采集的视频信号;
行为特征匹配模块520,用于分析所述视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配,得到匹配结果;
发言人确定模块530,用于若根据匹配结果确定所述N个用户行为特征中包含发言行为特征,将视频信号中与所述发言行为特征对应的用户作为发言人;
发言位置定位模块540,用于分析视频信号中所述发言人所在场地中的位置,确定所述发言人所在场地中的发言位置,控制音/视频采集设备对准所述发言人所在场地中的发言位置。
上述中各基于行为识别的语音定位装置模块的具体细节已经在对应的基于行为识别的语音定位方法中进行了详细的描述,因此此处不再赘述。
应当注意,尽管在上文详细描述中提及了基于行为识别的语音定位装置500的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
此外,在本公开的示例性实施例中,还提供了一种能够实现上述方法的电子设备。
所属技术领域的技术人员能够理解,本发明的各个方面可以实现为系统、方法或程序产品。因此,本发明的各个方面可以具体实现为以下形式,即:完全的硬件实施例、完全的软件实施例(包括固件、微代码等),或硬件和软件方面结合的实施例,这里可以统称为“电路”、“模块”或“系统”。
下面参照图6来描述根据本发明的这种实施例的电子设备600。图6显示的电子设备600仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。
如图6所示,电子设备600以通用计算设备的形式表现。电子设备600的组件可以包括但不限于:上述至少一个处理单元610、上述至少一个存储单元620、连接不同系统组件(包括存储单元620和处理单元610)的总线630、显示单元640。
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元610执行,使得所述处理单元610执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施例的步骤。例如,所述处理单元610可以执行如图1中所示的步骤S110至步骤S140。
存储单元620可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)6201和/或高速缓存存储单元6202,还可以进一步包括只读存储单元(ROM)6203。
存储单元620还可以包括具有一组(至少一个)程序模块6205的程序/实用工具6204,这样的程序模块6205包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线630可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备600也可以与一个或多个外部设备670(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备600交互的设备通信,和/或与使得该电子设备600能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口650进行。并且,电子设备600还可以通过网络适配器660与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器660通过总线 630与电子设备600的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备600使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
通过以上的实施例的描述,本领域的技术人员易于理解,这里描述的示例实施例可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施例的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开实施例的方法。
在本公开的示例性实施例中,还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施例中,本发明的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施例的步骤。
参考图7所示,描述了根据本发明的实施例的用于实现上述方法的程序产品700,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本发明的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意 合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本发明操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
此外,上述附图仅是根据本发明示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其他实施例。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结 构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限。
工业实用性
一方面,由于采用用户行为特征作为发言人的匹配特征,使识别特征更加明显清晰容易匹配,提高了匹配识别率;另一方面,通过特定语音信号作为行为特征匹配的使能标记,起到了辅助匹配的效果,节省了匹配运算资源,进一步提高了匹配识别率。

Claims (15)

  1. 一种基于行为识别的语音定位方法,其特征在于,所述方法包括:
    信号采集步骤,在接收到特定语音信号时,获取接收特定语音信号的时间信息,以及视频采集设备在所述时间信息对应时间段采集的视频信号;
    行为特征匹配步骤,分析所述视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配,得到匹配结果;
    发言人确定步骤,若根据匹配结果确定所述N个用户行为特征中包含发言行为特征,将视频信号中与发言行为特征对应的用户作为发言人;
    发言位置定位步骤,分析视频信号中发言人所在场地中的位置,确定发言人所在场地中的发言位置,控制音/视频采集设备对准所述发言人所在场地中的发言位置。
  2. 如权利要求1所述的方法,其特征在于,所述信号采集步骤,包括:
    采集第一音频信号;
    提取所述第一音频信号中的第一语音特征;
    将所述第一语音特征与预设的关键语音特征模型中的关键语音特征进行匹配;
    将匹配到所述关键语音特征的第一语音特征作为特定语音信号。
  3. 如权利要求1所述的方法,其特征在于,所述行为特征匹配步骤,包括:
    分析所述视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配;
    若判断所述N个用户行为特征中包含与预设的标准行为特征的特征匹配度大于或等于预设匹配度的用户行为特征,将大于或等于预设匹配度的用户行为特征作为发言行为特征,所述匹配结果为N个用户行为特征中包含发言行为特征;
    若所述N个用户行为特征与预设的标准行为特征的特征匹配度均小于 预设匹配度,所述匹配结果为N个用户行为特征中不包含发言行为特征。
  4. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    确定所述视频信号中的场地位置与实际场地位置的比例;
    根据所述比例将视频信号中的场地位置与实际场地位置进行位置映射;
    对进行位置映射后的视频信号中的场地位置进行区域划分,确定视频信号中每个区域与实际场地中对应区域的位置标识。
  5. 如权利要求4所述的方法,其特征在于,所述发言位置定位步骤中,分析视频信号中所述发言人所在场地中的位置包括:
    分析所述发言人在视频信号中的区域;
    确定所述发言人在视频信号中的区域对应的位置标识。
  6. 如权利要求5所述的方法,其特征在于,所述发言位置定位步骤中确定所述发言人所在场地中的发言位置,包括:
    将所述发言人在视频信号中的区域对应的位置标识作为发言位置标识;
    将发言位置标识对应的区域作为发言位置,控制音/视频采集设备对准所述发言位置标识在实际场地中的对应区域。
  7. 如权利要求1所述的方法,其特征在于,所述音/视频采集设备包括至少一个麦克风。
  8. 如权利要求7所述的方法,其特征在于,所述发言位置定位步骤包括:
    根据发言位置,确定发言人与麦克风之间的距离;
    根据所述距离,调节麦克风的音频采集音强。
  9. 如权利要求1所述的方法,其特征在于,所述发言位置定位步骤包括:
    在确定所述发言人所在场地中的发言位置后,控制音/视频采集设备跟踪对准所述发言位置。
  10. 如权利要求1所述的方法,其特征在于,所述方法包括:
    在确定发言结束后,将所述音/视频采集设备复位。
  11. 如权利要求10所述的方法,其特征在于,所述确定发言结束包括:
    获取音/视频采集设备采集的第二音频信号;
    提取所述第二音频信号中的第二语音特征;
    若所述第二语音特征确定满足预设的结束语特征条件,确定发言结束。
  12. 如权利要求1所述的方法,其特征在于,所述发言行为特征包括举手行为特征、起立行为特征。
  13. 一种基于行为识别的语音定位装置,其特征在于,所述装置包括:
    信号采集模块,用于在接收到特定语音信号时,获取接收特定语音信号的时间信息,以及视频采集设备在所述时间信息对应时间段采集的视频信号;
    行为特征匹配模块,用于分析所述视频信号中的N个用户行为特征,并将N个用户行为特征与预设的标准行为特征进行匹配,得到匹配结果;
    发言人确定模块,用于若根据匹配结果确定所述N个用户行为特征中包含发言行为特征,将视频信号中与发言行为特征对应的用户作为发言人;
    发言位置定位模块,用于分析视频信号中发言人所在场地中的位置,确定发言人所在场地中的发言位置,控制音/视频采集设备对准所述发言人所在场地中的发言位置。
  14. 一种电子设备,其特征在于,包括
    处理器;以及
    存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现根据权利要求1至12中任一项所述的方法。
  15. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现根据权利要求1至12中任一项所述方法。
PCT/CN2018/092791 2018-06-01 2018-06-26 基于行为识别的语音定位方法以及装置 WO2019227552A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810557504.1 2018-06-01
CN201810557504.1A CN109031201A (zh) 2018-06-01 2018-06-01 基于行为识别的语音定位方法以及装置

Publications (1)

Publication Number Publication Date
WO2019227552A1 true WO2019227552A1 (zh) 2019-12-05

Family

ID=64612185

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/092791 WO2019227552A1 (zh) 2018-06-01 2018-06-26 基于行为识别的语音定位方法以及装置

Country Status (2)

Country Link
CN (1) CN109031201A (zh)
WO (1) WO2019227552A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109862316A (zh) * 2019-01-29 2019-06-07 安徽理工大学 一种基于图像分析技术的自动监听方法装置
CN111343408B (zh) * 2020-01-22 2021-02-09 北京翼鸥教育科技有限公司 一种多方视频活动的举手发起、应答方法及交互系统
CN112788278B (zh) * 2020-12-30 2023-04-07 北京百度网讯科技有限公司 视频流的生成方法、装置、设备及存储介质
CN113242505A (zh) * 2021-05-18 2021-08-10 苏州朗捷通智能科技有限公司 一种音频控制系统及其控制方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1422494A (zh) * 2000-12-05 2003-06-04 皇家菲利浦电子有限公司 在电视会议和其他应用中预测事件的方法和装置
US20050210105A1 (en) * 2004-03-22 2005-09-22 Fuji Xerox Co., Ltd. Conference information processing apparatus, and conference information processing method and storage medium readable by computer
CN103581608A (zh) * 2012-07-20 2014-02-12 Polycom通讯技术(北京)有限公司 发言人检测系统、发言人检测方法和音频/视频会议系统
CN107369449A (zh) * 2017-07-14 2017-11-21 上海木爷机器人技术有限公司 一种有效语音识别方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103686074A (zh) * 2013-11-20 2014-03-26 南京熊猫电子股份有限公司 视频监控中移动目标的定位方法
KR101642084B1 (ko) * 2014-05-29 2016-07-22 경희대학교 산학협력단 다중 음원 국지화 기법을 이용한 얼굴 검출 장치 및 방법
CN104409075B (zh) * 2014-11-28 2018-09-04 深圳创维-Rgb电子有限公司 语音识别方法和系统
CN106251334B (zh) * 2016-07-18 2019-03-01 华为技术有限公司 一种摄像机参数调整方法、导播摄像机及系统
CN106603878B (zh) * 2016-12-09 2019-09-06 奇酷互联网络科技(深圳)有限公司 语音定位方法、装置和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1422494A (zh) * 2000-12-05 2003-06-04 皇家菲利浦电子有限公司 在电视会议和其他应用中预测事件的方法和装置
US20050210105A1 (en) * 2004-03-22 2005-09-22 Fuji Xerox Co., Ltd. Conference information processing apparatus, and conference information processing method and storage medium readable by computer
CN103581608A (zh) * 2012-07-20 2014-02-12 Polycom通讯技术(北京)有限公司 发言人检测系统、发言人检测方法和音频/视频会议系统
CN107369449A (zh) * 2017-07-14 2017-11-21 上海木爷机器人技术有限公司 一种有效语音识别方法及装置

Also Published As

Publication number Publication date
CN109031201A (zh) 2018-12-18

Similar Documents

Publication Publication Date Title
WO2019227552A1 (zh) 基于行为识别的语音定位方法以及装置
US10887690B2 (en) Sound processing method and interactive device
KR102481454B1 (ko) 방향성 인터페이스를 갖는 핸즈 프리 디바이스
JP2021086154A (ja) 音声認識方法、装置、機器及びコンピュータ読み取り可能な記憶媒体
US9263044B1 (en) Noise reduction based on mouth area movement recognition
Donley et al. Easycom: An augmented reality dataset to support algorithms for easy communication in noisy environments
US8614733B2 (en) Apparatus, system, and method of preventing leakage of information
US20150088515A1 (en) Primary speaker identification from audio and video data
US20160019886A1 (en) Method and apparatus for recognizing whisper
US11164571B2 (en) Content recognizing method and apparatus, device, and computer storage medium
US20140241702A1 (en) Dynamic audio perspective change during video playback
WO2021093380A1 (zh) 一种噪声处理方法、装置、系统
CN109032345B (zh) 设备控制方法、装置、设备、服务端和存储介质
CN109947387B (zh) 音频采集方法、音频播放方法、系统、设备及存储介质
CN109361995B (zh) 一种电器设备的音量调节方法、装置、电器设备和介质
US10325600B2 (en) Locating individuals using microphone arrays and voice pattern matching
CN112601045A (zh) 视频会议的发言控制方法、装置、设备及存储介质
WO2021120190A1 (zh) 数据处理方法、装置、电子设备和存储介质
CN110611861B (zh) 定向发声控制方法及装置、发声设备、介质和电子设备
KR101508092B1 (ko) 화상 회의를 지원하는 방법 및 시스템
CN112487246A (zh) 一种多人视频中发言人的识别方法和装置
CN113853529A (zh) 用于空间音频捕获的装置和相关方法
CN107680592A (zh) 一种移动终端语音识别方法、及移动终端及存储介质
WO2016078415A1 (zh) 一种终端拾音控制方法、终端及终端拾音控制系统
JP7400364B2 (ja) 音声認識システム及び情報処理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18920558

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.04.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18920558

Country of ref document: EP

Kind code of ref document: A1