WO2017210991A1 - 一种语音过滤的方法、装置及系统 - Google Patents

一种语音过滤的方法、装置及系统 Download PDF

Info

Publication number
WO2017210991A1
WO2017210991A1 PCT/CN2016/093963 CN2016093963W WO2017210991A1 WO 2017210991 A1 WO2017210991 A1 WO 2017210991A1 CN 2016093963 W CN2016093963 W CN 2016093963W WO 2017210991 A1 WO2017210991 A1 WO 2017210991A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
intercom
voice stream
voiceprint
voiceprint feature
Prior art date
Application number
PCT/CN2016/093963
Other languages
English (en)
French (fr)
Inventor
李鹏博
刘苗
王煜辰
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2017210991A1 publication Critical patent/WO2017210991A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1813Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms

Definitions

  • This document relates to, but is not limited to, audio processing technology, and in particular, to a method, device and system for voice filtering.
  • the embodiment of the invention provides a method, a device and a system for voice filtering, which can filter sound and environmental noise unrelated to the current conference, and improve conference quality.
  • a method for voice filtering includes:
  • the extracted voiceprint feature is matched with the voiceprint feature corresponding to the transmitting end in the voiceprint library, and the target voice stream with the voiceprint feature matching success is sent.
  • the method further includes:
  • the method before the receiving the original voice stream, the method further includes:
  • the method further includes:
  • the voiceprint feature corresponding to the sending end in the voiceprint library is cleared.
  • a device for voice filtering comprising: a voice stream processing unit, a voice separation unit, a voiceprint feature extraction unit, and a voiceprint feature matching unit;
  • the voice stream processing unit is configured to receive an original voice stream sent by the sending end, and send, to the sending end, a target voice stream with a voiceprint feature matching success;
  • the voice separation unit is configured to perform voice separation on the original voice stream according to an orientation of each sound source in the original voice stream, to obtain a target voice stream corresponding to each orientation;
  • the voiceprint feature extraction unit is configured to extract voiceprint features in all target voice streams obtained after voice separation;
  • the voiceprint feature matching unit is configured to match the extracted voiceprint feature with a voiceprint feature corresponding to the sender in the voiceprint library, and send the voiceprint feature to the voice stream processing unit to successfully match Target voice stream.
  • the device further includes: a sound source positioning unit configured to locate each of the original voice streams according to a time difference and/or an intensity difference of each of the original voice streams reaching the transmitting end The orientation of the sound source.
  • a sound source positioning unit configured to locate each of the original voice streams according to a time difference and/or an intensity difference of each of the original voice streams reaching the transmitting end The orientation of the sound source.
  • the voice stream processing unit is further configured to receive a voice stream sample sent by the sender;
  • the voiceprint library extracting unit is further configured to: extract a voiceprint feature in the voice stream sample, and extract a voiceprint feature in the voice stream sample as the voiceprint feature corresponding to the sending end, and save To the soundprint library.
  • the device further includes:
  • the data clearing unit is configured to: after receiving the exit request sent by the sending end, clear the voiceprint feature corresponding to the sending end in the voiceprint library.
  • a system for voice filtering includes a client, a server, and the above-mentioned voice filtering device.
  • the client is configured to interact with the voice filtering device through the server;
  • the server is configured to establish communication for interaction between the client and the device
  • the apparatus is arranged to establish a connection with the client via the server.
  • the device is disposed in the server or the client.
  • the client includes one or more intercom terminals.
  • the technical solution provided by the embodiment of the present invention includes: receiving an original voice stream, performing voice separation on the original voice stream according to an orientation of each sound source in the original voice stream, and obtaining a target corresponding to each orientation.
  • Voice stream extracting voiceprint features in all target voice streams obtained after speech separation; matching the extracted voiceprint features with the voiceprint features corresponding to the sender in the voiceprint library, and transmitting the voiceprint features successfully Target voice stream.
  • the unrelated voice stream in the voice transmission can be filtered, and only the correct voice stream is retained, thereby achieving the purpose of shielding interference, improving conference quality, and meeting efficiency.
  • FIG. 1 is a schematic flowchart diagram of a method for voice filtering according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a device for voice filtering according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a structure of a voice filtering system according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of a method for voice filtering according to an embodiment of the present invention; as shown in FIG. 1, the method includes:
  • Step 101 Receive an original voice stream.
  • the original voice stream may be sent by the client; in the application of the network conference call, the client may include multiple intercom terminals, and the original voice stream may be sent by one of the intercom terminals in the client;
  • Step 102 Perform voice separation on the original voice stream according to the orientation of each sound source in the original voice stream, and obtain a target voice stream corresponding to each orientation;
  • the time difference and/or intensity difference of each sound source in the original voice stream is used.
  • the time difference and/or intensity difference of the source reaching the transmitting end of the original voice stream can accurately locate the orientation of each sound source in the original voice stream; according to the orientation of each sound source in the original voice stream,
  • the original voice stream performs voice separation, and the voice streams belonging to the same direction are used as the target voice stream.
  • the method for acquiring the time difference and/or the intensity difference belongs to a conventional technical means by those skilled in the art. For example, based on the sound source positioning of the microphone array, two voice signals from different orientations can be arbitrarily separated.
  • the method of voice separation is a common technical means by those skilled in the art, and the separated voice stream voiceprint features can be determined to be the target voice stream in the same direction.
  • Step 103 Extract voiceprint features in all target voice streams obtained after voice separation
  • Step 104 Match the extracted voiceprint feature with the voiceprint feature corresponding to the sending end in the voiceprint library, and send the target voice stream with the voiceprint feature matching successfully.
  • the target voice stream with the voiceprint feature matching success can be sent to the client.
  • the voice filtering method in the embodiment of the present invention can be applied to various scenarios, such as a network conference call.
  • a network conference call such as a network conference call.
  • the following is an example based on voice filtering in a web conference call.
  • the client includes a plurality of intercom terminals, and the plurality of intercom terminals communicate with the server, and is mainly used to perform a voice intercom transaction.
  • the address book (ID) or the communication group ID of the address book in the intercom terminal is used to initiate a voice intercom to the server, receive an intercom invitation of the server, apply for an intercom speech, and play the voice of other intercom members. Display the number and list of intercom members participating in the conference, as well as the functions of withdrawing from the conference.
  • the device sends the intercom request; optionally, the chat object selected according to the intercom terminal enters the chat room, creates the intercom ID of the intercom, and creates a conference room according to the intercom access number and the intercom ID.
  • the conference room is successfully established, a voice link is established with the intercom terminal, and the intercom message is updated; the intercom message may include: intercom ID, intercom member ID, group ID, total number of conference rooms, and/or Member list.
  • the server queries the intercom terminal that needs to participate in the conference, and sends an intercom invitation to the intercom terminal that needs to attend the conference; after the intercom terminal that needs to attend the conference accepts the intercom invitation, the server queries the conference room member, and the server sets the current conference room.
  • the number of members and the member list are sent to the intercom terminal; after receiving the response from the server, the intercom terminal agrees to participate in the intercom, and uses the intercom access number and the intercom ID to call the server, requesting to join the already created conference room, the server and the server
  • the voice link between the intercom terminals is successfully established, and the current intercom message is updated.
  • the intercom message includes: intercom ID, intercom member ID, group ID, total number of conference rooms, and/or member list.
  • the intercom terminal added to the conference room sends an identity save request to the server, and after the identity save request response is successful, a voice is sent to the server, and the server processes the received voice to obtain a voice stream of the voice.
  • the stream is used as a sample of the voice stream, and the voiceprint features in the sample of the voice stream are extracted and saved to the voiceprint library, and the voiceprint feature corresponds to each member ID in the intercom terminal.
  • the intercom terminal that needs to speak sends an intercom request to the server, and after the server agrees to the intercom request, the intercom terminal starts to speak; the server receives the original voice stream generated by the speech, according to each of the original voice streams.
  • the time difference and/or intensity difference of a sound source reaching the intercom terminal can locate the orientation of each sound source in the original voice stream; optionally, the original voice stream according to the orientation of each sound source in the original voice stream Perform speech separation to obtain a target speech stream corresponding to each orientation.
  • a voice stream belonging to the same sound source and a voice stream not belonging to the same sound source may be combined to form a member voice stream and a filtered voice.
  • the flow (the same sound source includes the voice stream whose voiceprint feature corresponding to the member ID is successfully matched.
  • the voice stream whose voiceprint feature matching corresponding to the member ID fails is the filtered voice stream.
  • the member voice stream means the member ID sound.
  • the voice stream matching the pattern feature; extracting the voiceprint feature in the target voice stream separated by the voice; corresponding to the voiceprint feature in the voiceprint library Intercom voiceprint ID are speaking terminal matches the voiceprint feature matching the success of the target speech stream is sent to the other terminal intercom conference room, meeting room all the members listen to the speaker.
  • the method of the embodiment of the present invention further includes: after receiving the exit request sent by the sending end, clearing the voiceprint feature corresponding to the sending end in the voiceprint library.
  • the intercom terminal sends a speech end request to the server, and after receiving the speech end request, the server ends the speech of the intercom member, and saves the intercom member to end. State the speech and accept the request of the new intercom member.
  • the server sends an exit request to the server, and after receiving the exit request, the server clears the intercom ID, the intercom member ID, and the intercom member ID in the voice library. The corresponding voiceprint feature, and the intercom member successfully exits.
  • the server may further detect, according to the preset time of receiving the voice stream, whether the intercom member continues to speak; for example, when the server does not receive the sent by the intercom member within a preset time, for example, 5 minutes.
  • the intercom member's speech is directly ended, and the intercom member's end speech state is saved, and the new intercom member's speech request is accepted. In this way, it is to prevent the intercom member from being interrupted or absent from the network, but the degree of this speech has not been terminated in time, and other intercom members are also unable to apply for a speech.
  • the server clears the intercom ID corresponding to the intercom member, the intercom member ID, and the corresponding voiceprint feature of the intercom member ID in the voiceprint library, and the intercom member successfully exits. .
  • the embodiment of the invention further provides a computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions are used to execute the voice filtering method.
  • the voice filtering device 21 includes: a voice stream processing unit 210, a voice separation unit 212, a voiceprint feature extraction unit 213, and Voiceprint feature matching unit 214;
  • the voice stream processing unit 210 is configured to receive the original voice stream sent by the sending end, and send the target voice stream with the voiceprint feature matching successfully to the sending end;
  • the voice separation unit 212 is configured to perform voice separation on the original voice stream according to the orientation of each sound source in the original voice stream, to obtain a target voice stream corresponding to each orientation;
  • the voiceprint feature extraction unit 213 is configured to extract voiceprint features in all the target voice streams obtained after the voice separation;
  • the voiceprint feature matching unit 214 is configured to send the extracted voiceprint feature to the voiceprint library 215 The corresponding voiceprint features of the sending end are matched, and the target voice stream with the voiceprint feature matching success is sent to the voice stream processing unit 210.
  • the voice filtering device of the embodiment of the present invention can be applied to various scenarios, such as a network conference call.
  • the client includes a plurality of intercom terminals, and the plurality of intercom terminals communicate with the server, and is mainly configured to perform a voice intercom transaction, and one of the intercom terminals in the client may include a sending end.
  • the voice intercom is sent to the server through the address book friend ID or the communication group ID in the intercom terminal, the intercom invitation of the receiving server, the application of the intercom speech, the playing of the voice of other intercom members, and the participation in the present participation.
  • any intercom terminal is selected to send an intercom request to the server; optionally, the chat object selected according to the intercom terminal enters the chat room, and the intercom is created.
  • the intercom ID and then create a conference room based on the intercom access number and the intercom ID.
  • the intercom message may include: intercom ID, intercom member ID, group ID, total number of conference rooms, and/or Member list.
  • the server queries the intercom terminal that needs to participate in the conference, and sends an intercom invitation to the intercom terminal that needs to attend the conference; after the intercom terminal that needs to attend the conference accepts the intercom invitation, the server queries the conference room member, and the server sets the current conference room.
  • the number of members and the member list are sent to the intercom terminal; after receiving the response from the server, the intercom terminal agrees to participate in the intercom, and uses the intercom access number and the intercom ID to call the server, requesting to join the already created conference room, the server and the server
  • the voice link between the intercom terminals is successfully established, and the current intercom message is updated.
  • the intercom message includes: intercom ID, intercom member ID, group ID, total number of conference rooms, and/or member list.
  • the intercom terminal added to the conference room sends an identity save request to the server, and after the identity save request response is successful, a voice is sent to the server, and the server processes the received voice to obtain a voice stream of the voice.
  • the stream is sent as a voice stream sample to the voice stream processing unit 210.
  • the voice stream processing unit 210 sends the voice stream sample to the voiceprint feature extraction unit 213, and the voiceprint feature extraction unit 213 extracts the voice stream sample.
  • the voiceprint features are saved to the voiceprint library 215.
  • the voiceprint feature corresponds to each member ID in the intercom terminal.
  • the voice filtering device 21 further includes a sound source positioning unit 211, which is set to The orientation of each of the original voice streams is located based on the time difference and/or intensity difference of each of the original voice streams arriving at the transmitting end.
  • the intercom terminal that needs to speak sends an intercom request to the server, and after the server agrees to the intercom request, the intercom terminal starts to speak; after receiving the original voice stream generated by the speech, the server will original speech.
  • the stream is sent to the voice stream processing unit 210, and the voice stream processing unit 210 sends the original voice stream to the sound source locating unit 211.
  • the sound source locating unit 211 obtains the time difference of the sound source locating terminal 211 according to the sound source of the original voice stream. And/or the intensity difference, the orientation of each sound source in the original voice stream can be located; optionally, the voice separation unit 212 performs voice separation on the original voice stream according to the orientation of each sound source in the original voice stream.
  • the speech separation is performed by combining a voice stream belonging to the same sound source and a voice stream not belonging to the same sound source to form a member voice stream and a filtered voice stream (the same sound source includes a voiceprint feature corresponding to the member ID) Matching a successful voice stream.
  • the voice stream whose voiceprint feature matching corresponding to the member ID fails is a filtered voice stream.
  • the member voice stream means a voice stream matching the member ID voiceprint feature); and then the voiceprint feature extraction unit 213 extracting the voiceprint feature in the target voice stream separated by the voice, and then transmitting the voiceprint feature to the voiceprint feature matching unit 214; the voiceprint feature matching unit 214 and the voiceprint feature and the corresponding voiceback terminal in the voiceprint library 215
  • the voiceprint features of the intercom ID are matched, and the successfully matched target voice stream is sent to the voice stream processing unit 210, and then sent by the voice stream processing unit 210 to the other intercom terminal in the conference room through the message forwarding unit, the conference room All members listened to the speech.
  • the device further includes: a data clearing unit 216, configured to: after receiving the exit request sent by the sending end, clear the voiceprint feature corresponding to the sending end in the voiceprint library.
  • a data clearing unit 216 configured to: after receiving the exit request sent by the sending end, clear the voiceprint feature corresponding to the sending end in the voiceprint library.
  • the intercom terminal sends a speech end request to the server, and after receiving the speech end request, the server ends the speech of the intercom member, and saves the intercom member to end. State the speech and accept the request of the new intercom member.
  • the server sends an exit request to the server, and after receiving the exit request, the server sends the outgoing request to the data clearing unit 216, and the data clearing unit 216 clears the exit request.
  • the intercom ID corresponding to the intercom member, the intercom member ID, and the corresponding voiceprint feature of the intercom member ID in the voiceprint library, and the intercom member successfully exits.
  • the server may further detect the preset time according to the received voice stream. Whether the intercom member continues to speak; for example, when the server does not receive the original voice stream sent by the intercom member within a preset time, for example, 5 minutes, the conversation of the intercom member is directly ended, and the intercom member is saved. State the speech and accept the request of the new intercom member.
  • the server sends an exit request to the server, and after receiving the exit request, the server sends the outgoing request to the data clearing unit 216, and the data clearing unit 216 clears the exit request.
  • the intercom ID corresponding to the intercom member, the intercom member ID, and the corresponding voiceprint feature of the intercom member ID in the voiceprint library, and the intercom member successfully exits.
  • FIG. 3 is a schematic structural diagram of a structure of a voice filtering system according to an embodiment of the present invention.
  • the system includes a client 31, a server 32, and the above-described voice filtering device 21, and the client 31 interacts with the voice filtering device 21 through the server 32.
  • the voice filtering system of the embodiment of the present invention can be applied to various scenarios, such as a network conference call.
  • the client 31 includes a plurality of intercom terminals 311, and the intercom terminals 311 and the intercom terminal access management unit 321, the call control unit 322, and the message forwarding unit 323 in the server 32 can communicate with each other.
  • the voice intercom, the intercom invitation of the receiving server 32, the intercom speech, the playing of the voices of other intercom members, and the display are initiated to the server 32 through the address book friend ID or the communication group ID in the intercom terminal 311. The number and list of intercom members participating in this conference, as well as the functions of withdrawing from this conference.
  • the intercom terminal access management unit 321 is connected to the intercom terminal 311, and is mainly configured to implement intercom member management, ensuring that the intercom member logs in to the server 32 normally, and the intercom terminals 311 are all logged in to the intercom terminal.
  • the management unit 321 ensures that the intercom terminal access management unit 321 ensures that it has normally accessed the server 32, selects any one of the intercom terminals 311 to send a intercom request to the call control unit 322, and calls the conference bridge through the call control unit 322.
  • the unit 324 optionally, the chat object selected according to the intercom terminal 311 enters the chat room, creates the intercom ID of the intercom, and then calls the conference through the call control unit 322 according to the intercom access number and the intercom ID.
  • Bridge unit 324 optionally, the chat object selected according to the intercom terminal 311 enters the chat room, creates the intercom ID of the intercom, and then calls the conference through the call control unit 322 according to the intercom access number and the intercom ID.
  • Bridge unit 324 optionally, the chat object selected according to the intercom terminal 311 enters the chat room, creates the intercom ID of the intercom, and then calls the conference through the call control unit 322 according to the intercom access number and the intercom ID.
  • the conference bridge unit 324 After receiving the intercom request, the conference bridge unit 324 creates a conference room, and establishes a voice link with the intercom terminal 311, and notifies the intercom management unit 325 to update the intercom message; the intercom message here may include: intercom ID, intercom member ID, group ID, total number of meeting rooms, and/or member list.
  • the intercom management unit 325 queries the intercom terminal 311 that needs to participate in the conference according to the intercom message, and sends an intercom invitation to the intercom terminal 311 that needs to participate in the conference through the message forwarding unit 323; the intercom terminal 311 that needs to participate in the conference accepts After the intercom invitation, the intercom management unit 325 is queried to the conference room member.
  • the intercom management unit 325 responds to the intercom terminal 311 by the current conference room number and the member list. After the intercom terminal 311 receives the intercom management unit 325, If it is agreed to participate in the intercom, the intercom access number and the intercom ID are used to initiate a call to the conference bridge unit 324 through the call control unit 322, requesting to join the already created conference room, between the conference bridge unit 324 and the intercom terminal 311. The voice link is successfully established, and the intercom management unit 325 is notified to update the current intercom message.
  • the intercom message includes: a talkback ID, an intercom member ID, a group ID, a total number of conference rooms, and/or a member list.
  • the intercom terminal 311 added to the conference room sends an identity save request to the intercom management unit 325, and after the identity save request response is successful, starts transmitting a speech to the intercom management unit 325, and the intercom management unit 325 performs the received speech.
  • the voice stream of the voice is obtained, and the voice stream is sent as a voice stream sample to the voice stream processing unit 210 in the voice filtering device 21, and then the voice stream processing unit 210 sends the voice stream sample to the voiceprint feature extraction unit. 213.
  • the voiceprint feature extraction unit 213 extracts the voiceprint feature in the voice stream sample and saves it in the voiceprint library 215.
  • the voiceprint feature corresponds to each member ID in the intercom terminal.
  • the intercom terminal 311 that needs to speak sends an intercom request to the intercom management unit 325.
  • the intercom management unit 325 agrees to the intercom request, the intercom terminal 311 starts to speak; the intercom management unit 325 receives the speech.
  • the original voice stream is generated, the original voice stream is sent to the voice stream processing unit 210 in the voice filtering device 21, and then the received original voice stream is sent by the voice stream processing unit 210 to the sound source locating unit 211.
  • the source positioning unit 211 can locate the orientation of each sound source in the original voice stream according to the time difference and/or the intensity difference of each of the original voice streams reaching the intercom terminal; optionally, the voice separation unit 212, According to the orientation of each sound source in the original voice stream, the original voice stream is separated by voice; wherein the voice separation is to combine the voice stream belonging to the same sound source and the voice stream not belonging to the same sound source to form a member.
  • the voice stream and the filtered voice stream (the voice source that belongs to the same sound source including the voiceprint feature corresponding to the member ID is successfully matched.
  • the voiceprint feature matching corresponding to the member ID is lost.
  • the voice stream is a filtered voice stream.
  • the member voice stream herein means a voice stream matching the member ID voiceprint feature; and then the voiceprint feature extraction unit 213 extracts the voiceprint feature in the target voice stream separated by the voice.
  • the voiceprint feature is then sent to the voiceprint feature matching unit 214; the voiceprint feature matching unit 214 combines the voiceprint feature with the voiceprint library 215
  • the successfully matched target voice stream is sent to the voice stream processing unit 210, and then sent by the voice stream processing unit 210 to the conference room through the message forwarding unit 323.
  • the voice filtering device 21 further includes: a data clearing unit 216, configured to: after receiving the exit request sent by the intercom terminal 311, clearing the voiceprint library corresponding to the member ID in the intercom terminal 311 Voiceprint features.
  • the intercom terminal 311 when the intercom terminal 311 ends the speech of the intercom member, the intercom terminal 311 sends a speech end request to the intercom management unit 325, and the intercom management unit 325 receives the speech end request and ends the intercom member.
  • the intercom management unit 325 When the intercom member wants to exit the conference room, the intercom management unit 325 sends an exit request, and after receiving the egress request, the intercom management unit 325 sends the exit request to the data clearing unit 216, and the data clearing unit 216 receives After the exit request, the intercom ID corresponding to the intercom member, the intercom member ID, and the voiceprint feature corresponding to the intercom member ID in the voiceprint library are cleared, and the intercom member successfully exits.
  • the intercom management unit 325 may further detect whether the intercom member continues to speak according to the preset time of receiving the voice stream; for example, when the intercom management unit 325 is within a preset time, for example, 5 minutes. Upon receiving the original voice stream sent by the intercom member, the intercom member's speech is directly ended, and the intercom member ends the speaking state and accepts the new intercom member's speech request.
  • the intercom management unit 325 When the intercom member wants to exit the conference room, the intercom management unit 325 sends an exit request, and after receiving the egress request, the intercom management unit 325 sends the exit request to the data clearing unit 216, and the data clearing unit 216 receives After the exit request, the intercom ID corresponding to the intercom member, the intercom member ID, and the corresponding voiceprint feature of the intercom member ID in the voiceprint library are cleared, and the intercom member successfully exits.
  • the voice filtering device 21 may be disposed in the server 32.
  • the intercom terminal 311 in the client 31 performs voice separation and voiceprint features of the voice stream in a remote manner.
  • the matching and voiceprint feature matching is performed, and the target voice stream that is successfully matched is sent to the other intercom terminals 311 in the client 31 through the server 32.
  • the voice filtering device 21 can also be disposed in the client 31.
  • the voice filtering device 21 When the voice filtering device 21 is located in the client 31, the intercom terminal 311 in the client 31 can pass through.
  • the voice separation, the voiceprint feature extraction, and the voiceprint feature matching of the voice stream are performed in a local manner, and the target voice stream that is successfully matched is directly sent to the other intercom terminals 311 in the client 31 through the voice link.
  • the intercom terminal access management unit 321, the call control unit 322, the message forwarding unit 323, the call conference bridge unit 324, and the intercom management unit 325 may all be processed by a central processing unit (CPU) located in the server.
  • CPU central processing unit
  • MPU digital signal processor
  • FPGA field programmable gate array
  • the voice stream processing unit 210, the sound source localization unit 211, the speech separation unit 212, the voiceprint feature extraction unit 213, the voiceprint feature matching unit 214, the voiceprint library 215, and the data clearing unit 216 may all be processed by the central processing in the voice filtering device 21.
  • a CPU CPU
  • MPU microprocessor
  • DSP digital signal processor
  • FPGA field programmable gate array
  • the voice filtering device 21 When the voice filtering device 21 is disposed in the client 31, it may be a central processing unit (CPU), a microprocessor (MPU), a digital signal processor (DSP), or a field programmable gate array (FPGA) in the client 31. And so on.
  • CPU central processing unit
  • MPU microprocessor
  • DSP digital signal processor
  • FPGA field programmable gate array
  • the voice filtering device 21 When the voice filtering device 21 is installed in the server, it can be implemented by a central processing unit (CPU), a microprocessor (MPU), a digital signal processor (DSP), or a field programmable gate array (FPGA) in the server.
  • CPU central processing unit
  • MPU microprocessor
  • DSP digital signal processor
  • FPGA field programmable gate array
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention can take the form of a hardware embodiment, a software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • each module/unit in the foregoing embodiment may be implemented in the form of hardware, for example, by implementing an integrated circuit to implement its corresponding function, or may be implemented in the form of a software function module, for example, being executed by a processor and stored in a memory. Programs/instructions to implement their respective functions.
  • the invention is not limited to any specific form of combination of hardware and software.
  • the above technical solution realizes filtering the sounds and environmental noises of unrelated persons in the voice transmission, improves the voice quality, and improves the conference quality of the conference call.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音过滤的方法、装置及系统,包括:接收原始语音流,根据原始语音流中每一个声源的方位,对原始语音流进行语音分离,获得每一个方位相对应的目标语音流;提取语音分离后获得的所有目标语音流中的声纹特征;将提取的声纹特征与声纹库中与发送端相对应的声纹特征进行匹配,并发送声纹特征匹配成功的目标语音流。本发明实施例语音分离后仅发送声纹匹配的目标语音流,实现了对语音传输中无关人员的声音和环境噪声等进行过滤,提高了语音质量,提高了电话会议的会议质量。

Description

一种语音过滤的方法、装置及系统 技术领域
本文涉及但不限于音频处理技术,尤其涉及一种语音过滤的方法、装置及系统。
背景技术
目前基于网络的电话会议中,由于参会人员所处的环境的不同,会议通常会受到外界无关人员的声音和环境噪声的影响,导致会议效率降低。而相关技术中,不存在将语音传输中无关人员的声音和环境噪音等进行过滤的方法,因此,影响会议质量和效率。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本发明实施例提供一种语音过滤的方法、装置及系统,能过滤与当前会议无关的声音和环境噪声,提高会议质量。
本发明实施例的技术方案是这样实现的:
根据本发明实施例的一方面,提供一种语音过滤的方法,所述方法包括:
接收原始语音流;
根据所述原始语音流中每一个声源的方位,对所述原始语音流进行语音分离,获得每一个方位相对应的目标语音流;
提取语音分离后获得的所有目标语音流中的声纹特征;
将提取的所述声纹特征与声纹库中与发送端相对应的声纹特征进行匹配,并发送声纹特征匹配成功的目标语音流。
可选的,所述接收原始语音流之后,所述方法还包括:
根据所述原始语音流中每一个声源到达所述发送端的时间差和/或强度差,定位出所述原始语音流中每一个声源的方位。
可选的,所述接收原始语音流之前,所述方法还包括:
接收语音流样本;
提取所述语音流样本中的声纹特征,将提取所述语音流样本中的声纹特征作为所述与发送端相对应的声纹特征,并保存到所述声纹库。
可选的,所述方法还包括:
接收到所述发送端发送的退出请求后,清除所述声纹库中所述与发送端相对应的声纹特征。
根据本发明实施例的另一方面,提供一种语音过滤的装置,所述装置包括:语音流处理单元、语音分离单元、声纹特征提取单元和声纹特征匹配单元;其中,
所述语音流处理单元,设置为接收发送端发送的原始语音流;向所述发送端发送声纹特征匹配成功的目标语音流;
所述语音分离单元,设置为根据所述原始语音流中每一个声源的方位,对所述原始语音流进行语音分离,获得每一个方位相对应的目标语音流;
所述声纹特征提取单元,设置为提取语音分离后的获得的所有目标语音流中的声纹特征;
所述声纹特征匹配单元,设置为将提取的所述声纹特征与声纹库中与发送端相对应的声纹特征进行匹配,并向所述语音流处理单元发送声纹特征匹配成功的目标语音流。
可选的,所述装置还包括:声源定位单元,设置为根据所述原始语音流中每一个声源到达所述发送端的时间差和/或强度差,定位出所述原始语音流中每一个声源的方位。
可选的,所述语音流处理单元还设置为,接收发送端发送的语音流样本;
所述声纹库提取单元还用于,提取所述语音流样本中的声纹特征,将提取所述语音流样本中的声纹特征作为所述与发送端相对应的声纹特征,并保存到所述声纹库。
可选的,所述装置还包括:
数据清除单元,设置为接收到所述发送端发送的退出请求后,清除所述声纹库中所述与发送端相对应的声纹特征。
根据本发明实施例的再一方面,提供一种语音过滤的系统,所述系统包括客户端、服务器、以及上述的语音过滤的装置,
所述客户端设置为,通过所述服务器与所述语音过滤的装置交互;
所述服务器设置为,为所述客户端与所述装置的交互建立通信;
所述装置设置为,通过所述服务器与所述客户端建立连接。
可选的,所述装置设置于所述服务器或所述客户端中。
可选的,所述客户端包括一个或一个以上对讲终端。
与相关技术相比,本发明实施例提供的技术方案,包括:接收原始语音流,根据原始语音流中每一个声源的方位,对原始语音流进行语音分离,获得每一个方位相对应的目标语音流;提取语音分离后获得的所有目标语音流中的声纹特征;将提取的声纹特征与声纹库中与发送端相对应的声纹特征进行匹配,并发送声纹特征匹配成功的目标语音流。如此,能够将语音传输中无关的语音流过滤,仅保留正确的语音流,达到了屏蔽干扰、提升会议质量和会议效率的目的。
在阅读并理解了附图和详细描述后,可以明白其他方面。
附图概述
图1为本发明实施例一种语音过滤的方法的流程示意图;
图2为本发明实施例一种语音过滤的装置的组成结构示意图;
图3为本发明实施例一种语音过滤的系统的组成结构示意图。
本发明的实施方式
下文中将结合附图对本申请的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。
图1为本发明实施例一种语音过滤的方法的流程示意图;如图1所示,该方法包括:
步骤101,接收原始语音流;
这里,原始语音流可以由客户端发送;在网络电话会议的应用中,客户端可以包括多个对讲终端,原始语音流可以由客户端中的其中一个对讲终端发送;
步骤102,根据原始语音流中每一个声源的方位,对原始语音流进行语音分离,获得每一个方位相对应的目标语音流;
这里,由于环境的不同、障碍物的不同,很容易导致原始语音流中每一中声源到达该原始语音流的发送端的时间差和/或强度差也不同;而采用原始语音流中每一个声源到达原始语音流的发送端的时间差和/或强度差,作为声源定位的依据,能够准确定位出原始语音流中每一个声源的方位;根据原始语音流中每一个声源的方位,对原始语音流进行语音分离,将属于同一方位的语音流作为目标语音流。
需要说明的是,获取时间差和/或强度差的方法属于本领域技术人员的惯用技术手段,例如、基于麦克风阵列的声源定位,可以任意分离两路来自不同方位的语音信号。语音分离的方法为本领域技术人员的惯用技术手段,分离后的语音流声纹特征相同的就可以确定同一方位的语音流为目标语音流。
步骤103,提取语音分离后获得的所有目标语音流中的声纹特征;
步骤104,将提取的声纹特征与声纹库中与发送端相对应的声纹特征进行匹配,并发送声纹特征匹配成功的目标语音流。
可选的,可将声纹特征匹配成功的目标语音流发送到客户端。
在实际应用中,本发明实施例的语音过滤方法可以应用于多种场景下,比如:网络电话会议。下面基于网络电话会议中的语音过滤为例进行说明。
在网络电话会议中,客户端包括多个对讲终端,多个对讲终端与服务器通信,主要用于执行语音对讲事务。可选的,通过对讲终端中的通讯录好友身份编码(ID)或通讯群组ID,向服务器发起语音对讲、接收服务器的对讲邀请、申请对讲发言、播放其他对讲成员的语音、显示参与本次会议的对讲成员人数及列表、以及退出本次会议对讲等功能。
在多个对讲终端均正常登录到服务器后,选择任意一个对讲终端向服务 器发送对讲请求;可选的,根据该对讲终端选择的聊天对象进入聊天室,并创建本次对讲的对讲ID,再根据对讲接入号和对讲ID,创建会议室。会议室建立成功后,与该对讲终端建立语音链路,同时更新对讲消息;这里的对讲消息可以包括:对讲ID、对讲成员ID、群组ID、会议室总人数和/或成员列表。之后服务器查询需参加会议的对讲终端,并向需参加会议的对讲终端发出对讲邀请;需参加会议的对讲终端接受对讲邀请后,向服务器查询会议室成员,服务器把当前会议室人数及成员列表响应给对讲终端;对讲终端收到服务器响应后,如果同意参与对讲,用对讲接入号和对讲ID通过呼叫服务器,请求加入到已经创建的会议室,服务器与对讲终端之间的语音链路建立成功,并更新当前的对讲消息,对讲消息中包括:对讲ID、对讲成员ID、群组ID、会议室总人数和/或成员列表。加入到会议室中的对讲终端向服务器发送身份保存请求,在身份保存请求响应成功后开始发送一段语音到服务器,服务器将收到的语音进行处理后,得到该语音的语音流,将该语音流作为语音流样本,并提取该语音流样本中的声纹特征,保存到声纹库,声纹特征与对讲终端中每一个成员ID相对应。
会议室中,需要发言的对讲终端向服务器发送对讲请求,服务器同意该对讲请求后,该对讲终端开始发言;服务器接收该发言所产生的原始语音流,根据该原始语音流中每一个声源到达该对讲终端的时间差和/或强度差,可以定位出原始语音流中每一个声源的方位;可选的,根据原始语音流中每一个声源的方位,对原始语音流进行语音分离,获得每一个方位相对应的目标语音流,换句话说,可以将属于同一种声源的语音流和不属于同一种声源的语音流分别进行组合,形成成员语音流和过滤语音流(属于同一声源包括与成员ID对应的声纹特征匹配成功的语音流。与成员ID对应的声纹特征匹配失败的语音流为过滤语音流。这里的成员语音流意思就是与成员ID声纹特征匹配的语音流);提取被语音分离后的目标语音流中的声纹特征;将声纹特征与声纹库中对应对讲终端中对讲ID的声纹特征进行匹配,将声纹特征匹配成功的目标语音流发送到会议室中的其他对讲终端,会议室所有成员收听发言。
在本发明实施例中,本发明实施例方法还包括:接收到发送端发送的退出请求后,清除声纹库中与发送端相对应的声纹特征。
这里,当该对讲终端中该对讲成员发言结束后,该对讲终端向服务器发送发言结束请求,服务器收到发言结束请求后,结束该对讲成员的发言,并保存该对讲成员结束发言状态,接受新的对讲成员的发言请求。当该对讲成员要退出该会议室时,向服务器发送退出请求,服务器收到该退出请求后,清除对讲成员对应的对讲ID、对讲成员ID以及该对讲成员ID在声纹库中对应的声纹特征,同时该对讲成员成功退出。
在本发明实施例中,服务器还可以根据接收语音流的预设时间,检测该对讲成员是否继续发言;例如,当服务器在预设时间内,比如5分钟未收到该对讲成员发送的原始语音流时,则直接结束对讲成员的发言,并保存该对讲成员结束发言状态,接受新的对讲成员的发言请求。如此,是为了防止该对讲成员因网络中断或离席,但未及时结束本次发言程度,而导致其他对讲成员也无法申请发言的情况。当该对讲成员退出该会议室时,服务器清除对讲成员对应的对讲ID、对讲成员ID以及该对讲成员ID在声纹库中对应的声纹特征,同时该对讲成员成功退出。
本发明实施例还提供一种计算机存储介质,计算机存储介质中存储有计算机可执行指令,计算机可执行指令用于执行上述语音过滤的方法。
图2为本发明实施例一种语音过滤的装置21的组成结构示意图,如图2所示,该语音过滤装置21包括:语音流处理单元210、语音分离单元212、声纹特征提取单元213和声纹特征匹配单元214;
其中,语音流处理单元210,设置为接收发送端发送的原始语音流;向发送端发送声纹特征匹配成功的目标语音流;
语音分离单元212,设置为根据原始语音流中每一个声源的方位,对原始语音流进行语音分离,获得每一个方位相对应的目标语音流;
声纹特征提取单元213,设置为提取语音分离后的获得的所有目标语音流中的声纹特征;
声纹特征匹配单元214,设置为将提取的声纹特征与声纹库215中与发 送端相对应的声纹特征进行匹配,并向语音流处理单元210发送声纹特征匹配成功的目标语音流。
在实际应用中,本发明实施例的语音过滤装置可以应用于多种场景下,比如:网络电话会议。
下面,基于网络电话会议中的语音过滤为例进行说明。在网络电话会议中,客户端包括多个对讲终端,多个对讲终端与服务器通信,主要设置为执行语音对讲事务,而该客户端中的某个对讲终端就可以包括发送端。可选的,通过对讲终端中的通讯录好友ID或通讯群组ID向服务器发起语音对讲、接收服务器的对讲邀请、申请对讲发言、播放其他对讲成员的语音、显示参与本次会议的对讲成员人数及列表、以及退出本次会议对讲等功能。
在多个对讲终端均正常登录到服务器后,选择任意一个对讲终端向服务器发送对讲请求;可选的,根据该对讲终端选择的聊天对象进入聊天室,并创建本次对讲的对讲ID,再根据对讲接入号和对讲ID,创建会议室。会议室建立成功后,与该对讲终端建立语音链路,同时更新对讲消息;这里的对讲消息可以包括:对讲ID、对讲成员ID、群组ID、会议室总人数和/或成员列表。之后服务器查询需参加会议的对讲终端,并向需参加会议的对讲终端发出对讲邀请;需参加会议的对讲终端接受对讲邀请后,向服务器查询会议室成员,服务器把当前会议室人数及成员列表响应给对讲终端;对讲终端收到服务器响应后,如果同意参与对讲,用对讲接入号和对讲ID通过呼叫服务器,请求加入到已经创建的会议室,服务器与对讲终端之间的语音链路建立成功,并更新当前的对讲消息,对讲消息中包括:对讲ID、对讲成员ID、群组ID、会议室总人数和/或成员列表。加入到会议室中的对讲终端向服务器发送身份保存请求,在身份保存请求响应成功后开始发送一段语音到服务器,服务器将收到的语音进行处理后,得到该语音的语音流,将该语音流作为语音流样本发送到语音流处理单元210,语音流处理单元210接收到语音流样本后,将语音流样本发送到声纹特征提取单元213,由声纹特征提取单元213提取语音流样本中的声纹特征,并保存到声纹库215中。其中,声纹特征与对讲终端中每一个成员ID相对应。
在本发明实施例中,语音过滤装置21还包括声源定位单元211,设置为 根据原始语音流中每一个声源到达发送端的时间差和/或强度差,定位出原始语音流中每一个声源的方位。
这里,在会议室中,需要发言的对讲终端向服务器发送对讲请求,服务器同意该对讲请求后,该对讲终端开始发言;服务器接收该发言所产生的原始语音流后,将原始语音流发送到语音流处理单元210,语音流处理单元210再将该原始语音流发送到声源定位单元211,声源定位单元211根据该原始语音流中每一个声源到达该对讲终端的时间差和/或强度差,可以定位出原始语音流中每一个声源的方位;可选的,语音分离单元212根据原始语音流中每一个声源的方位,对原始语音流进行语音分离。其中,语音分离是将属于同一种声源的语音流和不属于同一种声源的语音流分别进行组合,形成成员语音流和过滤语音流(属于同一声源包括与成员ID对应的声纹特征匹配成功的语音流。与成员ID对应的声纹特征匹配失败的语音流为过滤语音流。这里的成员语音流意思就是与成员ID声纹特征匹配的语音流);之后由声纹特征提取单元213提取被语音分离后的目标语音流中的声纹特征,然后将声纹特征发送到声纹特征匹配单元214;声纹特征匹配单元214将声纹特征与声纹库215中对应对讲终端中对讲ID的声纹特征进行匹配,将匹配成功的目标语音流发送到语音流处理单元210,再由语音流处理单元210通过消息转发单元发送到会议室中的其他对讲终端,会议室所有成员收听发言。
可选的,在本发明实施例中,装置还包括:数据清除单元216,设置为在接收到发送端发送的退出请求后,清除声纹库中与发送端相对应的声纹特征。
这里,当该对讲终端中该对讲成员发言结束后,该对讲终端向服务器发送发言结束请求,服务器收到发言结束请求后,结束该对讲成员的发言,并保存该对讲成员结束发言状态,接受新的对讲成员的发言请求。当该对讲成员要退出该会议室时,向服务器发送退出请求,服务器收到该退出请求后,将该通出请求发送至数据清除单元216,数据清除单元216收到该退出请求后,清除对讲成员对应的对讲ID、对讲成员ID以及该对讲成员ID在声纹库中对应的声纹特征,同时该对讲成员成功退出。
在本发明实施例中,服务器还可以根据接收语音流的预设时间,检测该 对讲成员是否继续发言;例如,当服务器在预设时间内,比如5分钟未收到该对讲成员发送的原始语音流时,则直接结束对讲成员的发言,并保存该对讲成员结束发言状态,接受新的对讲成员的发言请求。当该对讲成员要退出该会议室时,向服务器发送退出请求,服务器收到该退出请求后,将该通出请求发送至数据清除单元216,数据清除单元216收到该退出请求后,清除对讲成员对应的对讲ID、对讲成员ID以及该对讲成员ID在声纹库中对应的声纹特征,同时该对讲成员成功退出。
图3为本发明实施例一种语音过滤的系统的组成结构示意图。如图3所示,系统包括:客户端31、服务器32和上述语音过滤的装置21,客户端31通过服务器32与语音过滤装置21交互。
在实际应用中,本发明实施例的语音过滤系统可以应用于多种场景下,比如:网络电话会议。
下面,基于网络电话会议中的语音过滤为例进行说明。在网络电话会议中,客户端31包括多个对讲终端311,多个对讲终端311与服务器32中的对讲终端接入管理单元321、呼叫控制单元322、消息转发单元323均能通信,主要设置为执行语音对讲事务。可选的,通过对讲终端311中的通讯录好友ID或通讯群组ID向服务器32发起语音对讲、接收服务器32的对讲邀请、申请对讲发言、播放其他对讲成员的语音、显示参与本次会议的对讲成员人数及列表、以及退出本次会议对讲等功能。
对讲终端接入管理单元321与对讲终端311连接,主要设置为实现对讲成员管理,确保对讲成员正常登录到服务器32,在多个对讲终端311均登录到该对讲终端接入管理单元321时,对讲终端接入管理单元321确保其已正常接入到服务器32之后,选择任意一个对讲终端311向呼叫控制单元322发送对讲请求,通过呼叫控制单元322来呼叫会议桥单元324;可选的,根据该对讲终端311选择的聊天对象进入聊天室,并创建本次对讲的对讲ID,再根据对讲接入号和对讲ID通过呼叫控制单元322呼叫会议桥单元324。
会议桥单元324收到该对讲请求后,创建会议室,并与该对讲终端311建立语音链路,同时通知对讲管理单元325更新对讲消息;这里的对讲消息可以包括:对讲ID、对讲成员ID、群组ID、会议室总人数和/或成员列表。 对讲管理单元325根据该对讲消息,查询需参加会议的对讲终端311,并通过消息转发单元323向需参加会议的对讲终端311发出对讲邀请;需参加会议的对讲终端311接受对讲邀请后,向对讲管理单元325查询会议室成员,对讲管理单元325把当前会议室人数及成员列表响应给对讲终端311;对讲终端311收到对讲管理单元325响应后,如果同意参与对讲,用对讲接入号和对讲ID通过呼叫控制单元322向会议桥单元324发起呼叫,请求加入到已经创建的会议室,会议桥单元324与对讲终端311之间的语音链路建立成功,并通知对讲管理单元325更新当前的对讲消息,对讲消息中包括:对讲ID、对讲成员ID、群组ID、会议室总人数和/或成员列表。加入到会议室中的对讲终端311向对讲管理单元325发送身份保存请求,在身份保存请求响应成功后开始发送一段语音到对讲管理单元325,对讲管理单元325将收到的语音进行处理后,得到该语音的语音流,将该语音流作为语音流样本发送到语音过滤装置21中的语音流处理单元210,再由语音流处理单元210将语音流样本发送到声纹特征提取单元213,声纹特征提取单元213收到语音流样本后,提取语音流样本中的声纹特征,并保存到声纹库215中。其中,声纹特征与对讲终端中每一个成员ID相对应。
会议室中,需要发言的对讲终端311向对讲管理单元325发送对讲请求,对讲管理单元325同意该对讲请求后,该对讲终端311开始发言;对讲管理单元325接收该发言所产生的原始语音流后,将该原始语音流发送到语音过滤装置21中的语音流处理单元210,再由语音流处理单元210将收到的原始语音流发送到声源定位单元211,声源定位单元211根据该原始语音流中每一个声源到达该对讲终端的时间差和/或强度差,可以定位出原始语音流中每一个声源的方位;可选的,语音分离单元212,根据原始语音流中每一个声源的方位,对原始语音流进行语音分离;其中,语音分离是将属于同一种声源的语音流和不属于同一种声源的语音流分别进行组合,形成成员语音流和过滤语音流(属于同一声源包括与成员ID对应的声纹特征匹配成功的语音流。与成员ID对应的声纹特征匹配失败的语音流为过滤语音流。这里的成员语音流意思就是与成员ID声纹特征匹配的语音流);之后由声纹特征提取单元213提取被语音分离后的目标语音流中的声纹特征,然后将声纹特征发送到声纹特征匹配单元214;声纹特征匹配单元214将声纹特征与声纹库215 中对应对讲终端中对讲ID的声纹特征进行匹配,将匹配成功的目标语音流发送到语音流处理单元210,再由语音流处理单元210通过消息转发单元323发送到会议室中的其他对讲终端311,会议室所有成员收听发言。
在本发明实施例中,语音过滤的装置21还包括:数据清除单元216,设置为在接收到对讲终端311发送的退出请求后,清除声纹库中与对讲终端311中成员ID相对应的声纹特征。
这里,当该对讲终端311中该对讲成员发言结束后,该对讲终端311向对讲管理单元325发送发言结束请求,对讲管理单元325收到发言结束请求后,结束该对讲成员的发言,并保存该对讲成员结束发言状态,接受新的对讲成员的发言请求。当该对讲成员要退出该会议室时,向对讲管理单元325发送退出请求,对讲管理单元325收到该退出请求后,将该退出请求发送至数据清除单元216,数据清除单元216收到该退出请求后,清除对讲成员对应的对讲ID、对讲成员ID以及该对讲成员ID在声纹库中对应的声纹特征,同时对讲成员成功退出。
在本发明实施例中,对讲管理单元325还可以根据接收语音流的预设时间,检测该对讲成员是否继续发言;例如,当对讲管理单元325在预设时间内,比如5分钟未收到该对讲成员发送的原始语音流时,则直接结束对讲成员的发言,并保存该对讲成员结束发言状态,接受新的对讲成员的发言请求。当该对讲成员要退出该会议室时,向对讲管理单元325发送退出请求,对讲管理单元325收到该退出请求后,将该退出请求发送到数据清除单元216,数据清除单元216收到该退出请求后,清除对讲成员对应的对讲ID、对讲成员ID以及该对讲成员ID在声纹库中对应的声纹特征,同时该对讲成员成功退出。
本发明实施例中,语音过滤装置21可以设置于服务器32中,当语音过滤装置21位于服务器32中时,客户端31中的对讲终端311通过远程方式完成语音流的语音分离、声纹特征提取和声纹特征匹配,并讲匹配成功的目标语音流通过服务器32发送给客户端31中的其他对讲终端311.
本发明实施例中,语音过滤装置21也可以设置于客户端31中,当语音过滤装置21位于客户端31中时,客户端31中的对讲终端311通过可以通过 本地方式完成语音流的语音分离、声纹特征提取和声纹特征匹配,并讲匹配成功的目标语音流通过语音链路直接发送给客户端31中的其他对讲终端311。
在实际应用中,对讲终端接入管理单元321、呼叫控制单元322、消息转发单元323、呼叫会议桥单元324和对讲管理单元325均可由位于服务器中的中央处理器(CPU)、微处理器(MPU)、数字信号处理器(DSP)、或现场可编程门阵列(FPGA)等实现。
语音流处理单元210、声源定位单元211、语音分离单元212、声纹特征提取单元213、声纹特征匹配单元214、声纹库215和数据清除单元216均可由语音过滤装置21中的中央处理器(CPU)、微处理器(MPU)、数字信号处理器(DSP)、或现场可编程门阵列(FPGA)等实现。
语音过滤装置21设置于客户端31中时,可以由客户端31中的中央处理器(CPU)、微处理器(MPU)、数字信号处理器(DSP)、或现场可编程门阵列(FPGA)等实现。
语音过滤装置21设置于服务器中时,可以由服务器中的中央处理器(CPU)、微处理器(MPU)、数字信号处理器(DSP)、或现场可编程门阵列(FPGA)等实现。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述,仅为本发明的可选实施例而已,并非用于限定本发明的保护范围。
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件(例如处理器)完成,所述程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现。相应地,上述实施例中的每个模块/单元可以采用硬件的形式实现,例如通过集成电路来实现其相应功能,也可以采用软件功能模块的形式实现,例如通过处理器执行存储于存储器中的程序/指令来实现其相应功能。本发明不限制于任何特定形式的硬件和软件的结合。
虽然本申请所揭露的实施方式如上,但所述的内容仅为便于理解本申请而采用的实施方式,并非用以限定本申请,如本发明实施方式中的具体的实现方法。任何本申请所属领域内的技术人员,在不脱离本申请所揭露的精神和范围的前提下,可以在实施的形式及细节上进行任何的修改与变化,但本申请的专利保护范围,仍须以所附的权利要求书所界定的范围为准。
工业实用性
上述技术方案实现了对语音传输中无关人员的声音和环境噪声等进行过滤,提高了语音质量,提高了电话会议的会议质量。

Claims (11)

  1. 一种语音过滤的方法,所述方法包括:
    接收原始语音流;
    根据所述原始语音流中每一个声源的方位,对所述原始语音流进行语音分离,获得每一个方位相对应的目标语音流;
    提取语音分离后获得的所有目标语音流中的声纹特征;
    将提取的所述声纹特征与声纹库中与发送端相对应的声纹特征进行匹配,并发送声纹特征匹配成功的目标语音流。
  2. 根据权利要求1所述的方法,所述接收原始语音流之后,所述方法还包括:
    根据所述原始语音流中每一个声源到达所述发送端的时间差和/或强度差,定位出所述原始语音流中每一个声源的方位。
  3. 根据权利要求1所述的方法,所述接收原始语音流之前,所述方法还包括:
    接收语音流样本;
    提取所述语音流样本中的声纹特征,将提取所述语音流样本中的声纹特征作为所述与发送端相对应的声纹特征,并保存到所述声纹库。
  4. 根据权利要求1~3任一项所述的方法,所述方法还包括:
    接收到所述发送端发送的退出请求后,清除所述声纹库中所述与发送端相对应的声纹特征。
  5. 一种语音过滤的装置,所述装置包括:语音流处理单元、语音分离单元、声纹特征提取单元和声纹特征匹配单元;其中,
    所述语音流处理单元,设置为接收发送端发送的原始语音流;向所述发送端发送声纹特征匹配成功的目标语音流;
    所述语音分离单元,设置为根据所述原始语音流中每一个声源的方位,对所述原始语音流进行语音分离,获得每一个方位相对应的目标语音流;
    所述声纹特征提取单元,设置为提取语音分离后的获得的所有目标语音流中的声纹特征;
    所述声纹特征匹配单元,设置为将提取的所述声纹特征与声纹库中与发送端相对应的声纹特征进行匹配,并向所述语音流处理单元发送声纹特征匹配成功的目标语音流。
  6. 根据权利要求5所述的装置,所述装置还包括:声源定位单元,设置为根据所述原始语音流中每一个声源到达所述发送端的时间差和/或强度差,定位出所述原始语音流中每一个声源的方位。
  7. 根据权利要求5所述的装置,
    所述语音流处理单元还设置为,接收发送端发送的语音流样本;
    所述声纹库提取单元还用于,提取所述语音流样本中的声纹特征,将提取所述语音流样本中的声纹特征作为所述与发送端相对应的声纹特征,并保存到所述声纹库。
  8. 根据权利要求5~7任一项所述的装置,所述装置还包括:
    数据清除单元,设置为接收到所述发送端发送的退出请求后,清除所述声纹库中所述与发送端相对应的声纹特征。
  9. 一种语音过滤的系统,所述系统包括客户端、服务器、以及权利要求5至8任一项所述的语音过滤的装置,
    所述客户端设置为,通过所述服务器与所述语音过滤的装置交互;
    所述服务器设置为,为所述客户端与所述装置的交互建立通信;
    所述装置设置为,通过所述服务器与所述客户端建立连接。
  10. 根据权利要求9所述的系统,其中,所述装置设置于所述服务器或所述客户端中。
  11. 根据权利要求9或10所述的系统,其中,所述客户端包括一个或一个以上对讲终端。
PCT/CN2016/093963 2016-06-06 2016-08-08 一种语音过滤的方法、装置及系统 WO2017210991A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610395256.6 2016-06-06
CN201610395256.6A CN107464570A (zh) 2016-06-06 2016-06-06 一种语音过滤方法、装置及系统

Publications (1)

Publication Number Publication Date
WO2017210991A1 true WO2017210991A1 (zh) 2017-12-14

Family

ID=60545729

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/093963 WO2017210991A1 (zh) 2016-06-06 2016-08-08 一种语音过滤的方法、装置及系统

Country Status (2)

Country Link
CN (1) CN107464570A (zh)
WO (1) WO2017210991A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10362394B2 (en) 2015-06-30 2019-07-23 Arthur Woodrow Personalized audio experience management and architecture for use in group audio communication
CN112929501A (zh) * 2021-01-25 2021-06-08 深圳前海微众银行股份有限公司 语音通话服务方法、装置、设备、介质及计算机程序产品
CN113022153A (zh) * 2021-01-25 2021-06-25 广州微体科技有限公司 一种智能便签打印机及其打印方法

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107843871B (zh) * 2017-11-06 2020-07-24 南京地平线机器人技术有限公司 声源定向方法、装置和电子设备
CN108495269B (zh) * 2018-03-22 2021-06-08 中国能源建设集团广东省电力设计研究院有限公司 海上风电场通信系统
CN109448460A (zh) * 2018-12-17 2019-03-08 广东小天才科技有限公司 一种背诵检测方法及用户设备
CN112614478B (zh) * 2020-11-24 2021-08-24 北京百度网讯科技有限公司 音频训练数据处理方法、装置、设备以及存储介质
CN113140223A (zh) * 2021-03-02 2021-07-20 广州朗国电子科技有限公司 一种会议语音数据处理方法、设备及存储介质
CN113064994A (zh) * 2021-03-25 2021-07-02 平安银行股份有限公司 会议质量评估方法、装置、设备及存储介质
CN116312564A (zh) * 2023-05-22 2023-06-23 安徽谱图科技有限公司 一种基于声纹技术的视频会议用啸叫抑制设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1584982A (zh) * 2003-08-04 2005-02-23 索尼株式会社 语音处理装置
CN102682771A (zh) * 2012-04-27 2012-09-19 厦门思德电子科技有限公司 一种适用于云平台的多语音控制方法
CN104936091A (zh) * 2015-05-14 2015-09-23 科大讯飞股份有限公司 基于圆形麦克风阵列的智能交互方法及系统
CN105141768A (zh) * 2015-08-31 2015-12-09 努比亚技术有限公司 多用户识别方法、装置及移动终端
CN105405439A (zh) * 2015-11-04 2016-03-16 科大讯飞股份有限公司 语音播放方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1584982A (zh) * 2003-08-04 2005-02-23 索尼株式会社 语音处理装置
CN102682771A (zh) * 2012-04-27 2012-09-19 厦门思德电子科技有限公司 一种适用于云平台的多语音控制方法
CN104936091A (zh) * 2015-05-14 2015-09-23 科大讯飞股份有限公司 基于圆形麦克风阵列的智能交互方法及系统
CN105141768A (zh) * 2015-08-31 2015-12-09 努比亚技术有限公司 多用户识别方法、装置及移动终端
CN105405439A (zh) * 2015-11-04 2016-03-16 科大讯飞股份有限公司 语音播放方法及装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10362394B2 (en) 2015-06-30 2019-07-23 Arthur Woodrow Personalized audio experience management and architecture for use in group audio communication
CN112929501A (zh) * 2021-01-25 2021-06-08 深圳前海微众银行股份有限公司 语音通话服务方法、装置、设备、介质及计算机程序产品
CN113022153A (zh) * 2021-01-25 2021-06-25 广州微体科技有限公司 一种智能便签打印机及其打印方法

Also Published As

Publication number Publication date
CN107464570A (zh) 2017-12-12

Similar Documents

Publication Publication Date Title
WO2017210991A1 (zh) 一种语音过滤的方法、装置及系统
US8606249B1 (en) Methods and systems for enhancing audio quality during teleconferencing
WO2015172435A1 (zh) 远程会议中实现有序发言的方法及服务器
EP3049949B1 (en) Acoustic feedback control for conference calls
US11782674B2 (en) Centrally controlling communication at a venue
CN110769352B (zh) 一种信号处理方法、装置以及计算机存储介质
US20090310762A1 (en) System and method for instant voice-activated communications using advanced telephones and data networks
US11924370B2 (en) Method for controlling a real-time conversation and real-time communication and collaboration platform
CN112261346A (zh) 视频会议方法及系统、计算机可读存储介质
US9843683B2 (en) Configuration method for sound collection system for meeting using terminals and server apparatus
US9525979B2 (en) Remote control of separate audio streams with audio authentication
JP2019115049A (ja) 会議設定における参加者の符号化方法
JP2019176386A (ja) 通信端末及び会議システム
US11094328B2 (en) Conferencing audio manipulation for inclusion and accessibility
WO2017219546A1 (zh) 信息处理方法、终端和计算机存储介质
CN114979545A (zh) 多终端的通话方法和存储介质及电子设备
US11037567B2 (en) Transcription of communications
JP2017519379A (ja) オブジェクトベースの遠隔会議プロトコル
TW202042089A (zh) 語音指令處理方法與系統
WO2024004006A1 (ja) チャット端末、チャットシステム、およびチャットシステムの制御方法
WO2022092126A1 (ja) 秘匿性会話可能なWeb会議システム
US20230421620A1 (en) Method and system for handling a teleconference
CN116633908A (zh) 传输连接构建方法以及系统
WO2018017086A1 (en) Determining when participants on a conference call are speaking
KR20050030191A (ko) 컴퓨터 전화 통합을 이용한 다자간 화상회의 개설 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16904440

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16904440

Country of ref document: EP

Kind code of ref document: A1