WO2022007497A1 - Voice processing method and apparatus, system and storage medium - Google Patents

Voice processing method and apparatus, system and storage medium Download PDF

Info

Publication number
WO2022007497A1
WO2022007497A1 PCT/CN2021/093325 CN2021093325W WO2022007497A1 WO 2022007497 A1 WO2022007497 A1 WO 2022007497A1 CN 2021093325 W CN2021093325 W CN 2021093325W WO 2022007497 A1 WO2022007497 A1 WO 2022007497A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
segment
voiceprint feature
voiceprint
character information
Prior art date
Application number
PCT/CN2021/093325
Other languages
French (fr)
Chinese (zh)
Inventor
李�瑞
贾巨涛
张伟伟
戴林
胡广绪
Original Assignee
珠海格力电器股份有限公司
珠海联云科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 珠海格力电器股份有限公司, 珠海联云科技有限公司 filed Critical 珠海格力电器股份有限公司
Publication of WO2022007497A1 publication Critical patent/WO2022007497A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Definitions

  • the embodiments of the present disclosure relate to the field of information technology, and in particular, to a voice processing method, apparatus, system, and storage medium.
  • the relevant voice message device cannot identify the identity of the current user according to the voice of the current user, and cannot accurately classify the current message into the voice database under the corresponding user identity, so that when another family member obtains the content of the message, it may be necessary to listen to all the message information. , resulting in a waste of time and poor customer experience.
  • the embodiments of the present disclosure provide a voice processing method, device, system, and storage medium.
  • an embodiment of the present disclosure provides a speech processing method, including:
  • Character information corresponding to the voiceprint feature is matched from the voiceprint database.
  • the method further includes:
  • Human voice detection is performed on the first voice segment after noise removal, and the part with human voice is used as the second voice segment.
  • the method further includes:
  • the second voice fragment is input into the DNN model, and the first voiceprint feature vector corresponding to the second voice fragment is obtained;
  • the first voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voice. texture feature vector;
  • the person information corresponding to the target voiceprint feature vector is used as the person information of the first voice segment.
  • the method further includes:
  • the third voice segment is saved in the voice database corresponding to the character information.
  • the method further includes:
  • an embodiment of the present disclosure provides a voice processing apparatus, including:
  • an acquisition module which is set to acquire the first voice segment
  • a processing module configured to extract the human voice part from the first speech segment as a second speech segment
  • the processing module is further configured to determine the voiceprint feature corresponding to the second voice segment
  • the determining module is configured to match person information corresponding to the voiceprint feature from the voiceprint database.
  • an embodiment of the present disclosure provides a speech processing system, including:
  • a microphone set to obtain the first voice segment
  • the processor is configured to extract the human voice part from the first voice fragment as a second voice fragment; determine the voiceprint feature corresponding to the second voice fragment; match the voiceprint from the voiceprint database with the voiceprint Character information corresponding to the feature.
  • the processor is specifically configured to perform denoising processing on the first speech segment to obtain the first speech segment after noise removal;
  • the voice segment is subjected to human voice detection, and the part with human voice is used as the second voice segment.
  • the processor is further configured to input the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment;
  • the voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voiceprint feature vector;
  • the person information corresponding to the target voiceprint feature vector is used as the person information of the first voice segment.
  • system further includes:
  • the microphone is also set to obtain a third voice segment
  • the processor is further configured to determine the voiceprint feature corresponding to the third voice segment; determine the character information corresponding to the third voice segment based on the voiceprint feature; save the third voice segment to the in the voice database corresponding to the character information.
  • system further includes:
  • the processor is also configured to receive a triggering operation on the target character information in the plurality of character information; based on the target character information, a fourth voice segment corresponding to the target character information is retrieved from the voice database;
  • a loudspeaker is arranged to play the fourth speech segment.
  • an embodiment of the present disclosure provides a storage medium, including: the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the above-mentioned first The speech processing method of any one of the aspects.
  • the voice processing solution by acquiring a first voice segment; extracting a human voice part from the first voice segment as a second voice segment; determining a voiceprint feature corresponding to the second voice segment;
  • the character information corresponding to the voiceprint feature is matched in the voiceprint database, and by this method, the identity of the user can be identified according to the voice message, so that the message can be prepared and classified, and stored in the voice database corresponding to the user, When other users get the message, they can extract the target message according to the specified identity, avoid wasting time and improve customer experience.
  • FIG. 1 is a schematic flowchart of a speech processing method according to an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart of another speech processing method provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic flowchart of another voice processing method provided by an embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of a voice processing apparatus according to an embodiment of the present disclosure.
  • FIG. 5 is a schematic structural diagram of a speech processing system according to an embodiment of the present disclosure.
  • FIG. 1 is a schematic flowchart of a speech processing method provided by an embodiment of the present disclosure. As shown in FIG. 1 , the method specifically includes:
  • the processor of the voice processing system receives the first voice segment entered by the user through the microphone, and the processor performs voice activity detection processing on the first voice segment, extracts the part of the human voice, and uses the voice segment of the human voice as the voice.
  • the second speech segment is the first speech segment entered by the user through the microphone.
  • the second speech segment is input into the pre-trained voiceprint recognition model, and the voiceprint feature corresponding to the second speech segment is extracted by using the voiceprint recognition model.
  • the sample voices of all members of the family are pre-entered in the voiceprint database, and all sample voices have been marked with their corresponding voiceprint feature labels.
  • the voiceprint feature labels can be entered by the user in the form of voice or typed text. .
  • the character information corresponding to the voiceprint feature stored in the voiceprint database with the same voiceprint feature is used as the second voiceprint feature.
  • the character information corresponding to the voice segment so as to identify the current user identity.
  • the voice processing method by acquiring a first voice fragment; extracting a human voice part from the first voice fragment as a second voice fragment; determining a voiceprint feature corresponding to the second voice fragment; The personal information corresponding to the voiceprint feature is matched in the voiceprint database, so that the user's identity can be recognized according to the voiceprint feature of the voice.
  • FIG. 2 is a schematic flowchart of another speech processing method provided by an embodiment of the present disclosure. As shown in FIG. 2 , the method specifically includes:
  • the user enters the first voice segment through the microphone of the voice message device, and the voice processor receives the first voice segment, and first performs denoising processing on the first voice Identify and eliminate the long silent period and remove the noise in the first voice segment, because when the user enters the voice, there may be loud background sounds in the surrounding environment, so these noises need to be removed to obtain the first voice after denoising Fragment.
  • Input the denoised first speech segment into the human voice detection model use the human voice detection model to identify the part where the voice of the character exists, and extract the part of the first voice segment where the voice of the character exists as the second voice segment.
  • the DNN voiceprint recognition model first deframes the second voice fragment, extracts the features of each frame of voice fragment, and after calculation, The first voiceprint feature vector corresponding to the second voice segment is obtained.
  • the average value of the voiceprint feature vectors of the multiple voices is calculated as the first voiceprint feature vector of the voice entered by the user.
  • the voiceprint feature vector pre-stored in the voiceprint database compares the similarity between the first voiceprint feature vector corresponding to the first voice clip entered by the current user and the voiceprint feature vector pre-stored in the voiceprint database, if the first voiceprint feature vector corresponding to the first voice clip and the voiceprint database The similarity of the pre-stored voiceprint feature vector in the voiceprint database exceeds the set threshold (for example, 0.7), then determine that the voiceprint feature vector pre-stored in the voiceprint database is the target voiceprint feature vector, and the target voiceprint feature vector corresponds to The character information of the first voice segment is used as the character information corresponding to the first voice segment to identify the current user identity.
  • the set threshold for example, 0.7
  • a voiceprint feature vector corresponding to the first voice fragment is obtained by acquiring a first voice fragment, performing data processing on the first voice fragment, and comparing the voiceprint corresponding to the first voice fragment
  • the similarity between the feature vector and the voiceprint feature vector stored in the voiceprint database determines the person information corresponding to the first voice segment, which can realize the identification of the user according to the voiceprint characteristics of the voice, and apply the voiceprint recognition technology to the voice.
  • it can effectively confirm and manage the identity of the person entering the voice message, and use the voiceprint to distinguish the message.
  • the message content can be extracted according to the specified identity, and the target message can be accurately extracted to improve the user experience.
  • FIG. 3 is a schematic flowchart of another speech processing method provided by an embodiment of the present disclosure. As shown in FIG. 3 , the method specifically includes:
  • the user first sends a message instruction. After the system receives the instruction sent by the user, it provides multiple identity options to the user. The user can choose according to the actual situation. After selecting the message object, it starts to leave a message. The third speech segment.
  • the system provides options for who to leave a message to.
  • the options can include: father, mother, lover or son.
  • the system enters the message recording mode, and the user records the message content through the microphone.
  • this message system will combine with the intelligent terminal equipment, and send messages from the background server to the intelligent terminal equipment (for example, the mobile terminal and the PC terminal), prompting the user that other members have left a message for him.
  • denoising and human voice detection are performed on the third speech segment, and then the processed third speech segment is input into the pre-trained DNN voiceprint recognition model, and voiceprint feature extraction is performed to obtain the corresponding third speech segment.
  • Voiceprint feature vector is
  • the voiceprint feature vector corresponding to the third voice segment entered by the current user Compares the similarity between the voiceprint feature vector corresponding to the third voice segment entered by the current user and the voiceprint feature vector pre-stored in the voiceprint database. If the similarity of the voiceprint feature vector exceeds the set threshold (for example, 0.8), the character information corresponding to the voiceprint feature vector pre-stored in the voiceprint database is used as the character information corresponding to the third voice segment, and the third voiceprint The voice segment is stored in the voice database corresponding to the character information.
  • the set threshold for example, 0.8
  • S35 Receive a triggering operation for the target person information in the plurality of person information.
  • the system When the user wants to listen to the message, after the system obtains the user's identity information according to the current user's voice analysis, according to the family member relationship diagram, it displays all the family member information about the user, and the user can choose which to listen to according to the actual situation. A family member leaves a message to himself, and the system receives the trigger command from the target person selected by the user.
  • the system determines the user's identity information according to the user's voiceprint feature vector, and according to the family member relationship diagram, All family member information about the user is displayed, and the user can choose which family member to listen to the message to him according to the actual situation.
  • the identity label of son can include: son and grandson; by the same token, a father's identity label can include: father, husband, and son.
  • the fourth voice segment corresponding to the target person is retrieved from the voice database, which is the message voice of the target person to the current user, and the fourth voice segment is played through the speaker. Listen to the current user.
  • the voice processing method by receiving the voice segment of the current user; extracting the corresponding voiceprint feature from the voice segment; matching the character information corresponding to the voiceprint feature from the voiceprint database, according to The character information is stored in the corresponding voice database, and the voice message to be listened to by the user can also be determined and retrieved from the voice database according to the character information.
  • This method can realize the confirmation and management of the identity of the person entering the voice message, and use the method.
  • Voiceprint distinguishes messages. When users obtain the message content of other members, they can extract the message content according to the specified identity, accurately extract the target message, and improve the user experience.
  • FIG. 4 is a schematic structural diagram of a speech processing apparatus provided by an embodiment of the present disclosure, which specifically includes:
  • the obtaining module 401 is set to obtain the first voice segment
  • the processing module 402 is configured to extract the human voice part from the first speech segment as the second speech segment;
  • the processing module 402 is further configured to determine the voiceprint feature corresponding to the second voice segment
  • the determining module 403 is configured to match person information corresponding to the voiceprint feature from the voiceprint database.
  • the acquiring module is specifically configured to acquire a third voice segment; receive a triggering operation for target character information among the plurality of character information; retrieve and fetch from the voice database based on the target character information The fourth voice segment corresponding to the target person information.
  • the processing module is specifically configured to perform denoising processing on the first speech segment to obtain the first speech segment after noise removal;
  • the voice segment is subjected to human voice detection, and the part with human voice is used as the second voice segment.
  • the processing module is further configured to input the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment;
  • the voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds a set threshold is used as the target voiceprint feature vector.
  • the processing module is further configured to save the third voice clip in a voice database corresponding to the character information, and play the fourth voice clip.
  • the determining module is specifically configured to determine the voiceprint feature corresponding to the third voice segment; based on the voiceprint feature, determine the character information corresponding to the third voice segment; The character information corresponding to the target voiceprint feature vector is used as the character information of the first voice segment.
  • the voice processing apparatus of the server provided in this embodiment may be the voice processing apparatus shown in FIG. 4 , and may execute all steps of the voice processing method shown in FIG.
  • the voice processing apparatus of the server may be the voice processing apparatus shown in FIG. 4 , and may execute all steps of the voice processing method shown in FIG.
  • detailed descriptions are not repeated here.
  • FIG. 5 is a schematic structural diagram of a voice processing system according to an embodiment of the present disclosure.
  • the voice processing system 500 shown in FIG. The various components in speech processing system 500 are coupled together by bus system 505 .
  • the bus system 505 is configured to enable connection communication between these components.
  • the bus system 505 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 505 in FIG. 5 .
  • the memory 502 in embodiments of the present disclosure may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (Read-Only Memory, ROM), a programmable read-only memory (Programmable ROM, PROM), an erasable programmable read-only memory (Erasable PROM, EPROM), an electrically programmable read-only memory (Erasable PROM, EPROM). Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be Random Access Memory (RAM), which acts as an external cache.
  • RAM Static RAM
  • DRAM Dynamic RAM
  • SDRAM Synchronous DRAM
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM DDRSDRAM
  • enhanced SDRAM ESDRAM
  • synchronous link dynamic random access memory Synch link DRAM, SLDRAM
  • Direct Rambus RAM Direct Rambus RAM
  • memory 502 stores the following elements, executable units or data structures, or a subset thereof, or an extended set of them: an operating system 5021 and applications 5022.
  • the operating system 5021 includes various system programs, such as a framework layer, a core library layer, a driver layer, etc., and is configured to implement various basic services and process hardware-based tasks.
  • the application program 5022 includes various application programs, such as a media player (Media Player), a browser (Browser), etc., and is set to implement various application services.
  • a program implementing the method of the embodiment of the present disclosure may be included in the application program 5022 .
  • each component stores a program or instruction in the memory 502 that is configured to execute FIG. 1 , FIG. 2 or FIG. 3 , and the controller/processor 501 executes the specific program in FIG. step;
  • the processor 501 For example, acquiring the first voice segment through the microphone 503; the processor 501 extracts the human voice part from the first voice segment as a second voice segment; determines the voiceprint feature corresponding to the second voice segment; Character information corresponding to the voiceprint feature is matched.
  • the processor 501 performs denoising processing on the first speech segment to obtain the first speech segment after noise removal; Sound detection, taking the part with human voice as the second speech segment.
  • the processor 501 inputs the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment;
  • the vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voiceprint feature vector;
  • the character information corresponding to the voiceprint feature vector is used as the character information of the first voice segment.
  • the microphone 503 acquires a third voice segment; the processor 501 determines a voiceprint feature corresponding to the third voice segment; and based on the voiceprint feature, determines the third voice segment corresponding character information; save the third voice segment in the voice database corresponding to the character information.
  • the processor 501 receives a triggering operation on target person information among the plurality of person information; and retrieves a fourth voice corresponding to the target person information from a voice database based on the target person information segment; the speaker 506 plays the fourth speech segment.
  • the methods disclosed in the above embodiments of the present disclosure may be applied to the processor 501 or implemented by the processor 501 .
  • the processor 501 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 501 or an instruction in the form of software.
  • the above-mentioned processor 501 can be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the methods disclosed in conjunction with the embodiments of the present disclosure can be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software units in the decoding processor.
  • the software unit may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502, and completes the steps of the above method in combination with its hardware.
  • the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof.
  • the processing unit can be implemented in one or more Application Specific Integrated Circuits (ASIC), Digital Signal Processing (DSP), Digital Signal Processing Device (DSPDevice, DSPD), programmable logic Programmable Logic Device (PLD), Field-Programmable Gate Array (FPGA), general purpose processor, controller, microcontroller, microprocessor, other configured to perform the functions described in this disclosure electronic unit or a combination thereof.
  • ASIC Application Specific Integrated Circuits
  • DSP Digital Signal Processing
  • DSPDevice Digital Signal Processing Device
  • PLD programmable logic Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • the techniques described herein may be implemented by means of units that perform the functions described herein.
  • Software codes may be stored in memory and executed by a processor.
  • the memory can be implemented in the processor or external to the processor.
  • the voice processing system provided in this embodiment may be the voice processing system shown in FIG. 5 , which can perform all the steps of the voice processing method shown in FIG. 1-3 , thereby realizing the technical effect of the voice processing method shown in FIG. 1-3 .
  • Embodiments of the present disclosure also provide a storage medium (computer-readable storage medium).
  • the storage medium here stores one or more programs.
  • the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk or solid-state hard disk; the memory may also include the above-mentioned types of memory. combination.
  • One or more programs in the storage medium can be executed by one or more processors, so as to implement the above-mentioned speech processing method executed in the speech processing system.
  • the processor is configured to execute the voice processing program stored in the memory to realize the following steps of the voice processing method executed in the voice processing system:
  • denoising processing is performed on the first speech segment to obtain the first speech segment after noise removal; human voice detection is performed on the first speech segment after noise removal, and there will be The part of the human voice is used as the second speech segment.
  • the second speech segment is input into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; Match the voiceprint feature vectors stored in the print database, and use the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold as the target voiceprint feature vector; The target voiceprint feature vector corresponds to The character information of the first voice segment is used as the character information of the first voice segment.
  • a third voice segment is acquired; a voiceprint feature corresponding to the third voice segment is determined; based on the voiceprint feature, character information corresponding to the third voice segment is determined; The three voice segments are added to the voice database corresponding to the character information.
  • receiving a triggering operation on target character information among the plurality of character information retrieving a fourth voice segment corresponding to the target character information from a voice database based on the target character information; playing the Fourth speech segment.
  • a software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
  • RAM random access memory
  • ROM read only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A voice processing method and apparatus, a system and a storage medium, the method comprising: acquiring a first voice segment (S11); extracting a human voice part from the first voice segment as a second voice segment (S12); determining voiceprint features corresponding to the second voice segment (S13); and from within a voiceprint database, matching character information corresponding to the voiceprint features (S14). In the method, the identity of the user can be identified according to a voice message so as to prepare and classify the message, and the identity is stored in a voice database corresponding to the user. When other users acquire the messages, a target message may be extracted according to a designated identity, which prevents the wasting of time and improves customer experience.

Description

语音处理方法、装置、系统及存储介质Speech processing method, device, system and storage medium
本公开要求于2020年07月08日提交中国专利局、申请号为202010666203.X、发明名称为“语音处理方法、装置、系统及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。This disclosure claims the priority of the Chinese patent application with the application number 202010666203.X and the invention title "Voice processing method, device, system and storage medium" filed with the China Patent Office on July 8, 2020, the entire contents of which are by reference Incorporated in this disclosure.
技术领域technical field
本公开实施例涉及信息技术领域,尤其涉及一种语音处理方法、装置、系统及存储介质。The embodiments of the present disclosure relate to the field of information technology, and in particular, to a voice processing method, apparatus, system, and storage medium.
背景技术Background technique
随着信息技术的发展,智能设备越来越多的应用到人们的家庭生活中,为了更加方便人们的生活,智能设备的功能越来越全面。当家庭人员外出时,很多人会通告过便签、电话或者语音留言设备等形式给其他成员进行留言,告知其他家庭成员某些注意事项等。With the development of information technology, more and more smart devices are applied to people's family life. In order to make people's lives more convenient, the functions of smart devices are becoming more and more comprehensive. When family members go out, many people will leave messages to other members in the form of notes, telephones or voice message devices, and inform other family members of certain precautions.
但是,相关的语音留言设备无法根据当前用户的语音对其身份进行识别,无法准确分类当前留言到对应的用户身份下的语音数据库中,使得当另外家庭成员获取留言内容时可能需要收听全部留言信息,从而造成时间的浪费,客户体验度不好。However, the relevant voice message device cannot identify the identity of the current user according to the voice of the current user, and cannot accurately classify the current message into the voice database under the corresponding user identity, so that when another family member obtains the content of the message, it may be necessary to listen to all the message information. , resulting in a waste of time and poor customer experience.
发明内容SUMMARY OF THE INVENTION
鉴于此,为解决上述根据语音留言无法识别用户身份的技术问题,本公开实施例提供一种语音处理方法、装置、系统及存储介质。In view of this, in order to solve the above-mentioned technical problem that the user's identity cannot be identified according to the voice message, the embodiments of the present disclosure provide a voice processing method, device, system, and storage medium.
第一方面,本公开实施例提供一种语音处理方法,包括:In a first aspect, an embodiment of the present disclosure provides a speech processing method, including:
获取第一语音片段;Get the first voice segment;
从所述第一语音片段中提取人声部分,作为第二语音片段;Extract the human voice part from the first speech segment as the second speech segment;
确定所述第二语音片段对应的声纹特征;determining the voiceprint feature corresponding to the second voice segment;
从声纹数据库中匹配出与所述声纹特征对应的人物信息。Character information corresponding to the voiceprint feature is matched from the voiceprint database.
在一个可能的实施方式中,所述方法还包括:In a possible implementation, the method further includes:
对所述第一语音片段进行去噪处理,得到去除噪声后的所述第一语音片段;performing denoising processing on the first speech segment to obtain the first speech segment after noise removal;
对去除噪声后的所述第一语音片段进行人声检测,将存在人声的部分作为第二语音片段。Human voice detection is performed on the first voice segment after noise removal, and the part with human voice is used as the second voice segment.
在一个可能的实施方式中,所述方法还包括:In a possible implementation, the method further includes:
将所述第二语音片段输入到DNN模型中,得到所述第二语音片段对应的第一声纹特征向量;The second voice fragment is input into the DNN model, and the first voiceprint feature vector corresponding to the second voice fragment is obtained;
对所述第一声纹特征向量与所述声纹数据库中存储的声纹特征向量进行匹配,将与所述第一声纹特征向量的相似度超过设定阈值的声纹特征向量作为目标声纹特征向量;The first voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voice. texture feature vector;
将所述目标声纹特征向量对应的人物信息作为第一语音片段的人物信息。The person information corresponding to the target voiceprint feature vector is used as the person information of the first voice segment.
在一个可能的实施方式中,所述方法还包括:In a possible implementation, the method further includes:
获取第三语音片段;Get the third voice segment;
确定所述第三语音片段对应的声纹特征;determining the voiceprint feature corresponding to the third voice segment;
基于所述声纹特征,确定所述第三语音片段对应的人物信息;determining the character information corresponding to the third voice segment based on the voiceprint feature;
保存所述第三语音片段到所述人物信息对应的语音数据库中。The third voice segment is saved in the voice database corresponding to the character information.
在一个可能的实施方式中,所述方法还包括:In a possible implementation, the method further includes:
接收对多个人物信息中目标人物信息的触发操作;Receive a triggering operation for the target character information in the multiple character information;
基于所述目标人物信息从语音数据库中调取与所述目标人物信息对应的第四语音片段;Retrieve the fourth voice segment corresponding to the target character information from the voice database based on the target character information;
播放所述第四语音片段。Play the fourth voice segment.
第二方面,本公开实施例提供一种语音处理装置,包括:In a second aspect, an embodiment of the present disclosure provides a voice processing apparatus, including:
获取模块,被设置为获取第一语音片段;an acquisition module, which is set to acquire the first voice segment;
处理模块,被设置为从所述第一语音片段中提取人声部分,作为第二语音片段;a processing module, configured to extract the human voice part from the first speech segment as a second speech segment;
所述处理模块,还被设置为确定所述第二语音片段对应的声纹特征;The processing module is further configured to determine the voiceprint feature corresponding to the second voice segment;
确定模块,被设置为从声纹数据库中匹配出与所述声纹特征对应的人物信息。The determining module is configured to match person information corresponding to the voiceprint feature from the voiceprint database.
第三方面,本公开实施例提供一种语音处理系统,包括:In a third aspect, an embodiment of the present disclosure provides a speech processing system, including:
麦克风,被设置为获取第一语音片段;a microphone, set to obtain the first voice segment;
处理器,被设置为从所述第一语音片段中提取人声部分,作为第二语音片段;确定所述第二语音片段对应的声纹特征;从声纹数据库中匹配出与所述声纹特征对应的人物信息。The processor is configured to extract the human voice part from the first voice fragment as a second voice fragment; determine the voiceprint feature corresponding to the second voice fragment; match the voiceprint from the voiceprint database with the voiceprint Character information corresponding to the feature.
在一个可能的实施方式中,所述处理器,具体被设置为对所述第一语音片段进行去噪处理,得到去除噪声后的所述第一语音片段;对去除噪声后的所述第一语音片段进行人声检测,将存在人声的部分作为第二语音片段。In a possible implementation manner, the processor is specifically configured to perform denoising processing on the first speech segment to obtain the first speech segment after noise removal; The voice segment is subjected to human voice detection, and the part with human voice is used as the second voice segment.
在一个可能的实施方式中,所述处理器,还被设置为将所述第二语音片段输入到DNN模型中,得到所述第二语音片段对应的第一声纹特征向量;对所述第一声纹特征向量与所述声纹数据库中存储的声纹特征向量进行匹配,将与所述第一声纹特征向量的相似度超过设定阈值的声纹特征向量作为目标声纹特征向量;将所述目标声纹特征向量对应的人物信息作为第一语音片段的人物信息。In a possible implementation manner, the processor is further configured to input the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; The voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voiceprint feature vector; The person information corresponding to the target voiceprint feature vector is used as the person information of the first voice segment.
在一个可能的实施方式中,所述系统还包括:In a possible implementation, the system further includes:
所述麦克风,还被设置为获取第三语音片段;the microphone is also set to obtain a third voice segment;
所述处理器,还被设置为确定所述第三语音片段对应的声纹特征;基于所述声纹特征,确定所述第三语音片段对应的人物信息;保存所述第三语音片段到所述人物信息对应的语音数据库中。The processor is further configured to determine the voiceprint feature corresponding to the third voice segment; determine the character information corresponding to the third voice segment based on the voiceprint feature; save the third voice segment to the in the voice database corresponding to the character information.
在一个可能的实施方式中,所述系统还包括:In a possible implementation, the system further includes:
所述处理器,还被设置为接收对多个人物信息中目标人物信息的触发操作;基于所述目标人物信息从语音数据库中调取与所述目标人物信息对应的第四语音片段;The processor is also configured to receive a triggering operation on the target character information in the plurality of character information; based on the target character information, a fourth voice segment corresponding to the target character information is retrieved from the voice database;
扬声器,被设置为播放所述第四语音片段。A loudspeaker is arranged to play the fourth speech segment.
第四方面,本公开实施例提供一种存储介质,包括:所述存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现上述第一方面中任一项所述的语音处理方法。In a fourth aspect, an embodiment of the present disclosure provides a storage medium, including: the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the above-mentioned first The speech processing method of any one of the aspects.
本公开实施例提供的语音处理方案,通过获取第一语音片段;从所述第一语音片段中提取人声部分,作为第二语音片段;确定所述第二语音片段对应的声纹特征;从声纹数据库中匹配出与所述声纹特征对应的人物信息,由此方法,可以实现根据语音留言对用户的身份进行识别,从而对留言进行准备分类,并存储到用户对应的语音数据库中,当其他用户获取留言时可按照指定身份提取目标留言,避免时间的浪费,提高客户体验度。In the voice processing solution provided by the embodiments of the present disclosure, by acquiring a first voice segment; extracting a human voice part from the first voice segment as a second voice segment; determining a voiceprint feature corresponding to the second voice segment; The character information corresponding to the voiceprint feature is matched in the voiceprint database, and by this method, the identity of the user can be identified according to the voice message, so that the message can be prepared and classified, and stored in the voice database corresponding to the user, When other users get the message, they can extract the target message according to the specified identity, avoid wasting time and improve customer experience.
附图说明Description of drawings
图1为本公开实施例提供的一种语音处理方法的流程示意图;FIG. 1 is a schematic flowchart of a speech processing method according to an embodiment of the present disclosure;
图2为本公开实施例提供的另一种语音处理方法的流程示意图;FIG. 2 is a schematic flowchart of another speech processing method provided by an embodiment of the present disclosure;
图3为本公开实施例提供的另一种语音处理方法的流程示意图;3 is a schematic flowchart of another voice processing method provided by an embodiment of the present disclosure;
图4为本公开实施例提供的一种语音处理装置的结构示意图;FIG. 4 is a schematic structural diagram of a voice processing apparatus according to an embodiment of the present disclosure;
图5为本公开实施例提供的一种语音处理系统的结构示意图。FIG. 5 is a schematic structural diagram of a speech processing system according to an embodiment of the present disclosure.
具体实施方式detailed description
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly and completely below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments These are some, but not all, embodiments of the present disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present disclosure.
为便于对本公开实施例的理解,下面将结合附图以具体实施例做在一些实施方式中解释说明,实施例并不构成对本公开实施例的限定。In order to facilitate the understanding of the embodiments of the present disclosure, the following will describe in some embodiments with specific embodiments in conjunction with the accompanying drawings, and the embodiments do not constitute limitations to the embodiments of the present disclosure.
图1为本公开实施例提供的一种语音处理方法的流程示意图,如图1所示,该方法具体包括:FIG. 1 is a schematic flowchart of a speech processing method provided by an embodiment of the present disclosure. As shown in FIG. 1 , the method specifically includes:
S11、获取第一语音片段。S11. Acquire a first voice segment.
S12、从所述第一语音片段中提取人声部分,作为第二语音片段。S12. Extract a human voice part from the first speech segment as a second speech segment.
在语音留言设备中,语音处理系统的处理器接收用户通过麦克风录入的第一语音片段,处理器对第一语音片段进行语音活动检测处理,提取出有人声的部分,将有人声的语音片段作为第二语音片段。In the voice message device, the processor of the voice processing system receives the first voice segment entered by the user through the microphone, and the processor performs voice activity detection processing on the first voice segment, extracts the part of the human voice, and uses the voice segment of the human voice as the voice. The second speech segment.
S13、确定所述第二语音片段对应的声纹特征。S13. Determine the voiceprint feature corresponding to the second voice segment.
将第二语音片段输入到预先训练好的声纹识别模型中,利用声纹识别模型,提取第二语音片段对应的声纹特征。The second speech segment is input into the pre-trained voiceprint recognition model, and the voiceprint feature corresponding to the second speech segment is extracted by using the voiceprint recognition model.
S14、从声纹数据库中匹配出与所述声纹特征对应的人物信息。S14. Match the person information corresponding to the voiceprint feature from the voiceprint database.
声纹数据库中预先录入家庭成员中所有成员身份的样本语音,所有样本语音均已标记其对应的声纹特征标签,声纹特征标签可以是用户通过语音形式录入,也可以是通过打字文本形式录入。The sample voices of all members of the family are pre-entered in the voiceprint database, and all sample voices have been marked with their corresponding voiceprint feature labels. The voiceprint feature labels can be entered by the user in the form of voice or typed text. .
进一步地,通过对比上述提取到的第二语音片段的声纹特征和声纹数据库中存储的声纹特征,将声纹特征一致的声纹数据库中存储的声纹特征对应的人物信息作为第二语音片段对应的人物信息,从而识别当前用户身 份。Further, by comparing the voiceprint feature of the second voice segment extracted above and the voiceprint feature stored in the voiceprint database, the character information corresponding to the voiceprint feature stored in the voiceprint database with the same voiceprint feature is used as the second voiceprint feature. The character information corresponding to the voice segment, so as to identify the current user identity.
本公开实施例提供的语音处理方法,通过获取第一语音片段;从所述第一语音片段中提取人声部分,作为第二语音片段;确定所述第二语音片段对应的声纹特征;从声纹数据库中匹配出与所述声纹特征对应的人物信息,可以实现根据语音的声纹特征对用户的身份进行识别。The voice processing method provided by the embodiments of the present disclosure, by acquiring a first voice fragment; extracting a human voice part from the first voice fragment as a second voice fragment; determining a voiceprint feature corresponding to the second voice fragment; The personal information corresponding to the voiceprint feature is matched in the voiceprint database, so that the user's identity can be recognized according to the voiceprint feature of the voice.
图2为本公开实施例提供的另一种语音处理方法的流程示意图,如图2所示,该方法具体包括:FIG. 2 is a schematic flowchart of another speech processing method provided by an embodiment of the present disclosure. As shown in FIG. 2 , the method specifically includes:
S21、获取第一语音片段。S21. Obtain a first voice segment.
S22、对所述第一语音片段进行去噪处理,得到去除噪声后的所述第一语音片段。S22. Perform denoising processing on the first speech segment to obtain the first speech segment after noise removal.
在本公开实施例中,用户通过语音留言设备的麦克风录入第一语音片段,语音处理器接收第一语音片段,首先通过语音活动检测模型对第一语音片段进行去噪处理,从第一语音片段中识别和消除长时间的静音期,并去除第一语音片段中的噪声,因为当用户录入语音时可能由于周围环境存在较大背景声音,所以需要去除这些噪声,得到去噪后的第一语音片段。In the embodiment of the present disclosure, the user enters the first voice segment through the microphone of the voice message device, and the voice processor receives the first voice segment, and first performs denoising processing on the first voice Identify and eliminate the long silent period and remove the noise in the first voice segment, because when the user enters the voice, there may be loud background sounds in the surrounding environment, so these noises need to be removed to obtain the first voice after denoising Fragment.
S23、对去除噪声后的所述第一语音片段进行人声检测,将存在人声的部分作为第二语音片段。S23. Perform human voice detection on the first voice segment after noise removal, and use the part with human voice as the second voice segment.
将去噪后的第一语音片段输入人声检测模型,利用人声检测模型识别存在人物说话声音的部分,将第一语音片段中存在人物说话声音的部分提取出来作为第二语音片段。Input the denoised first speech segment into the human voice detection model, use the human voice detection model to identify the part where the voice of the character exists, and extract the part of the first voice segment where the voice of the character exists as the second voice segment.
S24、将所述第二语音片段输入到DNN模型中,得到所述第二语音片段对应的第一声纹特征向量。S24. Input the second speech segment into the DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment.
将上述提取到的第二语音片段输入到预先训练好的DNN声纹识别模型中,DNN声纹识别模型首先对第二语音片段进行解帧操作,提取每一帧语音片段的特征,经过计算,得到第二语音片段对应的第一声纹特征向量。Input the second voice segment extracted above into the pre-trained DNN voiceprint recognition model. The DNN voiceprint recognition model first deframes the second voice fragment, extracts the features of each frame of voice fragment, and after calculation, The first voiceprint feature vector corresponding to the second voice segment is obtained.
在一些实施方式中,如果当前用户有多条语音,则计算多条语音的声纹特征向量的平均值,作为此用户录入语音的第一声纹特征向量。In some embodiments, if the current user has multiple voices, the average value of the voiceprint feature vectors of the multiple voices is calculated as the first voiceprint feature vector of the voice entered by the user.
S25、对所述第一声纹特征向量与所述声纹数据库中存储的声纹特征向量进行匹配,将与所述第一声纹特征向量的相似度超过设定阈值的声纹特征向量作为目标声纹特征向量。S25. Match the first voiceprint feature vector with the voiceprint feature vector stored in the voiceprint database, and use the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds a set threshold as the voiceprint feature vector The target voiceprint feature vector.
S26、将所述目标声纹特征向量对应的人物信息作为第一语音片段的人物信息。S26. Use the character information corresponding to the target voiceprint feature vector as the character information of the first voice segment.
对比当前用户录入的第一语音片段对应的第一声纹特征向量和声纹数据库中预先存储的声纹特征向量的相似度,如果第一语音片段对应的第一声纹特征向量和声纹数据库中预先存储的声纹特征向量的相似度超过设定阈值(例如,0.7),则确定声纹数据库中预先存储的该声纹特征向量为目标声纹特征向量,将该目标声纹特征向量对应的人物信息作为第一语音片段对应的人物信息,识别出当前用户身份。Compare the similarity between the first voiceprint feature vector corresponding to the first voice clip entered by the current user and the voiceprint feature vector pre-stored in the voiceprint database, if the first voiceprint feature vector corresponding to the first voice clip and the voiceprint database The similarity of the pre-stored voiceprint feature vector in the voiceprint database exceeds the set threshold (for example, 0.7), then determine that the voiceprint feature vector pre-stored in the voiceprint database is the target voiceprint feature vector, and the target voiceprint feature vector corresponds to The character information of the first voice segment is used as the character information corresponding to the first voice segment to identify the current user identity.
本公开实施例提供的语音处理方法,通过获取第一语音片段,对所述第一语音片段进行数据处理得到所述第一语音片段对应的声纹特征向量,对比第一语音片段对应的声纹特征向量与声纹数据库中存储的声纹特征向量的相似度,确定第一语音片段对应的人物信息,可以实现根据语音的声纹特征对用户的身份进行识别,将声纹识别技术运用于语音留言中,可有效进行语音留言录入人的身份的确认和管理,利用声纹区分留言,在用户获取其他成员的留言内容时可按照指定身份来提取留言内容,准确提取目标留言,提高用户体验。In the voice processing method provided by the embodiment of the present disclosure, a voiceprint feature vector corresponding to the first voice fragment is obtained by acquiring a first voice fragment, performing data processing on the first voice fragment, and comparing the voiceprint corresponding to the first voice fragment The similarity between the feature vector and the voiceprint feature vector stored in the voiceprint database determines the person information corresponding to the first voice segment, which can realize the identification of the user according to the voiceprint characteristics of the voice, and apply the voiceprint recognition technology to the voice. In the message, it can effectively confirm and manage the identity of the person entering the voice message, and use the voiceprint to distinguish the message. When the user obtains the message content of other members, the message content can be extracted according to the specified identity, and the target message can be accurately extracted to improve the user experience.
图3为本公开实施例提供的另一种语音处理方法的流程示意图,如图3所示,该方法具体包括:FIG. 3 is a schematic flowchart of another speech processing method provided by an embodiment of the present disclosure. As shown in FIG. 3 , the method specifically includes:
S31、获取第三语音片段。S31. Obtain a third voice segment.
本公开实施例中,用户首先发出留言指令,系统接收到用户发出的指 令后,提供多个身份选项给用户,用户可根据实际情况进行选择,选择留言对象后,开始留言,系统获取用户录入的第三语音片段。In the embodiment of the present disclosure, the user first sends a message instruction. After the system receives the instruction sent by the user, it provides multiple identity options to the user. The user can choose according to the actual situation. After selecting the message object, it starts to leave a message. The third speech segment.
例如,用户点击“我要留言”,系统提供留言给谁的选项,选项可以包括:爸爸、妈妈、爱人或儿子,用户选择要留言的对象后,系统进入留言录音模式,用户通过麦克风录入留言内容。进一步地,本留言系统会结合智能终端设备,从后台服务器发送消息到智能终端设备(例如,手机端和PC端),提示用户有其他成员给其留言。For example, when the user clicks "I want to leave a message", the system provides options for who to leave a message to. The options can include: father, mother, lover or son. After the user selects the object to leave a message, the system enters the message recording mode, and the user records the message content through the microphone. . Further, this message system will combine with the intelligent terminal equipment, and send messages from the background server to the intelligent terminal equipment (for example, the mobile terminal and the PC terminal), prompting the user that other members have left a message for him.
S32、确定所述第三语音片段对应的声纹特征。S32. Determine the voiceprint feature corresponding to the third voice segment.
首先对第三语音片段进行去噪和人声检测处理,然后将处理后的第三语音片段输入到预先训练好的DNN声纹识别模型中,进行声纹特征提取,得到第三语音片段对应的声纹特征向量。First, denoising and human voice detection are performed on the third speech segment, and then the processed third speech segment is input into the pre-trained DNN voiceprint recognition model, and voiceprint feature extraction is performed to obtain the corresponding third speech segment. Voiceprint feature vector.
S33、基于所述声纹特征,确定所述第三语音片段对应的人物信息。S33. Determine the character information corresponding to the third voice segment based on the voiceprint feature.
S34、保存所述第三语音片段到所述人物信息对应的语音数据库中。S34. Save the third voice segment in a voice database corresponding to the character information.
对比当前用户录入的第三语音片段对应的声纹特征向量和声纹数据库中预先存储的声纹特征向量的相似度,如果第三语音片段对应的声纹特征向量和声纹数据库中预先存储的声纹特征向量的相似度超过设定阈值(例如,0.8),则将声纹数据库中预先存储的该声纹特征向量对应的人物信息作为第三语音片段对应的人物信息,将所述第三语音片段保存到所述人物信息对应的语音数据库中。Compare the similarity between the voiceprint feature vector corresponding to the third voice segment entered by the current user and the voiceprint feature vector pre-stored in the voiceprint database. If the similarity of the voiceprint feature vector exceeds the set threshold (for example, 0.8), the character information corresponding to the voiceprint feature vector pre-stored in the voiceprint database is used as the character information corresponding to the third voice segment, and the third voiceprint The voice segment is stored in the voice database corresponding to the character information.
S35、接收对多个人物信息中目标人物信息的触发操作。S35: Receive a triggering operation for the target person information in the plurality of person information.
当用户想要收听留言时,系统根据当前用户的语音分析得到该用户的身份信息后,根据家庭成员关系图,显示出所有关于该用户的家庭成员信息,该用户可根据实际情况选择要收听哪位家庭成员给自己的留言,系统接收用户选择的目标人物的触发指令。When the user wants to listen to the message, after the system obtains the user's identity information according to the current user's voice analysis, according to the family member relationship diagram, it displays all the family member information about the user, and the user can choose which to listen to according to the actual situation. A family member leaves a message to himself, and the system receives the trigger command from the target person selected by the user.
例如,当用户直接说出“收听留言”的语音指令,并没有具体说要收听谁给自己的留言,系统根据该用户的声纹特征向量确定该用户的身份信 息,并根据家庭成员关系图,显示出所有关于该用户的家庭成员信息,该用户可根据实际情况选择要收听哪位家庭成员给自己的留言。For example, when the user directly speaks the voice command of "listen to the message" without specifying who to listen to the message for him, the system determines the user's identity information according to the user's voiceprint feature vector, and according to the family member relationship diagram, All family member information about the user is displayed, and the user can choose which family member to listen to the message to him according to the actual situation.
又如,家庭成员有儿子、爸爸、妈妈、爷爷和奶奶,儿子的身份标签相对于爸爸和妈妈来说是儿子,相对于爷爷和奶奶来说是孙子,则儿子的身份标签可以包括:儿子和孙子;同理,爸爸的身份标签可以包括:爸爸、丈夫和儿子。当用户想要收听留言时,说出“收听爸爸的留言”,系统根据语音内容首先选择出带有“爸爸”身份标签的语音数据库,则为家庭成员中爸爸和爷爷对应的语音数据库,然后根据该用户的声纹特征识别该用户的身份信息为家庭成员中的爸爸,则确定该用户要收听的是爷爷的留言。For another example, if the family members have son, father, mother, grandfather and grandmother, and the identity label of the son is son relative to father and mother, and grandson relative to grandfather and grandma, the identity label of son can include: son and grandson; by the same token, a father's identity label can include: father, husband, and son. When the user wants to listen to the message, say "Listen to Dad's message", the system first selects the voice database with the "Dad" identity label according to the voice content, which is the voice database corresponding to the father and grandfather in the family members, and then according to the voice content The user's voiceprint feature identifies that the user's identity information is the father of a family member, then it is determined that the user wants to listen to the message of the grandfather.
S36、基于所述目标人物信息从语音数据库中调取与所述目标人物信息对应的第四语音片段。S36. Retrieve a fourth voice segment corresponding to the target character information from the voice database based on the target character information.
S37、播放所述第四语音片段。S37. Play the fourth voice segment.
根据当前用户选择的要收听留言的目标人物信息,从语音数据库中调取该目标人物对应的第四语音片段,即为该目标人物给当前用户的留言语音,通过扬声器,将第四语音片段播放给当前用户收听。According to the information of the target person to listen to the message selected by the current user, the fourth voice segment corresponding to the target person is retrieved from the voice database, which is the message voice of the target person to the current user, and the fourth voice segment is played through the speaker. Listen to the current user.
本公开实施例提供的语音处理方法,通过接收当前用户的语音片段;从所述语音片段中提取对应的声纹特征;从声纹数据库中匹配出与所述声纹特征对应的人物信息,根据人物信息存储留言到对应的语音数据库中,还可以根据人物信息确定并从语音数据库中调取该用户要收听的语音留言,由此方法,可以实现语音留言录入人的身份的确认和管理,利用声纹区分留言,在用户获取其他成员的留言内容时可按照指定身份来提取留言内容,准确提取目标留言,提高用户体验。The voice processing method provided by the embodiment of the present disclosure, by receiving the voice segment of the current user; extracting the corresponding voiceprint feature from the voice segment; matching the character information corresponding to the voiceprint feature from the voiceprint database, according to The character information is stored in the corresponding voice database, and the voice message to be listened to by the user can also be determined and retrieved from the voice database according to the character information. This method can realize the confirmation and management of the identity of the person entering the voice message, and use the method. Voiceprint distinguishes messages. When users obtain the message content of other members, they can extract the message content according to the specified identity, accurately extract the target message, and improve the user experience.
图4为本公开实施例提供的一种语音处理装置的结构示意图,具体包括:FIG. 4 is a schematic structural diagram of a speech processing apparatus provided by an embodiment of the present disclosure, which specifically includes:
获取模块401,被设置为获取第一语音片段;The obtaining module 401 is set to obtain the first voice segment;
处理模块402,被设置为从所述第一语音片段中提取人声部分,作为第二语音片段;The processing module 402 is configured to extract the human voice part from the first speech segment as the second speech segment;
所述处理模块402,还被设置为确定所述第二语音片段对应的声纹特征;The processing module 402 is further configured to determine the voiceprint feature corresponding to the second voice segment;
确定模块403,被设置为从声纹数据库中匹配出与所述声纹特征对应的人物信息。The determining module 403 is configured to match person information corresponding to the voiceprint feature from the voiceprint database.
在一个可能的实施方式中,所述获取模块,具体被设置为获取第三语音片段;接收对多个人物信息中目标人物信息的触发操作;基于所述目标人物信息从语音数据库中调取与所述目标人物信息对应的第四语音片段。In a possible implementation manner, the acquiring module is specifically configured to acquire a third voice segment; receive a triggering operation for target character information among the plurality of character information; retrieve and fetch from the voice database based on the target character information The fourth voice segment corresponding to the target person information.
在一个可能的实施方式中,所述处理模块,具体被设置为对所述第一语音片段进行去噪处理,得到去除噪声后的所述第一语音片段;对去除噪声后的所述第一语音片段进行人声检测,将存在人声的部分作为第二语音片段。In a possible implementation manner, the processing module is specifically configured to perform denoising processing on the first speech segment to obtain the first speech segment after noise removal; The voice segment is subjected to human voice detection, and the part with human voice is used as the second voice segment.
在一个可能的实施方式中,所述处理模块,还被设置为将所述第二语音片段输入到DNN模型中,得到所述第二语音片段对应的第一声纹特征向量;对所述第一声纹特征向量与所述声纹数据库中存储的声纹特征向量进行匹配,将与所述第一声纹特征向量的相似度超过设定阈值的声纹特征向量作为目标声纹特征向量。In a possible implementation manner, the processing module is further configured to input the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; The voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds a set threshold is used as the target voiceprint feature vector.
在一个可能的实施方式中,所述处理模块,还被设置为保存所述第三语音片段到所述人物信息对应的语音数据库中,播放所述第四语音片段。In a possible implementation manner, the processing module is further configured to save the third voice clip in a voice database corresponding to the character information, and play the fourth voice clip.
在一个可能的实施方式中,所述确定模块,具体被设置为确定所述第三语音片段对应的声纹特征;基于所述声纹特征,确定所述第三语音片段对应的人物信息;将所述目标声纹特征向量对应的人物信息作为第一语音片段的人物信息。In a possible implementation manner, the determining module is specifically configured to determine the voiceprint feature corresponding to the third voice segment; based on the voiceprint feature, determine the character information corresponding to the third voice segment; The character information corresponding to the target voiceprint feature vector is used as the character information of the first voice segment.
本实施例提供的服务器的语音处理装置可以是如图4中所示的语音处理装置,可执行如图1-3中语音处理方法的所有步骤,进而实现图1-3所 示语音处理方法的技术效果,具体请参照图1-3相关描述,为简洁描述,在此不作赘述。The voice processing apparatus of the server provided in this embodiment may be the voice processing apparatus shown in FIG. 4 , and may execute all steps of the voice processing method shown in FIG. For details of the technical effects, please refer to the related descriptions in FIGS. 1-3 . For the sake of brevity, detailed descriptions are not repeated here.
图5为本公开实施例提供的一种语音处理系统的结构示意图,图5所示的语音处理系统500包括:至少一个处理器501、存储器502、麦克风503、至少一个网络接口504、扬声器506。语音处理系统500中的各个组件通过总线系统505耦合在一起。可理解,总线系统505被设置为实现这些组件之间的连接通信。总线系统505除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图5中将各种总线都标为总线系统505。FIG. 5 is a schematic structural diagram of a voice processing system according to an embodiment of the present disclosure. The voice processing system 500 shown in FIG. The various components in speech processing system 500 are coupled together by bus system 505 . It will be appreciated that the bus system 505 is configured to enable connection communication between these components. In addition to the data bus, the bus system 505 also includes a power bus, a control bus and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus system 505 in FIG. 5 .
可以理解,本公开实施例中的存储器502可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synch link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。本文描述的存储器502旨在包括但不限于这些和任意其它适合类型的存储器。It will be appreciated that the memory 502 in embodiments of the present disclosure may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Wherein, the non-volatile memory may be a read-only memory (Read-Only Memory, ROM), a programmable read-only memory (Programmable ROM, PROM), an erasable programmable read-only memory (Erasable PROM, EPROM), an electrically programmable read-only memory (Erasable PROM, EPROM). Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory. Volatile memory may be Random Access Memory (RAM), which acts as an external cache. By way of illustration and not limitation, many forms of RAM are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (Synch link DRAM, SLDRAM) ) and direct memory bus random access memory (Direct Rambus RAM, DRRAM). The memory 502 described herein is intended to include, but not be limited to, these and any other suitable types of memory.
在一些实施方式中,存储器502存储了如下的元素,可执行单元或者 数据结构,或者他们的子集,或者他们的扩展集:操作系统5021和应用程序5022。In some embodiments, memory 502 stores the following elements, executable units or data structures, or a subset thereof, or an extended set of them: an operating system 5021 and applications 5022.
其中,操作系统5021,包含各种系统程序,例如框架层、核心库层、驱动层等,被设置为实现各种基础业务以及处理基于硬件的任务。应用程序5022,包含各种应用程序,例如媒体播放器(Media Player)、浏览器(Browser)等,被设置为实现各种应用业务。实现本公开实施例方法的程序可以包含在应用程序5022中。The operating system 5021 includes various system programs, such as a framework layer, a core library layer, a driver layer, etc., and is configured to implement various basic services and process hardware-based tasks. The application program 5022 includes various application programs, such as a media player (Media Player), a browser (Browser), etc., and is set to implement various application services. A program implementing the method of the embodiment of the present disclosure may be included in the application program 5022 .
在本公开实施例中,各部件在存储器502存储有被设置为执行图1、图2或图3的程序或指令,通过控制器/处理器501执行图1、图2或图3中的具体步骤;In the embodiment of the present disclosure, each component stores a program or instruction in the memory 502 that is configured to execute FIG. 1 , FIG. 2 or FIG. 3 , and the controller/processor 501 executes the specific program in FIG. step;
如通过麦克风503获取第一语音片段;处理器501从所述第一语音片段中提取人声部分,作为第二语音片段;确定所述第二语音片段对应的声纹特征;从声纹数据库中匹配出与所述声纹特征对应的人物信息。For example, acquiring the first voice segment through the microphone 503; the processor 501 extracts the human voice part from the first voice segment as a second voice segment; determines the voiceprint feature corresponding to the second voice segment; Character information corresponding to the voiceprint feature is matched.
在一个可能的实施方式中,所述处理器501对所述第一语音片段进行去噪处理,得到去除噪声后的所述第一语音片段;对去除噪声后的所述第一语音片段进行人声检测,将存在人声的部分作为第二语音片段。In a possible implementation manner, the processor 501 performs denoising processing on the first speech segment to obtain the first speech segment after noise removal; Sound detection, taking the part with human voice as the second speech segment.
在一个可能的实施方式中,所述处理器501将所述第二语音片段输入到DNN模型中,得到所述第二语音片段对应的第一声纹特征向量;对所述第一声纹特征向量与所述声纹数据库中存储的声纹特征向量进行匹配,将与所述第一声纹特征向量的相似度超过设定阈值的声纹特征向量作为目标声纹特征向量;将所述目标声纹特征向量对应的人物信息作为第一语音片段的人物信息。In a possible implementation manner, the processor 501 inputs the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; The vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voiceprint feature vector; The character information corresponding to the voiceprint feature vector is used as the character information of the first voice segment.
在一个可能的实施方式中,所述麦克风503获取第三语音片段;所述处理器501确定所述第三语音片段对应的声纹特征;基于所述声纹特征,确定所述第三语音片段对应的人物信息;保存所述第三语音片段到所述人物信息对应的语音数据库中。In a possible implementation manner, the microphone 503 acquires a third voice segment; the processor 501 determines a voiceprint feature corresponding to the third voice segment; and based on the voiceprint feature, determines the third voice segment corresponding character information; save the third voice segment in the voice database corresponding to the character information.
在一个可能的实施方式中,所述处理器501接收对多个人物信息中目标人物信息的触发操作;基于所述目标人物信息从语音数据库中调取与所述目标人物信息对应的第四语音片段;扬声器506播放所述第四语音片段。In a possible implementation manner, the processor 501 receives a triggering operation on target person information among the plurality of person information; and retrieves a fourth voice corresponding to the target person information from a voice database based on the target person information segment; the speaker 506 plays the fourth speech segment.
上述本公开实施例揭示的方法可以应用于处理器501中,或者由处理器501实现。处理器501可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器501中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器501可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本公开实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本公开实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件单元组合执行完成。软件单元可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器502,处理器501读取存储器502中的信息,结合其硬件完成上述方法的步骤。The methods disclosed in the above embodiments of the present disclosure may be applied to the processor 501 or implemented by the processor 501 . The processor 501 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 501 or an instruction in the form of software. The above-mentioned processor 501 can be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps and logical block diagrams in the embodiments of the present disclosure can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the methods disclosed in conjunction with the embodiments of the present disclosure can be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software units in the decoding processor. The software unit may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502, and completes the steps of the above method in combination with its hardware.
可以理解的是,本文描述的这些实施例可以用硬件、软件、固件、中间件、微码或其组合来实现。对于硬件实现,处理单元可以实现在一个或多个专用集成电路(Application Specific Integrated Circuits,ASIC)、数字信号处理器(Digital Signal Processing,DSP)、数字信号处理设备(DSPDevice,DSPD)、可编程逻辑设备(Programmable Logic Device,PLD)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、通用处理器、控制器、微控制器、微处理器、被设置为执行本公开所述功能的其它电子单元或其组合中。It will be appreciated that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For hardware implementation, the processing unit can be implemented in one or more Application Specific Integrated Circuits (ASIC), Digital Signal Processing (DSP), Digital Signal Processing Device (DSPDevice, DSPD), programmable logic Programmable Logic Device (PLD), Field-Programmable Gate Array (FPGA), general purpose processor, controller, microcontroller, microprocessor, other configured to perform the functions described in this disclosure electronic unit or a combination thereof.
对于软件实现,可通过执行本文所述功能的单元来实现本文所述的技术。软件代码可存储在存储器中并通过处理器执行。存储器可以在处理器中或在处理器外部实现。For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. Software codes may be stored in memory and executed by a processor. The memory can be implemented in the processor or external to the processor.
本实施例提供的语音处理系统可以是如图5中所示的语音处理系统,可执行如图1-3中语音处理方法的所有步骤,进而实现图1-3所示语音处理方法的技术效果,具体请参照图1-3相关描述,为简洁描述,在此不作赘述。The voice processing system provided in this embodiment may be the voice processing system shown in FIG. 5 , which can perform all the steps of the voice processing method shown in FIG. 1-3 , thereby realizing the technical effect of the voice processing method shown in FIG. 1-3 . , please refer to the related descriptions in FIGS. 1-3 for details. For the sake of brevity, details are not repeated here.
本公开实施例还提供了一种存储介质(计算机可读存储介质)。这里的存储介质存储有一个或者多个程序。其中,存储介质可以包括易失性存储器,例如随机存取存储器;存储器也可以包括非易失性存储器,例如只读存储器、快闪存储器、硬盘或固态硬盘;存储器还可以包括上述种类的存储器的组合。Embodiments of the present disclosure also provide a storage medium (computer-readable storage medium). The storage medium here stores one or more programs. Wherein, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk or solid-state hard disk; the memory may also include the above-mentioned types of memory. combination.
当存储介质中一个或者多个程序可被一个或者多个处理器执行,以实现上述在语音处理系统执行的语音处理方法。One or more programs in the storage medium can be executed by one or more processors, so as to implement the above-mentioned speech processing method executed in the speech processing system.
所述处理器被设置为执行存储器中存储的语音处理程序,以实现以下在语音处理系统执行的语音处理方法的步骤:The processor is configured to execute the voice processing program stored in the memory to realize the following steps of the voice processing method executed in the voice processing system:
获取第一语音片段;从所述第一语音片段中提取人声部分,作为第二语音片段;确定所述第二语音片段对应的声纹特征;从声纹数据库中匹配出与所述声纹特征对应的人物信息。Obtain a first voice segment; extract a human voice part from the first voice segment as a second voice segment; determine the voiceprint feature corresponding to the second voice segment; match the voiceprint from the voiceprint database Character information corresponding to the feature.
在一个可能的实施方式中,对所述第一语音片段进行去噪处理,得到去除噪声后的所述第一语音片段;对去除噪声后的所述第一语音片段进行人声检测,将存在人声的部分作为第二语音片段。In a possible implementation manner, denoising processing is performed on the first speech segment to obtain the first speech segment after noise removal; human voice detection is performed on the first speech segment after noise removal, and there will be The part of the human voice is used as the second speech segment.
在一个可能的实施方式中,将所述第二语音片段输入到DNN模型中,得到所述第二语音片段对应的第一声纹特征向量;对所述第一声纹特征向量与所述声纹数据库中存储的声纹特征向量进行匹配,将与所述第一声纹特征向量的相似度超过设定阈值的声纹特征向量作为目标声纹特征向量; 将所述目标声纹特征向量对应的人物信息作为第一语音片段的人物信息。In a possible implementation manner, the second speech segment is input into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; Match the voiceprint feature vectors stored in the print database, and use the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold as the target voiceprint feature vector; The target voiceprint feature vector corresponds to The character information of the first voice segment is used as the character information of the first voice segment.
在一个可能的实施方式中,获取第三语音片段;确定所述第三语音片段对应的声纹特征;基于所述声纹特征,确定所述第三语音片段对应的人物信息;保存所述第三语音片段到所述人物信息对应的语音数据库中。In a possible implementation manner, a third voice segment is acquired; a voiceprint feature corresponding to the third voice segment is determined; based on the voiceprint feature, character information corresponding to the third voice segment is determined; The three voice segments are added to the voice database corresponding to the character information.
在一个可能的实施方式中,接收对多个人物信息中目标人物信息的触发操作;基于所述目标人物信息从语音数据库中调取与所述目标人物信息对应的第四语音片段;播放所述第四语音片段。In a possible implementation manner, receiving a triggering operation on target character information among the plurality of character information; retrieving a fourth voice segment corresponding to the target character information from a voice database based on the target character information; playing the Fourth speech segment.
专业人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本公开的范围。Professionals should be further aware that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. Interchangeability, the above description has generally described the components and steps of each example in terms of function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this disclosure.
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in connection with the embodiments disclosed herein may be implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
以上所述的具体实施方式,对本公开的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本公开的具体实施方式而已,并不用于限定本公开的保护范围,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present disclosure in detail. It should be understood that the above descriptions are only specific embodiments of the present disclosure, and are not intended to limit the scope of the present disclosure. Within the scope of protection, any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included within the scope of protection of the present disclosure.

Claims (12)

  1. 一种语音处理方法,包括:A speech processing method, comprising:
    获取第一语音片段;Get the first voice segment;
    从所述第一语音片段中提取人声部分,作为第二语音片段;Extract the human voice part from the first speech segment as the second speech segment;
    确定所述第二语音片段对应的声纹特征;determining the voiceprint feature corresponding to the second voice segment;
    从声纹数据库中匹配出与所述声纹特征对应的人物信息。Character information corresponding to the voiceprint feature is matched from the voiceprint database.
  2. 根据权利要求1所述的方法,其中,所述从所述第一语音片段中提取人声部分,作为第二语音片段,包括:The method according to claim 1, wherein the extracting the human voice part from the first speech segment as the second speech segment comprises:
    对所述第一语音片段进行去噪处理,得到去除噪声后的所述第一语音片段;performing denoising processing on the first speech segment to obtain the first speech segment after noise removal;
    对去除噪声后的所述第一语音片段进行人声检测,将存在人声的部分作为第二语音片段。Human voice detection is performed on the first speech segment after noise removal, and the part with human voice is used as the second speech segment.
  3. 根据权利要求2所述的方法,其中,所述确定所述第二语音片段对应的声纹特征,包括:The method according to claim 2, wherein the determining the voiceprint feature corresponding to the second voice segment comprises:
    将所述第二语音片段输入到DNN模型中,得到所述第二语音片段对应的第一声纹特征向量;The second voice fragment is input into the DNN model, and the first voiceprint feature vector corresponding to the second voice fragment is obtained;
    所述从声纹数据库中匹配出与所述声纹特征对应的人物信息,包括:The character information corresponding to the voiceprint feature is matched from the voiceprint database, including:
    对所述第一声纹特征向量与所述声纹数据库中存储的声纹特征向量进行匹配,将与所述第一声纹特征向量的相似度超过设定阈值的声纹特征向量作为目标声纹特征向量;The first voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voice. texture feature vector;
    将所述目标声纹特征向量对应的人物信息作为第一语音片段的人物信息。The person information corresponding to the target voiceprint feature vector is used as the person information of the first voice segment.
  4. 根据权利要求1-3任一所述的方法,其中,所述方法,还包括:The method according to any one of claims 1-3, wherein the method further comprises:
    获取第三语音片段;Get the third voice segment;
    确定所述第三语音片段对应的声纹特征;determining the voiceprint feature corresponding to the third voice segment;
    基于所述声纹特征,确定所述第三语音片段对应的人物信息;determining the character information corresponding to the third voice segment based on the voiceprint feature;
    保存所述第三语音片段到所述人物信息对应的语音数据库中。The third voice segment is saved in the voice database corresponding to the character information.
  5. 根据权利要求4所述的方法,其中,所述方法,还包括:The method of claim 4, wherein the method further comprises:
    接收对多个人物信息中目标人物信息的触发操作;Receive a triggering operation for the target character information in the multiple character information;
    基于所述目标人物信息从语音数据库中调取与所述目标人物信息对应的第四语音片段;Retrieve the fourth voice segment corresponding to the target character information from the voice database based on the target character information;
    播放所述第四语音片段。Play the fourth voice segment.
  6. 一种语音处理装置,包括:A voice processing device, comprising:
    获取模块,被设置为获取第一语音片段;an acquisition module, which is set to acquire the first voice segment;
    处理模块,被设置为从所述第一语音片段中提取人声部分,作为第二语音片段;a processing module, configured to extract the human voice part from the first speech segment as a second speech segment;
    所述处理模块,还被设置为确定所述第二语音片段对应的声纹特征;The processing module is further configured to determine the voiceprint feature corresponding to the second voice segment;
    确定模块,被设置为从声纹数据库中匹配出与所述声纹特征对应的人物信息。The determining module is configured to match the character information corresponding to the voiceprint feature from the voiceprint database.
  7. 一种语音处理系统,包括:A speech processing system, comprising:
    麦克风,被设置为获取第一语音片段;a microphone, set to obtain the first voice segment;
    处理器,被设置为从所述第一语音片段中提取人声部分,作为第二语音片段;确定所述第二语音片段对应的声纹特征;从声纹数据库中匹配出与所述声纹特征对应的人物信息。The processor is configured to extract the human voice part from the first voice fragment as a second voice fragment; determine the voiceprint feature corresponding to the second voice fragment; match the voiceprint from the voiceprint database with the voiceprint Character information corresponding to the feature.
  8. 根据权利要求7所述的系统,其中,所述处理器,具体被设置为对所述第一语音片段进行去噪处理,得到去除噪声后的所述第一语音片段;对去除噪声后的所述第一语音片段进行人声检测,将存在人声的部分作为第二语音片段。The system according to claim 7, wherein the processor is specifically configured to perform denoising processing on the first speech segment to obtain the first speech segment after noise removal; Human voice detection is performed on the first voice segment, and the part with human voice is used as the second voice segment.
  9. 根据权利要求8所述的系统,其中,所述处理器,还被设置为将所述第二语音片段输入到DNN模型中,得到所述第二语音片段对应的第一 声纹特征向量;对所述第一声纹特征向量与所述声纹数据库中存储的声纹特征向量进行匹配,将与所述第一声纹特征向量的相似度超过设定阈值的声纹特征向量作为目标声纹特征向量;将所述目标声纹特征向量对应的人物信息作为第一语音片段的人物信息。The system according to claim 8, wherein the processor is further configured to input the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; The first voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voiceprint feature vector; take the character information corresponding to the target voiceprint feature vector as the character information of the first voice segment.
  10. 根据权利要求7-9任一所述的系统,其中所述系统,还包括:The system according to any one of claims 7-9, wherein the system further comprises:
    所述麦克风,还被设置为获取第三语音片段;the microphone is also set to obtain a third voice segment;
    所述处理器,还被设置为确定所述第三语音片段对应的声纹特征;基于所述声纹特征,确定所述第三语音片段对应的人物信息;保存所述第三语音片段到所述人物信息对应的语音数据库中。The processor is further configured to determine the voiceprint feature corresponding to the third voice segment; determine the character information corresponding to the third voice segment based on the voiceprint feature; save the third voice segment to the in the voice database corresponding to the character information.
  11. 根据权利要求10所述的系统,其中所述系统,还包括:The system of claim 10, wherein the system further comprises:
    所述处理器,还被设置为接收对多个人物信息中目标人物信息的触发操作;基于所述目标人物信息从语音数据库中调取与所述目标人物信息对应的第四语音片段;The processor is also configured to receive a triggering operation on the target character information in the plurality of character information; based on the target character information, a fourth voice segment corresponding to the target character information is retrieved from the voice database;
    扬声器,被设置为播放所述第四语音片段。A loudspeaker is arranged to play the fourth speech segment.
  12. 一种存储介质,所述存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现权利要求1~5中任一项所述的语音处理方法。A storage medium, the storage medium stores one or more programs, the one or more programs can be executed by one or more processors, so as to realize the speech processing according to any one of claims 1 to 5 method.
PCT/CN2021/093325 2020-07-08 2021-05-12 Voice processing method and apparatus, system and storage medium WO2022007497A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010666203.X 2020-07-08
CN202010666203.XA CN111816191A (en) 2020-07-08 2020-07-08 Voice processing method, device, system and storage medium

Publications (1)

Publication Number Publication Date
WO2022007497A1 true WO2022007497A1 (en) 2022-01-13

Family

ID=72842801

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/093325 WO2022007497A1 (en) 2020-07-08 2021-05-12 Voice processing method and apparatus, system and storage medium

Country Status (2)

Country Link
CN (1) CN111816191A (en)
WO (1) WO2022007497A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816191A (en) * 2020-07-08 2020-10-23 珠海格力电器股份有限公司 Voice processing method, device, system and storage medium
CN112992154A (en) * 2021-05-08 2021-06-18 北京远鉴信息技术有限公司 Voice identity determination method and system based on enhanced voiceprint library

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258535A (en) * 2013-05-30 2013-08-21 中国人民财产保险股份有限公司 Identity recognition method and system based on voiceprint recognition
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
US9804822B2 (en) * 2014-07-29 2017-10-31 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
CN109994118A (en) * 2019-04-04 2019-07-09 平安科技(深圳)有限公司 Speech cipher verification method, device, storage medium and computer equipment
CN110265037A (en) * 2019-06-13 2019-09-20 中信银行股份有限公司 Auth method, device, electronic equipment and computer readable storage medium
CN110489659A (en) * 2019-07-18 2019-11-22 平安科技(深圳)有限公司 Data matching method and device
CN110544481A (en) * 2019-08-27 2019-12-06 华中师范大学 S-T classification method and device based on voiceprint recognition and equipment terminal
CN111105783A (en) * 2019-12-06 2020-05-05 中国人民解放军61623部队 Comprehensive customer service system based on artificial intelligence
CN111161742A (en) * 2019-12-30 2020-05-15 朗诗集团股份有限公司 Directional person communication method, system, storage medium and intelligent voice device
CN111816191A (en) * 2020-07-08 2020-10-23 珠海格力电器股份有限公司 Voice processing method, device, system and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109473105A (en) * 2018-10-26 2019-03-15 平安科技(深圳)有限公司 The voice print verification method, apparatus unrelated with text and computer equipment
CN110970036B (en) * 2019-12-24 2022-07-12 网易(杭州)网络有限公司 Voiceprint recognition method and device, computer storage medium and electronic equipment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN103258535A (en) * 2013-05-30 2013-08-21 中国人民财产保险股份有限公司 Identity recognition method and system based on voiceprint recognition
US9804822B2 (en) * 2014-07-29 2017-10-31 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
CN109994118A (en) * 2019-04-04 2019-07-09 平安科技(深圳)有限公司 Speech cipher verification method, device, storage medium and computer equipment
CN110265037A (en) * 2019-06-13 2019-09-20 中信银行股份有限公司 Auth method, device, electronic equipment and computer readable storage medium
CN110489659A (en) * 2019-07-18 2019-11-22 平安科技(深圳)有限公司 Data matching method and device
CN110544481A (en) * 2019-08-27 2019-12-06 华中师范大学 S-T classification method and device based on voiceprint recognition and equipment terminal
CN111105783A (en) * 2019-12-06 2020-05-05 中国人民解放军61623部队 Comprehensive customer service system based on artificial intelligence
CN111161742A (en) * 2019-12-30 2020-05-15 朗诗集团股份有限公司 Directional person communication method, system, storage medium and intelligent voice device
CN111816191A (en) * 2020-07-08 2020-10-23 珠海格力电器股份有限公司 Voice processing method, device, system and storage medium

Also Published As

Publication number Publication date
CN111816191A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
US10083687B2 (en) Method and apparatus for identifying acoustic background environments based on time and speed to enhance automatic speech recognition
US10210870B2 (en) Method for verification and blacklist detection using a biometrics platform
US9542150B2 (en) Controlling audio players using environmental audio analysis
US6219407B1 (en) Apparatus and method for improved digit recognition and caller identification in telephone mail messaging
US7995732B2 (en) Managing audio in a multi-source audio environment
WO2022007497A1 (en) Voice processing method and apparatus, system and storage medium
CN107995360B (en) Call processing method and related product
US10270736B2 (en) Account adding method, terminal, server, and computer storage medium
CN108682420B (en) Audio and video call dialect recognition method and terminal equipment
US20180108358A1 (en) Voice Categorisation
CN108831477B (en) Voice recognition method, device, equipment and storage medium
US9251808B2 (en) Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof
JP2017203808A (en) Interaction processing program, interaction processing method, and information processing apparatus
WO2015149359A1 (en) Method for automatically adjusting volume, volume adjustment apparatus and electronic device
WO2020024415A1 (en) Voiceprint recognition processing method and apparatus, electronic device and storage medium
WO2020055465A1 (en) Inline responses to video or voice messages
CN109271480B (en) Voice question searching method and electronic equipment
US10726850B2 (en) Systems and methods of sound-based fraud protection
CN113051426A (en) Audio information classification method and device, electronic equipment and storage medium
CN111986680A (en) Method and device for evaluating spoken language of object, storage medium and electronic device
CN111785280A (en) Identity authentication method and device, storage medium and electronic equipment
CN117153185B (en) Call processing method, device, computer equipment and storage medium
CN114242120B (en) Audio editing method and audio marking method based on DTMF technology
CN113873085B (en) Voice start-up white generation method and related device
JP2019035897A (en) Determination device, determination method, and determination program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21836789

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21836789

Country of ref document: EP

Kind code of ref document: A1