CN111816191A - Voice processing method, device, system and storage medium - Google Patents

Voice processing method, device, system and storage medium Download PDF

Info

Publication number
CN111816191A
CN111816191A CN202010666203.XA CN202010666203A CN111816191A CN 111816191 A CN111816191 A CN 111816191A CN 202010666203 A CN202010666203 A CN 202010666203A CN 111816191 A CN111816191 A CN 111816191A
Authority
CN
China
Prior art keywords
voice
voiceprint
fragment
segment
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010666203.XA
Other languages
Chinese (zh)
Inventor
李�瑞
贾巨涛
张伟伟
戴林
胡广绪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gree Electric Appliances Inc of Zhuhai
Zhuhai Lianyun Technology Co Ltd
Original Assignee
Gree Electric Appliances Inc of Zhuhai
Zhuhai Lianyun Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gree Electric Appliances Inc of Zhuhai, Zhuhai Lianyun Technology Co Ltd filed Critical Gree Electric Appliances Inc of Zhuhai
Priority to CN202010666203.XA priority Critical patent/CN111816191A/en
Publication of CN111816191A publication Critical patent/CN111816191A/en
Priority to PCT/CN2021/093325 priority patent/WO2022007497A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the invention relates to a voice processing method, a device, a system and a storage medium, wherein the method comprises the following steps: acquiring a first voice fragment; extracting a human voice part from the first voice fragment as a second voice fragment; determining the voiceprint characteristics corresponding to the second voice fragment; the person information corresponding to the voiceprint features is matched from the voiceprint database, so that the identity of the user can be identified according to the voice messages, the messages are prepared and classified and stored in the voice database corresponding to the user, when other users obtain the messages, the target messages can be extracted according to the specified identities, time waste is avoided, and the user experience is improved.

Description

Voice processing method, device, system and storage medium
Technical Field
Embodiments of the present invention relate to the field of information technologies, and in particular, to a method, an apparatus, a system, and a storage medium for speech processing.
Background
Along with the development of information technology, intelligent equipment is more and more applied to people's family life, and in order to make things convenient for people's life more, intelligent equipment's function is more and more comprehensive. When family members go out, many people can inform other members of notes in the forms of notes, telephones, voice message devices and the like, and inform other family members of some attention matters and the like.
However, the existing voice message leaving equipment cannot identify the identity of the current user according to the voice of the current user, and cannot accurately classify the current message into the voice database under the corresponding user identity, so that when another family member obtains message content, all message leaving information may need to be listened to, thereby causing time waste and poor user experience.
Disclosure of Invention
In view of this, in order to solve the technical problem that the user identity cannot be identified according to the voice message, embodiments of the present invention provide a voice processing method, apparatus, system, and storage medium.
In a first aspect, an embodiment of the present invention provides a speech processing method, including:
acquiring a first voice fragment;
extracting a human voice part from the first voice fragment as a second voice fragment;
determining the voiceprint characteristics corresponding to the second voice fragment;
and matching the person information corresponding to the voiceprint characteristics from the voiceprint database.
In one possible embodiment, the method further comprises:
denoising the first voice segment to obtain a denoised first voice segment;
and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.
In one possible embodiment, the method further comprises:
inputting the second voice fragment into a DNN model to obtain a first voiceprint feature vector corresponding to the second voice fragment;
matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector;
and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.
In one possible embodiment, the method further comprises:
acquiring a third voice fragment;
determining the voiceprint characteristics corresponding to the third voice fragment;
determining the figure information corresponding to the third voice fragment based on the voiceprint features;
and storing the third voice fragment into a voice database corresponding to the personal information.
In one possible embodiment, the method further comprises:
receiving a triggering operation of target person information in the plurality of person information;
calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information;
and playing the fourth voice fragment.
In a second aspect, an embodiment of the present invention provides a speech processing apparatus, including:
the acquisition module is used for acquiring a first voice fragment;
the processing module is used for extracting a human voice part from the first voice segment to serve as a second voice segment;
the processing module is further configured to determine a voiceprint feature corresponding to the second speech segment;
and the determining module is used for matching the person information corresponding to the voiceprint characteristics from the voiceprint database.
In a third aspect, an embodiment of the present invention provides a speech processing system, including:
the microphone is used for acquiring a first voice fragment;
a processor for extracting a human voice part from the first voice segment as a second voice segment; determining the voiceprint characteristics corresponding to the second voice fragment; and matching the person information corresponding to the voiceprint characteristics from the voiceprint database.
In a possible implementation manner, the processor is specifically configured to perform denoising processing on the first speech segment, so as to obtain a denoised first speech segment; and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.
In a possible embodiment, the processor is further configured to input the second speech segment into a DNN model, to obtain a first voiceprint feature vector corresponding to the second speech segment; matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector; and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.
In one possible embodiment, the system further comprises:
the microphone is also used for acquiring a third voice fragment;
the processor is further configured to determine a voiceprint feature corresponding to the third speech segment; determining the figure information corresponding to the third voice fragment based on the voiceprint features; and storing the third voice fragment into a voice database corresponding to the personal information.
In one possible embodiment, the system further comprises:
the processor is further used for receiving triggering operation of target person information in the plurality of person information; calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information;
and the loudspeaker is used for playing the fourth voice fragment.
In a fourth aspect, an embodiment of the present invention provides a storage medium, including: the storage medium stores one or more programs, which are executable by one or more processors to implement the speech processing method of any one of the above first aspects.
According to the voice processing scheme provided by the embodiment of the invention, the first voice fragment is obtained; extracting a human voice part from the first voice fragment as a second voice fragment; determining the voiceprint characteristics corresponding to the second voice fragment; the person information corresponding to the voiceprint features is matched from the voiceprint database, so that the identity of the user can be identified according to the voice messages, the messages are prepared and classified and stored in the voice database corresponding to the user, when other users obtain the messages, the target messages can be extracted according to the specified identities, time waste is avoided, and the user experience is improved.
Drawings
Fig. 1 is a schematic flow chart of a speech processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another speech processing method according to an embodiment of the present invention;
FIG. 3 is a flow chart of another speech processing method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a speech processing system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.
Fig. 1 is a schematic flow chart of a speech processing method according to an embodiment of the present invention, and as shown in fig. 1, the method specifically includes:
and S11, acquiring the first voice fragment.
And S12, extracting a human voice part from the first voice fragment as a second voice fragment.
In the voice message leaving equipment, a processor of a voice processing system receives a first voice segment recorded by a user through a microphone, the processor performs voice activity detection processing on the first voice segment, extracts a voice part, and takes the voice segment with voice as a second voice segment.
And S13, determining the corresponding voiceprint characteristics of the second voice fragment.
And inputting the second voice segment into a pre-trained voiceprint recognition model, and extracting the voiceprint characteristics corresponding to the second voice segment by using the voiceprint recognition model.
And S14, matching the person information corresponding to the voiceprint characteristics from the voiceprint database.
Sample voices of all member identities of the family members are pre-recorded in the voiceprint database, corresponding voiceprint feature labels are marked on all the sample voices, and the voiceprint feature labels can be recorded by a user in a voice mode or a typewritten text mode.
Further, by comparing the extracted voiceprint features of the second voice fragment with the voiceprint features stored in the voiceprint database, the person information corresponding to the voiceprint features stored in the voiceprint database with consistent voiceprint features is used as the person information corresponding to the second voice fragment, and therefore the current user identity is recognized.
According to the voice processing method provided by the embodiment of the invention, the first voice segment is obtained; extracting a human voice part from the first voice fragment as a second voice fragment; determining the voiceprint characteristics corresponding to the second voice fragment; and matching the figure information corresponding to the voiceprint characteristics from the voiceprint database, so that the identity of the user can be identified according to the voiceprint characteristics of the voice.
Fig. 2 is a schematic flow chart of another speech processing method according to an embodiment of the present invention, and as shown in fig. 2, the method specifically includes:
and S21, acquiring the first voice fragment.
S22, denoising the first voice segment to obtain the first voice segment with noise removed.
In the embodiment of the present invention, a user records a first voice segment through a microphone of a voice message apparatus, a voice processor receives the first voice segment, performs denoising processing on the first voice segment through a voice activity detection model, identifies and eliminates a long silence period from the first voice segment, and removes noise in the first voice segment.
And S23, carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.
Inputting the denoised first voice segment into a voice detection model, identifying a part with character speaking voice by using the voice detection model, and extracting the part with the character speaking voice in the first voice segment to be used as a second voice segment.
S24, inputting the second voice fragment into a DNN model to obtain a first voiceprint feature vector corresponding to the second voice fragment.
And inputting the extracted second voice fragment into a pre-trained DNN (digital noise network) voiceprint recognition model, firstly performing frame decoding operation on the second voice fragment by the DNN voiceprint recognition model, extracting the characteristics of each frame of voice fragment, and calculating to obtain a first voiceprint characteristic vector corresponding to the second voice fragment.
Optionally, if the current user has multiple voices, an average value of voiceprint feature vectors of the multiple voices is calculated to serve as a first voiceprint feature vector of the voice input by the user.
And S25, matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector.
And S26, using the character information corresponding to the target voiceprint feature vector as the character information of the first voice fragment.
And comparing the similarity between a first voiceprint feature vector corresponding to a first voice fragment input by the current user and a voiceprint feature vector prestored in a voiceprint database, if the similarity between the first voiceprint feature vector corresponding to the first voice fragment and the voiceprint feature vector prestored in the voiceprint database exceeds a set threshold (for example, 0.7), determining that the voiceprint feature vector prestored in the voiceprint database is a target voiceprint feature vector, and identifying the identity of the current user by taking the figure information corresponding to the target voiceprint feature vector as the figure information corresponding to the first voice fragment.
According to the voice processing method provided by the embodiment of the invention, the first voice fragment is obtained, the voiceprint characteristic vector corresponding to the first voice fragment is obtained by carrying out data processing on the first voice fragment, the similarity between the voiceprint characteristic vector corresponding to the first voice fragment and the voiceprint characteristic vector stored in the voiceprint database is compared, the figure information corresponding to the first voice fragment is determined, the identity of a user can be identified according to the voiceprint characteristics of voice, the voiceprint identification technology is applied to voice message leaving, the identity of a voice message recorder can be effectively confirmed and managed, the voiceprint is used for distinguishing the messages, the message content can be extracted according to the specified identity when the user obtains the message content of other members, the target message can be accurately extracted, and the user experience is improved.
Fig. 3 is a schematic flow chart of another speech processing method according to an embodiment of the present invention, and as shown in fig. 3, the method specifically includes:
and S31, acquiring a third voice fragment.
In the embodiment of the invention, a user firstly sends a message leaving instruction, a system provides a plurality of identity options for the user after receiving the instruction sent by the user, the user can select according to the actual situation, a message is left after selecting a message leaving object, and the system obtains a third voice fragment recorded by the user.
For example, the user clicks "i want to leave a message," and the system provides options to whom to leave a message, which may include: and after the user selects an object to be left, the system enters a message recording mode, and the user inputs message contents through a microphone. Further, the message leaving system can be combined with the intelligent terminal device, and sends messages to the intelligent terminal device (for example, a mobile phone terminal and a PC terminal) from the background server to prompt the user that other members leave messages for the intelligent terminal device.
And S32, determining the corresponding voiceprint feature of the third voice fragment.
The third voice segment is subjected to denoising and human voice detection processing, then the processed third voice segment is input into a DNN (digital network noise network) voiceprint recognition model trained in advance, and voiceprint feature extraction is carried out to obtain a voiceprint feature vector corresponding to the third voice segment.
And S33, determining the character information corresponding to the third voice fragment based on the voiceprint characteristics.
And S34, saving the third voice fragment to a voice database corresponding to the personal information.
And comparing the similarity of the voiceprint feature vector corresponding to the third voice fragment input by the current user with the similarity of the voiceprint feature vector prestored in the voiceprint database, and if the similarity of the voiceprint feature vector corresponding to the third voice fragment and the voiceprint feature vector prestored in the voiceprint database exceeds a set threshold (for example, 0.8), taking the figure information corresponding to the voiceprint feature vector prestored in the voiceprint database as the figure information corresponding to the third voice fragment, and storing the third voice fragment in the voice database corresponding to the figure information.
S35, a trigger operation for the target personal information among the plurality of pieces of personal information is received.
When a user wants to listen to a message, the system analyzes the voice of the current user to obtain the identity information of the user, displays all family member information related to the user according to the family member relationship graph, the user can select which family member to listen to the message for the user according to the actual situation, and the system receives a trigger instruction of a target person selected by the user.
For example, when a user directly speaks a voice command of "listen to a message", and does not specifically say who wants to listen to the message left by the user, the system determines the identity information of the user according to the voiceprint feature vector of the user, and displays all family member information about the user according to the family member relationship diagram, and the user can select which family member wants to listen to the message left by the user according to the actual situation.
As another example, the family members have children, dad, mom, grandpa, and grandpa, the identity tags of the children are children relative to dad and mom, and grandchildren relative to grandpa and grandpa, the identity tags of the children may include: children and grandchildren; similarly, the identity tag of dad can include: dad, husband, and son. When a user wants to listen to a message, the system speaks 'listening to a message left by dad', firstly selects a voice database with 'dad' identity tags according to voice content, then the voice database corresponds to dad and grander in family members, then identifies the identity information of the user as dad in family members according to the voiceprint characteristics of the user, and then determines that the message left by the father the user wants to listen to.
S36, retrieving a fourth voice fragment corresponding to the target personal information from a voice database based on the target personal information.
And S37, playing the fourth voice fragment.
And calling a fourth voice segment corresponding to the target character from the voice database according to the information of the target character selected by the current user to listen to the message, wherein the fourth voice segment is the message voice of the target character to the current user, and the fourth voice segment is played to the current user for listening through a loudspeaker.
The voice processing method provided by the embodiment of the invention receives the voice fragment of the current user; extracting corresponding voiceprint features from the voice segments; the method can realize the confirmation and management of the identity of a voice message recorder, distinguish the messages by using the voiceprints, extract the message content according to the specified identity when the user acquires the message content of other members, accurately extract the target messages and improve the user experience.
Fig. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention, which specifically includes:
an obtaining module 401, configured to obtain a first voice segment;
a processing module 402, configured to extract a human voice portion from the first voice segment as a second voice segment;
the processing module 402 is further configured to determine a voiceprint feature corresponding to the second speech segment;
a determining module 403, configured to match the person information corresponding to the voiceprint feature from the voiceprint database.
In a possible embodiment, the obtaining module is specifically configured to obtain a third speech segment; receiving a triggering operation of target person information in the plurality of person information; and calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information.
In a possible implementation manner, the processing module is specifically configured to perform denoising processing on the first speech segment to obtain a denoised first speech segment; and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.
In a possible embodiment, the processing module is further configured to input the second speech segment into a DNN model, so as to obtain a first voiceprint feature vector corresponding to the second speech segment; and matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector.
In a possible implementation manner, the processing module is further configured to store the third voice segment in a voice database corresponding to the personal information, and play the fourth voice segment.
In a possible embodiment, the determining module is specifically configured to determine a voiceprint feature corresponding to the third speech segment; determining the figure information corresponding to the third voice fragment based on the voiceprint features; and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.
The voice processing apparatus of the server provided in this embodiment may be the voice processing apparatus shown in fig. 4, and may perform all the steps of the voice processing method shown in fig. 1 to 3, so as to achieve the technical effects of the voice processing method shown in fig. 1 to 3, and for brevity, please refer to the description related to fig. 1 to 3, which is not described herein again.
Fig. 5 is a schematic structural diagram of a speech processing system according to an embodiment of the present invention, and the speech processing system 500 shown in fig. 5 includes: at least one processor 501, memory 502, microphone 503, at least one network interface 504, speaker 506. The various components of the speech processing system 500 are coupled together by a bus system 505. It is understood that the bus system 505 is used to enable connection communications between these components. The bus system 505 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 505 in FIG. 5.
It is to be understood that the memory 502 in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data rate Synchronous Dynamic random access memory (ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DRRAM). The memory 502 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, memory 502 stores elements, executable units or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system 5021 and application programs 5022.
The operating system 5021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application 5022 includes various applications, such as a Media Player (Media Player), a Browser (Browser), and the like, for implementing various application services. The program for implementing the method according to the embodiment of the present invention may be included in the application program 5022.
In the embodiment of the present invention, each component stores a program or instructions for executing fig. 1, fig. 2 or fig. 3 in the memory 502, and the controller/processor 501 executes the specific steps in fig. 1, fig. 2 or fig. 3;
a first speech segment is acquired, such as by microphone 503; the processor 501 extracts a human voice part from the first voice segment as a second voice segment; determining the voiceprint characteristics corresponding to the second voice fragment; and matching the person information corresponding to the voiceprint characteristics from the voiceprint database.
In a possible implementation manner, the processor 501 performs denoising processing on the first speech segment to obtain a denoised first speech segment; and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.
In a possible embodiment, the processor 501 inputs the second speech segment into a DNN model, and obtains a first voiceprint feature vector corresponding to the second speech segment; matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector; and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.
In one possible implementation, the microphone 503 acquires a third speech segment; the processor 501 determines a voiceprint feature corresponding to the third speech fragment; determining the figure information corresponding to the third voice fragment based on the voiceprint features; and storing the third voice fragment into a voice database corresponding to the personal information.
In one possible embodiment, the processor 501 receives a trigger operation for a target personal information among the plurality of personal information; calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information; speaker 506 plays the fourth voice clip.
The method disclosed by the above-mentioned embodiments of the present invention may be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 501. The Processor 501 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable Gate Array (FPGA) or other programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502 and completes the steps of the method in combination with the hardware.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
The speech processing system provided in this embodiment may be the speech processing system shown in fig. 5, and may perform all the steps of the speech processing method shown in fig. 1-3, so as to achieve the technical effects of the speech processing method shown in fig. 1-3, and for brevity, it is not described herein again.
The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.
When one or more programs in the storage medium are executable by one or more processors to implement the speech processing method described above as being performed in a speech processing system.
The processor is used for executing the voice processing program stored in the memory so as to realize the following steps of the voice processing method executed in the voice processing system:
acquiring a first voice fragment; extracting a human voice part from the first voice fragment as a second voice fragment; determining the voiceprint characteristics corresponding to the second voice fragment; and matching the person information corresponding to the voiceprint characteristics from the voiceprint database.
In a possible implementation manner, denoising is performed on the first speech segment to obtain a denoised first speech segment; and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.
In a possible implementation manner, the second speech segment is input into a DNN model, and a first voiceprint feature vector corresponding to the second speech segment is obtained; matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector; and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.
In one possible embodiment, a third speech segment is obtained; determining the voiceprint characteristics corresponding to the third voice fragment; determining the figure information corresponding to the third voice fragment based on the voiceprint features; and storing the third voice fragment into a voice database corresponding to the personal information.
In one possible embodiment, a trigger operation for the target personal information among the plurality of pieces of personal information is received; calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information; and playing the fourth voice fragment.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (12)

1. A method of speech processing, comprising:
acquiring a first voice fragment;
extracting a human voice part from the first voice fragment as a second voice fragment;
determining the voiceprint characteristics corresponding to the second voice fragment;
and matching the person information corresponding to the voiceprint characteristics from the voiceprint database.
2. The method according to claim 1, wherein said extracting the human voice part from the first speech segment as a second speech segment comprises:
denoising the first voice segment to obtain a denoised first voice segment;
and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.
3. The method of claim 2, wherein the determining the voiceprint characteristics corresponding to the second speech segment comprises:
inputting the second voice fragment into a DNN model to obtain a first voiceprint feature vector corresponding to the second voice fragment;
the matching of the person information corresponding to the voiceprint characteristics from the voiceprint database comprises the following steps:
matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector;
and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.
4. The method according to any one of claims 1-3, further comprising:
acquiring a third voice fragment;
determining the voiceprint characteristics corresponding to the third voice fragment;
determining the figure information corresponding to the third voice fragment based on the voiceprint features;
and storing the third voice fragment into a voice database corresponding to the personal information.
5. The method of claim 4, further comprising:
receiving a triggering operation of target person information in the plurality of person information;
calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information;
and playing the fourth voice fragment.
6. A speech processing apparatus, comprising:
the acquisition module is used for acquiring a first voice fragment;
the processing module is used for extracting a human voice part from the first voice segment to serve as a second voice segment;
the processing module is further configured to determine a voiceprint feature corresponding to the second speech segment;
and the determining module is used for matching the person information corresponding to the voiceprint characteristics from the voiceprint database.
7. A speech processing system, comprising:
the microphone is used for acquiring a first voice fragment;
a processor for extracting a human voice part from the first voice segment as a second voice segment; determining the voiceprint characteristics corresponding to the second voice fragment; and matching the person information corresponding to the voiceprint characteristics from the voiceprint database.
8. The system according to claim 7, wherein the processor is specifically configured to perform denoising processing on the first speech segment, to obtain a denoised first speech segment; and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.
9. The system of claim 8, wherein the processor is further configured to input the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector; and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.
10. The system according to any one of claims 7-9, further comprising:
the microphone is also used for acquiring a third voice fragment;
the processor is further configured to determine a voiceprint feature corresponding to the third speech segment; determining the figure information corresponding to the third voice fragment based on the voiceprint features; and storing the third voice fragment into a voice database corresponding to the personal information.
11. The system of claim 10, further comprising:
the processor is further used for receiving triggering operation of target person information in the plurality of person information; calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information;
and the loudspeaker is used for playing the fourth voice fragment.
12. A storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the speech processing method of any one of claims 1 to 5.
CN202010666203.XA 2020-07-08 2020-07-08 Voice processing method, device, system and storage medium Pending CN111816191A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010666203.XA CN111816191A (en) 2020-07-08 2020-07-08 Voice processing method, device, system and storage medium
PCT/CN2021/093325 WO2022007497A1 (en) 2020-07-08 2021-05-12 Voice processing method and apparatus, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010666203.XA CN111816191A (en) 2020-07-08 2020-07-08 Voice processing method, device, system and storage medium

Publications (1)

Publication Number Publication Date
CN111816191A true CN111816191A (en) 2020-10-23

Family

ID=72842801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010666203.XA Pending CN111816191A (en) 2020-07-08 2020-07-08 Voice processing method, device, system and storage medium

Country Status (2)

Country Link
CN (1) CN111816191A (en)
WO (1) WO2022007497A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992154A (en) * 2021-05-08 2021-06-18 北京远鉴信息技术有限公司 Voice identity determination method and system based on enhanced voiceprint library
WO2022007497A1 (en) * 2020-07-08 2022-01-13 珠海格力电器股份有限公司 Voice processing method and apparatus, system and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109473105A (en) * 2018-10-26 2019-03-15 平安科技(深圳)有限公司 The voice print verification method, apparatus unrelated with text and computer equipment
CN110489659A (en) * 2019-07-18 2019-11-22 平安科技(深圳)有限公司 Data matching method and device
CN110970036A (en) * 2019-12-24 2020-04-07 网易(杭州)网络有限公司 Voiceprint recognition method and device, computer storage medium and electronic equipment
CN111105783A (en) * 2019-12-06 2020-05-05 中国人民解放军61623部队 Comprehensive customer service system based on artificial intelligence
CN111161742A (en) * 2019-12-30 2020-05-15 朗诗集团股份有限公司 Directional person communication method, system, storage medium and intelligent voice device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN103258535A (en) * 2013-05-30 2013-08-21 中国人民财产保险股份有限公司 Identity recognition method and system based on voiceprint recognition
KR102339657B1 (en) * 2014-07-29 2021-12-16 삼성전자주식회사 Electronic device and control method thereof
CN109994118B (en) * 2019-04-04 2022-10-11 平安科技(深圳)有限公司 Voice password verification method and device, storage medium and computer equipment
CN110265037B (en) * 2019-06-13 2022-09-30 中信银行股份有限公司 Identity verification method and device, electronic equipment and computer readable storage medium
CN110544481B (en) * 2019-08-27 2022-09-20 华中师范大学 S-T classification method and device based on voiceprint recognition and equipment terminal
CN111816191A (en) * 2020-07-08 2020-10-23 珠海格力电器股份有限公司 Voice processing method, device, system and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109473105A (en) * 2018-10-26 2019-03-15 平安科技(深圳)有限公司 The voice print verification method, apparatus unrelated with text and computer equipment
CN110489659A (en) * 2019-07-18 2019-11-22 平安科技(深圳)有限公司 Data matching method and device
CN111105783A (en) * 2019-12-06 2020-05-05 中国人民解放军61623部队 Comprehensive customer service system based on artificial intelligence
CN110970036A (en) * 2019-12-24 2020-04-07 网易(杭州)网络有限公司 Voiceprint recognition method and device, computer storage medium and electronic equipment
CN111161742A (en) * 2019-12-30 2020-05-15 朗诗集团股份有限公司 Directional person communication method, system, storage medium and intelligent voice device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022007497A1 (en) * 2020-07-08 2022-01-13 珠海格力电器股份有限公司 Voice processing method and apparatus, system and storage medium
CN112992154A (en) * 2021-05-08 2021-06-18 北京远鉴信息技术有限公司 Voice identity determination method and system based on enhanced voiceprint library

Also Published As

Publication number Publication date
WO2022007497A1 (en) 2022-01-13

Similar Documents

Publication Publication Date Title
US10446140B2 (en) Method and apparatus for identifying acoustic background environments based on time and speed to enhance automatic speech recognition
CN107274916B (en) Method and device for operating audio/video file based on voiceprint information
US6219407B1 (en) Apparatus and method for improved digit recognition and caller identification in telephone mail messaging
CN107995360B (en) Call processing method and related product
CN108920640B (en) Context obtaining method and device based on voice interaction
CN111785275A (en) Voice recognition method and device
CN102568478A (en) Video play control method and system based on voice recognition
CN111785279A (en) Video speaker identification method and device, computer equipment and storage medium
CN111128223A (en) Text information-based auxiliary speaker separation method and related device
CN111816191A (en) Voice processing method, device, system and storage medium
CN106887231A (en) A kind of identification model update method and system and intelligent terminal
CN108364638A (en) A kind of voice data processing method, device, electronic equipment and storage medium
CN111739506A (en) Response method, terminal and storage medium
CN112397072B (en) Voice detection method and device, electronic equipment and storage medium
CN111986680A (en) Method and device for evaluating spoken language of object, storage medium and electronic device
CN109271480B (en) Voice question searching method and electronic equipment
CN110660385A (en) Command word detection method and electronic equipment
CN111785280B (en) Identity authentication method and device, storage medium and electronic equipment
CN113920996A (en) Voice interaction processing method and device, electronic equipment and storage medium
CN112861816A (en) Abnormal behavior detection method and device
CN114242120B (en) Audio editing method and audio marking method based on DTMF technology
CN113051902B (en) Voice data desensitizing method, electronic equipment and computer readable storage medium
US20240212702A1 (en) Manual-enrollment-free personalized denoise
TWI690814B (en) Text message processing device and method、computer storage medium and mobile terminal
CN108833656B (en) Call content early warning reminding method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201023

RJ01 Rejection of invention patent application after publication