CN111816191A

CN111816191A - Voice processing method, device, system and storage medium

Info

Publication number: CN111816191A
Application number: CN202010666203.XA
Authority: CN
Inventors: 李�瑞; 贾巨涛; 张伟伟; 戴林; 胡广绪
Original assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Current assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2020-10-23
Also published as: WO2022007497A1

Abstract

The embodiment of the invention relates to a voice processing method, a device, a system and a storage medium, wherein the method comprises the following steps: acquiring a first voice fragment; extracting a human voice part from the first voice fragment as a second voice fragment; determining the voiceprint characteristics corresponding to the second voice fragment; the person information corresponding to the voiceprint features is matched from the voiceprint database, so that the identity of the user can be identified according to the voice messages, the messages are prepared and classified and stored in the voice database corresponding to the user, when other users obtain the messages, the target messages can be extracted according to the specified identities, time waste is avoided, and the user experience is improved.

Description

Voice processing method, device, system and storage medium

Technical Field

Embodiments of the present invention relate to the field of information technologies, and in particular, to a method, an apparatus, a system, and a storage medium for speech processing.

Background

Along with the development of information technology, intelligent equipment is more and more applied to people's family life, and in order to make things convenient for people's life more, intelligent equipment's function is more and more comprehensive. When family members go out, many people can inform other members of notes in the forms of notes, telephones, voice message devices and the like, and inform other family members of some attention matters and the like.

However, the existing voice message leaving equipment cannot identify the identity of the current user according to the voice of the current user, and cannot accurately classify the current message into the voice database under the corresponding user identity, so that when another family member obtains message content, all message leaving information may need to be listened to, thereby causing time waste and poor user experience.

Disclosure of Invention

In view of this, in order to solve the technical problem that the user identity cannot be identified according to the voice message, embodiments of the present invention provide a voice processing method, apparatus, system, and storage medium.

In a first aspect, an embodiment of the present invention provides a speech processing method, including:

acquiring a first voice fragment;

extracting a human voice part from the first voice fragment as a second voice fragment;

determining the voiceprint characteristics corresponding to the second voice fragment;

and matching the person information corresponding to the voiceprint characteristics from the voiceprint database.

In one possible embodiment, the method further comprises:

denoising the first voice segment to obtain a denoised first voice segment;

and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.

In one possible embodiment, the method further comprises:

inputting the second voice fragment into a DNN model to obtain a first voiceprint feature vector corresponding to the second voice fragment;

matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector;

and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.

In one possible embodiment, the method further comprises:

acquiring a third voice fragment;

determining the voiceprint characteristics corresponding to the third voice fragment;

determining the figure information corresponding to the third voice fragment based on the voiceprint features;

and storing the third voice fragment into a voice database corresponding to the personal information.

In one possible embodiment, the method further comprises:

receiving a triggering operation of target person information in the plurality of person information;

calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information;

and playing the fourth voice fragment.

In a second aspect, an embodiment of the present invention provides a speech processing apparatus, including:

the acquisition module is used for acquiring a first voice fragment;

the processing module is used for extracting a human voice part from the first voice segment to serve as a second voice segment;

the processing module is further configured to determine a voiceprint feature corresponding to the second speech segment;

and the determining module is used for matching the person information corresponding to the voiceprint characteristics from the voiceprint database.

In a third aspect, an embodiment of the present invention provides a speech processing system, including:

the microphone is used for acquiring a first voice fragment;

a processor for extracting a human voice part from the first voice segment as a second voice segment; determining the voiceprint characteristics corresponding to the second voice fragment; and matching the person information corresponding to the voiceprint characteristics from the voiceprint database.

In a possible implementation manner, the processor is specifically configured to perform denoising processing on the first speech segment, so as to obtain a denoised first speech segment; and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.

In a possible embodiment, the processor is further configured to input the second speech segment into a DNN model, to obtain a first voiceprint feature vector corresponding to the second speech segment; matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector; and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.

In one possible embodiment, the system further comprises:

the microphone is also used for acquiring a third voice fragment;

the processor is further configured to determine a voiceprint feature corresponding to the third speech segment; determining the figure information corresponding to the third voice fragment based on the voiceprint features; and storing the third voice fragment into a voice database corresponding to the personal information.

In one possible embodiment, the system further comprises:

the processor is further used for receiving triggering operation of target person information in the plurality of person information; calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information;

and the loudspeaker is used for playing the fourth voice fragment.

In a fourth aspect, an embodiment of the present invention provides a storage medium, including: the storage medium stores one or more programs, which are executable by one or more processors to implement the speech processing method of any one of the above first aspects.

According to the voice processing scheme provided by the embodiment of the invention, the first voice fragment is obtained; extracting a human voice part from the first voice fragment as a second voice fragment; determining the voiceprint characteristics corresponding to the second voice fragment; the person information corresponding to the voiceprint features is matched from the voiceprint database, so that the identity of the user can be identified according to the voice messages, the messages are prepared and classified and stored in the voice database corresponding to the user, when other users obtain the messages, the target messages can be extracted according to the specified identities, time waste is avoided, and the user experience is improved.

Drawings

Fig. 1 is a schematic flow chart of a speech processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another speech processing method according to an embodiment of the present invention;

FIG. 3 is a flow chart of another speech processing method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a speech processing system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.

Fig. 1 is a schematic flow chart of a speech processing method according to an embodiment of the present invention, and as shown in fig. 1, the method specifically includes:

and S11, acquiring the first voice fragment.

And S12, extracting a human voice part from the first voice fragment as a second voice fragment.

In the voice message leaving equipment, a processor of a voice processing system receives a first voice segment recorded by a user through a microphone, the processor performs voice activity detection processing on the first voice segment, extracts a voice part, and takes the voice segment with voice as a second voice segment.

And S13, determining the corresponding voiceprint characteristics of the second voice fragment.

And inputting the second voice segment into a pre-trained voiceprint recognition model, and extracting the voiceprint characteristics corresponding to the second voice segment by using the voiceprint recognition model.

And S14, matching the person information corresponding to the voiceprint characteristics from the voiceprint database.

Sample voices of all member identities of the family members are pre-recorded in the voiceprint database, corresponding voiceprint feature labels are marked on all the sample voices, and the voiceprint feature labels can be recorded by a user in a voice mode or a typewritten text mode.

Further, by comparing the extracted voiceprint features of the second voice fragment with the voiceprint features stored in the voiceprint database, the person information corresponding to the voiceprint features stored in the voiceprint database with consistent voiceprint features is used as the person information corresponding to the second voice fragment, and therefore the current user identity is recognized.

According to the voice processing method provided by the embodiment of the invention, the first voice segment is obtained; extracting a human voice part from the first voice fragment as a second voice fragment; determining the voiceprint characteristics corresponding to the second voice fragment; and matching the figure information corresponding to the voiceprint characteristics from the voiceprint database, so that the identity of the user can be identified according to the voiceprint characteristics of the voice.

Fig. 2 is a schematic flow chart of another speech processing method according to an embodiment of the present invention, and as shown in fig. 2, the method specifically includes:

and S21, acquiring the first voice fragment.

S22, denoising the first voice segment to obtain the first voice segment with noise removed.

In the embodiment of the present invention, a user records a first voice segment through a microphone of a voice message apparatus, a voice processor receives the first voice segment, performs denoising processing on the first voice segment through a voice activity detection model, identifies and eliminates a long silence period from the first voice segment, and removes noise in the first voice segment.

And S23, carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.

Inputting the denoised first voice segment into a voice detection model, identifying a part with character speaking voice by using the voice detection model, and extracting the part with the character speaking voice in the first voice segment to be used as a second voice segment.

S24, inputting the second voice fragment into a DNN model to obtain a first voiceprint feature vector corresponding to the second voice fragment.

And inputting the extracted second voice fragment into a pre-trained DNN (digital noise network) voiceprint recognition model, firstly performing frame decoding operation on the second voice fragment by the DNN voiceprint recognition model, extracting the characteristics of each frame of voice fragment, and calculating to obtain a first voiceprint characteristic vector corresponding to the second voice fragment.

Optionally, if the current user has multiple voices, an average value of voiceprint feature vectors of the multiple voices is calculated to serve as a first voiceprint feature vector of the voice input by the user.

And S25, matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector.

And S26, using the character information corresponding to the target voiceprint feature vector as the character information of the first voice fragment.

And comparing the similarity between a first voiceprint feature vector corresponding to a first voice fragment input by the current user and a voiceprint feature vector prestored in a voiceprint database, if the similarity between the first voiceprint feature vector corresponding to the first voice fragment and the voiceprint feature vector prestored in the voiceprint database exceeds a set threshold (for example, 0.7), determining that the voiceprint feature vector prestored in the voiceprint database is a target voiceprint feature vector, and identifying the identity of the current user by taking the figure information corresponding to the target voiceprint feature vector as the figure information corresponding to the first voice fragment.

According to the voice processing method provided by the embodiment of the invention, the first voice fragment is obtained, the voiceprint characteristic vector corresponding to the first voice fragment is obtained by carrying out data processing on the first voice fragment, the similarity between the voiceprint characteristic vector corresponding to the first voice fragment and the voiceprint characteristic vector stored in the voiceprint database is compared, the figure information corresponding to the first voice fragment is determined, the identity of a user can be identified according to the voiceprint characteristics of voice, the voiceprint identification technology is applied to voice message leaving, the identity of a voice message recorder can be effectively confirmed and managed, the voiceprint is used for distinguishing the messages, the message content can be extracted according to the specified identity when the user obtains the message content of other members, the target message can be accurately extracted, and the user experience is improved.

Fig. 3 is a schematic flow chart of another speech processing method according to an embodiment of the present invention, and as shown in fig. 3, the method specifically includes:

and S31, acquiring a third voice fragment.

In the embodiment of the invention, a user firstly sends a message leaving instruction, a system provides a plurality of identity options for the user after receiving the instruction sent by the user, the user can select according to the actual situation, a message is left after selecting a message leaving object, and the system obtains a third voice fragment recorded by the user.

For example, the user clicks "i want to leave a message," and the system provides options to whom to leave a message, which may include: and after the user selects an object to be left, the system enters a message recording mode, and the user inputs message contents through a microphone. Further, the message leaving system can be combined with the intelligent terminal device, and sends messages to the intelligent terminal device (for example, a mobile phone terminal and a PC terminal) from the background server to prompt the user that other members leave messages for the intelligent terminal device.

And S32, determining the corresponding voiceprint feature of the third voice fragment.

The third voice segment is subjected to denoising and human voice detection processing, then the processed third voice segment is input into a DNN (digital network noise network) voiceprint recognition model trained in advance, and voiceprint feature extraction is carried out to obtain a voiceprint feature vector corresponding to the third voice segment.

And S33, determining the character information corresponding to the third voice fragment based on the voiceprint characteristics.

And S34, saving the third voice fragment to a voice database corresponding to the personal information.

And comparing the similarity of the voiceprint feature vector corresponding to the third voice fragment input by the current user with the similarity of the voiceprint feature vector prestored in the voiceprint database, and if the similarity of the voiceprint feature vector corresponding to the third voice fragment and the voiceprint feature vector prestored in the voiceprint database exceeds a set threshold (for example, 0.8), taking the figure information corresponding to the voiceprint feature vector prestored in the voiceprint database as the figure information corresponding to the third voice fragment, and storing the third voice fragment in the voice database corresponding to the figure information.

S35, a trigger operation for the target personal information among the plurality of pieces of personal information is received.

When a user wants to listen to a message, the system analyzes the voice of the current user to obtain the identity information of the user, displays all family member information related to the user according to the family member relationship graph, the user can select which family member to listen to the message for the user according to the actual situation, and the system receives a trigger instruction of a target person selected by the user.

For example, when a user directly speaks a voice command of "listen to a message", and does not specifically say who wants to listen to the message left by the user, the system determines the identity information of the user according to the voiceprint feature vector of the user, and displays all family member information about the user according to the family member relationship diagram, and the user can select which family member wants to listen to the message left by the user according to the actual situation.

As another example, the family members have children, dad, mom, grandpa, and grandpa, the identity tags of the children are children relative to dad and mom, and grandchildren relative to grandpa and grandpa, the identity tags of the children may include: children and grandchildren; similarly, the identity tag of dad can include: dad, husband, and son. When a user wants to listen to a message, the system speaks 'listening to a message left by dad', firstly selects a voice database with 'dad' identity tags according to voice content, then the voice database corresponds to dad and grander in family members, then identifies the identity information of the user as dad in family members according to the voiceprint characteristics of the user, and then determines that the message left by the father the user wants to listen to.

S36, retrieving a fourth voice fragment corresponding to the target personal information from a voice database based on the target personal information.

And S37, playing the fourth voice fragment.

And calling a fourth voice segment corresponding to the target character from the voice database according to the information of the target character selected by the current user to listen to the message, wherein the fourth voice segment is the message voice of the target character to the current user, and the fourth voice segment is played to the current user for listening through a loudspeaker.

The voice processing method provided by the embodiment of the invention receives the voice fragment of the current user; extracting corresponding voiceprint features from the voice segments; the method can realize the confirmation and management of the identity of a voice message recorder, distinguish the messages by using the voiceprints, extract the message content according to the specified identity when the user acquires the message content of other members, accurately extract the target messages and improve the user experience.

Fig. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention, which specifically includes:

an obtaining module 401, configured to obtain a first voice segment;

a processing module 402, configured to extract a human voice portion from the first voice segment as a second voice segment;

the processing module 402 is further configured to determine a voiceprint feature corresponding to the second speech segment;

a determining module 403, configured to match the person information corresponding to the voiceprint feature from the voiceprint database.

In a possible embodiment, the obtaining module is specifically configured to obtain a third speech segment; receiving a triggering operation of target person information in the plurality of person information; and calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information.

In a possible implementation manner, the processing module is specifically configured to perform denoising processing on the first speech segment to obtain a denoised first speech segment; and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.

In a possible embodiment, the processing module is further configured to input the second speech segment into a DNN model, so as to obtain a first voiceprint feature vector corresponding to the second speech segment; and matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector.

In a possible implementation manner, the processing module is further configured to store the third voice segment in a voice database corresponding to the personal information, and play the fourth voice segment.

In a possible embodiment, the determining module is specifically configured to determine a voiceprint feature corresponding to the third speech segment; determining the figure information corresponding to the third voice fragment based on the voiceprint features; and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.

The voice processing apparatus of the server provided in this embodiment may be the voice processing apparatus shown in fig. 4, and may perform all the steps of the voice processing method shown in fig. 1 to 3, so as to achieve the technical effects of the voice processing method shown in fig. 1 to 3, and for brevity, please refer to the description related to fig. 1 to 3, which is not described herein again.

Fig. 5 is a schematic structural diagram of a speech processing system according to an embodiment of the present invention, and the speech processing system 500 shown in fig. 5 includes: at least one processor 501, memory 502, microphone 503, at least one network interface 504, speaker 506. The various components of the speech processing system 500 are coupled together by a bus system 505. It is understood that the bus system 505 is used to enable connection communications between these components. The bus system 505 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 505 in FIG. 5.

It is to be understood that the memory 502 in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data rate Synchronous Dynamic random access memory (ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DRRAM). The memory 502 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 502 stores elements, executable units or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system 5021 and application programs 5022.

The operating system 5021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application 5022 includes various applications, such as a Media Player (Media Player), a Browser (Browser), and the like, for implementing various application services. The program for implementing the method according to the embodiment of the present invention may be included in the application program 5022.

In the embodiment of the present invention, each component stores a program or instructions for executing fig. 1, fig. 2 or fig. 3 in the memory 502, and the controller/processor 501 executes the specific steps in fig. 1, fig. 2 or fig. 3;

a first speech segment is acquired, such as by microphone 503; the processor 501 extracts a human voice part from the first voice segment as a second voice segment; determining the voiceprint characteristics corresponding to the second voice fragment; and matching the person information corresponding to the voiceprint characteristics from the voiceprint database.

In a possible implementation manner, the processor 501 performs denoising processing on the first speech segment to obtain a denoised first speech segment; and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.

In a possible embodiment, the processor 501 inputs the second speech segment into a DNN model, and obtains a first voiceprint feature vector corresponding to the second speech segment; matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector; and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.

In one possible implementation, the microphone 503 acquires a third speech segment; the processor 501 determines a voiceprint feature corresponding to the third speech fragment; determining the figure information corresponding to the third voice fragment based on the voiceprint features; and storing the third voice fragment into a voice database corresponding to the personal information.

In one possible embodiment, the processor 501 receives a trigger operation for a target personal information among the plurality of personal information; calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information; speaker 506 plays the fourth voice clip.

The method disclosed by the above-mentioned embodiments of the present invention may be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 501. The Processor 501 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable Gate Array (FPGA) or other programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502 and completes the steps of the method in combination with the hardware.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The speech processing system provided in this embodiment may be the speech processing system shown in fig. 5, and may perform all the steps of the speech processing method shown in fig. 1-3, so as to achieve the technical effects of the speech processing method shown in fig. 1-3, and for brevity, it is not described herein again.

The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.

When one or more programs in the storage medium are executable by one or more processors to implement the speech processing method described above as being performed in a speech processing system.

The processor is used for executing the voice processing program stored in the memory so as to realize the following steps of the voice processing method executed in the voice processing system:

acquiring a first voice fragment; extracting a human voice part from the first voice fragment as a second voice fragment; determining the voiceprint characteristics corresponding to the second voice fragment; and matching the person information corresponding to the voiceprint characteristics from the voiceprint database.

In a possible implementation manner, denoising is performed on the first speech segment to obtain a denoised first speech segment; and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.

In a possible implementation manner, the second speech segment is input into a DNN model, and a first voiceprint feature vector corresponding to the second speech segment is obtained; matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector; and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.

In one possible embodiment, a third speech segment is obtained; determining the voiceprint characteristics corresponding to the third voice fragment; determining the figure information corresponding to the third voice fragment based on the voiceprint features; and storing the third voice fragment into a voice database corresponding to the personal information.

In one possible embodiment, a trigger operation for the target personal information among the plurality of pieces of personal information is received; calling a fourth voice fragment corresponding to the target person information from a voice database based on the target person information; and playing the fourth voice fragment.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of speech processing, comprising:

acquiring a first voice fragment;

2. The method according to claim 1, wherein said extracting the human voice part from the first speech segment as a second speech segment comprises:

denoising the first voice segment to obtain a denoised first voice segment;

3. The method of claim 2, wherein the determining the voiceprint characteristics corresponding to the second speech segment comprises:

the matching of the person information corresponding to the voiceprint characteristics from the voiceprint database comprises the following steps:

4. The method according to any one of claims 1-3, further comprising:

acquiring a third voice fragment;

5. The method of claim 4, further comprising:

and playing the fourth voice fragment.

6. A speech processing apparatus, comprising:

the acquisition module is used for acquiring a first voice fragment;

7. A speech processing system, comprising:

the microphone is used for acquiring a first voice fragment;

8. The system according to claim 7, wherein the processor is specifically configured to perform denoising processing on the first speech segment, to obtain a denoised first speech segment; and carrying out voice detection on the first voice segment after the noise is removed, and taking the part with the voice as a second voice segment.

9. The system of claim 8, wherein the processor is further configured to input the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; matching the first voiceprint feature vector with the voiceprint feature vectors stored in the voiceprint database, and taking the voiceprint feature vector with the similarity exceeding a set threshold as a target voiceprint feature vector; and taking the figure information corresponding to the target voiceprint feature vector as the figure information of the first voice fragment.

10. The system according to any one of claims 7-9, further comprising:

the microphone is also used for acquiring a third voice fragment;

11. The system of claim 10, further comprising:

and the loudspeaker is used for playing the fourth voice fragment.

12. A storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the speech processing method of any one of claims 1 to 5.