WO2022007497A1

WO2022007497A1 - Voice processing method and apparatus, system and storage medium

Info

Publication number: WO2022007497A1
Application number: PCT/CN2021/093325
Authority: WO
Inventors: 李�瑞; 贾巨涛; 张伟伟; 戴林; 胡广绪
Original assignee: 珠海格力电器股份有限公司; 珠海联云科技有限公司
Priority date: 2020-07-08
Filing date: 2021-05-12
Publication date: 2022-01-13
Also published as: CN111816191A

Abstract

A voice processing method and apparatus, a system and a storage medium, the method comprising: acquiring a first voice segment (S11); extracting a human voice part from the first voice segment as a second voice segment (S12); determining voiceprint features corresponding to the second voice segment (S13); and from within a voiceprint database, matching character information corresponding to the voiceprint features (S14). In the method, the identity of the user can be identified according to a voice message so as to prepare and classify the message, and the identity is stored in a voice database corresponding to the user. When other users acquire the messages, a target message may be extracted according to a designated identity, which prevents the wasting of time and improves customer experience.

Description

Speech processing method, device, system and storage medium

This disclosure claims the priority of the Chinese patent application with the application number 202010666203.X and the invention title "Voice processing method, device, system and storage medium" filed with the China Patent Office on July 8, 2020, the entire contents of which are by reference Incorporated in this disclosure.

technical field

The embodiments of the present disclosure relate to the field of information technology, and in particular, to a voice processing method, apparatus, system, and storage medium.

Background technique

With the development of information technology, more and more smart devices are applied to people's family life. In order to make people's lives more convenient, the functions of smart devices are becoming more and more comprehensive. When family members go out, many people will leave messages to other members in the form of notes, telephones or voice message devices, and inform other family members of certain precautions.

However, the relevant voice message device cannot identify the identity of the current user according to the voice of the current user, and cannot accurately classify the current message into the voice database under the corresponding user identity, so that when another family member obtains the content of the message, it may be necessary to listen to all the message information. , resulting in a waste of time and poor customer experience.

SUMMARY OF THE INVENTION

In view of this, in order to solve the above-mentioned technical problem that the user's identity cannot be identified according to the voice message, the embodiments of the present disclosure provide a voice processing method, device, system, and storage medium.

In a first aspect, an embodiment of the present disclosure provides a speech processing method, including:

Get the first voice segment;

Extract the human voice part from the first speech segment as the second speech segment;

determining the voiceprint feature corresponding to the second voice segment;

Character information corresponding to the voiceprint feature is matched from the voiceprint database.

In a possible implementation, the method further includes:

performing denoising processing on the first speech segment to obtain the first speech segment after noise removal;

Human voice detection is performed on the first voice segment after noise removal, and the part with human voice is used as the second voice segment.

In a possible implementation, the method further includes:

The second voice fragment is input into the DNN model, and the first voiceprint feature vector corresponding to the second voice fragment is obtained;

The first voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voice. texture feature vector;

The person information corresponding to the target voiceprint feature vector is used as the person information of the first voice segment.

In a possible implementation, the method further includes:

Get the third voice segment;

determining the voiceprint feature corresponding to the third voice segment;

determining the character information corresponding to the third voice segment based on the voiceprint feature;

The third voice segment is saved in the voice database corresponding to the character information.

In a possible implementation, the method further includes:

Receive a triggering operation for the target character information in the multiple character information;

Retrieve the fourth voice segment corresponding to the target character information from the voice database based on the target character information;

Play the fourth voice segment.

In a second aspect, an embodiment of the present disclosure provides a voice processing apparatus, including:

an acquisition module, which is set to acquire the first voice segment;

a processing module, configured to extract the human voice part from the first speech segment as a second speech segment;

The processing module is further configured to determine the voiceprint feature corresponding to the second voice segment;

The determining module is configured to match person information corresponding to the voiceprint feature from the voiceprint database.

In a third aspect, an embodiment of the present disclosure provides a speech processing system, including:

a microphone, set to obtain the first voice segment;

The processor is configured to extract the human voice part from the first voice fragment as a second voice fragment; determine the voiceprint feature corresponding to the second voice fragment; match the voiceprint from the voiceprint database with the voiceprint Character information corresponding to the feature.

In a possible implementation manner, the processor is specifically configured to perform denoising processing on the first speech segment to obtain the first speech segment after noise removal; The voice segment is subjected to human voice detection, and the part with human voice is used as the second voice segment.

In a possible implementation manner, the processor is further configured to input the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; The voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voiceprint feature vector; The person information corresponding to the target voiceprint feature vector is used as the person information of the first voice segment.

In a possible implementation, the system further includes:

the microphone is also set to obtain a third voice segment;

The processor is further configured to determine the voiceprint feature corresponding to the third voice segment; determine the character information corresponding to the third voice segment based on the voiceprint feature; save the third voice segment to the in the voice database corresponding to the character information.

In a possible implementation, the system further includes:

The processor is also configured to receive a triggering operation on the target character information in the plurality of character information; based on the target character information, a fourth voice segment corresponding to the target character information is retrieved from the voice database;

A loudspeaker is arranged to play the fourth speech segment.

In a fourth aspect, an embodiment of the present disclosure provides a storage medium, including: the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the above-mentioned first The speech processing method of any one of the aspects.

In the voice processing solution provided by the embodiments of the present disclosure, by acquiring a first voice segment; extracting a human voice part from the first voice segment as a second voice segment; determining a voiceprint feature corresponding to the second voice segment; The character information corresponding to the voiceprint feature is matched in the voiceprint database, and by this method, the identity of the user can be identified according to the voice message, so that the message can be prepared and classified, and stored in the voice database corresponding to the user, When other users get the message, they can extract the target message according to the specified identity, avoid wasting time and improve customer experience.

Description of drawings

FIG. 1 is a schematic flowchart of a speech processing method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flowchart of another speech processing method provided by an embodiment of the present disclosure;

3 is a schematic flowchart of another voice processing method provided by an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a voice processing apparatus according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a speech processing system according to an embodiment of the present disclosure.

detailed description

In order to make the purposes, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly and completely below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments These are some, but not all, embodiments of the present disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present disclosure.

In order to facilitate the understanding of the embodiments of the present disclosure, the following will describe in some embodiments with specific embodiments in conjunction with the accompanying drawings, and the embodiments do not constitute limitations to the embodiments of the present disclosure.

FIG. 1 is a schematic flowchart of a speech processing method provided by an embodiment of the present disclosure. As shown in FIG. 1 , the method specifically includes:

S11. Acquire a first voice segment.

S12. Extract a human voice part from the first speech segment as a second speech segment.

In the voice message device, the processor of the voice processing system receives the first voice segment entered by the user through the microphone, and the processor performs voice activity detection processing on the first voice segment, extracts the part of the human voice, and uses the voice segment of the human voice as the voice. The second speech segment.

S13. Determine the voiceprint feature corresponding to the second voice segment.

The second speech segment is input into the pre-trained voiceprint recognition model, and the voiceprint feature corresponding to the second speech segment is extracted by using the voiceprint recognition model.

S14. Match the person information corresponding to the voiceprint feature from the voiceprint database.

The sample voices of all members of the family are pre-entered in the voiceprint database, and all sample voices have been marked with their corresponding voiceprint feature labels. The voiceprint feature labels can be entered by the user in the form of voice or typed text. .

Further, by comparing the voiceprint feature of the second voice segment extracted above and the voiceprint feature stored in the voiceprint database, the character information corresponding to the voiceprint feature stored in the voiceprint database with the same voiceprint feature is used as the second voiceprint feature. The character information corresponding to the voice segment, so as to identify the current user identity.

The voice processing method provided by the embodiments of the present disclosure, by acquiring a first voice fragment; extracting a human voice part from the first voice fragment as a second voice fragment; determining a voiceprint feature corresponding to the second voice fragment; The personal information corresponding to the voiceprint feature is matched in the voiceprint database, so that the user's identity can be recognized according to the voiceprint feature of the voice.

FIG. 2 is a schematic flowchart of another speech processing method provided by an embodiment of the present disclosure. As shown in FIG. 2 , the method specifically includes:

S21. Obtain a first voice segment.

S22. Perform denoising processing on the first speech segment to obtain the first speech segment after noise removal.

In the embodiment of the present disclosure, the user enters the first voice segment through the microphone of the voice message device, and the voice processor receives the first voice segment, and first performs denoising processing on the first voice Identify and eliminate the long silent period and remove the noise in the first voice segment, because when the user enters the voice, there may be loud background sounds in the surrounding environment, so these noises need to be removed to obtain the first voice after denoising Fragment.

S23. Perform human voice detection on the first voice segment after noise removal, and use the part with human voice as the second voice segment.

Input the denoised first speech segment into the human voice detection model, use the human voice detection model to identify the part where the voice of the character exists, and extract the part of the first voice segment where the voice of the character exists as the second voice segment.

S24. Input the second speech segment into the DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment.

Input the second voice segment extracted above into the pre-trained DNN voiceprint recognition model. The DNN voiceprint recognition model first deframes the second voice fragment, extracts the features of each frame of voice fragment, and after calculation, The first voiceprint feature vector corresponding to the second voice segment is obtained.

In some embodiments, if the current user has multiple voices, the average value of the voiceprint feature vectors of the multiple voices is calculated as the first voiceprint feature vector of the voice entered by the user.

S25. Match the first voiceprint feature vector with the voiceprint feature vector stored in the voiceprint database, and use the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds a set threshold as the voiceprint feature vector The target voiceprint feature vector.

S26. Use the character information corresponding to the target voiceprint feature vector as the character information of the first voice segment.

Compare the similarity between the first voiceprint feature vector corresponding to the first voice clip entered by the current user and the voiceprint feature vector pre-stored in the voiceprint database, if the first voiceprint feature vector corresponding to the first voice clip and the voiceprint database The similarity of the pre-stored voiceprint feature vector in the voiceprint database exceeds the set threshold (for example, 0.7), then determine that the voiceprint feature vector pre-stored in the voiceprint database is the target voiceprint feature vector, and the target voiceprint feature vector corresponds to The character information of the first voice segment is used as the character information corresponding to the first voice segment to identify the current user identity.

In the voice processing method provided by the embodiment of the present disclosure, a voiceprint feature vector corresponding to the first voice fragment is obtained by acquiring a first voice fragment, performing data processing on the first voice fragment, and comparing the voiceprint corresponding to the first voice fragment The similarity between the feature vector and the voiceprint feature vector stored in the voiceprint database determines the person information corresponding to the first voice segment, which can realize the identification of the user according to the voiceprint characteristics of the voice, and apply the voiceprint recognition technology to the voice. In the message, it can effectively confirm and manage the identity of the person entering the voice message, and use the voiceprint to distinguish the message. When the user obtains the message content of other members, the message content can be extracted according to the specified identity, and the target message can be accurately extracted to improve the user experience.

FIG. 3 is a schematic flowchart of another speech processing method provided by an embodiment of the present disclosure. As shown in FIG. 3 , the method specifically includes:

S31. Obtain a third voice segment.

In the embodiment of the present disclosure, the user first sends a message instruction. After the system receives the instruction sent by the user, it provides multiple identity options to the user. The user can choose according to the actual situation. After selecting the message object, it starts to leave a message. The third speech segment.

For example, when the user clicks "I want to leave a message", the system provides options for who to leave a message to. The options can include: father, mother, lover or son. After the user selects the object to leave a message, the system enters the message recording mode, and the user records the message content through the microphone. . Further, this message system will combine with the intelligent terminal equipment, and send messages from the background server to the intelligent terminal equipment (for example, the mobile terminal and the PC terminal), prompting the user that other members have left a message for him.

S32. Determine the voiceprint feature corresponding to the third voice segment.

First, denoising and human voice detection are performed on the third speech segment, and then the processed third speech segment is input into the pre-trained DNN voiceprint recognition model, and voiceprint feature extraction is performed to obtain the corresponding third speech segment. Voiceprint feature vector.

S33. Determine the character information corresponding to the third voice segment based on the voiceprint feature.

S34. Save the third voice segment in a voice database corresponding to the character information.

Compare the similarity between the voiceprint feature vector corresponding to the third voice segment entered by the current user and the voiceprint feature vector pre-stored in the voiceprint database. If the similarity of the voiceprint feature vector exceeds the set threshold (for example, 0.8), the character information corresponding to the voiceprint feature vector pre-stored in the voiceprint database is used as the character information corresponding to the third voice segment, and the third voiceprint The voice segment is stored in the voice database corresponding to the character information.

S35: Receive a triggering operation for the target person information in the plurality of person information.

When the user wants to listen to the message, after the system obtains the user's identity information according to the current user's voice analysis, according to the family member relationship diagram, it displays all the family member information about the user, and the user can choose which to listen to according to the actual situation. A family member leaves a message to himself, and the system receives the trigger command from the target person selected by the user.

For example, when the user directly speaks the voice command of "listen to the message" without specifying who to listen to the message for him, the system determines the user's identity information according to the user's voiceprint feature vector, and according to the family member relationship diagram, All family member information about the user is displayed, and the user can choose which family member to listen to the message to him according to the actual situation.

For another example, if the family members have son, father, mother, grandfather and grandmother, and the identity label of the son is son relative to father and mother, and grandson relative to grandfather and grandma, the identity label of son can include: son and grandson; by the same token, a father's identity label can include: father, husband, and son. When the user wants to listen to the message, say "Listen to Dad's message", the system first selects the voice database with the "Dad" identity label according to the voice content, which is the voice database corresponding to the father and grandfather in the family members, and then according to the voice content The user's voiceprint feature identifies that the user's identity information is the father of a family member, then it is determined that the user wants to listen to the message of the grandfather.

S36. Retrieve a fourth voice segment corresponding to the target character information from the voice database based on the target character information.

S37. Play the fourth voice segment.

According to the information of the target person to listen to the message selected by the current user, the fourth voice segment corresponding to the target person is retrieved from the voice database, which is the message voice of the target person to the current user, and the fourth voice segment is played through the speaker. Listen to the current user.

The voice processing method provided by the embodiment of the present disclosure, by receiving the voice segment of the current user; extracting the corresponding voiceprint feature from the voice segment; matching the character information corresponding to the voiceprint feature from the voiceprint database, according to The character information is stored in the corresponding voice database, and the voice message to be listened to by the user can also be determined and retrieved from the voice database according to the character information. This method can realize the confirmation and management of the identity of the person entering the voice message, and use the method. Voiceprint distinguishes messages. When users obtain the message content of other members, they can extract the message content according to the specified identity, accurately extract the target message, and improve the user experience.

FIG. 4 is a schematic structural diagram of a speech processing apparatus provided by an embodiment of the present disclosure, which specifically includes:

The obtaining module 401 is set to obtain the first voice segment;

The processing module 402 is configured to extract the human voice part from the first speech segment as the second speech segment;

The processing module 402 is further configured to determine the voiceprint feature corresponding to the second voice segment;

The determining module 403 is configured to match person information corresponding to the voiceprint feature from the voiceprint database.

In a possible implementation manner, the acquiring module is specifically configured to acquire a third voice segment; receive a triggering operation for target character information among the plurality of character information; retrieve and fetch from the voice database based on the target character information The fourth voice segment corresponding to the target person information.

In a possible implementation manner, the processing module is specifically configured to perform denoising processing on the first speech segment to obtain the first speech segment after noise removal; The voice segment is subjected to human voice detection, and the part with human voice is used as the second voice segment.

In a possible implementation manner, the processing module is further configured to input the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; The voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds a set threshold is used as the target voiceprint feature vector.

In a possible implementation manner, the processing module is further configured to save the third voice clip in a voice database corresponding to the character information, and play the fourth voice clip.

In a possible implementation manner, the determining module is specifically configured to determine the voiceprint feature corresponding to the third voice segment; based on the voiceprint feature, determine the character information corresponding to the third voice segment; The character information corresponding to the target voiceprint feature vector is used as the character information of the first voice segment.

The voice processing apparatus of the server provided in this embodiment may be the voice processing apparatus shown in FIG. 4 , and may execute all steps of the voice processing method shown in FIG. For details of the technical effects, please refer to the related descriptions in FIGS. 1-3 . For the sake of brevity, detailed descriptions are not repeated here.

FIG. 5 is a schematic structural diagram of a voice processing system according to an embodiment of the present disclosure. The voice processing system 500 shown in FIG. The various components in speech processing system 500 are coupled together by bus system 505 . It will be appreciated that the bus system 505 is configured to enable connection communication between these components. In addition to the data bus, the bus system 505 also includes a power bus, a control bus and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus system 505 in FIG. 5 .

It will be appreciated that the memory 502 in embodiments of the present disclosure may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Wherein, the non-volatile memory may be a read-only memory (Read-Only Memory, ROM), a programmable read-only memory (Programmable ROM, PROM), an erasable programmable read-only memory (Erasable PROM, EPROM), an electrically programmable read-only memory (Erasable PROM, EPROM). Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory. Volatile memory may be Random Access Memory (RAM), which acts as an external cache. By way of illustration and not limitation, many forms of RAM are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (Synch link DRAM, SLDRAM) ) and direct memory bus random access memory (Direct Rambus RAM, DRRAM). The memory 502 described herein is intended to include, but not be limited to, these and any other suitable types of memory.

In some embodiments, memory 502 stores the following elements, executable units or data structures, or a subset thereof, or an extended set of them: an operating system 5021 and applications 5022.

The operating system 5021 includes various system programs, such as a framework layer, a core library layer, a driver layer, etc., and is configured to implement various basic services and process hardware-based tasks. The application program 5022 includes various application programs, such as a media player (Media Player), a browser (Browser), etc., and is set to implement various application services. A program implementing the method of the embodiment of the present disclosure may be included in the application program 5022 .

In the embodiment of the present disclosure, each component stores a program or instruction in the memory 502 that is configured to execute FIG. 1 , FIG. 2 or FIG. 3 , and the controller/processor 501 executes the specific program in FIG. step;

For example, acquiring the first voice segment through the microphone 503; the processor 501 extracts the human voice part from the first voice segment as a second voice segment; determines the voiceprint feature corresponding to the second voice segment; Character information corresponding to the voiceprint feature is matched.

In a possible implementation manner, the processor 501 performs denoising processing on the first speech segment to obtain the first speech segment after noise removal; Sound detection, taking the part with human voice as the second speech segment.

In a possible implementation manner, the processor 501 inputs the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; The vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voiceprint feature vector; The character information corresponding to the voiceprint feature vector is used as the character information of the first voice segment.

In a possible implementation manner, the microphone 503 acquires a third voice segment; the processor 501 determines a voiceprint feature corresponding to the third voice segment; and based on the voiceprint feature, determines the third voice segment corresponding character information; save the third voice segment in the voice database corresponding to the character information.

In a possible implementation manner, the processor 501 receives a triggering operation on target person information among the plurality of person information; and retrieves a fourth voice corresponding to the target person information from a voice database based on the target person information segment; the speaker 506 plays the fourth speech segment.

The methods disclosed in the above embodiments of the present disclosure may be applied to the processor 501 or implemented by the processor 501 . The processor 501 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 501 or an instruction in the form of software. The above-mentioned processor 501 can be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps and logical block diagrams in the embodiments of the present disclosure can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the methods disclosed in conjunction with the embodiments of the present disclosure can be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software units in the decoding processor. The software unit may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502, and completes the steps of the above method in combination with its hardware.

It will be appreciated that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For hardware implementation, the processing unit can be implemented in one or more Application Specific Integrated Circuits (ASIC), Digital Signal Processing (DSP), Digital Signal Processing Device (DSPDevice, DSPD), programmable logic Programmable Logic Device (PLD), Field-Programmable Gate Array (FPGA), general purpose processor, controller, microcontroller, microprocessor, other configured to perform the functions described in this disclosure electronic unit or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. Software codes may be stored in memory and executed by a processor. The memory can be implemented in the processor or external to the processor.

The voice processing system provided in this embodiment may be the voice processing system shown in FIG. 5 , which can perform all the steps of the voice processing method shown in FIG. 1-3 , thereby realizing the technical effect of the voice processing method shown in FIG. 1-3 . , please refer to the related descriptions in FIGS. 1-3 for details. For the sake of brevity, details are not repeated here.

Embodiments of the present disclosure also provide a storage medium (computer-readable storage medium). The storage medium here stores one or more programs. Wherein, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk or solid-state hard disk; the memory may also include the above-mentioned types of memory. combination.

One or more programs in the storage medium can be executed by one or more processors, so as to implement the above-mentioned speech processing method executed in the speech processing system.

The processor is configured to execute the voice processing program stored in the memory to realize the following steps of the voice processing method executed in the voice processing system:

Obtain a first voice segment; extract a human voice part from the first voice segment as a second voice segment; determine the voiceprint feature corresponding to the second voice segment; match the voiceprint from the voiceprint database Character information corresponding to the feature.

In a possible implementation manner, denoising processing is performed on the first speech segment to obtain the first speech segment after noise removal; human voice detection is performed on the first speech segment after noise removal, and there will be The part of the human voice is used as the second speech segment.

In a possible implementation manner, the second speech segment is input into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; Match the voiceprint feature vectors stored in the print database, and use the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold as the target voiceprint feature vector; The target voiceprint feature vector corresponds to The character information of the first voice segment is used as the character information of the first voice segment.

In a possible implementation manner, a third voice segment is acquired; a voiceprint feature corresponding to the third voice segment is determined; based on the voiceprint feature, character information corresponding to the third voice segment is determined; The three voice segments are added to the voice database corresponding to the character information.

In a possible implementation manner, receiving a triggering operation on target character information among the plurality of character information; retrieving a fourth voice segment corresponding to the target character information from a voice database based on the target character information; playing the Fourth speech segment.

Professionals should be further aware that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. Interchangeability, the above description has generally described the components and steps of each example in terms of function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present disclosure in detail. It should be understood that the above descriptions are only specific embodiments of the present disclosure, and are not intended to limit the scope of the present disclosure. Within the scope of protection, any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included within the scope of protection of the present disclosure.

Claims

A speech processing method, comprising:

Get the first voice segment;

Extract the human voice part from the first speech segment as the second speech segment;

determining the voiceprint feature corresponding to the second voice segment;

Character information corresponding to the voiceprint feature is matched from the voiceprint database.
The method according to claim 1, wherein the extracting the human voice part from the first speech segment as the second speech segment comprises:

performing denoising processing on the first speech segment to obtain the first speech segment after noise removal;

Human voice detection is performed on the first speech segment after noise removal, and the part with human voice is used as the second speech segment.
The method according to claim 2, wherein the determining the voiceprint feature corresponding to the second voice segment comprises:

The second voice fragment is input into the DNN model, and the first voiceprint feature vector corresponding to the second voice fragment is obtained;

The character information corresponding to the voiceprint feature is matched from the voiceprint database, including:

The first voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voice. texture feature vector;

The person information corresponding to the target voiceprint feature vector is used as the person information of the first voice segment.
The method according to any one of claims 1-3, wherein the method further comprises:

Get the third voice segment;

determining the voiceprint feature corresponding to the third voice segment;

determining the character information corresponding to the third voice segment based on the voiceprint feature;

The third voice segment is saved in the voice database corresponding to the character information.
The method of claim 4, wherein the method further comprises:

Receive a triggering operation for the target character information in the multiple character information;

Retrieve the fourth voice segment corresponding to the target character information from the voice database based on the target character information;

Play the fourth voice segment.
A voice processing device, comprising:

an acquisition module, which is set to acquire the first voice segment;

a processing module, configured to extract the human voice part from the first speech segment as a second speech segment;

The processing module is further configured to determine the voiceprint feature corresponding to the second voice segment;

The determining module is configured to match the character information corresponding to the voiceprint feature from the voiceprint database.
A speech processing system, comprising:

a microphone, set to obtain the first voice segment;

The processor is configured to extract the human voice part from the first voice fragment as a second voice fragment; determine the voiceprint feature corresponding to the second voice fragment; match the voiceprint from the voiceprint database with the voiceprint Character information corresponding to the feature.
The system according to claim 7, wherein the processor is specifically configured to perform denoising processing on the first speech segment to obtain the first speech segment after noise removal; Human voice detection is performed on the first voice segment, and the part with human voice is used as the second voice segment.
The system according to claim 8, wherein the processor is further configured to input the second speech segment into a DNN model to obtain a first voiceprint feature vector corresponding to the second speech segment; The first voiceprint feature vector is matched with the voiceprint feature vector stored in the voiceprint database, and the voiceprint feature vector whose similarity with the first voiceprint feature vector exceeds the set threshold is used as the target voiceprint feature vector; take the character information corresponding to the target voiceprint feature vector as the character information of the first voice segment.
The system according to any one of claims 7-9, wherein the system further comprises:

the microphone is also set to obtain a third voice segment;

The processor is further configured to determine the voiceprint feature corresponding to the third voice segment; determine the character information corresponding to the third voice segment based on the voiceprint feature; save the third voice segment to the in the voice database corresponding to the character information.
The system of claim 10, wherein the system further comprises:

The processor is also configured to receive a triggering operation on the target character information in the plurality of character information; based on the target character information, a fourth voice segment corresponding to the target character information is retrieved from the voice database;

A loudspeaker is arranged to play the fourth speech segment.
A storage medium, the storage medium stores one or more programs, the one or more programs can be executed by one or more processors, so as to realize the speech processing according to any one of claims 1 to 5 method.