CN117789728A

CN117789728A - Speaker voice recognition method, system, electronic device and storage medium

Info

Publication number: CN117789728A
Application number: CN202311824479.6A
Authority: CN
Inventors: 郝竹林; 罗超; 张威; 陈文浩; 张启祥; 张泽; 任君; 周明康; 江小林
Original assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Current assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-03-29

Abstract

The invention discloses a speaker voice recognition method, a speaker voice recognition system, electronic equipment and a storage medium. The voice recognition method comprises the following steps: acquiring target audio, and performing voice recognition processing on the target audio to obtain a target text; performing sentence breaking processing on the target text according to semantics to obtain at least two clause texts; performing voiceprint recognition on clause audio corresponding to the clause text to obtain voiceprint information of the clause audio; and determining whether the speaker corresponding to the clause audio is a main speaker according to the voiceprint information of the clause audio and the association degree of the clause text and the current scene. By identifying voiceprint information and scene theme relativity of the audio, whether the speaker corresponding to the audio is a main speaker is judged, the problem of interference of the speaker of the next person is solved, and the interaction comfort level in the voice dialogue is improved.

Description

Speaker voice recognition method, system, electronic device and storage medium

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a method, a system, an electronic device, and a storage medium for recognizing speech of a speaker.

Background

The intelligent voice customer service recognizes the information of a main speaker for a guest or a hotel, and the technology of shielding the nearby people from speaking and placing the voice needs to satisfy several characteristics: 1. a speech recognition system supporting word level timestamp and confidence output; 2. a natural language sentence breaking module supporting word level input; 3. a voiceprint recognition system supporting clause levels; 4. context caching techniques. Fourth, on the basis of one, two and three, the simultaneous intelligent voice conversation of a plurality of users is needed to be dealt with, and a temporary main speaker coding result of a plurality of users is stored at a clause level. At present, the technology is mature, but the problem of identifying scene words in OTA (Online Travel Agency, online tourism) industry exists, and the method is not involved in main speaker information extraction, partial side speaker voice shielding and resting voice information shielding.

The current mainstream of the common side person speaking detection and recognition technology is a voice noise reduction method adopting DNS (Deep Noise Suppression ), the method is to train a DNS network, the traditional voice noise reduction technology is adopted to carry out voice noise reduction processing on the produced real voice as the output of the DNS network, and the produced real voice is used as the input of the DNS network. Finally, in the process of deploying the DNS network into production, directly taking the produced real original voice as the input of the DNS network, taking the output voice of the DNS network as the voice recognition model of the voice telephone robot, and taking the original voice of the current voice conversation robot as the voice recognition model. The voice processed by the DNS network model has lost more obvious main speaker information, and the DNS network only adopts analog data as no actual and real data before and after noise reduction, so that the DNS network lacks certain self-adaptability to real production data, has poor effect on scene diversity, has different noisy degree of the voice environment of each voice dialogue scene, and is difficult to match with a real voice recognition system.

In summary, for the existing problems of main speaker information extraction, side speaker speech shielding and voice information shielding setting up in the OTA industry under the condition of low sampling rate of 8kHz, the main technical difficulties are as follows:

1. the voice recognition system, the text sentence breaking module and the voiceprint system supporting word level timestamp and confidence output need to make certain data pre-marking sample support on the scene data of the appointed telephone robot, and the data required by the text sentence breaking module is mostly input in a short sentence mode, because guests or hotels in voice conversations are mostly speaking more simply, and the original general system is directly used with low precision.

2. Intelligent voice customer service in an OTA environment needs to overcome a noisy low sampling rate 8kHz voice environment.

3. The data trained by the DNS network is only analog data, the support of real production data is lacking, and the voice environment of the voice dialogue robot is different in noisy degree, so that the voice environment is difficult to match with a voice recognition system with better noise resistance and robustness, and each voice environment needs a similar DNS network and has poor migration.

4. The OTA intelligent voice service is faced with tens of millions of voice service requests and responses.

Disclosure of Invention

The invention aims to overcome the defect that voice information of a main speaker cannot be effectively recognized under the condition that a nearby person speaks in the prior art, and provides a voice recognition method, a voice recognition system, electronic equipment and a storage medium of the speaker.

The invention solves the technical problems by the following technical scheme:

in a first aspect, a method for recognizing a speaker's voice is provided, the method comprising the steps of:

acquiring target audio, and performing voice recognition processing on the target audio to obtain a target text;

performing sentence breaking processing on the target text according to semantics to obtain at least two clause texts;

performing voiceprint recognition on clause audio corresponding to the clause text to obtain voiceprint information of the clause audio;

and determining whether the speaker corresponding to the clause audio is a main speaker according to the voiceprint information of the clause audio and the association degree of the clause text and the current scene.

Optionally, the step of performing sentence breaking processing on the target text according to semantics to obtain at least two clause texts specifically includes:

inputting the target text into a text sentence-breaking model to obtain at least two clause texts; the text sentence breaking model is used for extracting semantic features of the target text and carrying out sentence breaking processing on the target text according to the semantic features.

Optionally, obtaining clause audio corresponding to the clause text according to the following steps:

acquiring time stamps of words in the target text;

determining the time stamp of the clause text according to the time stamp of the word;

splitting the target audio according to the timestamp of the clause text to obtain the clause audio corresponding to the clause text.

Optionally, the step of determining whether the speaker corresponding to the clause audio is the main speaker according to the voiceprint information of the clause audio and the association degree of the clause text and the current scene specifically includes:

judging whether the voiceprint information of the clause audio is consistent with prestored voiceprint information;

if the clause text is consistent with the current scene, acquiring the association degree of the clause text and the current scene;

and if the association degree characterizes that the clause text is associated with the current scene, determining that the speaker corresponding to the clause audio is a main speaker.

In a second aspect, there is provided a speech recognition system for recognizing a speaker, the speech recognition system comprising: the system comprises a text recognition module, a text sentence breaking module, a voiceprint recognition module and a speaker judgment module;

the text recognition module is used for acquiring target audio and performing voice recognition processing on the target audio to obtain a target text;

the text sentence breaking module is used for breaking sentences of the target text according to semantics to obtain at least two clause texts;

the voiceprint recognition module is used for carrying out voiceprint recognition on the clause audio corresponding to the clause text to obtain voiceprint information of the clause audio;

the speaker judging module is used for determining whether the speaker corresponding to the clause audio is a main speaker according to the voiceprint information of the clause audio and the association degree of the clause text and the current scene.

Optionally, the text sentence breaking module is further configured to input the target text into a text sentence breaking model to obtain at least two clause texts; the text sentence breaking model is used for extracting semantic features of the target text and carrying out sentence breaking processing on the target text according to the semantic features.

Optionally, the voice recognition system further comprises an audio splitting module; the method comprises the steps of obtaining time stamps of words in a target text, determining the time stamps of clause text according to the time stamps of the words, and splitting the target audio according to the time stamps of the clause text to obtain clause audio corresponding to the clause text.

Optionally, the speaker judgment module is specifically configured to judge whether the voiceprint information of the clause audio is consistent with the prestored voiceprint information, obtain a degree of association between the clause text and the current scene if the voiceprint information is consistent with the prestored voiceprint information, and determine that the speaker corresponding to the clause audio is the main speaker if the degree of association characterizes that the clause text is associated with the current scene.

In a third aspect, an electronic device is provided, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the above-described speaker's voice recognition method when executing the computer program.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the above-described speaker's speech recognition method.

On the basis of conforming to the common knowledge in the art, the optional conditions can be arbitrarily combined to obtain the preferred embodiments of the invention.

The invention has the positive progress effects that: the invention provides the voice recognition method, the voice recognition system, the electronic equipment and the storage medium for the speaker, which are used for recognizing the voiceprint information and the scene theme relativity of the audio, so that the problem of interference conversation of intelligent voice customer service in the OTA industry under the condition that a person beside the voice conversation speaks in the voice conversation process is solved, and the interaction comfort level in the voice conversation is improved.

Drawings

Fig. 1 is a flowchart of a method for recognizing a speaker's voice according to embodiment 1 of the present invention;

fig. 2 is a flowchart of a method for obtaining clause audio corresponding to clause text according to embodiment 1 of the present invention;

FIG. 3 is a flowchart of a method for determining whether a speaker corresponding to a clause audio is a master speaker according to voiceprint information of the clause audio and association degree of the clause text and a current scene provided in embodiment 1 of the present invention;

fig. 4 is a schematic structural diagram of a speaker voice recognition system according to embodiment 2 of the present invention;

fig. 5 is a schematic structural diagram of an electronic device for implementing a method for recognizing a speaker's voice according to embodiment 3 of the present invention.

Detailed Description

The invention is further illustrated by means of the following examples, which are not intended to limit the scope of the invention.

Example 1

The present embodiment provides a method for recognizing voice of a speaker, as shown in fig. 1, the method for recognizing voice includes the following steps:

s11, acquiring target audio, and performing voice recognition processing on the target audio to obtain a target text;

in a specific implementation, fbank (FilterBank) may be used as a feature extraction method to convert the target audio into corresponding text, so as to obtain the target text.

S12, performing sentence breaking processing on the target text according to semantics to obtain at least two clause texts;

s13, carrying out voiceprint recognition on the clause audio corresponding to the clause text to obtain voiceprint information of the clause audio;

in this embodiment, voiceprint recognition is performed on each clause audio to obtain voiceprint information of each clause audio. In a specific example, in the process of voiceprint recognition, audio waveform sampling processing is performed on the sentence audio at a sampling rate of 4 times, and the processing is consistent with step S12, so as to improve the recognition speed and the recognition effect.

S14, determining whether the speaker corresponding to the clause audio is a main speaker according to the voiceprint information of the clause audio and the association degree of the clause text and the current scene.

In a specific example, fbank is adopted as a feature extraction method to judge whether a speaker is the same speaker as a main speaker of a current round; and identifying the clause text of each semantic clause by a speaker judging module, wherein the association is set to be 1, and the non-association is set to be 0. Judging whether the speaker corresponding to each clause audio belongs to the current speaker or not by combining the relevance of the clauses and the voiceprint information of the clauses; if yes, judging that the speaker corresponding to the clause audio is the main speaker; if not, judging that the speaker corresponding to the clause audio is not the main speaker, and considering the speaker as the side speaker.

In the embodiment, whether the speaker corresponding to the audio is the main speaker or not is judged by identifying voiceprint information and identifying the scene theme relativity of the audio, so that the problem of interference of the speaker of the next person is solved, and the interaction comfort level in the voice conversation is improved.

In an optional embodiment, the step S12 specifically includes:

In a specific implementation, the text sentence-breaking model may be model-coded in an autoregressive manner, and both coding (encoding) and decoding (decoding) are implemented by using a classical 12-layer MHA structure (Multi-Head Attention mechanism), the Attention dimension is set to 768, and the encoding and decoding are connected by using Cross-Attention. Here constraint encoding requires sampling processing at 4 times sampling to guarantee the model training effect. The model can identify the separated clause identifiers, so that the target text is subjected to sentence breaking processing, and at least two clause texts are obtained.

In a specific example, the target text is input into a text sentence-breaking model to perform sentence-breaking processing, and the output result of the text sentence-breaking model contains [ SEP ], so that at least two clause texts can be obtained according to the [ SEP ].

For example, input1, order requests may be processed as soon as possible

Output2 order please process [ SEP ] as soon as possible

The SEP is a clause identifier for separation, which is used for separating words with dissimilar meanings to distinguish the text semantically, and then obtaining a first clause text 'order please be processed as soon as possible' and a second clause text 'can' from the first clause text.

In an alternative embodiment, as shown in fig. 2, the clause audio corresponding to the clause text is obtained according to the following steps:

s21, obtaining time stamps of words in the target text;

in a specific implementation, tag information such as a word tag, a blank tag and the like output at a CTC (Connectionist Temporal Classification, connection sense time class) frame level can be restored to an original frame so as to obtain a time stamp of each word in the target text by decoding.

In a specific example, the text [ can ] is the voice information expressed by the hotel business support personnel at the hotel side, the text [ please process the order as soon as possible ] is the robot information (i.e. the voice of the next person) automatically sent by a certain device at the hotel side, and the hotel business support personnel expresses the following voice contents in the background voice environment:

input, order requests can be processed as soon as possible

Output1, order requests can be processed as soon as possible

Output2:[1.120,1.640][1.640,1.840][1.840,2.680][2.680,3.400][3.400,4.320]

Output1 is each word obtained by splitting; output2 is the timestamp of each word and since a sampling rate of 4 times is required in this embodiment, the timestamps are all multiples of 40 ms.

S22, determining the time stamp of the clause text according to the time stamp of the word;

in one specific example, timestamp information of clause text is determined using the timestamp of each word and [ SEP ] information.

For example, input1, order requests may be processed as soon as possible

Output2 order please process [ SEP ] as soon as possible

The time stamp of the first clause text of "order please be processed as soon as possible" is the time stamp of the word "processing", and the time stamp of the second clause text of "can" is the time stamp of the word "can".

S23, splitting the target audio according to the timestamp of the clause text to obtain clause audio corresponding to the clause text.

In this embodiment, according to the timestamp information of the clause text, the target audio is disassembled, so as to obtain the clause audio corresponding to the clause text.

In an alternative embodiment, as shown in fig. 3, the step S14 specifically includes:

s141, judging whether the voiceprint information of the clause audio is consistent with prestored voiceprint information; if so, step S142 is executed, and if not, the flow ends.

In a specific example, the obtained voiceprint information is compared with pre-stored voiceprint information to determine whether the voiceprint information is consistent.

S142, acquiring the association degree of the clause text and the current scene;

in this embodiment, if the voiceprint information is consistent with the prestored voiceprint information, the next step of judgment is performed, that is, the association degree between the clause text and the current scene is obtained. For example, an association of 1 may be set, and a non-association of 0.

S143, if the association degree represents that the clause text is associated with the current scene, determining that the speaker corresponding to the clause audio is a main speaker.

In this embodiment, whether the clause audio is the valid audio of the main speaker is determined according to the association degree of the clause text and the current scene.

Example 2

The present embodiment provides a voice recognition system for recognizing a speaker, as shown in fig. 4, the voice recognition system includes: a text recognition module 201, a text sentence breaking module 202, a voiceprint recognition module 204, and a speaker determination module 205;

the text recognition module 201 is configured to obtain a target audio, and perform speech recognition processing on the target audio to obtain a target text;

the text sentence breaking module 202 is configured to perform sentence breaking processing on the target text according to semantics, so as to obtain at least two clause texts;

the voiceprint recognition module 204 is configured to perform voiceprint recognition on clause audio corresponding to the clause text, so as to obtain voiceprint information of the clause audio;

the speaker judgment module 205 is configured to determine whether a speaker corresponding to the clause audio is a dominant speaker according to voiceprint information of the clause audio and a degree of association between the clause text and a current scene.

In this embodiment, as shown in fig. 4, the voice recognition system solves the problem of interference conversation caused by speaking and placing voice by a person beside an intelligent voice customer service in the voice conversation process in the OTA industry through a text recognition module, a text sentence breaking module, a voiceprint recognition module and a speaker judgment module, thereby improving the interaction comfort level in the voice conversation.

In an optional implementation manner, the text sentence breaking module is further configured to input the target text into a text sentence breaking model to obtain at least two clause texts; the text sentence breaking model is used for extracting semantic features of the target text and carrying out sentence breaking processing on the target text according to the semantic features.

In this embodiment, the text sentence breaking module is a module specific to text sentence breaking in a speech dialogue scene, and can split the speech recognition result into multiple clause texts; in a specific embodiment, the text sentence breaking module requires the use of the production data of the smart phone voice robot for the scenery support. The text sentence breaking model in the text sentence breaking module adopts autoregressive model coding, and both coding (encoding) and decoding (decoding) adopt a classical 12-layer MHA structure (Multi-Head Attention mechanism), the Attention dimension is set to 768, and meanwhile, the encoding and decoding adopt Cross-Attention for connection. The constraint encoding needs to be sampled 4 times to ensure the training effect of the text sentence-breaking model. The text sentence breaking model can identify separated clause identifiers, so that sentence breaking processing is carried out on the target text, and at least two clause texts are obtained.

In an alternative embodiment, as shown in fig. 4, the speech recognition system further includes an audio splitting module 203, configured to obtain a timestamp of each word in the target text, determine a timestamp of the clause text according to the timestamp of the word, and split the target audio according to the timestamp of the clause text to obtain clause audio corresponding to the clause text.

In this embodiment, the text recognition module restores the tag information such as the word tag and blank tag output by CTC frame level to the original frame so as to decode and obtain the time stamp of each word in the target text; the audio splitting module judges the time stamp information of the clause text by utilizing the time stamp of each word and the clause identifier, and then splits the target audio according to the time stamp information of the clause text to obtain the clause audio corresponding to the clause text.

In an optional implementation manner, the speaker judgment module is specifically configured to judge whether the voiceprint information of the clause audio is consistent with the prestored voiceprint information, obtain a degree of association between the clause text and the current scene if the voiceprint information is consistent with the prestored voiceprint information, and determine that the speaker corresponding to the clause audio is the main speaker if the degree of association characterizes that the clause text is associated with the current scene.

In this embodiment, the speaker judgment module is configured to judge voiceprint information of a speaker, and judge whether the speaker is the same speaker as a main speaker of a current round by judging whether the voiceprint information of the clause audio is consistent with pre-stored voiceprint information; if yes, identifying the information result of the clause audio, and judging the relevance of the current scene of the clause text of the clause audio; if not, the process is ended. The speaker judgment module can eliminate the interference of the speaker of the side person, improves the current scene association degree of the clause audio of the main speaker in the voice recognition system, and improves the accuracy of voice dialogue in the voice recognition system and the interaction comfort degree in the voice recognition of the speaker.

Example 3

The present embodiment provides an electronic device 50, as shown in fig. 5, including a memory 52, a processor 51, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the speaker's voice recognition method as described in embodiment 1. The schematic structural diagram of one electronic device 50 shown in fig. 5 is merely an example, and should not limit the functions and the application scope of the embodiment of the present invention.

As shown in fig. 5, the electronic device 50 may be embodied in the form of a general purpose computing device, which may be a server device, for example. Components of electronic device 50 may include, but are not limited to: the at least one processor 51, the at least one memory 52, a bus 53 connecting the different system components, including the memory 52 and the processor 51.

The bus 53 includes a data bus, an address bus, and a control bus.

Memory 52 may include volatile memory such as Random Access Memory (RAM) 521 and/or cache memory 522, and may further include Read Only Memory (ROM) 523.

Memory 52 may also include a program tool 525 (or utility) having a set (at least one) of program modules 524, such program modules 524 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The processor 51 executes various functional applications and data processing, such as the speaker's voice recognition method in the above-described embodiment 1, by running a computer program stored in the memory 52.

The electronic device 50 may also communicate with one or more external devices 55 (e.g., keyboard, pointing device, etc.). Such communication may occur through an input/output (I/O) interface 55. Also, the electronic device 50 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 56. As shown, the network adapter 56 communicates with other modules of the model-generated electronic device 50 via the bus 53. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with model-generating electronic device 50, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, data backup storage systems, and the like.

Example 4

The present embodiment provides a computer-readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the speaker's voice recognition method as described in embodiment 1.

More specifically, among others, readable storage media may be employed including, but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation manner, the present invention may also be realized in the form of a program product comprising program code for causing a terminal device to carry out the speech recognition method of a speaker as in the above-mentioned embodiments, when said program product is run on the terminal device.

Wherein the program code for carrying out the invention may be written in any combination of one or more programming languages, which program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on the remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims

1. A method for speaker speech recognition, the method comprising the steps of:

2. The method for recognizing speech according to claim 1, wherein the step of performing sentence breaking processing on the target text according to semantics to obtain at least two clause texts comprises:

3. The method of claim 2, wherein the clause audio corresponding to the clause text is obtained according to the steps of:

acquiring time stamps of words in the target text;

4. A method of speech recognition according to any one of claims 1-3, wherein the step of determining whether the speaker corresponding to the clause audio is the dominant speaker according to the voiceprint information of the clause audio and the association degree of the clause text with the current scene specifically comprises:

5. A speech recognition system for recognizing a speaker, the speech recognition system comprising: the system comprises a text recognition module, a text sentence breaking module, a voiceprint recognition module and a speaker judgment module;

6. The speech recognition system of claim 5, wherein the text sentence breaking module is specifically configured to input the target text into a text sentence breaking model to obtain at least two clause texts; the text sentence breaking model is used for extracting semantic features of the target text and carrying out sentence breaking processing on the target text according to the semantic features.

7. The speech recognition system of claim 6, further comprising an audio splitting module configured to obtain a timestamp of each term in the target text, determine a timestamp of the clause text based on the timestamp of the term, and split the target audio according to the timestamp of the clause text to obtain clause audio corresponding to the clause text.

8. The speech recognition system of any one of claims 5-7, wherein the speaker determination module is specifically configured to determine whether voiceprint information of the clause audio is consistent with prestored voiceprint information, obtain a degree of association of the clause text with a current scene if the voiceprint information is consistent with the prestored voiceprint information, and determine that a speaker corresponding to the clause audio is a master speaker if the degree of association characterizes that the clause text is associated with the current scene.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method for speech recognition of a speaker according to any one of claims 1 to 4 when executing the computer program.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method for speech recognition of a speaker according to any one of claims 1 to 4.