CN109785846B

CN109785846B - Role recognition method and device for mono voice data

Info

Publication number: CN109785846B
Application number: CN201910012155.XA
Authority: CN
Inventors: 顾艳梅; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2024-05-28
Anticipated expiration: 2039-01-07
Also published as: CN109785846A

Abstract

The invention relates to the field of artificial intelligence, and discloses a role recognition method and device for mono voice data. The method comprises the following steps: performing voice recognition on the voice data to obtain recording information and recording duration of the voice data; recording the time length of the recording information; extracting the sound characteristics of a speaker from the recording information based on the universal background model; determining a role judgment threshold value for the speaker according to the recording duration; comparing the voice characteristics of the speaker with the voice characteristics of the pre-stored target characters to obtain a similarity comparison result; and determining whether the speaker in the voice data is the target role according to the relation between the similarity comparison result and the role judgment threshold. The technical scheme solves the problems that the prior art is not suitable for real-time speaker recognition, has lower accuracy rate for voice recording recognition with shorter duration and has no error correction mechanism for character recognition.

Description

Role recognition method and device for mono voice data

Technical Field

The invention relates to the technical field of voice recognition in artificial intelligence, in particular to a role recognition method and device for mono voice data.

Background

In recent years, artificial intelligence (ARTIFICIAL INTELLIGENCE, abbreviated as AI) technology has been rapidly developed, and is widely used in various fields of retail, transportation, logistics, medical treatment, education and the like, and with the continuous deep study, the application field of artificial intelligence is wider.

In the prior art, a speaker identification method and a speaker identification device are provided, and the method mainly comprises the following steps: 1) Receiving a voice signal of a speaker; 2) Acquiring a fundamental frequency value of the voice signal; 3) Acquiring the channel length of the speaker based on the voice signal; 4) And identifying the category of the speaker at least according to the fundamental frequency value and the channel length.

However, this solution has the following drawbacks: (1) The sound recording with shorter duration cannot be accurately judged; (2) no corrective action for the erroneous character information; (3) The method framework is not suitable for real-time speaker recognition.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a method and an apparatus for character recognition of mono voice data, which are used to solve the problems that the prior art is not suitable for real-time speaker recognition, has low accuracy in voice recording recognition with shorter duration, and has no error correction mechanism for character recognition.

In one aspect, an embodiment of the present invention provides a method for identifying a role of mono voice data, including: performing voice recognition on voice data to obtain recording information and recording duration of the voice data; the recording duration records the time length of the recording information; extracting the sound characteristics of a speaker from the recording information based on a general background model; determining a role judgment threshold value for the speaker according to the recording duration; comparing the voice characteristics of the speaker with the voice characteristics of the pre-stored target character to obtain a similarity comparison result; and determining whether the speaker in the voice data is the target role according to the relation between the similarity comparison result and the role judgment threshold.

Optionally, the extracting the voice feature of the speaker from the recording information based on the universal background model includes: judging whether the recording duration is greater than a preset time threshold; if the recording time is longer than or equal to the preset time threshold, extracting the sound characteristics of a speaker from the recording information based on the universal background model; and if the recording duration is smaller than the preset time threshold, copying a plurality of pieces of recording information so that the time length of the copied recording information is larger than or equal to the preset time threshold, and extracting the voice characteristics of the speaker from the copied recording information based on the universal background model.

Optionally, after determining whether the speaker is the target character according to the relationship between the similarity comparison result and the character judgment threshold, the method further includes: selecting at least three pieces of recording information of the recognized speaker roles; respectively carrying out similarity comparison on the record information with the longest record duration in the selected record information and other record information except the record information with the longest record duration in the record information so as to obtain a corresponding comparison result; and correcting the recognition result of the speaker roles of the rest recording information according to the comparison result.

Optionally, correcting the recognition result of the speaker role of the remaining recording information according to the comparison result includes: judging whether the comparison result of the recording information with the longest recording duration and the first recording information in the rest recording information is that the similarity is higher than a similarity threshold value; if the similarity is greater than or equal to the similarity threshold, but the identification result of the speaker role of the first recording information is different from the identification result of the speaker role of the recording information with the longest recording duration, correcting the identification result of the speaker role of the first recording information to be the same as the identification result of the speaker role of the recording information with the longest recording duration; if the similarity is smaller than the similarity threshold and the identification result of the speaker role of the first recording information is different from the identification result of the speaker role of the recording information with the longest recording duration, or if the similarity is larger than or equal to the similarity threshold but the identification result of the speaker role of the recording information with the longest recording duration is the same, the identification result of the speaker role of the first recording information is not changed.

Optionally, after extracting the voice feature of the speaker from the recording information based on the general background model, the method further includes: and performing channel compensation on the extracted voice characteristics of the speaker by using a channel compensation algorithm.

On the other hand, the embodiment of the invention provides a role recognition device for mono voice data, which comprises: the voice recognition module is used for carrying out voice recognition on the voice data so as to obtain recording information and recording duration of the voice data; the recording duration records the time length of the recording information; the feature extraction module is used for extracting the sound features of the speaker from the recording information based on the universal background model; the threshold determining module is used for determining a role judging threshold of the speaker according to the recording duration; the feature comparison module is used for comparing the voice features of the speaker with the voice features of the pre-stored target characters to obtain a similarity comparison result; and the role judgment module is used for determining whether the speaker in the voice data is the target role according to the relation between the similarity comparison result and the role judgment threshold value.

On the other hand, the embodiment of the invention also provides a computer device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the computer program is executed by the processor to realize the role recognition method of the mono voice data.

In still another aspect, an embodiment of the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the above-described method for character recognition of mono speech data.

Compared with the prior art, the technical scheme has at least the following beneficial effects:

According to the character recognition method for the mono voice data, provided by the embodiment of the invention, the mono voice data is recognized by utilizing automatic voice recognition, so that the recording information and the recording duration of the voice data are obtained. The speech data is identified using real-time ASR such that the received speech data contains only one speaker role. And then, extracting the voice characteristics of the speaker from the recording information based on the universal background model, and determining a role judgment threshold value for the speaker according to the recording duration. And then, carrying out similarity comparison on the voice characteristics of the speaker and the voice characteristics of the pre-stored target characters to obtain a similarity comparison result, and further determining whether the speaker in the voice data is the target character according to the relationship between the similarity comparison result and the character judgment threshold value.

Further, in the process of extracting the voice features of the speaker from the recording information based on the universal background model, for the recording information with shorter recording duration (smaller than a preset time threshold), a plurality of copies of the recording information are copied so that the time length of the copied recording information is greater than or equal to the preset time threshold, and therefore the voice features of the speaker, which are extracted from the copied recording information (with longer recording duration), can be more easily identified.

Further, by selecting at least three pieces of recording information with recognized speaker roles, the recording information with the longest recording duration is respectively compared with the recording information with the relatively shorter recording duration, so as to obtain corresponding comparison results, and the recognition results of the speaker roles of the remaining recording information are corrected according to the comparison results. Because the character recognition of the speaker in the recording information with longer recording duration is relatively accurate, the recognition result can be corrected by adopting the error correction mechanism, thereby further improving the accuracy of character recognition.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of one embodiment of a method of character recognition of mono speech data of the present application;

FIG. 2 is a flow chart of another embodiment of a method for character recognition of mono speech data of the present application;

FIG. 3 is a flow chart of another embodiment of a method for character recognition of mono speech data of the present application;

FIG. 4 is a schematic diagram of a character recognition device for monaural speech data according to an embodiment of the application;

FIG. 5 is a schematic diagram of one embodiment of a computer device of the present application.

Detailed Description

For a better understanding of the technical solution of the present invention, the following detailed description of the embodiments of the present invention refers to the accompanying drawings.

It should be understood that the described embodiments are merely some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flow chart illustrating an embodiment of a character recognition method of monaural voice data according to the present application. Referring to fig. 1, the method includes:

Step 101, performing real-time voice recognition on voice data to obtain recording information and recording duration of the voice data; the recording duration records the time length of the recording information.

In this embodiment, automatic speech recognition (Automatic Speech Recognition, abbreviated as ASR) may be used to automatically recognize speech data received from a monaural in real time, and after a complete sentence is recognized, the speech data is processed to obtain recording information and recording duration of the speech data. The recording information is sound information in the voice data (specifically, a recognized complete sentence), and includes the voice of a speaker and noise. And recording the time length of the recording information, namely the time length of the recognized complete sentence.

It should be noted that, a complete sentence recognized from speech data by ASR is spoken by the same speaker, and sentences spoken by different speakers are recognized as different complete sentences. And for multiple sentences continuously spoken by the same speaker, the sentences can be identified as a complete sentence (i.e. a sentence with longer recording duration) or multiple sentences (i.e. a sentence with shorter recording duration than each).

Step 102, extracting the voice characteristics of the speaker from the recording information based on the universal background model.

Specifically, the universal background model (Universal Background Model, abbreviated as UBM) is obtained after training by a training module through massive training data. The universal background model comprises voice characteristics of speakers in multiple dimensions, and voice characteristic information of different speakers can be mapped according to the dimensions. In this embodiment, the sound features are i-vector features. In the i-vector feature model, global difference space (Total Variability Space) estimation and i-vector estimation can be adopted, and the global difference space can ensure that the distinguishing property of the vector to speaker information and channel information after projection is more obvious.

In the step, the voice characteristics of a speaker are extracted from the recording information based on the i-vector characteristics of a plurality of dimensions in the universal background model. Those skilled in the art understand that for the recorded information with shorter recording duration, since the voice features of the speaker are fewer, it is more difficult to identify the role of the speaker based on the voice features of the speaker extracted from the recorded information.

And step 103, determining a role judgment threshold value for the speaker according to the recording duration.

Specifically, for recording information with different recording durations, the voice characteristics of speakers extracted from the recording information are greatly different. For example, for recording information with longer recording duration, the recording information contains relatively abundant information, and the voice features of the speaker extracted from the recording information can fully represent the identity of the speaker, while for recording information with shorter recording duration, the recording information contains relatively less information, and the voice features of the speaker extracted from the recording information are insufficient to fully represent the identity of the speaker.

Furthermore, in the subsequent character recognition process according to the extracted voice features of the speaker, the similarity comparison result of the voice features obtained from the recording information of different recording durations cannot be judged according to the same standard. That is, even if the same similarity comparison result is obtained after the similarity comparison, the i-vector features extracted from the recording information of different recording durations are different in the performance capability of recognizing the character of the speaker, so that the character judgment threshold needs to be updated in real time when the character recognition result is determined. Therefore, different character recognition results may be obtained for the same similarity comparison result.

Therefore, in this embodiment, different role judgment thresholds (i.e., different judgment standards) need to be determined for different recording durations to judge the obtained similarity comparison result. For example, for recording information having a longer recording time period, the set character judgment threshold is larger, and for recording information having a shorter recording time period, the set character judgment threshold is smaller. That is, for recording information with a shorter recording duration (i.e., fewer extracted speaker voice features), a lower threshold is set.

In practical application, the role judgment threshold is correspondingly updated according to the recording duration of the recording information aiming at the recording information to be recognized currently.

And 104, comparing the voice characteristics of the speaker with the voice characteristics of the pre-stored target character to obtain a similarity comparison result.

Specifically, the pre-stored sound characteristics of the target character may be collected in advance and stored in a database for use in the decision of character recognition. In practical application, there may be multiple target roles, so that different accounts (or different identities, etc.) may be used to store the sound features of each target role in the corresponding account.

The sound characteristic of the target character is also an i-vector characteristic. Therefore, the similarity comparison result can be obtained by comparing the voice characteristics of the speaker extracted from the recording information with the voice characteristics of the pre-stored target character. The similarity comparison result may be a specific score (for example, 80 points), a percentage (for example, 60%) representing the similarity, or the like.

As described above, the i-vector feature model uses the global variance space to include the variance between speakers and the variance between channels. Therefore, in practical application, the method further comprises the following steps: and performing channel compensation on the extracted voice characteristics of the speaker by using a channel compensation algorithm. The channel compensation algorithm is a Probabilistic LINEAR DISCRIMINANT ANALYSIS (PLDA for short).

And step 105, determining whether the speaker in the voice data is the target role according to the relation between the similarity comparison result and the role judgment threshold.

Specifically, based on the step 103, a role judgment threshold for the speaker in the recording information is determined, and the relationship between the similarity comparison result and the role judgment threshold is compared, so as to determine whether the speaker is the target role.

For example, the role determination threshold is: more than or equal to 60 minutes, namely determining that the speaker is a target role; less than 60 minutes, i.e., it is determined that the speaker is not the target character. If the similarity comparison result obtained in the step 104 is 80 points, the target role of the speaker is determined.

For another example, the role determination threshold is: greater than or equal to 80%, namely determining that the speaker is a target role; less than 80%, i.e., it is determined that the speaker is not the target character. If the similarity comparison result obtained in the step 104 is 60%, it is determined that the speaker is not the target character.

The embodiment is applied to analyzing voice data of a plurality of single-channel speakers to realize character recognition of the plurality of speakers.

Fig. 2 is a flow chart of another embodiment of a character recognition method of mono voice data of the present application. As shown in fig. 2, in the embodiment of the present application shown in fig. 1, step 102 specifically includes:

Step 1021: judging whether the recording duration is greater than a preset time threshold.

Step 1022: and if the voice feature of the speaker is greater than or equal to the preset time threshold, extracting the voice feature of the speaker from the recording information based on the universal background model.

Step 1023: and if the time length of the copied recording information is greater than or equal to the preset time threshold, extracting the sound characteristics of the speaker from the copied recording information based on the universal background model.

Specifically, the preset time threshold may be set to different time thresholds according to different application scenarios, for example, the preset time threshold may be set to 1 second.

And for the recording information with the recording time length being greater than or equal to the preset time threshold, extracting the sound characteristics of the speaker from the recording information directly based on the universal background model.

And for the recording information with the recording duration smaller than the preset time threshold, copying the recording information (with the recording duration shorter) for a plurality of times to obtain the recording information with the recording duration longer (at least equal to the preset time threshold), and extracting the sound characteristics of the speaker from the recording information based on the universal background model. Because the copied recording information contains more voice characteristics of the speaker, the subsequent recognition of the role of the speaker is facilitated.

According to the character recognition method for the mono voice data, provided by the embodiment, the characters of the speaker in the voice data can be recognized in real time, and the accuracy of character recognition on the record with shorter time length is improved.

However, in practical applications, since the character judgment threshold is obtained based on training data, the recognition result obtained by the character recognition method of the monophonic speech data may have errors when the training data is insufficient or the training algorithm is not accurate enough. Therefore, the inventors have further studied and proposed a scheme for performing error correction on the character recognition result obtained based on the above method.

Fig. 3 is a flow chart of another embodiment of a character recognition method of mono voice data of the present application. As shown in fig. 3, according to the above step 105, after obtaining the speaker's character recognition result of the plurality of pieces of voice data, the method further includes:

and 106, selecting at least three pieces of recording information of the recognized speaker roles.

Specifically, N pieces (where N is greater than or equal to 3, i.e., at least three pieces) of recording information are arbitrarily selected from the recording information of the recognized speaker character.

And 107, respectively carrying out similarity comparison on the record information with the longest record duration in the selected record information and the rest record information except the record information with the longest record duration in the record information so as to obtain a corresponding comparison result.

Specifically, the recording information with the longest recording duration is respectively compared with the rest of recording information, so that N-1 comparison results are obtained. For example, the similarity of the voice features of the speakers in the two recording information may be compared by referring to the comparison method described in step 104, so as to obtain N-1 comparison results.

And step 108, correcting the identification result of the speaker roles of the rest recording information according to the comparison result.

The step 108 specifically includes:

step 1081, judging whether the comparison result of the record information with the longest record duration and the first record information in the rest record information is that the similarity is higher than a similarity threshold.

The similarity threshold may be preset, and the similarity threshold and the similarity are the same class of numerical values, for example, are all scores or percentages.

Step 1082, if the similarity is greater than or equal to the similarity threshold, but the recognition result of the speaker role of the first recording information is different from the recognition result of the speaker role of the recording information with the longest recording duration, correcting the recognition result of the speaker role of the first recording information to be the same as the recognition result of the speaker role of the recording information with the longest recording duration.

Step 1083, if the similarity is smaller than the similarity threshold and the identification result of the speaker role of the first recording information is different from the identification result of the speaker role of the recording information with the longest recording duration, or the similarity is larger than or equal to the similarity threshold but the identification result of the speaker role of the recording information with the longest recording duration is the same, the identification result of the speaker role of the first recording information is not changed.

Specifically, according to the determination result of step 1081, if the similarity between the recorded information with the longest recording duration and the first recorded information is high (i.e., greater than or equal to the similarity threshold), the speaker roles of the two recorded information should be the same. If the original identification results are different, correcting the identification results of the speaker roles of the first recording information (namely, the recording information with shorter recording duration) so that the identification results of the speaker roles of the first recording information are the same as the identification results of the speaker roles of the recording information with longest recording duration. And if the original identification results are the same, not changing the identification results of the speaker roles of the first recording information.

Conversely, if the similarity between the recorded information with the longest recording duration and the first recorded information is not high (i.e., less than the similarity threshold), the speaker roles of the two recorded information should be different. If the original identification results are different, the identification result of the speaker role of the first recording information is not changed. If the original identification results are the same, correcting the identification results of the speaker roles of the first recording information (namely, the recording information with shorter recording duration) so that the identification results of the speaker roles of the first recording information are different from the identification results of the speaker roles of the recording information with longest recording duration.

Each piece of recording information in the rest of recording information is judged and corrected according to the implementation process of the steps 1081-1083, which is not described herein.

Because the character recognition of the speaker in the recording information with longer recording duration is relatively accurate, the recognition result can be corrected by adopting the error correction mechanism of the steps 106-108, so that the accuracy of character recognition is further improved.

Fig. 4 is a schematic diagram of a configuration of an embodiment of a character recognition apparatus for monaural voice data according to the present application. Referring to fig. 4, the character recognition apparatus 4 includes:

a voice recognition module 41, configured to perform voice recognition on voice data to obtain recording information and recording duration of the voice data; the recording duration records the time length of the recording information.

The feature extraction module 42 is configured to extract the voice feature of the speaker from the recording information based on the generic background model.

And the threshold determining module 43 is configured to determine a threshold for judging the role of the speaker according to the recording duration.

The feature comparison module 44 is configured to compare the voice feature of the speaker with the pre-stored voice feature of the target character to obtain a similarity comparison result.

And a role decision module 45, configured to determine whether the speaker in the voice data is the target role according to the relationship between the similarity comparison result and the role decision threshold.

Wherein the feature extraction module 42 comprises: a duration determining unit 421, configured to determine whether the recording duration is greater than a preset time threshold. And the feature extraction processing unit 422 is configured to extract the voice feature of the speaker from the recording information based on the general background model if the recording time is greater than or equal to the preset time threshold. The feature extraction processing unit 422 is further configured to copy a plurality of pieces of the recording information if the recording duration is less than the preset time threshold, so that the time length of the copied recording information is greater than or equal to the preset time threshold, and extract the voice feature of the speaker from the copied recording information based on the universal background model.

The character recognition apparatus 4 further includes: the recording information selecting module 46 is configured to select at least three pieces of recording information of the identified speaker roles. The recording information comparison module 47 is configured to compare the recording information with the longest recording duration in the selected recording information with the remaining recording information except for the recording information with the longest recording duration, so as to obtain a corresponding comparison result. And the role decision correcting module 48 is used for correcting the recognition result of the speaker roles of the rest recording information according to the comparison result.

Wherein the role decision correction module 48 includes:

the comparison and judgment unit 481 is configured to judge whether a comparison result of the recording information with the longest recording duration and the first recording information in the remaining recording information is that the similarity is higher than a similarity threshold.

And the correction processing unit 482 is configured to correct the recognition result of the speaker character of the first recording information to be the same as the recognition result of the speaker character of the recording information with the longest recording duration if the similarity is greater than or equal to the similarity threshold, but the recognition result of the speaker character of the first recording information is different from the recognition result of the speaker character of the recording information with the longest recording duration.

The correction processing unit 482 is further configured to, if the similarity is smaller than the similarity threshold and the recognition result of the speaker role of the first recording information and the speaker role of the recording information with the longest recording duration is different, or if the similarity is larger than or equal to the similarity threshold but the recognition result of the speaker role of the first recording information and the speaker role of the recording information with the longest recording duration is the same, not change the recognition result of the speaker role of the first recording information.

The character recognition apparatus 4 further includes: and the channel compensation module is used for carrying out channel compensation on the extracted voice characteristics of the speaker by utilizing a channel compensation algorithm.

The specific processing procedure of each module and unit in the role recognition device in this embodiment may refer to the above method embodiment, and will not be described herein.

Fig. 5 is a schematic structural diagram of an embodiment of a computer device according to the present application, where the computer device may include a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the method for identifying a role of mono voice data provided by the embodiment of the present application may be implemented.

The computer device may be a server, for example: the cloud server may also be an electronic device, for example: the embodiment of the intelligent electronic device such as a smart phone, a smart watch or a tablet personal computer is not limited to the specific form of the computer device.

Fig. 5 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present application. The computer device 12 shown in fig. 5 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in FIG. 5, the computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a memory 28, and a bus 18 that connects the various system components, including the memory 28 and the processing unit 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry Standard architecture (Industry Standard Architecture; hereinafter ISA) bus, micro channel architecture (Micro Channel Architecture; hereinafter MAC) bus, enhanced ISA bus, video electronics standards Association (Video Electronics Standards Association; hereinafter VESA) local bus, and peripheral component interconnect (PERIPHERAL COMPONENT INTERCONNECTION; hereinafter PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a disk drive for reading from and writing to a removable nonvolatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable nonvolatile optical disk (e.g., a compact disk read only memory (Compact Disc Read Only Memory; hereinafter CD-ROM), digital versatile read only optical disk (Digital Video Disc Read Only Memory; hereinafter DVD-ROM), or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the application.

Program/utility 40 having a set (at least one) of program modules 52 may be stored in memory 28, for example, such program modules 52 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 52 generally perform the functions and/or methods of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, the computer device 12 may also communicate with one or more networks such as a local area network (Local Area Network; hereinafter: LAN), a wide area network (Wide Area Network; hereinafter: WAN) and/or a public network such as the Internet via the network adapter 20. As shown in fig. 5, the network adapter 20 communicates with other modules of the computer device 12 via the bus 18. It should be appreciated that although not shown in fig. 5, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 16 executes a program stored in the memory 28 to thereby perform various functional applications and data processing, for example, to implement a character recognition method of monaural voice data provided by the embodiment of the present application.

The embodiment of the application also provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor can implement the role recognition method for mono voice data provided by the embodiment of the application.

The non-transitory computer readable storage media described above may employ any combination of one or more computer readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory; EPROM) or flash Memory, an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (Local Area Network; hereinafter: LAN) or a wide area network (Wide Area Network; hereinafter: WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. A character recognition method of monaural voice data, comprising:

Performing voice recognition on voice data to obtain recording information and recording duration of the voice data; the recording duration records the time length of the recording information;

Extracting the sound characteristics of a speaker from the recording information based on a general background model;

determining a role judgment threshold value for the speaker according to the recording duration;

Comparing the voice characteristics of the speaker with the voice characteristics of the pre-stored target character to obtain a similarity comparison result;

Determining whether a speaker in the voice data is the target role according to the relation between the similarity comparison result and the role judgment threshold;

selecting at least three pieces of recording information of the recognized speaker roles;

Respectively carrying out similarity comparison on the record information with the longest record duration in the selected record information and other record information except the record information with the longest record duration in the selected record information so as to obtain a corresponding comparison result;

And correcting the recognition result of the speaker roles of the rest recording information according to the comparison result.

2. The method of claim 1, wherein extracting speaker voice features from the recording information based on a generic background model comprises:

judging whether the recording duration is greater than a preset time threshold;

if the recording time is longer than or equal to the preset time threshold, extracting the sound characteristics of a speaker from the recording information based on the universal background model;

And if the recording duration is smaller than the preset time threshold, copying a plurality of pieces of recording information so that the time length of the copied recording information is larger than or equal to the preset time threshold, and extracting the voice characteristics of the speaker from the copied recording information based on the universal background model.

3. The method of claim 2, wherein correcting the recognition result of the speaker character of the remaining recording information based on the comparison result comprises:

Judging whether the comparison result of the recording information with the longest recording duration and the first recording information in the rest recording information is that the similarity is higher than a similarity threshold value;

If the similarity is greater than or equal to the similarity threshold, but the identification result of the speaker role of the first recording information is different from the identification result of the speaker role of the recording information with the longest recording duration, correcting the identification result of the speaker role of the first recording information to be the same as the identification result of the speaker role of the recording information with the longest recording duration;

If the similarity is smaller than the similarity threshold and the identification result of the speaker role of the first recording information is different from the identification result of the speaker role of the recording information with the longest recording duration, or if the similarity is larger than or equal to the similarity threshold but the identification result of the speaker role of the recording information with the longest recording duration is the same, the identification result of the speaker role of the first recording information is not changed.

4. The method of claim 1, further comprising, after the extracting the voice features of the speaker from the recording information based on the generic background model:

And performing channel compensation on the extracted voice characteristics of the speaker by using a channel compensation algorithm.

5. A character recognition apparatus for monaural voice data, comprising:

The voice recognition module is used for carrying out voice recognition on the voice data so as to obtain recording information and recording duration of the voice data; the recording duration records the time length of the recording information;

the feature extraction module is used for extracting the sound features of the speaker from the recording information based on the universal background model;

The threshold determining module is used for determining a role judging threshold of the speaker according to the recording duration;

the feature comparison module is used for comparing the voice features of the speaker with the voice features of the pre-stored target characters to obtain a similarity comparison result;

The role judgment module is used for determining whether the speaker in the voice data is the target role according to the relation between the similarity comparison result and the role judgment threshold value;

The recording information selecting module is used for selecting at least three pieces of recording information of the recognized speaker roles;

The recording information comparison module is used for comparing the recording information with the longest recording duration in the selected recording information with the rest recording information except the recording information with the longest recording duration in the selected recording information respectively to obtain corresponding comparison results;

And the role judgment and correction module is used for correcting the identification result of the speaker roles of the rest recording information according to the comparison result.

6. The apparatus of claim 5, wherein the feature extraction module comprises:

the duration judging unit is used for judging whether the recording duration is greater than a preset time threshold value or not;

the feature extraction processing unit is used for extracting the sound features of a speaker from the recording information based on the universal background model if the recording time is greater than or equal to the preset time threshold;

And the feature extraction processing unit is further configured to copy a plurality of pieces of the recording information if the recording duration is less than the preset time threshold, so that the time length of the copied recording information is greater than or equal to the preset time threshold, and extract the voice feature of the speaker from the copied recording information based on the universal background model.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any of claims 1-4 when executing the computer program.

8. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1-4.