CN110534115B

CN110534115B - Multi-party mixed voice recognition method, device, system and storage medium

Info

Publication number: CN110534115B
Application number: CN201910973395.6A
Authority: CN
Inventors: 顾欣欣; 陆文渊; 曾传名
Original assignee: Shanghai Enterprise Information Technology Co ltd
Current assignee: Shanghai Enterprise Information Technology Co ltd
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2021-11-26
Anticipated expiration: 2039-10-14
Also published as: CN110534115A

Abstract

The embodiment of the invention discloses a method, a device and a system for recognizing multi-language mixed voice and a storage medium. The embodiment of the invention is based on the existing dialect recognition subsystem corresponding to each dialect, the possible dialect combination of the whole voice file is obtained by carrying out the blocking processing on the multi-dialect mixed voice file, and finally, all the dialect combinations are input into the full text recognition subsystem to be evaluated and optimized to obtain the voice recognition result of the multi-dialect mixed voice file, so that the effective recognition of the multi-dialect mixed voice file can be realized, and the higher voice recognition accuracy rate is ensured.

Description

Multi-party mixed voice recognition method, device, system and storage medium

Technical Field

The embodiment of the invention relates to the technical field of voice recognition, in particular to a method, a device, a system and a storage medium for recognizing multi-party mixed voice.

Background

The speech recognition is an important application branch in the field of artificial intelligence, and the accuracy of the speech recognition is an important evaluation index of the speech recognition effect. However, when speech recognition is performed on a speech file mixed with a plurality of dialects (including mandarin chinese, dialects chinese, and even languages of different countries), it is difficult to ensure high speech recognition accuracy.

Most of the existing voice recognition technologies are used for carrying out targeted voice recognition on voices in a single language, cannot recognize voice files mixed with multiple dialects or have poor recognition effect, and even cannot guarantee that the voice recognition accuracy rate of the multi-dialect mixed voices is higher.

Disclosure of Invention

The embodiment of the invention provides a method, a device and a system for recognizing multi-dialect mixed voice and a storage medium, which are used for realizing effective recognition of the multi-dialect mixed voice.

In a first aspect, an embodiment of the present invention provides a method for recognizing a multi-party mixed speech, where the method includes:

taking initial voice to be recognized as target voice, and acquiring a semantic text and time line information corresponding to the semantic text, wherein the semantic text is obtained by processing the target voice by at least one dialect recognition subsystem, and the type of a dialect corresponding to each dialect recognition subsystem at least comprises the type of a dialect contained in the initial voice to be recognized;

adding each semantic text and time line information into a history word segmentation set of a corresponding dialect identification subsystem;

obtaining unprocessed target voices corresponding to the dialect recognition subsystems, sequentially using the unprocessed target voices as new target voices, and returning to execute the obtaining operation of semantic texts and timeline information corresponding to the target voices until the unprocessed target voices do not exist in the dialect recognition subsystems correspondingly;

aiming at the historical word segmentation sets corresponding to the dialect recognition subsystems, combining semantic texts in the historical word segmentation sets with corresponding timeline information to form at least one word segmentation sequence, and forming word segmentation sequence sets of the corresponding dialect recognition subsystems based on the word segmentation sequences;

and determining the recognition result of the initial speech to be recognized from the word segmentation sequence set corresponding to each dialect recognition subsystem. .

In a second aspect, an embodiment of the present invention further provides an apparatus for recognizing a multi-party mixed speech, where the apparatus includes:

the semantic acquisition module is used for taking initial voice to be recognized as target voice and acquiring semantic text obtained by processing the target voice by at least one dialect recognition subsystem and timeline information corresponding to the semantic text, wherein the type of the dialect corresponding to each dialect recognition subsystem at least comprises the type of the dialect contained in the initial voice to be recognized;

the semantic adding module is used for adding each semantic text and time line information into a history word segmentation set of the corresponding dialect identification subsystem;

an unprocessed acquisition module, configured to acquire unprocessed target voices corresponding to each dialect recognition subsystem, sequentially use the unprocessed target voices as new target voices respectively, and return to perform an acquisition operation of semantic texts and timeline information corresponding to the target voices until no unprocessed target voice exists in each dialect recognition subsystem;

the sequence forming module is used for forming at least one word segmentation sequence by combining semantic texts in each historical word segmentation set with corresponding timeline information aiming at the historical word segmentation set corresponding to each dialect recognition subsystem, and forming a word segmentation sequence set of the corresponding dialect recognition subsystem based on each word segmentation sequence;

and the result determining module is used for determining the recognition result of the initial speech to be recognized from the word segmentation sequence set corresponding to each dialect recognition subsystem.

In a third aspect, an embodiment of the present invention further provides a system for recognizing a multi-party mixed speech, where the system includes:

one or more processors;

storage means for storing one or more programs;

the one or more programs are executed by the one or more processors to cause the one or more processors to implement the method for recognizing a multi-lingual mixed speech according to the first aspect of the embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for recognizing the multi-lingual mixed speech according to the first aspect of the embodiment of the present invention.

The embodiment of the invention is based on the existing dialect recognition subsystem corresponding to each dialect, the possible dialect combination of the whole voice file is obtained by carrying out the blocking processing on the multi-dialect mixed voice file, and finally, all the dialect combinations are input into the full text recognition subsystem to be evaluated and optimized to obtain the voice recognition result of the multi-dialect mixed voice file, thereby realizing the effective recognition of the multi-dialect mixed voice file and ensuring higher voice recognition accuracy.

Drawings

Fig. 1 is a flowchart illustrating a method for recognizing a multi-party mixed speech according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for recognizing a multi-party mixed speech according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a method for recognizing a multi-party mixed speech according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a recognition apparatus for multi-party mixed speech according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of a recognition system for multi-party mixed speech according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. In addition, the embodiments and features of the embodiments in the present invention may be combined with each other without conflict. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a schematic flow chart of a method for recognizing a multi-dialect mixed speech according to an embodiment of the present invention, which is applicable to a situation where an existing dialect recognition subsystem corresponding to each dialect is used to effectively recognize the multi-dialect mixed speech.

It can be understood that most of the existing voice recognition technologies recognize voices of a single language and can realize higher voice recognition accuracy; for the multi-dialect mixed voice, effective recognition cannot be carried out, and higher voice recognition accuracy cannot be ensured. The invention aims to utilize the existing voice recognition technology aiming at single language, sequentially recognize multi-language mixed voice files by using single participles as recognition units through a plurality of single dialect recognition subsystems, combine semantic participles recognized by each single dialect recognition subsystem according to corresponding time line information to obtain a plurality of participle sequences corresponding to the whole multi-language mixed voice file, and score all analysis sequences by using a full-text recognition subsystem, thereby selecting one or more participle selection columns as the recognition result of the whole multi-language mixed voice file according to the scoring result.

It should be noted that, in the embodiment of the present invention, the standard mandarin in chinese, the dialects in chinese, and the languages in other countries are all regarded as a dialect, and accordingly, the multi-dialect mixed speech may be understood as speech including at least one of the dialects. For a scenario of mixed speech in multiple languages, for example, the speech content is "my boss is good", i.e., chinese-english mixing; or a plurality of persons from different regions, each having a conversation with a local dialect. In addition, the method for recognizing the multi-dialect mixed speech according to the embodiment of the present invention can be performed by a multi-dialect mixed speech recognition system, which includes a plurality of dialect recognition subsystems for performing speech recognition on a single type of dialect.

As shown in fig. 1, the method for recognizing a multi-party speech mixture provided in this embodiment specifically includes the following steps:

s101, taking initial voice to be recognized as target voice, and acquiring semantic text and time line information corresponding to the semantic text, wherein the semantic text is obtained by processing the target voice by at least one dialect recognition subsystem.

Wherein the initial speech to be recognized is speech including at least one dialect, i.e. the multi-dialect mixed speech. The dialect recognition subsystem can be understood as a subsystem capable of performing speech recognition of a single type of dialect for the initial speech to be recognized; optionally, the dialect recognition subsystem takes a single word as a recognition unit, that is, a single recognition result is a word.

It can be understood that, in order to ensure that the recognition system for the multi-dialect mixed speech can effectively recognize the multi-dialect mixed speech and ensure high recognition accuracy, the recognition system for the multi-dialect mixed speech should include as many dialect recognition subsystems corresponding to each type of dialect as possible, and optionally, the type of the dialect corresponding to each dialect recognition subsystem at least includes the type of the dialect included in the initial speech to be recognized. Alternatively, the multi-dialect mixed speech recognition system may be continuously expanded with the generation of new dialect recognition subsystems, thereby continuously expanding the coverage of the multi-dialect mixed speech recognition system for multi-dialect mixed speech recognition.

The semantic text can be understood as standard generalized written text which is obtained by the dialect recognition subsystem after the target voice is processed and can best express the content expressed by the processed part of the target voice; alternatively, the semantic text may be a word or a word composed of two or more words. The time line information may be understood as a cut-off point of a processed part of the target speech corresponding to the semantic text on an audio time axis of the initial speech to be recognized, or a cut-off point of the initial speech to be recognized corresponding to the semantic text until an audio time length for which processing has been accumulated currently.

It can be understood that, based on the way that each dialect recognition system recognizes the target speech by using a single vocabulary as a recognition unit, when the total audio duration of the initial speech to be recognized is greater than the audio duration corresponding to one vocabulary, the dialect recognition subsystem recognizes the target speech for multiple times, in order to avoid repeated recognition of the processed part of the target speech, may record corresponding timeline information after each recognition of a semantic text, and use the timeline information as a time starting point of the next recognition, and in a subsequent step, each semantic text may be sequentially concatenated according to the timeline information corresponding to each semantic text, thereby ensuring semantic continuity of the semantic text.

Specifically, the step is that each dialect recognition subsystem in the multi-dialect mixed voice recognition system sequentially processes the target voice to obtain a semantic text obtained by processing the target voice by at least one dialect recognition subsystem; optionally, when the target speech is an initial speech to be recognized, the semantic text is a first semantic text corresponding to the initial speech to be recognized.

It is understood that when the dialect recognition subsystems process the target speech in sequence, not every dialect recognition subsystem can obtain the corresponding semantic text, for example, the first dialect (or the first vocabulary to be recognized) corresponding to the target speech is mandarin chinese, and at this time, the target speech is processed by the dialect recognition subsystem for recognizing english, and the dialect recognition subsystem has a high probability that no semantic text is output.

S102, adding each semantic text and time line information into a history word segmentation set of the corresponding dialect recognition subsystem.

The history part word set can be understood as a set for storing a first semantic text obtained by processing the initial speech to be recognized by the corresponding dialect recognition subsystem and all subsequent semantic texts corresponding to the first semantic text.

It can be understood that after the first semantic text obtained by processing the initial speech to be recognized by each dialect recognition subsystem is obtained, all possible final recognition results corresponding to the whole initial speech to be recognized, which respectively take each first semantic text as a first recognition vocabulary, are determined, and each first semantic text respectively corresponds to one dialect recognition subsystem, so that a history word set corresponding to the first semantic text can be set in each corresponding dialect recognition subsystem, and the history word set is used for storing all semantic texts included in all possible final recognition results taking the first semantic text as the first recognition vocabulary.

It should be noted that, for any first semantic text, the dialect identifying subsystem for obtaining the subsequent semantic text corresponding to all subsequent semantic texts of the first semantic text may include the dialect identifying subsystem for obtaining the first semantic text, and may also include other dialect identifying subsystems.

For example, assuming that the content of the initial speech to be recognized is "my (mandarin) boss (english) beige (tianjingan)", the total audio duration corresponding to the initial speech to be recognized is 6s, and assuming that dialects corresponding to the dialect recognition subsystems respectively have mandarin, northeast mandarin, tianjingan and english, then the initial speech to be recognized is processed by the dialect recognition subsystems respectively, and the dialect recognition subsystems corresponding to the mandarin and the northeast mandarin may obtain a first semantic text; assuming that a first semantic text obtained by the mandarin chinese is "my", and corresponding time line information is 2s, that is, the mandarin chinese dialect identifying system analyzes that the duration of the audio processed by "my" is the first 2s of the total audio duration; the first semantic text obtained by the northeast dialect recognition subsystem is assumed to be 'me', and the corresponding time line information is 1 s; supposing that the dialect recognition subsystem corresponding to Tianjin dialect and English does not recognize the first semantic text; thus, "my" and "me", respectively, and the respective corresponding timeline information "2 s" and "1 s" are stored in the historical segmentation sets corresponding to the mandarin dialect identifying subsystem and the northeast dialect identifying subsystem, respectively.

S103, unprocessed target voices corresponding to the dialect recognition subsystems are obtained and sequentially used as new target voices, and the obtaining operation of semantic texts and time line information corresponding to the target voices is returned to be executed until the unprocessed target voices do not exist in the dialect recognition subsystems correspondingly.

Wherein the unprocessed target speech refers to a portion of the target speech that has not been processed.

It can be understood that, based on a manner that each dialect recognition system recognizes the target speech by using a single vocabulary as a recognition unit, when the total audio duration of the initial speech to be recognized is greater than the audio duration corresponding to one vocabulary, each dialect recognition subsystem may have unprocessed target speech after processing the target speech, at this time, for each unprocessed target speech, the unprocessed target speech may be used as a new target speech, the operation of obtaining the semantic text and the timeline information corresponding to the target speech is returned to be executed, and the process is repeated until each dialect recognition subsystem has no unprocessed target speech correspondingly.

It should be noted that, in the process of circularly executing the operation of acquiring the semantic text and the timeline information of the target speech, a situation that a certain dialect recognition subsystem processes the target speech without the semantic text occurs, at this time, an unprocessed target speech corresponding to the dialect recognition subsystem is the original target speech (which is equivalent to that the dialect recognition subsystem does not process the original target speech or a processing result is invalid), at this time, the subsequent processing operation on the target speech is immediately stopped, that is, only the target speech with the semantic text can be subjected to the next operation.

For example, in the above example, since the dialect recognition subsystem corresponding to tianjingle and english does not obtain the first semantic text, that is, it indicates that the first semantic text corresponding to the initial speech to be recognized may not be tianjingle or english, only the subsequent recognition determination may be performed on the first semantic text being mandarin or northeast mandarin. Because the timeline information corresponding to my is 2s, and the timeline information corresponding to me is 1s, which are both smaller than the total audio time length 6s, the last 4s part and the last 5s part of the total audio time length are unprocessed target voices respectively corresponding to the mandarin dialect recognition subsystem and the northeast dialect recognition subsystem, the two unprocessed target voices are sequentially determined as new target voices (namely, there are two target voices at this time), the obtaining operation of the semantic texts and the timeline information of the two target voices is returned to be executed, namely, the semantic texts and the corresponding timeline information obtained by processing the two new target voices by at least one dialect recognition subsystem are respectively obtained. And circulating the steps until the dialect recognition subsystems do not have unprocessed target voice correspondingly.

And S104, aiming at the historical participle sets corresponding to the dialect recognition subsystems, combining the semantic texts in the historical participle sets with corresponding timeline information to form at least one participle sequence, and forming the participle sequence sets of the corresponding dialect recognition subsystems based on the participle sequences.

The word segmentation sequence can be understood as a semantic text sequence which takes the first voice text in each historical word segmentation set as a first recognition vocabulary and connects a plurality of semantic texts in series based on time line information corresponding to each semantic text. And one word segmentation sequence is a possible recognition result corresponding to the whole initial voice to be recognized. The word segmentation sequence set is a set including all word segmentation sequences, and can represent a set of all possible recognition results corresponding to the whole initial speech to be recognized.

It can be understood that, for a multi-dialect mixed speech recognition system comprising n (n is a positive integer greater than or equal to 1) dialect recognition subsystems, at most n possible recognition results corresponding to an initial speech to be recognized correspond to n possible recognition results^mA (i.e. a word segmentation sequence corresponding to an initial speech to be recognized has at most n^mBars), n and m are positive integers which are more than or equal to 1, and m represents the maximum processing times of the dialect identification subsystem. Although the number of all possible recognition results corresponding to the initial speech to be recognized is large, as the processing proceeds, many possible segmentation sequences are not actually formed, and the processing itself is a screening process, for example, in the above example, it is possible that the segmentation sequences that can be formed may only have one of "my-boss-particularly good" from the end of the processing (where "boss" corresponds to "boss" and "particularly good" corresponds to "best").

S105, determining the recognition result of the initial speech to be recognized from the word segmentation sequence set corresponding to each dialect recognition subsystem.

It can be understood that after acquiring a set of all possible recognition results corresponding to the entire initial speech to be recognized, that is, the segmentation sequence set, one or more segmentation sequences that can best express the content expressed by the initial speech to be recognized may be screened out from the segmentation sequence set according to a preset rule as a final recognition result of the initial speech to be recognized.

Optionally, word segmentation sequences in the word segmentation sequence set corresponding to each dialect recognition subsystem are collected to a full-text resolution subsystem, so that each word segmentation sequence is scored through the full-text resolution subsystem based on a preset scoring rule;

and selecting at least one word segmentation sequence as the recognition result of the initial speech to be recognized based on the scoring result of each word segmentation sequence by the full-text recognition subsystem.

The full-text distinguishing and analyzing subsystem is a semantic analyzing subsystem which can perform further semantic analysis on the word segmentation sequences so as to screen and judge each word segmentation sequence. Optionally, the full-text parsing subsystem is a full-text parsing Natural Language Processing (NLP) subsystem; optionally, the full-text resolution NLP subsystem may be configured to analyze semantic consistency of the word segmentation sequence and score the word segmentation sequence according to a preset rule.

Further, as an optional embodiment of the first embodiment, in the first embodiment, the dialect identifying subsystem preferably includes: a speech-to-text component and a semantic parsing component.

Wherein the speech-to-text component may be understood as a component in the corresponding dialect recognition subsystem for converting the processed portion of the input target speech into an initial text; the semantic analysis component can be understood as a component which is used for further analyzing the initial text obtained by the voice text transferring component in a corresponding dialect recognition subsystem so as to obtain a standard generalized written text corresponding to the initial text; optionally, the phonetic transcription component can convert dialects other than Mandarin in the target speech into corresponding transliterated texts, and the semantic parsing component can convert the transliterated texts obtained by the phonetic transcription component into standard generalized written Mandarin texts. Optionally, the semantic parsing component is an NLP component corresponding to the dialect recognition subsystem.

Further, for each dialect recognition subsystem, the step of processing the target speech by the dialect recognition subsystem to obtain a semantic text and timeline information corresponding to the semantic text includes:

performing voice recognition on the target voice through a voice transfer component in the dialect recognition subsystem to obtain a voice text corresponding to the target voice and timeline information corresponding to the voice text;

performing semantic analysis on the voice text through a semantic analysis component in the dialect recognition subsystem, if a semantic text corresponding to the voice text is obtained, determining the semantic text as the semantic text corresponding to the target voice, and determining the timeline information as the timeline information corresponding to the semantic text; and if the semantic text corresponding to the voice text cannot be obtained, judging that the dialect corresponding to the voice text is not matched with the dialect corresponding to the corresponding processed target voice, and discarding the voice text and the timeline information.

In the optional embodiment, the dialect recognition subsystem is further refined, and the step of processing the target voice by the dialect recognition subsystem to obtain the semantic text and the timeline information corresponding to the semantic text is provided, so that a foundation is laid for obtaining the semantic text and the timeline information.

Fig. 2 is a flowchart illustrating a multi-party mixed speech recognition method according to an embodiment of the present invention.

Example two

Fig. 3 is a schematic flow chart of a recognition method of multi-party mixed speech according to a second embodiment of the present invention, which is further optimized based on the first embodiment. In this embodiment, the adding of each semantic text and timeline information to the history participle set of the corresponding dialect recognition subsystem is embodied as: judging whether a target voice corresponding to each semantic text is the initial voice to be recognized or not; if the target voice corresponding to the semantic text is the initial voice to be recognized, determining the semantic text as a first semantic text corresponding to a dialect recognition subsystem for generating the semantic text, and adding a binary information group formed by the first semantic text and time line information into a history participle set corresponding to the dialect recognition subsystem for generating the first semantic text; and if the target voice corresponding to the semantic text is not the initial voice to be recognized, determining an adjacent semantic text corresponding to the semantic text based on the target voice, and adding a ternary information group consisting of the semantic text, the adjacent semantic text and the time line information into a history participle set in which the first semantic text corresponding to the semantic text is located.

In this embodiment, the historical segmented word sets corresponding to the dialect identification subsystems are further combined with corresponding timeline information to form at least one segmented word sequence according to semantic texts in the historical segmented word sets, and the segmented word sequence sets of the corresponding dialect identification subsystems are formed based on the segmented word sequences, which is embodied as: aiming at a historical participle set corresponding to each dialect identification subsystem, acquiring a first semantic text in the historical participle set; for each ternary information group in the historical word segmentation set, determining an adjacent ternary information group of the ternary information group based on adjacent semantic text and time line information in the ternary information group; arranging the ternary information groups of the adjacent ternary information groups according to the time line information sequence in the ternary information groups to form at least one ternary information group sequence; and aiming at each ternary information group sequence, sequentially taking out corresponding semantic texts from each ternary information group of the ternary information group sequence, and forming a word segmentation sequence by taking the first semantic text as a sequence head.

As shown in fig. 3, the method for recognizing a multi-party speech mixture provided in this embodiment specifically includes the following steps:

s201, taking the initial voice to be recognized as target voice, and acquiring semantic text and time line information corresponding to the semantic text, wherein the semantic text is obtained by processing the target voice by at least one dialect recognition subsystem.

Optionally, if the category of the dialect containing the initial speech to be recognized is known, determining a dialect recognition subsystem corresponding to each known dialect based on the category of the known dialect;

and acquiring semantic texts and time line information corresponding to the semantic texts, wherein the semantic texts are obtained by processing the target voice by each determined dialect recognition subsystem.

It can be understood that when the types of the dialects included in the initial speech to be recognized are known, only the semantic text and the timeline information corresponding to the semantic text, which are obtained by processing the target speech by the dialect recognition subsystem corresponding to each known dialect, are obtained, so that the subsequent processing amount can be greatly reduced.

S202, judging whether a target voice corresponding to each semantic text is the initial voice to be recognized or not, if so, executing S203; otherwise, S204 is executed.

It can be understood that, by judging whether the target speech corresponding to each semantic text is the initial speech to be recognized, it can be determined whether the semantic text is the first semantic text corresponding to each dialect recognition subsystem.

S203, determining the semantic text as a first semantic text corresponding to a dialect recognition subsystem for generating the semantic text, and adding a binary information group consisting of the first semantic text and time line information into a history participle set corresponding to the dialect recognition subsystem for generating the first semantic text; proceed to execute S205.

It will be appreciated that all possible recognition results for the entire initial speech to be recognized must be for each of the first semantic texts as the first recognized word.

S204, determining an adjacent semantic text corresponding to the semantic text based on the target voice, and adding a ternary information group consisting of the semantic text, the adjacent semantic text and time line information into a history participle set where a first semantic text corresponding to the semantic text is located; proceed to execute S205.

Wherein the adjacent semantic text can be understood as a previous semantic text associated with the semantic text.

It can be understood that, because the timeline information corresponding to different semantic texts, even the combination of the timeline information corresponding to the combination of different semantic texts, is the same, for the subsequent semantic text corresponding to the first semantic text, in the step of forming the segmentation sequence by the subsequent combined semantic text, an unnecessary segmentation sequence may be additionally formed only according to the timeline information, thereby increasing the processing amount of the segmentation sequence. Therefore, if the semantic text and the corresponding timeline information are added into the historical word segmentation set where the first semantic text corresponding to the semantic text is located, and the adjacent semantic text corresponding to the semantic text is added together, the word segmentation sequence can be determined based on the adjacent semantic text and the timeline information, so that the formation of the unnecessary word segmentation sequence can be eliminated.

S205, unprocessed target voices corresponding to the dialect recognition subsystems are obtained and sequentially used as new target voices, and the obtaining operation of semantic texts and time line information corresponding to the target voices is executed in a returning mode until the unprocessed target voices do not exist in the dialect recognition subsystems correspondingly.

S206, judging whether each dialect recognition subsystem is corresponding to unprocessed target voice or not; if yes, go to S207; otherwise, return to execution S205.

S207, aiming at the historical participle set corresponding to each dialect recognition subsystem, obtaining a first semantic text in the historical participle set.

S208, aiming at each ternary information group in the history word segmentation set, determining an adjacent ternary information group of the ternary information group based on adjacent semantic text and time line information in the ternary information group.

And if the two ternary information groups meet the condition that the time line information is coherent and the semantic text in the previous ternary information group is the adjacent semantic text in the next ternary information group, the two ternary information groups are mutually adjacent ternary information groups.

Alternatively, all possible adjacent three-way information sets of each three-way information set can be determined based on the timeline information in the three-way information set, and then the adjacent three-way information set of the three-way information set can be determined from all possible adjacent three-way information sets based on the adjacent semantic text contained in the three-way information set.

S209, arranging the ternary information groups of the adjacent ternary information groups according to the time line information sequence in the ternary information groups to form at least one ternary information group sequence.

It will be appreciated that each of said triplets of contiguous triplets may be readily concatenated according to the resulting contiguous triplets of each triplet. Although only one first semantic text is in each history word segmentation set, a plurality of adjacent semantic texts corresponding to the first semantic text may be provided, and a plurality of adjacent semantic texts corresponding to the first semantic text may be provided, so that more than one ternary information group sequence obtained from one history word segmentation set may be provided.

S210, aiming at each ternary information group sequence, sequentially taking out corresponding semantic texts from each ternary information group of the ternary information group sequence, and forming a participle sequence by taking the first semantic text as a sequence head.

It is understood that a triplet sequence uniquely corresponds to a participle sequence.

S211, collecting the word segmentation sequences in the word segmentation sequence set corresponding to each dialect recognition subsystem to a full-text resolution subsystem, and scoring each word segmentation sequence through the full-text resolution subsystem based on a preset scoring rule.

S212, based on the scoring result of the full-text recognition subsystem on each word segmentation sequence, selecting at least one word segmentation sequence as the recognition result of the initial speech to be recognized.

EXAMPLE III

Fig. 4 is a schematic structural diagram of a recognition apparatus for multi-dialect mixed speech according to a third embodiment of the present invention, where this embodiment is applicable to a case where effective recognition of multi-dialect mixed speech is implemented based on existing dialect recognition subsystems corresponding to respective dialects, and the apparatus may be implemented by software and/or hardware, and specifically includes: a semantic obtaining module 301, a semantic adding module 302, an unprocessed obtaining module 303, a sequence forming module 304, and a result determining module 305. Wherein the content of the first and second substances,

a semantic obtaining module 301, configured to use an initial speech to be recognized as a target speech, and obtain a semantic text obtained by processing the target speech by at least one dialect recognizing subsystem and timeline information corresponding to the semantic text, where a type of a dialect corresponding to each dialect recognizing subsystem at least includes a type of a dialect included in the initial speech to be recognized;

a semantic adding module 302, configured to add each semantic text and timeline information to a history participle set of a corresponding dialect identifying subsystem;

an unprocessed acquisition module 303, configured to acquire unprocessed target voices corresponding to each dialect recognition subsystem, respectively and sequentially serve as new target voices, and return to perform an acquisition operation of semantic texts and timeline information corresponding to the target voices until each dialect recognition subsystem corresponds to no unprocessed target voice;

a sequence forming module 304, configured to form, for a history participle set corresponding to each dialect identification subsystem, at least one participle sequence according to semantic texts in each history participle set in combination with corresponding timeline information, and form a participle sequence set of the corresponding dialect identification subsystem based on each participle sequence;

a result determining module 305, configured to determine a recognition result of the initial speech to be recognized from the word segmentation sequence set corresponding to each dialect recognition subsystem.

On the basis of the above embodiments, the dialect identifying subsystem includes: a voice text conversion component and a semantic analysis component;

correspondingly, in the semantic obtaining module 301, for each dialect recognition subsystem, the step of processing the target speech by the dialect recognition subsystem to obtain a semantic text and timeline information corresponding to the semantic text includes:

On the basis of the foregoing embodiments, the semantic adding module 302 includes:

the voice judging unit is used for judging whether the target voice corresponding to the semantic text is the initial voice to be recognized or not aiming at each semantic text;

the binary adding unit is used for determining the semantic text as a first semantic text corresponding to a dialect recognition subsystem for generating the semantic text if the target voice corresponding to the semantic text is the initial voice to be recognized, and adding a binary information group consisting of the first semantic text and the timeline information into a history participle set corresponding to the dialect recognition subsystem for generating the first semantic text;

and the ternary adding unit is used for determining an adjacent semantic text corresponding to the semantic text based on the target voice if the target voice corresponding to the semantic text is not the initial voice to be recognized, and adding a ternary information group consisting of the semantic text, the adjacent semantic text and the time line information into a history participle set in which the first semantic text corresponding to the semantic text is located.

On the basis of the above embodiments, the sequence forming module 304 includes:

the first language acquisition unit is used for acquiring a first semantic text in a history participle set corresponding to each dialect identification subsystem;

the adjacency determining unit is used for determining an adjacency ternary information set of the ternary information set according to each ternary information set in the history word segmentation set and on the basis of an adjacency semantic text and time line information in the ternary information set;

the group sequence forming unit is used for arranging the ternary information groups of the adjacent ternary information groups according to the time line information sequence in the ternary information groups to form at least one ternary information group sequence;

and the word segmentation sequence forming unit is used for sequentially taking out corresponding semantic texts from the ternary information groups of the ternary information group sequence aiming at each ternary information group sequence and forming a word segmentation sequence by taking the first semantic text as a sequence head.

On the basis of the above embodiments, the result determining module 305 includes:

the sequence summarizing unit is used for summarizing the word segmentation sequences in the word segmentation sequence set corresponding to each dialect recognition subsystem to the full-text recognition subsystem so as to score each word segmentation sequence based on a preset scoring rule through the full-text recognition subsystem;

and the result determining unit is used for selecting at least one word segmentation sequence as the recognition result of the initial speech to be recognized based on the scoring result of each word segmentation sequence by the full-text resolution subsystem.

The multi-language mixed voice recognition device provided by the embodiment of the invention can execute the multi-language mixed voice recognition method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 5 is a schematic structural diagram of a multi-party mixed speech recognition system according to a fourth embodiment of the present invention, as shown in fig. 5, the multi-party mixed speech recognition system includes a processor 40, a memory 41, an input device 42 and an output device 43; the number of processors 40 in the multi-dialect mixed speech recognition system may be one or more, and one processor 40 is taken as an example in fig. 5; the processor 40, the memory 41, the input device 42 and the output device 43 in the multi-dialect mixed speech recognition system may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The memory 41 serves as a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the recognition method of multi-party mixed speech in the embodiment of the present invention (for example, the semantic acquiring module 301, the semantic adding module 302, the unprocessed acquiring module 303, the sequence forming module 304, and the result determining module 305 in the recognition apparatus of multi-party mixed speech). The processor 40 executes various functional applications and data processing of the multi-dialect mixed speech recognition system by executing software programs, instructions and modules stored in the memory 41, that is, implements the above-described multi-dialect mixed speech recognition method.

The memory 41 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 41 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 41 may further include memory remotely located from processor 40, which may be connected to a multi-dialect mixed-speech recognition system via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 42 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function controls of the multi-lingual mixed-speech recognition system. The output device 43 may include a display device such as a display screen.

EXAMPLE five

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method for recognizing multi-party mixed speech, the method including:

and determining the recognition result of the initial speech to be recognized from the word segmentation sequence set corresponding to each dialect recognition subsystem.

Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the multi-party mixed speech recognition method provided by any embodiments of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the apparatus for recognizing a multi-dialect mixed speech, the units and modules included in the apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for recognizing a multi-party mixed speech, comprising:

determining the recognition result of the initial speech to be recognized from the word segmentation sequence set corresponding to each dialect recognition subsystem;

determining the recognition result of the initial speech to be recognized from the word segmentation sequence set corresponding to each dialect recognition subsystem, wherein the determination comprises the following steps:

collecting the word segmentation sequences in the word segmentation sequence set corresponding to each dialect recognition subsystem to a full-text resolution subsystem, and scoring each word segmentation sequence based on a preset scoring rule through the full-text resolution subsystem;

2. The method of claim 1, wherein the dialect identification subsystem comprises: a voice text conversion component and a semantic analysis component;

correspondingly, for each dialect recognition subsystem, the step of processing the target speech by the dialect recognition subsystem to obtain a semantic text and timeline information corresponding to the semantic text includes:

3. The method of claim 1, wherein adding each of the semantic text and timeline information to a historical tokenization set of a corresponding dialect recognition subsystem comprises:

judging whether a target voice corresponding to each semantic text is the initial voice to be recognized or not;

if the target voice corresponding to the semantic text is the initial voice to be recognized, determining the semantic text as a first semantic text corresponding to a dialect recognition subsystem for generating the semantic text, and adding a binary information group formed by the first semantic text and time line information into a history participle set corresponding to the dialect recognition subsystem for generating the first semantic text;

if the target voice corresponding to the semantic text is not the initial voice to be recognized, determining an adjacent semantic text corresponding to the semantic text based on the target voice, and adding a ternary information group consisting of the semantic text, the adjacent semantic text and time line information into a history participle set in which a first semantic text corresponding to the semantic text is located;

the adjacent semantic text is a previous semantic text associated with the semantic text.

4. The method of claim 3, wherein the forming at least one segmentation sequence for the historical segmentation set corresponding to each dialect recognition subsystem according to semantic texts in the historical segmentation set and corresponding timeline information, and forming a segmentation sequence set for the corresponding dialect recognition subsystem based on each segmentation sequence comprises:

aiming at a historical participle set corresponding to each dialect identification subsystem, acquiring a first semantic text in the historical participle set;

for each ternary information group in the historical word segmentation set, determining an adjacent ternary information group of the ternary information group based on adjacent semantic text and time line information in the ternary information group;

arranging the ternary information groups of the adjacent ternary information groups according to the time line information sequence in the ternary information groups to form at least one ternary information group sequence;

and aiming at each ternary information group sequence, sequentially taking out corresponding semantic texts from each ternary information group of the ternary information group sequence, and forming a word segmentation sequence by taking the first semantic text as a sequence head.

5. An apparatus for recognizing a multi-party mixed speech, comprising:

a result determining module, configured to determine a recognition result of the initial speech to be recognized from a word segmentation sequence set corresponding to each dialect recognition subsystem;

the result determination module includes:

6. The apparatus of claim 5, wherein the semantic adding module comprises:

the ternary adding unit is used for determining an adjacent semantic text corresponding to the semantic text based on the target voice if the target voice corresponding to the semantic text is not the initial voice to be recognized, and adding a ternary information group consisting of the semantic text, the adjacent semantic text and the time line information into a history participle set in which a first semantic text corresponding to the semantic text is located;

7. The apparatus of claim 6, wherein the sequence forms a module comprising:

8. A system for recognizing mixed speech, comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs being executable by the one or more processors to cause the one or more processors to implement the multi-party mixed speech recognition method of any one of claims 1-4.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for multi-party mixed speech recognition according to any one of claims 1-4.