CN107016994B

CN107016994B - Voice recognition method and device

Info

Publication number: CN107016994B
Application number: CN201610057651.3A
Authority: CN
Inventors: 李宏言
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2016-01-27
Filing date: 2016-01-27
Publication date: 2020-05-08
Anticipated expiration: 2036-01-27
Also published as: CN107016994A

Abstract

The application provides a voice recognition method and device. Wherein, the method comprises the following steps: performing voice recognition on the voice of the named entity to be recognized by utilizing the voice recognition based on the Chinese characters so as to recognize a Chinese character sequence serving as a Chinese character recognition result of the voice of the named entity to be recognized; performing voice recognition on the named entity voice to be recognized by using voice recognition based on pinyin so as to recognize a pinyin sequence serving as a pinyin recognition result of the named entity voice to be recognized; determining the similarity between each candidate named entity in a specific named entity list and the voice of the named entity to be identified according to the recognized Chinese character sequence and the recognized pinyin sequence; and determining a voice recognition result of the named entity voice to be recognized from the specific named entity list according to the similarity between each candidate named entity and the named entity voice to be recognized. The method and the device improve the accuracy of the recognition of the named entity voice.

Description

Voice recognition method and device

Technical Field

The present application relates to the field of speech recognition, and in particular, to a method and an apparatus for speech recognition.

Background

Existing speech recognition techniques typically utilize a speech recognition network consisting of a language model and an acoustic model to recognize speech. The acoustic model is generated after model training is carried out on a training voice database by using a training algorithm, and the characteristic parameters of the voice to be recognized are matched with the acoustic model during voice recognition to obtain a recognition result. The language model is generated by carrying out grammar and semantic analysis on a training text database and training based on a statistical model, and can describe the internal relation between words by combining the knowledge of grammar and semantics.

Named Entities (NEs) refer to some specific names having Entity significance, commonly known as people's names, place names, organization names, song names, etc., and may also have time, date, number phrases, etc. The recognition accuracy of named entities in existing speech recognition systems is low, and further recognition of named entities, such as song names, contact names, etc., is often required in some scenarios. This is because the named entities are typically short in length (e.g., "default" to the song title), and thus are difficult to recognize efficiently in conjunction with the language model and the acoustic model, resulting in a low recognition accuracy. Moreover, there is some confusion among many named entities, e.g., "Henan" and "Netherlands" voices are similar, and it is difficult to accurately identify which one if not combined with context; still other named entities do not comply with the language rules, e.g., using the network popular language as the song name, e.g., "what to do". Both of the above cases further increase the difficulty of speech recognition for a particular type of named entity.

Disclosure of Invention

It is an object of the present application to improve the accuracy of recognition of named entity speech.

According to an embodiment of the present application, there is provided a method of speech recognition, the method of speech recognition comprising the steps of:

performing voice recognition on the voice of the named entity to be recognized by utilizing the voice recognition based on the Chinese characters so as to recognize a Chinese character sequence serving as a Chinese character recognition result of the voice of the named entity to be recognized;

performing voice recognition on the named entity voice to be recognized by using voice recognition based on pinyin so as to recognize a pinyin sequence serving as a pinyin recognition result of the named entity voice to be recognized;

determining the similarity between each candidate named entity in a specific named entity list and the voice of the named entity to be identified according to the recognized Chinese character sequence and the recognized pinyin sequence;

and determining a voice recognition result of the named entity voice to be recognized from the specific named entity list according to the similarity between each candidate named entity and the named entity voice to be recognized.

According to an embodiment of the application, a name voice searching method comprises the following steps:

matching the voice command to be recognized with a pre-stored voice command template so as to obtain the name voice to be recognized in the voice command to be recognized;

carrying out voice recognition on the name voice to be recognized by utilizing the voice recognition based on the Chinese characters so as to recognize a Chinese character sequence serving as a Chinese character recognition result of the name voice to be recognized;

performing voice recognition on the name voice to be recognized by using voice recognition based on pinyin so as to recognize a pinyin sequence serving as a pinyin recognition result of the name voice to be recognized;

determining the similarity between each candidate name in the specific name list and the voice of the name to be recognized according to the recognized Chinese character sequence and the recognized pinyin sequence;

and determining a voice recognition result of the voice of the name to be recognized from the specific name list according to the similarity between each candidate name and the voice of the name to be recognized.

According to an embodiment of the present application, there is provided a song voice search method including:

matching the voice command to be recognized with a pre-stored voice command template so as to obtain the song name voice to be recognized in the voice command to be recognized;

performing voice recognition on the song title voice to be recognized by utilizing voice recognition based on Chinese characters so as to recognize a Chinese character sequence serving as a Chinese character recognition result of the song title voice to be recognized;

performing voice recognition on the song name voice to be recognized by using voice recognition based on pinyin to recognize a pinyin sequence as a result of the pinyin recognition of the song name voice to be recognized;

determining the similarity between each candidate song name in a specific song name list and the voice of the song name to be identified according to the recognized Chinese character sequence and the recognized pinyin sequence;

and determining the voice recognition result of the song name voice to be recognized from the specific song name list according to the similarity between each candidate song name and the song name voice to be recognized.

According to an embodiment of the present application, there is provided a method for establishing a communication connection through voice, including:

determining the similarity between each name in the user address list and the voice of the name to be recognized according to the recognized Chinese character sequence and the recognized pinyin sequence;

determining a voice recognition result of the voice of the name to be recognized from the user address list according to the similarity of the name to be recognized and the voice of the name to be recognized;

and initiating communication connection to the user in the address list of the determined user as the voice recognition result.

Compared with the prior art, the embodiment of the application has the following advantages:

according to the embodiment of the application, on the basis of obtaining the recognition result of the Chinese character form by performing conventional voice recognition on the voice of the named entity to be recognized, pinyin recognition is further performed to obtain the recognition result of the pinyin form, and the final voice recognition result of the named entity to be recognized is determined in the specific named entity list according to the recognized Chinese character recognition result and the pinyin recognition result, instead of only depending on the recognition result of the Chinese character form to determine the final voice recognition result in the specific named entity list, so that the accuracy of recognizing the voice of the named entity is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow diagram of a method of speech recognition provided in one embodiment of the present application;

FIG. 2 is a schematic diagram of a current general architecture for speech recognition;

FIG. 3 is a flowchart illustrating an embodiment of determining similarity between a candidate named entity and a speech of a named entity to be recognized;

FIG. 4 is a flow diagram of a method of speech recognition according to another embodiment of the present application;

FIG. 5 is a flowchart of a person name voice search method according to an embodiment of the present application;

FIG. 6 is a flowchart of a song voice search method according to an embodiment of the present application;

FIG. 7 is a flowchart of a method for establishing a communication connection through voice according to an embodiment of the present application;

FIG. 8 is a block diagram of a speech recognition device according to an embodiment of the present application;

FIG. 9 is a detailed block diagram of a similarity determination unit according to an embodiment of the present application;

FIG. 10 is a block diagram of a speech recognition device according to another embodiment of the present application;

FIG. 11 is a block diagram of a named-person voice search apparatus according to an embodiment of the present application;

FIG. 12 is a block diagram of a song voice search apparatus according to an embodiment of the present application;

fig. 13 is a block diagram of an apparatus for establishing a communication connection through voice according to an embodiment of the present application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel, concurrently, or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The term "computer device" or "computer" in this context refers to an intelligent electronic device that can execute predetermined processes such as numerical calculation and/or logic calculation by running predetermined programs or instructions, and may include a processor and a memory, wherein the processor executes a pre-stored instruction stored in the memory to execute the predetermined processes, or the predetermined processes are executed by hardware such as ASIC, FPGA, DSP, or a combination thereof. Computer devices include, but are not limited to, servers, personal computers, laptops, tablets, smart phones, and the like.

The computer equipment comprises user equipment and network equipment. Wherein the user equipment includes but is not limited to computers, smart phones, PDAs, etc.; the network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of computers or network servers, wherein Cloud Computing is one of distributed Computing, a super virtual computer consisting of a collection of loosely coupled computers. The computer equipment can be independently operated to realize the application, and can also be accessed into a network to realize the application through the interactive operation with other computer equipment in the network. The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.

It should be noted that the user equipment, the network device, the network, etc. are only examples, and other existing or future computer devices or networks may also be included in the scope of the present application, if applicable, and are included by reference.

The methods discussed below, some of which are illustrated by flow diagrams, may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium. The processor(s) may perform the necessary tasks.

Specific structural and functional details disclosed herein are merely representative and are provided for purposes of describing example embodiments of the present application. This application may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element may be termed a second element, and, similarly, a second element may be termed a first element, without departing from the scope of example embodiments. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently, or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

Prior to detailing the detailed processes of the embodiments of the present application, a brief description of prior art speech recognition is provided. FIG. 2 is a schematic diagram of a prior art speech recognition architecture. As shown in fig. 2, a speech database and a text database are generally respectively built based on a large amount of speech data and text data, and an acoustic model is trained by extracting speech features from the speech data, and a language model is trained using the text data. When receiving input speech to be recognized, recognizing syllables through an acoustic model by extracting the characteristics of the speech, performing speech decoding by utilizing a language model by inquiring the possible mapping relation between the syllables and the text in a dictionary, and outputting the text corresponding to the speech through a corresponding search algorithm.

The present application is described in further detail below with reference to the attached figures.

The embodiment of the application is generally applied to the situation that the to-be-recognized named entity voice contained in the to-be-recognized voice is acquired. For example, for an application of song search in a smart speaker product, typically in order to search for a song, the user may issue voice commands of "i want to listen. The format of all commands that a user may issue is formatted as a command template, such as "i want to listen. When a user issues a voice command, such as "i want to listen to the song of zhang san", the voice command of the user is subjected to preliminary voice recognition, i.e., recognition by the acoustic model and the language model in fig. 2, and matched with the stored command template. In general, the preliminary speech recognition will not recognize the error for the general vocabulary in the template, such as "i want to listen", but only for "song with three characters", since the speech and text used in training the acoustic model and the text model may focus on the general vocabulary at first and are rarely trained with the specialized vocabulary such as name of a person, name of a song, etc., it is difficult to recognize which words are the speech of "song with three characters". And recognizing universal words in the voice command of the user through preliminary voice recognition, and matching the recognized universal words with a stored command template so as to find out the voice of the named entity to be recognized. If the 'I wants to listen to the' three songs ', and the' I wants to listen to. The following process in the embodiment of the present application is used to identify the named entity corresponding to the voice, i.e., "song of song zhang san", or "song of chapter san", "guo of zhang san", etc.

Referring to fig. 1, in step S110, the named entity speech to be recognized is speech-recognized using chinese character-based speech recognition to recognize a chinese character sequence as a result of the chinese character recognition of the named entity speech to be recognized.

The Chinese character-based speech recognition is speech recognition of a Chinese character sequence of a text when a language model is trained by using text data in a text database. That is, in the speech recognition architecture shown in fig. 2, the chinese character sequence of the text is used when training the language model in fig. 2 using the text data in the text database.

And recognizing the voice of the named entity to be recognized by utilizing the voice recognition based on the Chinese characters, wherein the output recognition result is a string of Chinese character sequences. For example, for the speech of the named entity "three by three", the recognition result is output as the Chinese character sequence "three by three".

Referring to fig. 1, in step S120, the speech of the named entity to be recognized is speech-recognized using pinyin-based speech recognition to recognize a pinyin sequence as a result of the pinyin recognition of the speech of the named entity to be recognized.

The pinyin-based speech recognition is speech recognition of the pinyin sequence of the text when the language model is trained using text data in the text database. That is, in the speech recognition architecture shown in FIG. 2, the phonetic sequences of the text are used in training the language model in FIG. 2 using the text data in the text database.

The Chinese phonetic is the internationally universally accepted standard for the transcription of the Chinese mandarin latin and is mainly used for phonetic notation of Chinese characters. The Chinese phonetic uses 26 Latin letters, which are internationally common, and the letters are divided into initial consonants and final sounds. The phonetic units of chinese language mainly include syllables and phonemes. One Chinese character in Chinese can be a syllable, namely, one syllable can be formed by adding an initial consonant and a final or only one final. Phonemes are the smallest units of speech that are divided according to the natural properties of the speech (physical and physiological properties).

The embodiment of the application establishes a phonetic recognition network based on the Pinyin scheme. The pinyin-based speech recognition network is composed of an acoustic model and a pinyin-based language model. The acoustic model may be the same as that in the aforementioned chinese character-based speech recognition network. The pinyin-based language model may be a syllable-based language model or a phone-based language model. Therefore, step S120 includes the following specific embodiments:

in a first embodiment, the pinyin-based speech recognition is syllable recognition. The pinyin sequence is a syllable sequence.

In this first embodiment, step 120 is specifically to perform syllable recognition on the named entity speech to be recognized, so as to recognize a syllable sequence as a result of the syllable recognition of the named entity speech to be recognized.

That is, the syllable recognition network composed of the acoustic model and the syllable-based language model is used for carrying out syllable recognition on the named entity voice to be recognized so as to recognize the syllable sequence which is the syllable recognition result of the named entity voice to be recognized. For example, for the speech of the named entity "zhang san", the syllable recognition is performed by the syllable recognition network, and then the syllable sequence "zhang san" is output.

In a second embodiment, the pinyin-based speech recognition is used for phoneme recognition. The pinyin sequence includes a phoneme sequence. In the second embodiment, step 120 is specifically to perform phoneme recognition on the named entity speech to be recognized, so as to recognize a phoneme sequence as a result of the phoneme recognition of the named entity speech to be recognized.

That is, the phoneme recognition is performed on the named entity speech to be recognized by using a phoneme recognition network formed by an acoustic model and the phoneme-based speech model to recognize a phoneme sequence as a phoneme recognition result of the named entity speech to be recognized. For example, for the speech of the named entity "zhang san", the phoneme sequence "zh ang s an" is output after the phoneme recognition is performed through the phoneme recognition network.

In a third embodiment, based on the second embodiment, the step S120 may further include:

and performing tone recognition on the final phoneme in the recognized phoneme sequence to recognize a tone sequence serving as a tone recognition result of the named entity voice to be recognized.

There are four tones in Mandarin, commonly called four tones, which are yin Ping (first tone), such as b ā; yanping (second sound), such as b-a; up (third sound), as b { hacek }; whistling (fourth sound), such as b-a. In speech recognition technology, a soft sound (fifth sound) is also generally added. And identifying the vowels in the identified phoneme sequence, adding the identified tones into the phoneme sequence to obtain a string of tone sequences, and using the tone sequences as tone identification results of the named entity voice to be identified. And marking the tone of the identified final after the final so as to obtain a tone sequence which is used as a tone identification result of the named entity voice to be identified after marking. For example, the phoneme sequence "zh ang san" obtained by phoneme recognition is subjected to tone recognition to obtain a tone sequence "zh ang1s an 1".

In a fourth embodiment, the pinyin-based speech recognition includes syllable recognition and phoneme recognition, and the pinyin sequences include syllable sequences and phoneme sequences.

In the fourth embodiment, step S120 specifically includes:

performing syllable recognition on the named entity voice to be recognized so as to recognize a syllable sequence serving as a syllable recognition result of the named entity voice to be recognized; and

and performing phoneme recognition on the named entity voice to be recognized so as to recognize a phoneme sequence which is a phoneme recognition result of the named entity voice to be recognized.

The specific description of performing syllable recognition on the named entity speech to be recognized and performing phoneme recognition on the named entity speech to be recognized can also refer to the description in the first embodiment and the second embodiment.

In a fifth embodiment based on the fourth embodiment, step S120 further includes:

For a detailed description of this step, reference may be made to the description of tone recognition of the final phoneme in the recognized phoneme sequence in the third embodiment, which is not repeated herein.

Referring to fig. 1, in step S130, the similarity between each candidate named entity in the specific named entity list and the named entity to be identified is determined according to the recognized chinese character sequence and pinyin sequence.

The similarity, i.e., the degree of similarity between the candidate named entities and the named entities to be identified, may be calculated through various metrics, wherein in one embodiment, the similarity between each candidate named entity and the named entity to be identified is determined according to the edit distance between the chinese character sequence corresponding to the candidate named entity and the identified chinese character sequence, and the edit distance between the pinyin sequence corresponding to the candidate named entity and the identified pinyin sequence.

As shown in fig. 3, step S130 specifically includes the following steps:

step S131, determining the edit distance between the Chinese character sequence corresponding to each candidate named entity in the specific named entity list and the recognized Chinese character sequence, and taking the edit distance as the Chinese character sequence edit distance between each candidate named entity and the speech of the named entity to be recognized.

An Edit-distance based algorithm (EDA) is an algorithm for measuring the matching degree of two character strings, and refers to the minimum number of editing operations required for converting one character string into another character string. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. And calculating the edit distance between the Chinese character sequence corresponding to each candidate named entity in the specific named entity list and the identified Chinese character sequence by using an edit distance algorithm. In the calculation of the edit distance of the Chinese character sequence, the characters are specifically Chinese characters. For example, if the candidate named entity is "song three times", and the recognized Chinese character sequence is "song three times", the "chapter" of "song three times" is replaced by "song" and "song three times" is added to change the "song three times", and the Chinese character sequence edit distance of "song three times" and "song three times" is 2.

Step S132, determining the edit distance between the pinyin sequence corresponding to each candidate named entity in the specific named entity list and the recognized pinyin sequence, and taking the edit distance as the pinyin sequence edit distance between each candidate named entity and the voice of the named entity to be recognized.

As in step S131, an edit distance algorithm is used to calculate the edit distance between the pinyin sequence corresponding to each candidate named entity in the specific named entity list and the recognized pinyin sequence, so as to serve as the edit distance between the pinyin sequence of each candidate named entity and the recognized named entity.

Corresponding to the first implementation manner of step S120, the pinyin sequence edit distance is a syllable sequence edit distance of the recognition result of the speech between each candidate named entity in the specific named entity list and the named entity to be recognized, and then an edit distance algorithm is used to calculate an edit distance between a syllable sequence corresponding to each candidate named entity in the specific named entity list and the recognized syllable sequence, so as to serve as the syllable sequence edit distance between each candidate named entity and the named entity to be recognized. I.e. the characters in the aforementioned edit distance algorithm are syllables here. For example, the syllable sequence corresponding to the candidate named entity is "zhang san de ge", and the identified syllable sequence is zhang shang ge ", wherein" zhang shang ge "is to be changed into" zhang san de ge ", first" shang "is to be changed into" san ", then" de "is added, that is, the change of 2 syllables, and the editing distance is 2.

Corresponding to the second implementation manner of step S120, the pinyin-sequence edit distance is a phoneme-sequence edit distance of the recognition result of the speech between each candidate named entity in the specific named entity list and the to-be-recognized named entity, and then an edit distance algorithm is used to calculate an edit distance between a phoneme sequence corresponding to each candidate named entity in the specific named entity list and the recognized phoneme sequence, so as to serve as a phoneme-sequence edit distance between each candidate named entity and the speech between the to-be-recognized named entities. I.e. the characters in the aforementioned edit distance algorithm are here phonemes. For example, the phoneme sequence corresponding to the candidate named entity is "zh ang s an d e g e", and the identified syllable sequence is "zh ang sh ang ge", wherein "zh ang sh ang e" is to be changed into "zh ang s an d e g e", first "sh" is to be changed into "s", and "ang" is to be changed into "an", then "d" and "e" are added, that is, the change of 4 syllables, and the editing distance is 4. Corresponding to the third implementation manner of step S120, the pinyin-sequence editing distance includes the phoneme-sequence editing distance and pitch-sequence editing distance of the recognition result of the candidate named entities and the voice of the named entity to be recognized, and then

Calculating the editing distance between the phoneme sequence corresponding to each candidate named entity in the specific named entity list and the recognized phoneme sequence by using an editing distance algorithm to serve as the editing distance between each candidate named entity and the phoneme sequence of the speech of the named entity to be recognized; and

and calculating the editing distance between the tone sequence corresponding to each candidate named entity in the specific named entity list and the recognized tone sequence by using an editing distance algorithm to serve as the editing distance between each candidate named entity and the tone sequence of the voice of the named entity to be recognized.

The edit distance between the phoneme sequence corresponding to each candidate named entity in the specific named entity list and the identified phoneme sequence is calculated as described above. In the calculation of the edit distance between the tone sequence corresponding to each candidate named entity in the specific named entity list and the identified tone sequence, the characters in the edit distance algorithm are tones here. For example, the tone sequence corresponding to the candidate named entity is "zhang 1san 1", and the recognized tone sequence is "zhang 1san 2", where "zhang 1san 2" needs to be changed into "zhang 1san 1", only the tone of "san" needs to be changed, and the edit distance is 1.

Corresponding to the fourth implementation manner of step S120, if the pinyin sequence editing distance includes the syllable sequence editing distance and the phoneme sequence editing distance of the recognition result of the candidate named entities and the to-be-recognized named entity speech, then

Calculating the editing distance between the syllable sequence corresponding to each candidate named entity in the specific named entity list and the recognized syllable sequence by using an editing distance algorithm to serve as the editing distance between each candidate named entity and the syllable sequence of the voice of the named entity to be recognized; and

and calculating the editing distance between the phoneme sequence corresponding to each candidate named entity in the specific named entity list and the recognized phoneme sequence by using an editing distance algorithm to serve as the editing distance between each candidate named entity and the phoneme sequence of the speech of the named entity to be recognized. The edit distance between the syllable sequence corresponding to each candidate named entity in the specific named entity list and the identified syllable sequence and the edit distance between the phoneme sequence corresponding to each candidate named entity in the specific named entity list and the identified phoneme sequence are calculated as described above.

Corresponding to the fifth implementation manner of step S120, the pinyin-sequence editing distance includes the syllable-sequence editing distance, the phoneme-sequence editing distance and the tone-sequence editing distance between each candidate named entity and the named entity to be identified, and then

Calculating the editing distance between the syllable sequence corresponding to each candidate named entity in the specific named entity list and the recognized syllable sequence by using an editing distance algorithm to serve as the editing distance between each candidate named entity and the syllable sequence of the voice of the named entity to be recognized;

and calculating the editing distance between the tone sequence corresponding to each candidate named entity in the specific named entity list and the recognized tone sequence by using an editing distance algorithm to serve as the editing distance between each candidate named entity and the tone sequence of the voice of the named entity to be recognized. The edit distance between the syllable sequence corresponding to each candidate named entity in the specific named entity list and the identified syllable sequence, the edit distance between the phoneme sequence corresponding to each candidate named entity in the specific named entity list and the identified phoneme sequence, and the edit distance between the tone sequence corresponding to each candidate named entity in the specific named entity list and the identified tone sequence are calculated as described above.

Step S133, calculating the total editing distance between each candidate named entity and the named entity to be identified according to the Chinese character sequence editing distance and the pinyin sequence editing distance between each candidate named entity and the named entity to be identified.

The overall edit distance may be a weighted average edit distance, an average edit distance, a weighted sum of edit distances, a sum of edit distances, and the like.

If the total edit distance is the weighted average edit distance, the preset weights corresponding to the Chinese character sequence edit distance and the pinyin sequence edit distance can be preset. When the speech recognition of the named entity speech to be recognized is performed, weighting processing can be performed on each candidate named entity in the specific named entity list and the Chinese character sequence editing distance and the pinyin sequence editing distance of the named entity to be recognized according to the preset weight, and the obtained weighted average value is used as the total editing distance between each candidate named entity in the specific named entity list and the speech of the named entity to be recognized.

As a specific example of the total edit distance, there is a case where the predetermined weights are equal, that is, a case where the total edit distance is an average edit distance.

In addition, the total editing distance can be equal to the weighted sum or sum of the Chinese character sequence editing distance and the pinyin sequence editing distance of the candidate named entity and the named entity to be identified.

In the case that the overall editing distance is a weighted average value, corresponding to the first or second implementation manner of step S120, the weighting processing is performed on the kanji sequence editing distance and the syllable sequence editing distance according to the weight corresponding to the kanji sequence editing distance of the recognition result of each candidate named entity and the named entity to be recognized and the weight corresponding to the syllable sequence editing distance or the phoneme sequence editing distance, or the weighting processing is performed on the kanji sequence editing distance and the phoneme sequence editing distance, and the obtained weighted average value is used as the overall editing distance between each candidate named entity in the specific named entity list and the speech of the named entity to be recognized.

In the case that the overall editing distance is a weighted average, corresponding to the third implementation manner of step S120, according to the weight corresponding to the chinese character sequence editing distance, the weight corresponding to the phoneme sequence editing distance, and the weight corresponding to the tone sequence editing distance of the recognition result of each candidate named entity and the named entity to be recognized, the chinese character sequence editing distance, the phoneme sequence editing distance, and the tone sequence editing distance are weighted to obtain a weighted average of them as the overall editing distance of each candidate named entity and the speech of the named entity to be recognized.

In the case that the overall editing distance is a weighted average, corresponding to the fourth implementation manner of step S120, the chinese character sequence editing distance, the syllable sequence editing distance, and the phoneme sequence editing distance are weighted according to the weight corresponding to the chinese character sequence editing distance, the weight corresponding to the syllable sequence editing distance, and the weight corresponding to the phoneme sequence editing distance of the recognition result of each candidate named entity and the named entity to be recognized, so as to obtain a weighted average of the weights as the overall editing distance of the speech of each candidate named entity and the named entity to be recognized.

In the case that the overall editing distance is a weighted average, corresponding to the fifth implementation manner of step S120, the chinese character sequence editing distance, the syllable sequence editing distance, the phoneme sequence editing distance, and the tone sequence editing distance are weighted according to the weight corresponding to the chinese character sequence editing distance, the weight corresponding to the syllable sequence editing distance, the weight corresponding to the phoneme sequence editing distance, and the weight corresponding to the tone sequence editing distance of each candidate named entity and the recognition result of the named entity to be recognized, so as to obtain a weighted average of the weights as the overall editing distance of each candidate named entity and the speech of the named entity to be recognized.

Step S134, taking the reciprocal of the sum of the total editing distance of each candidate named entity and the to-be-recognized named entity voice and a preset constant obtained through calculation as the similarity of each candidate named entity and the to-be-recognized named entity voice.

Since the smaller the edit distance, the higher the similarity, the reciprocal of the sum of the overall edit distance of each candidate named entity and the speech of the named entity to be recognized and a predetermined constant is taken as the similarity. Since there may be a case where the total edit distance is 0, a constant needs to be set in advance so that the sum of the total edit distance and the predetermined constant is taken as the denominator part of the similarity. The predetermined constant is preferably set to 1, and the similarity is 1/(d +1), where d is the overall edit distance of the candidate named entity from the named entity to be identified. For example, if the overall edit distance between a candidate named entity and the named entity to be identified is 1, the similarity between the candidate named entity and the named entity to be identified is 1/(1+1) ═ 1/2.

Referring to fig. 1, in step S140, a speech recognition result of the speech of the named entity to be recognized is determined from the specific named entity list according to the similarity between the candidate named entities and the speech of the named entity to be recognized.

Specifically, the candidate named entity with the largest similarity to the recognition result of the to-be-recognized named entity voice in the specific named entity list is used as the voice recognition result of the to-be-recognized named entity voice. In fact, the candidate named entity in the specific named entity list with the smallest overall editing distance with the recognition result of the named entity voice to be recognized is taken as the voice recognition result of the named entity voice to be recognized.

According to the embodiment of the application, on the basis of obtaining the recognition result of the Chinese character form by performing conventional voice recognition on the named entity to be recognized, pinyin recognition is further performed to obtain the recognition result of the pinyin form, and the final voice recognition result of the named entity to be recognized is determined in the specific named entity list according to the recognized Chinese character recognition result and the pinyin recognition result, so that the accuracy of the voice recognition on the named entity is improved.

In addition, in order to further improve the accuracy of the named entity speech recognition, the language model used in the chinese character-based speech recognition may be generated by jointly training a chinese character sequence corresponding to each candidate named entity in the specific named entity list and a chinese character sequence of a text in a general training text library.

In the general architecture of Chinese character-based speech recognition (as shown in FIG. 2), the language model used in the system is only trained by the Chinese character sequence of the text in the general training text library. Since the text in the universal training text base generally has few named entities, such as names of people, names of places, etc., the architecture of such speech recognition is not accurate for the recognition of named entities. However, in the embodiment of the present application, the language model may be trained by using the chinese character sequence corresponding to each candidate named entity in the specific named entity list and the chinese character sequence of the text in the universal training text library together, so as to further improve the accuracy of the named entity speech recognition.

In addition, in order to further improve the accuracy of the named entity speech recognition, the language model for syllable recognition may be generated by training a syllable sequence obtained by syllable expansion of each candidate named entity in the specific named entity list and a syllable sequence obtained by syllable expansion of a text in a general training text library. The language model for phoneme recognition may be generated by training a phoneme sequence obtained by performing phoneme expansion on each candidate named entity in the specific named entity list and a phoneme sequence obtained by performing phoneme expansion on a text in a universal training text library. Thus, compared with a syllable sequence training language model obtained by performing syllable expansion only by using texts in a general training text library or a phoneme sequence training language model obtained by performing phoneme expansion only by using texts in a general training text library, the accuracy of named entity speech recognition is further improved because each candidate named entity in the specific named entity list is added during training.

Referring to fig. 4, based on any of the above embodiments, optionally, the speech recognition method 1 further includes a step S100 of obtaining a to-be-recognized named entity speech included in the to-be-recognized speech.

In practical application scenarios, a user typically speaks a speech command, not just a named entity. For example, the user utters "i want to listen to a song of three. Therefore, it is necessary to identify which part of the speech uttered by the user is the named entity speech to be identified.

As mentioned above, in one embodiment, the initial speech recognition may be performed on the speech to be recognized including the speech of the named entity to be recognized, and the result of the recognition may be matched with a command template stored in advance, so as to determine which part of the speech is the speech of the named entity to be recognized.

As shown in fig. 5, an embodiment of the present application provides a name voice search method 2, including: s200, matching the voice command to be recognized with a pre-stored voice command template so as to obtain the name voice to be recognized in the voice command to be recognized; s210, carrying out voice recognition on the name voice to be recognized by utilizing the voice recognition based on the Chinese characters so as to recognize a Chinese character sequence serving as a Chinese character recognition result of the name voice to be recognized; s220, performing voice recognition on the name voice to be recognized by utilizing the voice recognition based on the pinyin so as to recognize a pinyin sequence serving as a pinyin recognition result of the name voice to be recognized; s230, determining the similarity between each candidate name in the specific name list and the voice of the name to be recognized according to the recognized Chinese character sequence and the recognized pinyin sequence; s240, determining a voice recognition result of the voice of the name to be recognized from the specific name list according to the similarity between the candidate names and the voice of the name to be recognized.

Compared with fig. 4, the embodiment of fig. 5 is only a scheme for embodying the named entity as the name of the person, and thus detailed implementation of each step is not described again. Here, the specific name list may be a list of all employees of the company, and with the embodiment of fig. 5, an effect of voice searching for employees of the company through simple voice interaction is achieved, and the method may be used in situations such as automatic transfer of a company telephone call.

As shown in fig. 6, an embodiment of the present application provides a song voice search method 3, including: s300, matching the voice command to be recognized with a pre-stored voice command template so as to obtain the song name voice to be recognized in the voice command to be recognized; s310, carrying out voice recognition on the song name voice to be recognized by utilizing the voice recognition based on the Chinese characters so as to recognize a Chinese character sequence which is a Chinese character recognition result of the song name voice to be recognized; s320, performing voice recognition on the song name voice to be recognized by using voice recognition based on pinyin to recognize a pinyin sequence serving as a pinyin recognition result of the song name voice to be recognized; s330, determining the similarity between each candidate song name in the specific song name list and the song name voice to be identified according to the identified Chinese character sequence and the pinyin sequence; s340, determining a voice recognition result of the song name voice to be recognized from the specific song name list according to the similarity between each candidate song name and the song name voice to be recognized.

Compared with fig. 4, the embodiment of fig. 6 is only one scheme for embodying the named entity as the song name, and thus detailed implementation of each step thereof is omitted. The scheme can be used for song searching in intelligent sound box products. Here, the specific song title list may be a list of song titles of all songs stored in the speaker. Through the embodiment of fig. 6, the effect of searching for songs in a sound box through simple voice interaction, thereby realizing voice automatic on-demand is achieved.

As shown in fig. 7, an embodiment of the present application provides a method 5 for establishing a communication connection through voice, including: s200, matching the voice command to be recognized with a pre-stored voice command template so as to obtain the name voice to be recognized in the voice command to be recognized; s210, carrying out voice recognition on the name voice to be recognized by utilizing the voice recognition based on the Chinese characters so as to recognize a Chinese character sequence serving as a Chinese character recognition result of the name voice to be recognized; s220, performing voice recognition on the name voice to be recognized by utilizing the voice recognition based on the pinyin so as to recognize a pinyin sequence serving as a pinyin recognition result of the name voice to be recognized; s230, determining the similarity between each name in the user address list and the voice of the name to be recognized according to the recognized Chinese character sequence and the recognized pinyin sequence; s240, determining a voice recognition result of the voice of the name to be recognized from the user address list according to the similarity between the candidate names and the voice of the name to be recognized; and S250, initiating communication connection to the determined user in the user address list as the voice recognition result.

Steps S200-S240 of the embodiment of fig. 7 are similar to those of the embodiment of fig. 5, and are not repeated herein. Step S250 may include initiating a call connection request to the user in the determined user address book as the voice recognition result or sending a short message to the user in the determined user address book as the voice recognition result.

The scheme can be used in vehicle-mounted voice automatic communication products, for example. Here, the user address book may be an address book stored in the user terminal. Therefore, the effect that the driver can automatically talk or send short messages by simply speaking a sentence without dialing the mobile phone by hands when driving is achieved.

As shown in fig. 8, an embodiment of the present application provides a speech recognition apparatus 4, where the apparatus 4 includes:

a first recognition unit 410, configured to perform voice recognition on a named entity voice to be recognized by using voice recognition based on a chinese character to recognize a chinese character sequence as a result of the recognition of the chinese character of the named entity voice to be recognized;

a second recognition unit 420, configured to perform speech recognition on the named entity speech to be recognized by using speech recognition based on pinyin, so as to recognize a pinyin sequence as a result of the pinyin recognition on the named entity speech to be recognized;

a similarity determining unit 430, configured to determine, according to the recognized Chinese character sequence and the recognized pinyin sequence, a similarity between each candidate named entity in the specific named entity list and the speech of the named entity to be recognized;

a recognition result determining unit 440, configured to determine, according to the similarity between each candidate named entity and the to-be-recognized named entity voice, a voice recognition result of the to-be-recognized named entity voice from the specific named entity list.

Optionally, the language model used in the chinese character-based speech recognition is generated by jointly training a chinese character sequence corresponding to each candidate named entity in the specific named entity list and a chinese character sequence of a text in a universal training text library.

Optionally, the pinyin-based speech recognition is syllable recognition, and the pinyin sequence includes a syllable sequence. The second identification unit is further configured to: and performing syllable recognition on the named entity voice to be recognized so as to recognize a syllable sequence which is a syllable recognition result of the named entity voice to be recognized.

Optionally, the pinyin-based speech recognition is a phoneme recognition, and the pinyin sequence includes a phoneme sequence. The second identification unit is further configured to: and performing phoneme recognition on the named entity voice to be recognized so as to recognize a phoneme sequence which is a phoneme recognition result of the named entity voice to be recognized.

Optionally, the pinyin-based speech recognition includes syllable recognition and phoneme recognition, and the pinyin sequence includes a syllable sequence and a phoneme sequence. The second identification unit is further configured to: performing syllable recognition on the named entity voice to be recognized so as to recognize a syllable sequence serving as a syllable recognition result of the named entity voice to be recognized; and performing phoneme recognition on the named entity voice to be recognized so as to recognize a phoneme sequence serving as a phoneme recognition result of the named entity voice to be recognized.

Optionally, the second identification unit is further configured to:

Alternatively, as shown in fig. 9, the similarity determining unit 430 includes:

a Chinese character sequence edit distance determining subunit 431, configured to determine an edit distance between a Chinese character sequence corresponding to each candidate named entity in the specific named entity list and the recognized Chinese character sequence, so as to serve as a Chinese character sequence edit distance between each candidate named entity and the to-be-recognized named entity;

a pinyin sequence editing distance determining subunit 432, configured to determine an editing distance between a pinyin sequence corresponding to each candidate named entity in the specific named entity list and the recognized pinyin sequence, so as to serve as an editing distance between the pinyin sequence of each candidate named entity and the voice of the named entity to be recognized;

a total edit distance determining subunit 433, configured to calculate a total edit distance between each candidate named entity and the to-be-identified named entity according to a chinese character sequence edit distance and a pinyin sequence edit distance between each candidate named entity and the to-be-identified named entity;

a similarity determining subunit 434, configured to use the reciprocal of the sum of the total edit distance between each candidate named entity and the to-be-identified named entity speech, which is obtained through calculation, and a predetermined constant as the similarity between each candidate named entity and the to-be-identified named entity speech.

Optionally, the language model for syllable recognition is generated by training a syllable sequence obtained by syllable expansion of each candidate named entity in the specific named entity list and a syllable sequence obtained by syllable expansion of a text in a general training text library.

Optionally, the language model for phoneme recognition is generated by training a phoneme sequence obtained by performing phoneme expansion on each candidate named entity in the specific named entity list and a phoneme sequence obtained by performing phoneme expansion on a text in a universal training text library.

Optionally, as shown in fig. 10, the apparatus 4 further includes:

an obtaining unit 400, configured to obtain a named entity voice to be recognized included in the voice to be recognized.

Referring to fig. 11, according to an embodiment of the present application, there is provided a name voice search apparatus 6 including:

a to-be-recognized name voice acquiring unit 610, configured to match the to-be-recognized voice command with a pre-stored voice command template, so as to acquire a to-be-recognized name voice in the to-be-recognized voice command;

a first to-be-recognized name speech recognition unit 620, configured to perform speech recognition on a name speech to be recognized by using speech recognition based on a Chinese character, so as to recognize a Chinese character sequence as a Chinese character recognition result of the name speech to be recognized;

a second to-be-recognized name voice recognition unit 630, configured to perform voice recognition on the to-be-recognized name voice by using pinyin-based voice recognition, so as to recognize a pinyin sequence as a pinyin recognition result of the to-be-recognized name voice;

a to-be-recognized name similarity determining unit 640, configured to determine similarity between each candidate name in the specific name list and the voice of the to-be-recognized name according to the recognized Chinese character sequence and the recognized pinyin sequence;

a to-be-recognized name voice recognition result determining unit 650, configured to determine a voice recognition result of the to-be-recognized name voice from the specific name list according to the similarity between each candidate name and the to-be-recognized name voice.

Referring to fig. 12, according to an embodiment of the present application, there is provided a song voice search apparatus 7 including:

a song name voice acquiring unit 710 for matching the voice command to be recognized with a pre-stored voice command template, so as to acquire the song name voice to be recognized in the voice command to be recognized;

a first song name voice recognition unit 720, configured to perform voice recognition on the song name voice to be recognized by using voice recognition based on a Chinese character to recognize a Chinese character sequence as a result of the Chinese character recognition of the song name voice to be recognized;

a second song title voice recognition unit 730, configured to perform voice recognition on the song title voice to be recognized by using pinyin-based voice recognition to recognize a pinyin sequence as a result of the pinyin recognition of the song title voice to be recognized;

a song name similarity determining unit 740, configured to determine, according to the recognized Chinese character sequence and the recognized pinyin sequence, similarities between the candidate song names in the specific song name list and the voices of the song names to be recognized;

a song name voice recognition result determining unit 750, configured to determine, from the specific song name list, a voice recognition result of the song name voice to be recognized according to a similarity between the respective candidate song names and the song name voice to be recognized.

Referring to fig. 13, according to an embodiment of the present application, there is provided an apparatus 8 for establishing a communication connection by voice, including:

a to-be-recognized name similarity determining unit 640, configured to determine similarity between each name in the user address list and the voice of the to-be-recognized name according to the recognized Chinese character sequence and the recognized pinyin sequence;

a to-be-recognized name voice recognition result determining unit 650, configured to determine a voice recognition result of the to-be-recognized name voice from the user address book according to the similarity between each candidate name and the to-be-recognized name voice;

a communication connection initiating unit 660, configured to initiate a communication connection to the user in the determined user address book as the voice recognition result.

Optionally, the communication connection initiating unit is further configured to initiate a call connection request to the user in the determined user address list as the voice recognition result or send a short message to the user in the determined user address list as the voice recognition result.

It is noted that the present application may be implemented in software and/or a combination of software and hardware, for example, the various means of the present application may be implemented using Application Specific Integrated Circuits (ASICs) or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

While exemplary embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the claims.

Claims

1. A method of speech recognition, the method comprising the steps of:

2. The method of claim 1, wherein the language model used in the chinese character-based speech recognition is generated by co-training a sequence of chinese characters corresponding to each candidate named entity in the specific named entity list with a sequence of chinese characters in text in a universal training text library.

3. The method of claim 1, wherein the pinyin-based speech recognition is syllable recognition, the pinyin sequences including syllable sequences,

the step of performing voice recognition on the named entity voice to be recognized by using the voice recognition based on the pinyin to recognize the pinyin sequence as the pinyin recognition result of the named entity voice to be recognized comprises the following steps:

and performing syllable recognition on the named entity voice to be recognized so as to recognize a syllable sequence which is a syllable recognition result of the named entity voice to be recognized.

4. The method of claim 1, wherein the pinyin-based speech recognition is a phoneme recognition, the pinyin sequences including phoneme sequences,

5. The method of claim 1, wherein the pinyin-based speech recognition includes syllable recognition and phoneme recognition, wherein the pinyin sequences include syllable sequences and phoneme sequences,

the step of performing speech recognition on the named entity to be recognized by using speech recognition based on pinyin to recognize a pinyin sequence as a result of the pinyin recognition of the speech of the named entity to be recognized further includes:

6. The method of claim 4 or 5, wherein the step of performing speech recognition on the named entity speech to be recognized using pinyin-based speech recognition to recognize the pinyin sequence as a result of the pinyin recognition of the named entity speech to be recognized further comprises:

7. The method of claim 1, wherein determining the similarity of each candidate named entity in the list of specified named entities to the speech of the named entity to be identified based on the recognized hanzi sequence and the pinyin sequence comprises:

determining the editing distance between the Chinese character sequence corresponding to each candidate named entity in the specific named entity list and the recognized Chinese character sequence, and taking the editing distance as the Chinese character sequence editing distance of the voice of each candidate named entity and the named entity to be recognized;

determining the edit distance between the pinyin sequence corresponding to each candidate named entity in the specific named entity list and the recognized pinyin sequence, and taking the edit distance as the pinyin sequence edit distance between each candidate named entity and the voice of the named entity to be recognized;

calculating the total editing distance between each candidate named entity and the voice of the named entity to be recognized according to the Chinese character sequence editing distance and the pinyin sequence editing distance between each candidate named entity and the voice of the named entity to be recognized;

and taking the reciprocal of the sum of the total editing distance of each candidate named entity and the to-be-recognized named entity voice and a preset constant obtained by calculation as the similarity of each candidate named entity and the to-be-recognized named entity voice.

8. The method according to claim 3 or 5, wherein the language model for syllable recognition is generated by training with syllable sequences resulting from syllable expansion of candidate named entities in the specific named entity list and syllable sequences resulting from syllable expansion of texts in a generic training text corpus.

9. The method according to claim 4 or 5, wherein the language model for phoneme recognition is generated by training with a phoneme sequence obtained by phoneme decomposition of each candidate named entity in the specific named entity list and a phoneme sequence obtained by phoneme decomposition of a text in a universal training text corpus.

10. The method of claim 1, further comprising:

and acquiring the voice of the named entity to be recognized contained in the voice to be recognized.

11. A name voice searching method is characterized by comprising the following steps:

12. A song voice searching method is characterized by comprising the following steps:

13. A method for establishing a communication connection through speech, comprising:

determining a voice recognition result of the voice of the name to be recognized from the user address list according to the similarity between the name in the user address list and the voice of the name to be recognized;

14. The method of claim 13, wherein initiating the communication connection comprises initiating a call connection request to a user in the address list of the determined user as a result of the voice recognition or sending a short message to a user in the address list of the determined user as a result of the voice recognition.

15. An apparatus for speech recognition, the apparatus comprising:

the system comprises a first identification unit, a second identification unit and a third identification unit, wherein the first identification unit is used for carrying out voice identification on the voice of the named entity to be identified by utilizing the voice identification based on Chinese characters so as to identify a Chinese character sequence which is a Chinese character identification result of the voice of the named entity to be identified;

the second identification unit is used for carrying out voice identification on the named entity voice to be identified by utilizing voice identification based on pinyin so as to identify a pinyin sequence serving as a pinyin identification result of the named entity voice to be identified;

the similarity determining unit is used for determining the similarity between each candidate named entity in the specific named entity list and the voice of the named entity to be recognized according to the recognized Chinese character sequence and the recognized pinyin sequence;

and the recognition result determining unit is used for determining the voice recognition result of the named entity voice to be recognized from the specific named entity list according to the similarity between each candidate named entity and the voice of the named entity to be recognized.

16. The apparatus of claim 15, wherein the language model used in the chinese character based speech recognition is generated by co-training a sequence of chinese characters corresponding to each candidate named entity in the specific named entity list with a sequence of chinese characters in text in a universal training text library.

17. The apparatus of claim 15, wherein the pinyin-based speech recognition is syllable recognition, the pinyin sequences including syllable sequences,

the second identification unit is further configured to:

18. The apparatus of claim 15, wherein the pinyin-based speech recognition is a phoneme recognition, the pinyin sequences including phoneme sequences,

the second identification unit is further configured to:

19. The apparatus of claim 15, wherein the pinyin-based speech recognition includes syllable recognition and phoneme recognition, wherein the pinyin sequences include syllable sequences and phoneme sequences,

the second identification unit is further configured to:

20. The apparatus according to claim 18 or 19, wherein the second identification unit is further configured to:

21. The apparatus of claim 15, wherein the similarity determining unit comprises:

a Chinese character sequence editing distance determining subunit, configured to determine an editing distance between a Chinese character sequence corresponding to each candidate named entity in the specific named entity list and the recognized Chinese character sequence, so as to serve as a Chinese character sequence editing distance between each candidate named entity and the to-be-recognized named entity;

a pinyin sequence editing distance determining subunit, configured to determine an editing distance between a pinyin sequence corresponding to each candidate named entity in the specific named entity list and the recognized pinyin sequence, so as to serve as an editing distance between the pinyin sequence of each candidate named entity and the voice of the named entity to be recognized;

the overall editing distance determining subunit is used for calculating the overall editing distance between each candidate named entity and the voice of the named entity to be recognized according to the Chinese character sequence editing distance and the pinyin sequence editing distance between each candidate named entity and the voice of the named entity to be recognized;

and the similarity determining subunit is used for taking the reciprocal of the sum of the total editing distance of each candidate named entity and the to-be-recognized named entity voice obtained through calculation and a predetermined constant as the similarity of each candidate named entity and the to-be-recognized named entity voice.

22. The apparatus according to claim 17 or 19, wherein the language model for syllable recognition is generated by training with syllable sequences resulting from syllable expansion of candidate named entities in the specific named entity list and syllable sequences resulting from syllable expansion of texts in a generic training text corpus.

23. The apparatus according to claim 18 or 19, wherein said language model for phoneme recognition is generated by training with a phoneme sequence obtained by phoneme decomposition of each candidate named entity in said list of specific named entities and a phoneme sequence obtained by phoneme decomposition of a text in a universal training text corpus.

24. The apparatus of claim 15, further comprising:

and the acquisition unit is used for acquiring the voice of the named entity to be recognized contained in the voice to be recognized.

25. A person name voice search apparatus, comprising:

the voice acquiring unit of the name of the person to be recognized is used for matching the voice command to be recognized with a pre-stored voice command template so as to acquire the voice of the name of the person to be recognized in the voice command to be recognized;

the first to-be-recognized name voice recognition unit is used for performing voice recognition on the to-be-recognized name voice by utilizing the voice recognition based on the Chinese characters so as to recognize a Chinese character sequence serving as a Chinese character recognition result of the to-be-recognized name voice;

the second to-be-recognized name voice recognition unit is used for performing voice recognition on the to-be-recognized name voice by using voice recognition based on pinyin so as to recognize a pinyin sequence serving as a pinyin recognition result of the to-be-recognized name voice;

the to-be-recognized name similarity determining unit is used for determining the similarity between each candidate name in the specific name list and the voice of the to-be-recognized name according to the recognized Chinese character sequence and the recognized pinyin sequence;

and the to-be-recognized name voice recognition result determining unit is used for determining the voice recognition result of the to-be-recognized name voice from the specific name list according to the similarity between each candidate name and the to-be-recognized name voice.

26. A song voice search apparatus, comprising:

the song name voice acquiring unit is used for matching the voice command to be recognized with a pre-stored voice command template so as to acquire the song name voice to be recognized in the voice command to be recognized;

the first song name voice recognition unit to be recognized is used for performing voice recognition on the song name voice to be recognized by utilizing the voice recognition based on Chinese characters so as to recognize a Chinese character sequence serving as a Chinese character recognition result of the song name voice to be recognized;

a second song title voice recognition unit for performing voice recognition on the song title voice to be recognized by using voice recognition based on pinyin to recognize a pinyin sequence as a result of the pinyin recognition of the song title voice to be recognized;

a song name similarity determining unit to be recognized, which is used for determining the similarity between each candidate song name in the specific song name list and the song name voice to be recognized according to the recognized Chinese character sequence and the recognized pinyin sequence;

and the song name voice recognition result determining unit is used for determining the voice recognition result of the song name voice to be recognized from the specific song name list according to the similarity between each candidate song name and the song name voice to be recognized.

27. An apparatus for establishing a communication connection by voice, comprising:

the to-be-recognized name similarity determining unit is used for determining the similarity between each name in the user address list and the to-be-recognized name voice according to the recognized Chinese character sequence and the recognized pinyin sequence;

the to-be-recognized name voice recognition result determining unit is used for determining a voice recognition result of the to-be-recognized name voice from the user address list according to the similarity between each name in the user address list and the to-be-recognized name voice;

and the communication connection initiating unit is used for initiating communication connection to the user in the determined user address list as the voice recognition result.

28. The apparatus of claim 27, wherein the communication connection initiating unit is further configured to initiate a call connection request to the user in the address list of the determined user as the voice recognition result or send a short message to the user in the address list of the determined user as the voice recognition result.