CN109243430B

CN109243430B - Voice recognition method and device

Info

Publication number: CN109243430B
Application number: CN201710537548.3A
Authority: CN
Inventors: 郑宏
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-07-04
Filing date: 2017-07-04
Publication date: 2022-03-01
Anticipated expiration: 2037-07-04
Also published as: CN109243430A

Abstract

The embodiment of the invention provides a voice recognition method and a voice recognition device, wherein the method comprises the following steps: receiving voice input of a user, and identifying the voice input to obtain a candidate voice identification result; sorting the candidate voice recognition results by using a language recognition model of a user; the personal language model corresponding to the user is a language model established by using historical text input data of the user; and obtaining a final voice recognition result by using the sorted candidate voice recognition results. The embodiment of the invention can effectively improve the accuracy of the voice recognition result.

Description

Voice recognition method and device

Technical Field

The embodiment of the invention relates to the technical field of voice recognition, in particular to a voice recognition method and a voice recognition device.

Background

Speech recognition technology is a technology that converts human speech into computer-readable input. The voice recognition technology is widely applied to the fields of voice dialing, voice navigation, automatic equipment control and the like. Therefore, how to improve the accuracy of speech recognition becomes an important issue.

In the prior art, a speech model is generally used to recognize speech input by a user, and an input speech feature sequence is converted into a character sequence. The speech model generally includes an acoustic model and a language model, corresponding to the computation of speech-to-syllable probabilities and syllable-to-character probabilities, respectively.

In the process of researching the prior art, the applicant finds that the prior art adopts the same speech recognition model to recognize the speech of different users, however, the pronunciation characteristics and the language use habits of different users are different, and the prior art cannot provide accurate and personalized speech recognition results. Although the prior art has a method for recognizing the user speech by using the personal acoustic model of the user to obtain the recognition result, the method only considers the pronunciation characteristics of the user, such as the dialect category to which the user belongs, and the method still cannot provide a more accurate and personalized speech recognition result.

Disclosure of Invention

The embodiment of the invention aims to provide a voice recognition method and a voice recognition device, which can utilize a general language model and a personal language model corresponding to a user to sequence candidate voice recognition results to obtain more accurate and personalized voice recognition results.

Therefore, the embodiment of the invention provides the following technical scheme:

in a first aspect, an embodiment of the present invention provides a speech recognition method, including: receiving voice input of a user, and identifying the voice input to obtain a candidate voice identification result; sorting the candidate voice recognition results by using a language recognition model of a user; the language identification model of the user is obtained through a general language model and a personal language model corresponding to the user, wherein the personal language model corresponding to the user is a language model established by using historical text input data of the user; and obtaining a final voice recognition result by using the sorted candidate voice recognition results.

In a second aspect, an embodiment of the present invention provides a speech recognition apparatus, including: the recognition unit is used for receiving voice input of a user, recognizing the voice input and obtaining a candidate voice recognition result; the sorting unit is used for sorting the candidate voice recognition results by utilizing a language recognition model of a user; the language identification model of the user is obtained through a general language model and a personal language model corresponding to the user, wherein the personal language model corresponding to the user is a language model established by using historical text input data of the user; and the result obtaining unit is used for obtaining a final voice recognition result by using the sorted candidate voice recognition results.

In a third aspect, an embodiment of the present invention provides an apparatus for speech recognition, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: receiving voice input of a user, and identifying the voice input to obtain a candidate voice identification result; sorting the candidate voice recognition results by using a language recognition model of a user; the language identification model of the user is obtained through a general language model and a personal language model corresponding to the user, wherein the personal language model corresponding to the user is a language model established by using historical text input data of the user; and obtaining a final voice recognition result by using the sorted candidate voice recognition results.

In a fourth aspect, embodiments of the present invention provide a machine-readable medium having stored thereon instructions, which, when executed by one or more processors, cause an apparatus to perform the speech recognition method as shown in the first aspect.

The voice recognition method and the voice recognition device provided by the embodiment of the invention can receive the voice input of a user, recognize the voice input to obtain a candidate voice recognition result, sort the candidate voice recognition result by using the language recognition model of the user, and obtain a final voice recognition result by using the sorted candidate voice recognition result. According to the embodiment of the invention, the universal language model and the user personal language model are utilized to obtain the language recognition model of the user to sequence the candidate voice recognition results, so that not only is the universal language use habit considered, but also the influence of the user personalized language use habit on the candidate voice recognition results is considered, the results which are more in line with the user personalized language use habit are sequenced in the front row, and the accuracy of the voice recognition results is effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a speech recognition method according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present invention;

FIG. 4 is a block diagram illustrating an apparatus for speech recognition according to an example embodiment;

FIG. 5 is a block diagram illustrating a server according to an example embodiment.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present invention, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A speech recognition method according to an exemplary embodiment of the present invention will be described with reference to fig. 1 to 2.

Referring to fig. 1, a flowchart of a speech recognition method according to an embodiment of the present invention is shown. As shown in fig. 1, may include:

s101, receiving voice input of a user, recognizing the voice input, and obtaining a candidate voice recognition result.

When recognizing the voice input of the user, the acoustic model in the prior art may be used to recognize the voice input, and obtain a candidate voice recognition result. The candidate speech recognition results are generally the first N best recognition results, N is a positive integer, and the value can be set according to experience or needs.

S102, the candidate voice recognition results are ranked by using the language recognition model of the user.

It should be noted that, in the prior art, a general speech model is generally used to recognize speech input of a user, and pronunciation characteristics and language usage habits of different users are not considered. For example, each user has different language usage habits, such as different vocabularies, different commonly used vocabularies, different local specials, and the like. According to the personalized acoustic model recognition method provided by the prior art, different pronunciation characteristics of different users are considered, acoustic characteristics of the users are collected, and a personal acoustic model is established to provide accuracy of voice recognition, but the method does not consider language use habits of the users, and still cannot provide a more accurate and personalized voice recognition result.

In the embodiment of the invention, the personal language model corresponding to the user can be established in advance. The personal language model corresponding to the user is a language model established by using historical text input data of the user, and the level of the probability of one sentence can be effectively measured.

In specific implementation, the personal language model corresponding to the user can be established in the following ways:

A. historical text input data of a user is obtained.

In specific implementation, the text input data of the user can be collected in various ways to be used as the corpus for training the speech model.

B. And obtaining word characteristics and/or word combination characteristics of the user according to the historical text input data of the user, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations.

The statistical frequency of the words may specifically be the number of occurrences of the words in the whole corpus. The statistical frequency of the word combinations is specifically the occurrence frequency of the word combinations in the whole corpus. A phrase is a combination of more than two words. For example, the user particularly likes entering "I le go" when Zhang three types, and rarely enters "I go". By collecting word combinations of Zhang III of the user and counting the times of the word combinations appearing in the corpus, when three-way voice is spoken in the future, the candidate voice recognition result of 'I le go' is arranged in front of the candidate voice recognition result of 'I go'. As another example, the user often enters "stack" when he/she is Li four typing, and rarely enters "fight". By collecting word features, the candidate speech recognition result "stack" will be ranked ahead of the candidate speech recognition result "fight" at the time of subsequent speech recognition.

C. And training by utilizing the word characteristics and/or the word combination characteristics of the user to obtain a personal language model corresponding to the user.

When training the personal language model, an N-Gram (a speech model) language model training method, a Recurrent Neural Network (RNN) language model training method, a Long Short Term Memory Network (LSTM) language model training method, and the like may be specifically adopted. The training of the ternary N-Gram language model is described as an example.

(1) And performing word segmentation processing on each sentence in the corpus. For example, ABC results in (A, B, C) when the words are segmented.

(2) And respectively calculating the probability of the occurrence of the word A, the word B and the word C in the corpus. Wherein:

number of times that A appears in corpus/total number of words in corpus

(3) And calculating the probability of the occurrence in the word combination AB corpus.

Number of occurrences of P (B | A) ═ AB in corpus/total number of occurrences of A in corpus

(4) The conditional probability of the word C occurring after the word combination AB is calculated.

Number of occurrences of P (C | AB) ═ ABC in corpus/number of occurrences of AB in corpus

(5) And calculating the probability of the sentence ABC appearing in the corpus.

P(ABC)＝P(A)P(B|A)P(C|AB)

Therefore, the personal language model of the user can be obtained by training all the linguistic data of the user, and the personal language model can effectively measure the occurrence probability of one sentence.

In the specific implementation of the present invention, after obtaining N candidate speech recognition results, the candidate speech recognition results may be ranked using the language recognition model of the user. The language identification model of the user is obtained through a universal language model and a personal language model corresponding to the user, and the universal language model is a model obtained by training text input corpora of all users.

In some embodiments, the ranking the candidate speech recognition results using the language recognition model of the user comprises: performing linear interpolation by using the general language model and the personal language model corresponding to the user to obtain a language identification model of the user; and calculating the probability of each candidate voice recognition result by using the language recognition model of the user, and sequencing each candidate voice recognition result according to the calculated probability.

For example, the weight may be preset, and the language identification model of the user is obtained by the following formula:

user's language identification model is a multiplied by personal language model + b multiplied by general language model

Wherein 0< a <1, 0< b <1, and a + b ═ 1.

For example, a is 0.7 and b is 0.3.

And calculating the probability of each candidate voice recognition result by using the obtained language recognition model, and sorting the candidate voice recognition results in a descending order according to the probability.

In some further embodiments, the ranking the candidate speech recognition results using the language recognition model of the user includes: calculating the probability of each candidate voice recognition result by using a general language model and calculating the probability of each candidate voice recognition result by using a personal language model corresponding to the user; performing linear interpolation on the probability calculated by using the general language model and the probability calculated by using the personal language model corresponding to the user, and sequencing the candidate voice recognition results according to the result of the linear interpolation

For example, the weight may be preset, and the result of linear interpolation is obtained by the following formula:

final probability value a × probability value of personal language model + b × probability value of general language model

Wherein 0< a <1, 0< b <1, and a + b ═ 1.

For example, assume that a takes a value of 0.7 and b takes a value of 0.3. The calculation result of the probability value of the personal language model of Zhang III for the candidate voice recognition result "I le you go" is 0.00038683, the calculation result of the probability value of the general language model for the sentence is 0.00023453, and the calculation result of the final probability value of the sentence obtained by linear interpolation is: 0.7 × 0.00038683+0.3 × 0.00023453 ═ 0.00034114.

In some embodiments, the method further comprises: acquiring a group language model corresponding to the user; the group language model is used for describing the language characteristics of the group to which the user belongs; the ranking the candidate speech recognition results using the language recognition model of the user comprises: and sequencing the candidate voice recognition results by utilizing a general language model, a personal language model corresponding to the user and a group language model corresponding to the user. The specific implementation can be seen in the embodiment shown in fig. 2.

And S103, obtaining a final voice recognition result by using the sorted candidate voice recognition results.

In specific implementation, the final speech recognition result can be obtained from the candidate speech recognition result ranked at the top. For example, the user says "Wangli makes a phone call to me". If the universal language model is used alone, the recognition result is 'Wangli makes a call to me'. However, when the user uses the input method at ordinary times, the names of people who are input frequently are 'Wang Li'. By applying the method, when the candidate input results are re-scored and sorted by the N-best reordering algorithm, the 'Wangli makes a call to me' is ranked in front of the 'Wangli makes a call to me'. The obtained personalized recognition result is more accurate than a general recognition result and better conforms to the language use characteristics of the user.

In this embodiment of the present invention, a voice input of a user may be received, the voice input may be recognized to obtain candidate voice recognition results, the candidate voice recognition results may be ranked by using a language recognition model of the user, and a final voice recognition result may be obtained by using the ranked candidate voice recognition results. According to the embodiment of the invention, the universal language model and the user personal language model are utilized to obtain the language recognition model of the user to sequence the candidate voice recognition results, and the influence of the universal language use habit and the user personalized language use habit on the candidate voice recognition results is comprehensively considered, so that the results which are more in line with the user personalized language use habit are sequenced in the front row, and the accuracy of the voice recognition results is effectively improved.

Referring to fig. 2, a flowchart of a speech recognition method according to another embodiment of the present invention is provided. Different from the embodiment shown in fig. 1, the embodiment also considers the influence of the group language model corresponding to the user on the recognition result, and through the group language model corresponding to the user, the personalized corpus which is rarely used by the user but is commonly used by a similar group corresponding to the user can be recognized, so that the defect of the corpus of the user is made up, and the accuracy of voice recognition is improved.

S201, establishing a personal language model corresponding to the user.

In specific implementation, the personal language model corresponding to the user can be established in the following ways: acquiring historical text input data of a user; obtaining word characteristics and/or word combination characteristics of the user according to historical text input data of the user, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations; and training by utilizing the word characteristics and/or the word combination characteristics of the user to obtain a personal language model corresponding to the user. Specific implementations may be realized with reference to the embodiment shown in fig. 1.

S202, establishing language models of all groups.

The group language model is used for describing language features of a group to which the user belongs. Each user has a group language model corresponding to the user, and the corresponding relationship between the user and the group language model may be pre-stored, and in S205, the group language model corresponding to the current user may be obtained by using the pre-stored corresponding relationship.

Wherein, the establishment of each group language model comprises the following steps:

S202A, calculating the similarity among different users, and acquiring a similar user group set according to the calculated similarity. The similar user group set comprises users with similarity greater than a set threshold.

In a specific implementation, S202A may further include: acquiring word feature vectors of different users; respectively taking each user of the different users as a current user, calculating cosine distances of word feature vectors of the current user and word feature vectors of other users, and taking the cosine distances as the similarity of the current user and the other users; and adding each user with the similarity degree with the current user larger than a set threshold into a similar user group set corresponding to the current user.

For example, the word feature vector of the user can be obtained through a 1-Gram individual language model of the user, and the word feature vector can be a probability value of each word and represents the use frequency of the user on each word. Since the use frequency of the words by the similar users is similar, the similarity of the user groups can be measured by the similarity of the word feature vectors. For example, the vocabulary of doctor A is similar to the vocabulary of doctor B, the vocabulary of doctor A is different from the vocabulary of hockey coach C and truck driver D, and the word features of each user can be reflected in a vocabulary-sized vector. The similarity between the word feature vector of doctor A and the word feature vector of doctor B is definitely greater than the similarity between the word feature vector of doctor A and the word feature vector of ice hockey coach C. In calculating the similarity, a method of calculating a cosine distance of the vector may be employed.

Wherein the cosine distance of the two vectors a and b can be calculated using the following formula:

cosθ＝(a·b)/‖a‖‖b‖

it should be noted that, if the similarity of the feature vectors of the two user words is greater than the set threshold, the two users may be considered as similar users. And the set formed by the users with the similarity greater than the set threshold with the current user can be used as a similar user group corresponding to the current user. If the text input corpus of the current user is less, the similar user group can make up the defect that the text input corpus of the current user is less, and can supplement the personalized corpus which is rarely used by the user but is frequently used by the user group similar to the user group, so that the obtained voice recognition result is more accurate.

S202B, obtaining word characteristics and/or word combination characteristics corresponding to the similar user group by using the text input of each user of the similar user group set, wherein the word characteristics comprise words and the statistical frequency of the words, and the word combination characteristics comprise word combinations and the statistical frequency of the word combinations.

S202C, training by using the word features and/or word combination features corresponding to the similar user groups to obtain a group language model.

It should be noted that the method for training to obtain the group language model is similar to the method for obtaining the personal language model, but the input corpus is different, and the input corpus of the group language model may be the corpus of some or all users in the similar user group.

S203, establishing a universal language model.

The execution sequence of S201, S202, and S203 is not necessarily sequential, and may be executed in reverse or in parallel.

S204, receiving the voice input of the user, and recognizing the voice input to obtain a candidate voice recognition result.

S205, sorting the candidate voice recognition results by using a general language model, a personal language model corresponding to the user and a group language model corresponding to the user.

In some embodiments, said ranking the candidate speech recognition results using a generic language model, a personal language model corresponding to the user, and a group language model corresponding to the user comprises: performing linear interpolation by using a general language model and a personal language model and a group language model corresponding to the user to obtain a language identification model of the user; and calculating the probability of each candidate voice recognition result by using the language recognition model of the user, and sequencing each candidate voice recognition result according to the calculated probability.

user's language identification model x personal language model + y x general language model + z x group language model

Wherein 0< x <1, 0< y <1, 0< z <1, and x + y + z is 1.

For example, x has a value of 0.5, b has a value of 0.3, and z has a value of 0.2.

In specific implementation, the obtained final language recognition model can be used for calculating the probability of each candidate voice recognition result, and the candidate voice recognition results are sorted in a descending order according to the probability.

In some embodiments, said ranking the candidate speech recognition results using a generic language model, a personal language model corresponding to the user, and a group language model corresponding to the user comprises: calculating the probability of each candidate voice recognition result by using a general language model, calculating the probability of each candidate voice recognition result by using a personal language model corresponding to the user, and calculating the probability of each candidate voice recognition result by using a group language model corresponding to the user; and performing linear interpolation on the probability calculated by using the general language model, the probability calculated by using the personal language model corresponding to the user and the probability calculated by using the group language model corresponding to the user, and sequencing the candidate voice recognition results according to the results of the linear interpolation.

the final probability value is x the probability value of the personal language model + y the probability value of the general language model + z the probability value of the group language model

Wherein 0< x <1, 0< y <1, 0< z <1, and x + y + z is 1.

And S206, obtaining a final voice recognition result by using the sorted candidate voice recognition results.

In the embodiment of the invention, the influence of the group language model corresponding to the user on the recognition result is considered, and the personalized linguistic data which are rarely used by the user but are commonly used by the similar group corresponding to the user can be recognized through the group language model corresponding to the user, so that the defect of the linguistic data of the user is made up, and the accuracy of voice recognition is improved.

Referring to fig. 3, a schematic diagram of a speech recognition apparatus according to an embodiment of the present invention is shown.

A speech recognition apparatus 300 comprising:

the recognition unit 301 is configured to receive a voice input of a user, recognize the voice input, and obtain a candidate voice recognition result. The specific implementation of the identifying unit 301 can be implemented with reference to step 101 in the embodiment shown in fig. 1.

A sorting unit 302, configured to sort the candidate speech recognition results by using a language recognition model of a user; the language identification model of the user is obtained through a general language model and a personal language model corresponding to the user, wherein the personal language model corresponding to the user is a language model established by using historical text input data of the user. The specific implementation of the sorting unit 302 can be implemented with reference to step 102 in the embodiment shown in fig. 1.

A result obtaining unit 303, configured to obtain a final speech recognition result by using the sorted candidate speech recognition results. The specific implementation of the result obtaining unit 303 can be implemented with reference to step 103 of the embodiment shown in fig. 1.

In some embodiments, the sorting unit 302 is specifically configured to: performing linear interpolation by using a general language model and a personal language model corresponding to the user to obtain a language identification model of the user; and calculating the probability of each candidate voice recognition result by using the language recognition model of the user, and sequencing each candidate voice recognition result according to the calculated probability.

In some embodiments, the sorting unit 302 is specifically configured to: calculating the probability of each candidate voice recognition result by using a general language model and calculating the probability of each candidate voice recognition result by using a personal language model corresponding to the user; and performing linear interpolation on the probability calculated by using the universal language model and the probability calculated by using the personal language model corresponding to the user, and sequencing the candidate voice recognition results according to the result of the linear interpolation.

In some embodiments, the apparatus further includes a personal language model establishing unit, configured to establish a personal language model corresponding to the user, where the personal language model establishing unit is specifically configured to: acquiring historical text input data of a user; obtaining word characteristics and/or word combination characteristics of the user according to historical text input data of the user, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations; and training by utilizing the word characteristics and/or the word combination characteristics of the user to obtain a personal language model corresponding to the user.

In some embodiments, the apparatus further includes a group language model obtaining unit configured to obtain a group language model corresponding to the user; the group language model is used for describing the language characteristics of the group to which the user belongs;

the sorting unit is further configured to: and sequencing the candidate voice recognition results by utilizing a general language model, a personal language model corresponding to the user and a group language model corresponding to the user.

In some embodiments, the apparatus further comprises a group language model building unit, the group language model building unit being specifically configured to: calculating the similarity among different users, and acquiring a similar user group set according to the calculated similarity, wherein the similar user group set comprises all users with the similarity larger than a set threshold; acquiring word characteristics and/or word combination characteristics corresponding to a similar user group by using text input of each user in the similar user group set, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations; and training by using the word features and/or the word combination features corresponding to the similar user groups to obtain a group language model.

In some embodiments, the group language model building unit is specifically configured to: acquiring word feature vectors of different users; respectively taking each user of the different users as a current user, calculating cosine distances of word feature vectors of the current user and word feature vectors of other users, and taking the cosine distances as the similarity of the current user and the other users; and adding each user with the similarity degree with the current user larger than a set threshold into a similar user group set corresponding to the current user.

The arrangement of each unit or module of the device of the present invention can be implemented by referring to the methods shown in fig. 1 to 2, which are not described herein again.

Referring to fig. 4, a block diagram for a speech recognition device is shown in accordance with an exemplary embodiment. For example, the apparatus 400 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 4, the apparatus 400 may include one or more of the following components: processing components 402, memory 404, power components 406, multimedia components 408, audio components 410, input/output (I/O) interfaces 412, sensor components 414, and communication components 416.

The processing component 402 generally controls overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 can include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.

The memory 404 is configured to store various types of data to support operations at the device 400. Examples of such data include instructions for any application or method operating on the device 400, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 404 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply components 406 provide power to the various components of device 400. The power components 406 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 400.

The multimedia component 408 includes a screen that provides an output interface between the device 400 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 400 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 410 is configured to output and/or input audio signals. For example, audio component 410 includes a Microphone (MIC) configured to receive external audio signals when apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 also includes a speaker for outputting audio signals.

The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 414 includes one or more sensors for providing various aspects of status assessment for the apparatus 400. For example, the sensor component 414 can detect the open/closed state of the device 400, the relative positioning of components, such as a display and keypad of the apparatus 400, the sensor component 414 can also detect a change in the position of the apparatus 400 or a component of the apparatus 400, the presence or absence of user contact with the apparatus 400, orientation or acceleration/deceleration of the apparatus 400, and a change in the temperature of the apparatus 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 416 is configured to facilitate wired or wireless communication between the apparatus 400 and other devices. The apparatus 400 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 414 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 414 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

Specifically, the embodiment of the present invention provides a speech recognition apparatus 400, which comprises a memory 404 and one or more programs, wherein the one or more programs are stored in the memory 404, and the one or more programs configured to be executed by the one or more processors 420 comprise instructions for: receiving voice input of a user, and identifying the voice input to obtain a candidate voice identification result; sorting the candidate voice recognition results by using a language recognition model of a user; the language identification model of the user is obtained through a general language model and a personal language model corresponding to the user, wherein the personal language model corresponding to the user is a language model established by using historical text input data of the user; and obtaining a final voice recognition result by using the sorted candidate voice recognition results.

Further, the processor 420 is specifically configured to execute the one or more programs including instructions for: performing linear interpolation by using a general language model and a personal language model corresponding to the user to obtain a language identification model of the user; and calculating the probability of each candidate voice recognition result by using the language recognition model of the user, and sequencing each candidate voice recognition result according to the calculated probability.

Further, the processor 420 is specifically configured to execute the one or more programs including instructions for: calculating the probability of each candidate voice recognition result by using a general language model and calculating the probability of each candidate voice recognition result by using a personal language model corresponding to the user; and performing linear interpolation on the probability calculated by using the universal language model and the probability calculated by using the personal language model corresponding to the user, and sequencing the candidate voice recognition results according to the result of the linear interpolation.

Further, the processor 420 is specifically configured to execute the one or more programs including instructions for: acquiring historical text input data of a user; obtaining word characteristics and/or word combination characteristics of the user according to historical text input data of the user, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations; and training by utilizing the word characteristics and/or the word combination characteristics of the user to obtain a personal language model corresponding to the user.

Further, the processor 420 is specifically configured to execute the one or more programs including instructions for: acquiring a group language model corresponding to the user; the group language model is used for describing the language characteristics of the group to which the user belongs; and sequencing the candidate voice recognition results by utilizing a general language model, a personal language model corresponding to the user and a group language model corresponding to the user.

Further, the processor 420 is specifically configured to execute the one or more programs including instructions for: calculating the similarity among different users, and acquiring a similar user group set according to the calculated similarity, wherein the similar user group set comprises users with the similarity larger than a set threshold; obtaining word characteristics and/or word combination characteristics corresponding to a similar user group by utilizing text input of each user of the similar user group set, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations; and training by using the word features and/or the word combination features corresponding to the similar user groups to obtain a group language model.

Further, the processor 420 is specifically configured to execute the one or more programs including instructions for: acquiring different word feature vectors; respectively taking each user of the different users as a current user, calculating cosine distances of word feature vectors of the current user and word feature vectors of other users, and taking the cosine distances as the similarity of the current user and the other users; and adding the users with the similarity greater than a set threshold value with the current user into a similar user group set corresponding to the current user.

A machine-readable medium, which may be, for example, a non-transitory computer-readable storage medium, having instructions thereon which, when executed by a processor of an apparatus (terminal or server), enable the apparatus to perform a variety of speech recognition methods, the methods comprising: receiving voice input of a user, and identifying the voice input to obtain a candidate voice identification result; sorting the candidate voice recognition results by using a language recognition model of a user; the personal language model corresponding to the user is a language model established by using historical text input data of the user; and obtaining a final voice recognition result by using the sorted candidate voice recognition results.

Optionally, the sorting the candidate speech recognition results by using the language recognition model of the user includes: performing linear interpolation by using a general language model and a personal language model corresponding to the user to obtain a language identification model of the user; and calculating the probability of each candidate voice recognition result by using the language recognition model of the user, and sequencing each candidate voice recognition result according to the calculated probability.

Optionally, the sorting the candidate speech recognition results by using the language recognition model of the user includes: calculating the probability of each candidate voice recognition result by using a general language model and calculating the probability of each candidate voice recognition result by using a personal language model corresponding to the user; and performing linear interpolation on the probability calculated by using the universal language model and the probability calculated by using the personal language model corresponding to the user, and sequencing the candidate voice recognition results according to the result of the linear interpolation.

Optionally, the method further comprises: acquiring historical text input data of a user; obtaining word characteristics and/or word combination characteristics of the user according to historical text input data of the user, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations; and training by utilizing the word characteristics and/or the word combination characteristics of the user to obtain a personal language model corresponding to the user.

Optionally, the method further comprises: acquiring a group language model corresponding to the user; the group language model is used for describing the language characteristics of the group to which the user belongs; the ranking the candidate speech recognition results using the language recognition model of the user comprises: and sequencing the candidate voice recognition results by utilizing a general language model, a personal language model corresponding to the user and a group language model corresponding to the user.

Optionally, the obtaining a group language model corresponding to the user includes:

pre-establishing language models of each group;

and acquiring a group language model corresponding to the user according to the corresponding relation between the user and the group language model.

Optionally, the pre-establishing each group language model includes: calculating the similarity among different users, and acquiring a similar user group set according to the calculated similarity, wherein the similar user group set comprises all users with the similarity larger than a set threshold; obtaining word characteristics and/or word combination characteristics corresponding to a similar user group by utilizing text input of each user of the similar user group set, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations; and training by using the word features and/or the word combination features corresponding to the similar user groups to obtain a group language model.

Optionally, the calculating the similarity between different users, and obtaining a similar user group set according to the calculated similarity includes: acquiring word feature vectors of different users; respectively taking each user of the different users as a current user, calculating cosine distances of word feature vectors of the current user and word feature vectors of other users, and taking the cosine distances as the similarity of the current user and the other users; and adding the users with the similarity greater than a set threshold value with the current user into a similar user group set corresponding to the current user.

Fig. 5 is a schematic structural diagram of a server in an embodiment of the present invention. The server 500 may vary widely in configuration or performance and may include one or more Central Processing Units (CPUs) 522 (e.g., one or more processors) and memory 532, one or more storage media 530 (e.g., one or more mass storage devices) storing applications 542 or data 544. Memory 532 and storage media 530 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 522 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the server 500.

The server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input-output interfaces 558, one or more keyboards 556, and/or one or more operating systems 541, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element. The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort. The foregoing is directed to embodiments of the present invention, and it is understood that various modifications and improvements can be made by those skilled in the art without departing from the spirit of the invention.

Claims

1. A speech recognition method, comprising:

receiving voice input of a user, and identifying the voice input to obtain a candidate voice identification result;

performing linear interpolation by using a general language model, a personal language model corresponding to the user and a group language model corresponding to the user to obtain a language identification model of the user; the personal language model is a language model established by using historical text input data of the user and is used for effectively measuring the probability of occurrence of a sentence, and the group language model is used for describing the language characteristics of a group to which the user belongs; the group language model training corpus is a corpus of some or all users in a similar user group corresponding to the user; the similar user group comprises users of which the similarity of the word feature vectors is greater than a set threshold value;

calculating the probability of each candidate voice recognition result by using the language recognition model of the user, and sequencing each candidate voice recognition result according to the calculated probability;

and obtaining a final voice recognition result by using the sorted candidate voice recognition results.

2. The method of claim 1, further comprising:

acquiring historical text input data of a user;

obtaining word characteristics and/or word combination characteristics of the user according to historical text input data of the user, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations;

and training by utilizing the word characteristics and/or the word combination characteristics of the user to obtain a personal language model corresponding to the user.

3. The method of claim 1, further comprising:

pre-establishing language models of each group;

acquiring a group language model corresponding to a user according to the corresponding relation between the user and the group language model;

wherein, pre-establishing each group language model comprises:

calculating the similarity among different users, and acquiring a similar user group set according to the calculated similarity, wherein the similar user group set comprises users with the similarity larger than a set threshold;

obtaining word characteristics and/or word combination characteristics corresponding to a similar user group by utilizing text input of each user of the similar user group set, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations;

and training by using the word features and/or the word combination features corresponding to the similar user groups to obtain a group language model.

4. The method according to claim 3, wherein the calculating the similarity between different users and the obtaining the similar user group set according to the calculated similarity comprises:

acquiring word feature vectors of different users;

respectively taking each user of the different users as a current user, calculating cosine distances of word feature vectors of the current user and word feature vectors of other users, and taking the cosine distances as the similarity of the current user and the other users;

and adding the users with the similarity greater than a set threshold value with the current user into a similar user group set corresponding to the current user.

5. A speech recognition apparatus, comprising:

the recognition unit is used for receiving voice input of a user, recognizing the voice input and obtaining a candidate voice recognition result;

the model obtaining unit is used for carrying out linear interpolation by utilizing a general language model, a personal language model corresponding to the user and a group language model corresponding to the user to obtain a language identification model of the user; the personal language model is a language model established by using historical text input data of the user and is used for effectively measuring the probability of occurrence of a sentence, and the group language model is used for describing the language characteristics of a group to which the user belongs; the group language model training corpus is a corpus of some or all users in a similar user group corresponding to the user; the similar user group comprises users of which the similarity of the word feature vectors is greater than a set threshold value;

the sorting unit is used for calculating the probability of each candidate voice recognition result by using the language recognition model of the user and sorting each candidate voice recognition result according to the calculated probability;

and the result obtaining unit is used for obtaining a final voice recognition result by using the sorted candidate voice recognition results.

6. The apparatus of claim 5, further comprising:

a personal language model establishing unit for establishing a personal language model corresponding to the user;

wherein the personal language model building unit is specifically configured to: acquiring historical text input data of a user; obtaining word characteristics and/or word combination characteristics of the user according to historical text input data of the user, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations; and training by utilizing the word characteristics and/or the word combination characteristics of the user to obtain a personal language model corresponding to the user.

7. The apparatus of claim 5, further comprising:

a group language model building unit, the group language model building unit is specifically configured to: calculating the similarity among different users, and acquiring a similar user group set according to the calculated similarity, wherein the similar user group set comprises all users with the similarity larger than a set threshold; acquiring word characteristics and/or word combination characteristics corresponding to a similar user group by using text input of each user in the similar user group set, wherein the word characteristics comprise words and statistical frequency of the words, and the word combination characteristics comprise word combinations and statistical frequency of the word combinations; and training by using the word features and/or the word combination features corresponding to the similar user groups to obtain a group language model.

8. The apparatus according to claim 7, wherein the group language model building unit is further specifically configured to:

acquiring word feature vectors of different users; respectively taking each user of the different users as a current user, calculating cosine distances of word feature vectors of the current user and word feature vectors of other users, and taking the cosine distances as the similarity of the current user and the other users; and adding the users with the similarity greater than a set threshold value with the current user into a similar user group set corresponding to the current user.

9. An apparatus for speech recognition comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for:

10. The apparatus of claim 9, wherein the processor is further specifically configured to execute the one or more programs including instructions for:

acquiring historical text input data of a user;

11. The apparatus of claim 9, wherein the processor is further specifically configured to execute the one or more programs including instructions for:

pre-establishing language models of each group;

12. The apparatus of claim 11, wherein the processor is further specifically configured to execute the one or more programs including instructions for:

13. The apparatus of claim 11, wherein the processor is further specifically configured to execute the one or more programs including instructions for:

acquiring word feature vectors of different users;

14. A machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform a speech recognition method as claimed in one or more of claims 1 to 4.