CN108682423A

CN108682423A - A kind of audio recognition method and device

Info

Publication number: CN108682423A
Application number: CN201810504702.1A
Authority: CN
Inventors: 任阳
Original assignee: Beijing Racing Current Network Information Technology Co Ltd
Current assignee: Beijing Racing Current Network Information Technology Co Ltd
Priority date: 2018-05-24
Filing date: 2018-05-24
Publication date: 2018-10-19

Abstract

The present invention provides a kind of audio recognition method and device, technical solution is：Receive voice signal；Determine the corresponding original character information of voice signal, and the multistage text information for including from current information displaying interface；The distance based on phonetic is carried out to every section of text information that the original character information and current information displaying interface include to calculate；Determine current information displaying interface include with the shortest passage information of the original character information distance, the corresponding final text information of the voice signal of user is determined at a distance from the original character information according to this section of text information.The present invention can adjust voice recognition result according to scene adaptive residing for user, mention the accuracy of speech recognition.

Description

A kind of audio recognition method and device

Technical field

The present invention relates to technical field of voice recognition, more particularly to a kind of audio recognition method and device.

Background technology

Speech recognition technology, also referred to as automatic speech recognition (Automatic Speech Recognition, ASR), It is computer-readable input that its target, which is vocabulary Content Transformation in the voice by the mankind, for example, button, binary coding or Person's character string.

Current speech recognition technology carries out the signal of input mainly according to tools such as acoustics, language model and dictionaries Analysis, searching can export the word string of the signal with maximum probability (weight).For example, voice signal input by user sounds like " liudehua ", since the word string that maximum probability exports the voice signal is " Liu Dehua ", the voice letter of the input of user Number be eventually converted to word " Liu Dehua ", rather than similar " Liu Dehua ", Liu get Hua " etc. the similar word of pronunciation.

Existing speech recognition technology, it is already possible to meet the application demand of the overwhelming majority, however in specific scene Or some problems are also will produce under specific context, for example, user is currently seeing a film acted the leading role by " Liu Dehua ", it is desirable to Find some and " Liu Dehua " relevant information (video, document, webpage) etc., but when being " liudehua " due to pronunciation, it is defeated Go out the probability highest of " Liu Dehua ", therefore, " Liu Dehua " can be used as final voice conversion results, however this result can not Meet the user demand under this scene.

Invention content

In view of this, the purpose of the present invention is to provide a kind of audio recognition method, it can be according to scene residing for user certainly Adjustment voice recognition result is adapted to, the accuracy of speech recognition is mentioned.

In order to achieve the above object, the present invention provides following technical solutions：

A kind of audio recognition method, this method include：

Receive voice signal；

The corresponding original character information of determination voice signal；

The multistage text information for including from current information displaying interface；

Every section of text information for including to the original character information and current information displaying interface is carried out based on phonetic Distance calculates；

Determine current information displaying interface include with the shortest passage information of the original character information distance, root The corresponding final text information of the voice signal of user is determined at a distance from the original character information according to this section of text information.

A kind of speech recognition equipment, including：Receiving unit, recognition unit, acquiring unit, processing unit；

The receiving unit, the voice signal for receiving user；

The recognition unit when receiving the voice signal of user for receiving unit, determines the voice signal pair of user The original character information answered；

The acquiring unit when receiving the voice signal of user for receiving unit, is worked as from the acquisition of information display module The multistage text information that preceding information displaying interface includes；

The processing unit, every section of word for including to the original character information and current information displaying interface are believed Breath carries out the distance based on phonetic and calculates；For determining that current information displaying interface is including with the original character information distance Shortest passage information determines the voice signal of user according to this section of text information at a distance from the original character information Corresponding final text information.

As can be seen from the above technical solution, in the present invention, user is identified first with the speech recognition technology of the prior art The corresponding original character information of voice signal, the information that this original character information and user are currently browsed shows interface Zhong Bao Each section of text information contained is matched after being converted into the pinyin character string mutually met, find out current information displaying interface in this The nearest passage information of original character information distance is determined according to the distance of this section of text information and this original character information The corresponding final text information of voice signal of user.The present invention can adjust voice according to scene adaptive residing for user Recognition result mentions the accuracy of speech recognition.

Description of the drawings

Fig. 1 is the flow chart of audio recognition method of the embodiment of the present invention；

Fig. 2 is the structural schematic diagram of speech recognition equipment of the embodiment of the present invention.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with the accompanying drawings and according to embodiment, Technical scheme of the present invention is described in detail.

The equipment that the present invention is suitable for including sound identification module and information displaying application module, described information displaying application Module can be an application module with video playback capability, audio playing function, and/or text importing function, such as One video player.Described information displaying application module can the request based on sound identification module its current information is shown The text information for including in interface is supplied to the sound identification module, alternatively, will include in its current information displaying interface actively Text information be pushed to the sound identification module；Described information shows that application module and the sound identification module can be belonged to The different function module of same application can also be the function module for belonging to different application.

The method of the invention realizes the functions of above-mentioned sound identification module.

The implementation method of the present invention is specifically introduced below.

The flow chart of audio recognition method of the embodiment of the present invention referring to Fig. 1, Fig. 1, as shown in Figure 1, this method include with Lower step：

Step 101 receives voice signal.

In the embodiment of the present invention, the voice signal of user is picked up by voice radio equipment and is sent to sound identification module. Radio equipment can be microphone.User can realize voice signal input, the sound that microphone sends out user by microphone Sound is converted into the sound identification module that voice signal is sent to equipment.

Step 102 determines the corresponding original character information of voice signal.

In the embodiment of the present invention, determine that the corresponding original character of the voice signal of user is believed using existing voice identification technology Breath.

There are many existing voice identification technologies, can be identified and be determined the voice signal of user using any of which Corresponding text information, for the ease of distinguishing, the voice signal that this identification and the text information determined are known as to user corresponds to Original character information because follow-up will also be according to this text information and the language for combining the specific residing scene of user find out user The corresponding more accurately text information of sound signal (is known as final text information).

Step 103 obtains the multistage text information that current information displaying interface includes.

In the embodiment of the present invention, information shows that interface can show that application module provides by information, and mould is applied in information displaying Block can the request based on sound identification module by its current information show interface in include text information be supplied to the voice Identification module, alternatively, its current information is actively shown that the text information for including in interface is pushed to the sound identification module.Cause This, sound identification module can to information show application module hair request by way of or receive information show application module The mode of active push obtains the multistage text information that current information displaying interface includes.

It should be noted that the information in the present invention is shown in interface, include many text informations, these text informations It is to be shown according to certain format, these text informations is divided into multistage text information, such as multiple regard is shown with tabular form Frequency title, each list items correspond to a video name, which is one section of independent text information；For another example using separation Symbol, which is divided, shows multiple video names, and word corresponds to a video name between two separators, which is one section only Vertical text information.In the present invention, current information shows that each section of text information in interface is mutual indepedent, and sound identification module needs These the mutually independent multistage text informations for including in current information displaying interface are obtained, and corresponding with voice messaging one by one Original character information carry out the distance based on phonetic calculating.

It is illustrated by taking television system as an example, it is assumed that information shows that application module is a video player, speech recognition mould Block is a function module of the embedded video player, when user browses TV/film under the main interface that video playing rises When information, which can show that all data that interface includes are encapsulated into map by current information, and map is handed over To the binder of television system, the binder of television system can be by this data forwarding to sound identification module, so that language Sound identification module obtains the multistage text information that current information displaying interface includes.In addition, the correlation at current information displaying interface Information can then be obtained from the stack information of operating system, specially the stack top information of storehouse, parsed and answered from stack top information The title (name each wrapped in operating system is unique) of packet stores current letter in the file that shows of name referring of this packet Text information in breath displaying interface.

It should be noted that this step 103 can also execute before step 101 or step 102.

Step 104, every section of text information for including to the original character information and current information displaying interface carry out base It is calculated in the distance of phonetic.

In the embodiment of the present invention, every section of text information that original character information and current information displaying interface include is turned Pinyin character string is turned to, is calculated into row distance based on the pinyin character string after conversion.

There are two types of the modes for converting text information to pinyin character string, and one is be converted into not toned pinyin character String, such as " Liu Dehua " is converted to " liudehua "；One is being converted to the pinyin character string with tone, using phonetic+ The mode of tone (sound call number 1,2,3,4 indicate, respectively represent, two sound, three sound, the four tones of standard Chinese pronunciation), such as by " Liu Dehua " It is converted to " liu2de2hua2 ", wherein the tone of " liu2 " expression " liu " is 2, the tone of " de2 " expression " de " is 2, The tone of " hua2 " expression " hua " is 2.

The mode that pinyin character string is converted to based on above two text information, to the original character information and current letter It is at least three kinds following that every section of text information that breath displaying interface includes carries out the method that the distance based on phonetic calculates：

Method 1：

Convert the original character information to not toned first pinyin character string and the second phonetic with tone Character string；

Convert every section of text information that current information displaying interface includes to pinyin character string to be matched, it is described to be matched Pinyin character string is not toned pinyin character string；

By current information displaying interface include the corresponding pinyin character string to be matched of every section of text information respectively with it is described First pinyin character string and the second pinyin character string are calculated into row distance, and calculate the pinyin character string and the first phonetic to be matched The distance of character string and sum of the distance with the second pinyin character string, using the sum of the distance as this section of text information and the original The distance of beginning text information.

Method 2：

Convert every section of text information that current information displaying interface includes to pinyin character string to be matched, it is described to be matched Pinyin character string is the pinyin character string with tone；

Method 3：

Convert every section of text information that current information displaying interface includes to the not toned first phonetic word to be matched Symbol string and the second pinyin character string to be matched with tone；

The corresponding first pinyin character string to be matched of current information displaying interface every section of text information including, second are waited for Pinyin character string is matched to calculate into row distance with the first pinyin character string, the second pinyin character string respectively, and calculate this One pinyin character string to be matched is at a distance from the first pinyin character string and the second pinyin character string to be matched and the second phonetic word The sum of the distance for according with string, using the sum of the distance as this section of text information at a distance from the original character information.

In above-mentioned three kinds of methods, the distance that smallest edit distance algorithm calculates two pinyin character strings may be used, specifically Including：The smallest edit distance that two pinyin character strings are calculated using smallest edit distance algorithm, by the smallest edit distance Distance as two pinyin character strings.

Step 105, determine current information displaying interface include with the shortest one section of text of the original character information distance Word information determines the corresponding final text of the voice signal of user according to this section of text information at a distance from the original character information Word information.

According to all apart from result of calculation in step 104：Interface is shown to the original character information and current information Including every section of text information at a distance from the original character information, can find out current information displaying interface include with institute State the shortest passage information of original character information distance, the corresponding pinyin character string of this section of text information and the original text The similarity highest of the corresponding pinyin character string of word information is most likely to be user and wishes that the word of the voice signal of its input turns Change result.

Find out current information displaying interface include with the shortest passage information of the original character information distance it Afterwards, due to regardless of distance, always it is including with the original character information finally to find current information displaying interface Apart from shortest passage information, but the distance of this section of text information and original character information may be still bigger, Such as this section of text information " Liu Xue " is converted into " liuxue ", original character information " Liu Dehua " is converted to " liudehua ", this In the case of kind, this section of text information and original character information are actually unmatched, and " Liu Xue " is believed as the voice of user Number corresponding text information is clearly mistake.

To solve the above-mentioned problems, the present invention in, find out current information displaying interface include with the original character believe Breath is after shortest passage information, it is also necessary to further according to this section of text information and the original character information Distance determines that the corresponding final text information of voice signal of user, specific method are：According to this section of text information and the original Beginning text information determines a distance metric value, if be less than at a distance from the original character information should be away from for this section of text information It, then, otherwise, will be described original using this section of text information as the corresponding final text information of the voice signal of user from metric Voice signal corresponding final text information of the text information as user.

It, can be in advance according to the bigger principle setting one of the more long then distance metric value of string length about word in the present invention The function f (x) of string length and distance metric value is accorded with, for example, f (x)=ax+b, wherein a and b are preset adjustment factor values, it can With according to actual demand or experience setting.Simplest f (x) can be set to each specific string length and be arranged one Corresponding distance metric value indicates the corresponding distance metric value of all sub- symbol string length by way of enumerating.

It is specifically as follows according to the method that this section of text information and the original character information determine a distance metric value：Really The length L1 for the not toned pinyin character string that fixed this section of text information is transformed and original character information conversion Made of not toned pinyin character string length L2, take the maximum length value in L1 and L2, institute determined according to the function The corresponding distance metric value of maximum length value is stated, which is determined as one distance metric value.

Audio recognition method of the embodiment of the present invention is described in detail above, the present invention also provides a kind of voices Identification device is described in detail below in conjunction with Fig. 2：

It is the structural schematic diagram of speech recognition equipment of the embodiment of the present invention referring to Fig. 2, Fig. 2, which knows where device Further include information displaying application module in equipment, as shown in Fig. 2, the device includes：Receiving unit 201, obtains recognition unit 202 Take unit 203, processing unit 204；Wherein,

Receiving unit 201, the voice signal for receiving user；

Recognition unit 202 when receiving voice signal for receiving unit 201, determines the corresponding original text of voice signal Word information；

Acquiring unit 203, when receiving voice signal for receiving unit 201, obtaining current information displaying interface includes Multistage text information；

Processing unit 204, every section of word for including to the original character information and current information displaying interface are believed Breath carries out the distance based on phonetic and calculates；For determining that current information displaying interface is including with the original character information distance Shortest passage information determines the voice signal of user according to this section of text information at a distance from the original character information Corresponding final text information.

In Fig. 2 shown devices,

The processing unit 204, every section of word for including to the original character information and current information displaying interface are believed When breath carries out the calculating of the distance based on phonetic, it is used for：

Convert every section of text information that current information displaying interface includes to pinyin character string to be matched, it is described to be matched Pinyin character string is not toned pinyin character string or the pinyin character string with tone；

In Fig. 2 shown devices,

The processing unit 204 calculates the distance of two pinyin character strings based on smallest edit distance algorithm, specific to wrap It includes：The smallest edit distance that two pinyin character strings are calculated using smallest edit distance algorithm is made the smallest edit distance For the distance of two pinyin character strings.

In Fig. 2 shown devices,

The processing unit 204 determines the language of user according to this section of text information at a distance from the original character information When the corresponding final text information of sound signal, it is used for：According to this section of text information and the original character information determine one away from From metric, if this section of text information is less than the distance metric value at a distance from the original character information, by the Duan Wen Voice signal corresponding final text information of the word information as user, otherwise, using the original character information as user's The corresponding final text information of voice signal.

Further include dispensing unit 205 in Fig. 2 shown devices；

The dispensing unit 205, for being arranged in advance according to the bigger principle of the more long then distance metric value of string length One function about string length and distance metric value；

The processing unit 204 determines a distance metric value according to this section of text information and the original character information When, it is used for：Determine the length L1 for the not toned pinyin character string that this section of text information is transformed and the original text The length L2 for the not toned pinyin character string that word information is transformed, takes the maximum length value in L1 and L2, according to described Function determines the corresponding distance metric value of the maximum length value, which is determined as one distance metric Value.

It has been proved by practice that the method for the application present invention, can greatly improve the success rate of speech recognition, especially of the invention Specific environment where user is participated in speech recognition process by method, can be very good to improve user experience.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention With within principle, any modification, equivalent substitution, improvement and etc. done should be included within the scope of protection of the invention god.

Claims

1. a kind of audio recognition method, which is characterized in that this method includes：

Receive voice signal；

Every section of text information for including to the original character information and current information displaying interface carries out the distance based on phonetic It calculates；

Determine current information displaying interface include with the shortest passage information of the original character information distance, according to this Section text information determines the corresponding final text information of the voice signal of user at a distance from the original character information.

2. according to the method described in claim 1, it is characterized in that,

Every section of text information for including to the original character information and current information displaying interface carries out the distance based on phonetic The method of calculating is：

Convert the original character information to not toned first pinyin character string and the second pinyin character with tone String；

Convert every section of text information that current information displaying interface includes to pinyin character string to be matched, the phonetic to be matched Character string is not toned pinyin character string or the pinyin character string with tone；

The corresponding pinyin character string to be matched of every section of text information for including by current information displaying interface is respectively with described first Pinyin character string and the second pinyin character string are calculated into row distance, and calculate the pinyin character string to be matched and the first pinyin character The distance of string and sum of the distance with the second pinyin character string, using the sum of the distance as this section of text information and the original text The distance of word information.

3. according to the method described in claim 1, it is characterized in that,

Convert every section of text information that current information displaying interface includes to the not toned first pinyin character string to be matched With the second pinyin character string to be matched with tone；

By the corresponding first pinyin character string to be matched of current information displaying interface every section of text information including, second to be matched Pinyin character string is calculated with the first pinyin character string, the second pinyin character string into row distance respectively, and is calculated this and first waited for Pinyin character string is matched at a distance from the first pinyin character string and the second pinyin character string to be matched and the second pinyin character string Sum of the distance, using the sum of the distance as this section of text information at a distance from the original character information.

4. according to the method in claim 2 or 3, which is characterized in that

The distance that two pinyin character strings are calculated based on smallest edit distance algorithm, is specifically included：It is calculated using smallest edit distance Method calculates the smallest edit distance of two pinyin character strings, using the smallest edit distance as two pinyin character strings away from From.

5. according to the method described in claim 1,2 or 3, which is characterized in that

The corresponding final word of the voice signal of user is determined at a distance from the original character information according to this section of text information The method of information is：A distance metric value is determined according to this section of text information and the original character information, if the Duan Wen Word information is less than the distance metric value at a distance from the original character information, then using this section of text information as the voice of user The corresponding final text information of signal, otherwise, using the original character information as the corresponding final text of the voice signal of user Word information.

6. according to the method described in claim 5, it is characterized in that,

In advance according to the bigger principle setting one of the more long then distance metric value of string length about string length and apart from degree The function of magnitude；

It is according to the method that this section of text information and the original character information determine a distance metric value：Determine this section of word The length L1 for the not toned pinyin character string that information is transformed and the original character information be transformed without The length L2 of the pinyin character string of tone, takes the maximum length value in L1 and L2, and the maximum length is determined according to the function It is worth corresponding distance metric value, which is determined as one distance metric value.

7. a kind of speech recognition equipment, which is characterized in that the device includes：Receiving unit, recognition unit, acquiring unit, processing Unit；

The receiving unit, the voice signal for receiving user；

The recognition unit when receiving the voice signal of user for receiving unit, determines that the voice signal of user is corresponding Original character information；

When receiving the voice signal of user for receiving unit, current letter is obtained from information display module for the acquiring unit The multistage text information that breath displaying interface includes；

The processing unit, for the original character information and current information displaying interface every section of text information including into Distance of the row based on phonetic calculates；For determining that current information shows that interface includes most short with the original character information distance Passage information, determine that the voice signal of user is corresponding at a distance from the original character information according to this section of text information Final text information.

8. speech recognition equipment according to claim 7, which is characterized in that

The processing unit, every section of text information for including to the original character information and current information displaying interface carry out base When the distance of phonetic calculates, it is used for：

9. speech recognition equipment according to claim 7, which is characterized in that

10. speech recognition equipment according to claim 8 or claim 9, which is characterized in that

The processing unit is calculated the distance of two pinyin character strings based on smallest edit distance algorithm, specifically included：Using most Small editing distance algorithm calculates the smallest edit distance of two pinyin character strings, is spelled the smallest edit distance as this two The distance of sound character string.

11. according to the speech recognition equipment described in claim 7,8 or 9, which is characterized in that

The processing unit determines the voice signal pair of user according to this section of text information at a distance from the original character information When the final text information answered, it is used for：A distance metric value is determined according to this section of text information and the original character information, If this section of text information at a distance from the original character information be less than the distance metric value, using this section of text information as The corresponding final text information of voice signal of user, otherwise, using the original character information as the voice signal pair of user The final text information answered.

12. speech recognition equipment according to claim 11, which is characterized in that further include dispensing unit；

The dispensing unit, in advance according to the bigger principle setting one of the more long then distance metric value of string length about word Accord with the function of string length and distance metric value；

The processing unit is used for when determining a distance metric value according to this section of text information and the original character information： The length L1 and the original character information for determining the not toned pinyin character string that this section of text information is transformed turn The length L2 of not toned pinyin character string, takes the maximum length value in L1 and L2 made of change, is determined according to the function The corresponding distance metric value of the maximum length value, is determined as one distance metric value by the distance metric value.