CN106384593B

CN106384593B - A kind of conversion of voice messaging, information generating method and device

Info

Publication number: CN106384593B
Application number: CN201610801720.7A
Authority: CN
Inventors: 吴育強
Original assignee: Beijing Kingsoft Software Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Current assignee: Beijing Kingsoft Software Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date: 2016-09-05
Filing date: 2016-09-05
Publication date: 2019-11-01
Anticipated expiration: 2036-09-05
Also published as: CN106384593A; CN110060687A

Abstract

The embodiment of the present application discloses a kind of conversion of voice messaging, information generating method and device, is related to field of computer technology, is applied to electronic equipment, wherein above-mentioned voice messaging conversion method includes: to receive target voice information；In the case where meeting information switch condition, speech recognition is carried out to the target voice information and obtains text conversion information, so that client shows the text conversion information based on the display location of the target voice information.Using scheme provided by the embodiments of the present application, text can be converted speech into.

Description

A kind of conversion of voice messaging, information generating method and device

Technical field

This application involves field of computer technology, in particular to a kind of voice messaging conversion, information generating method and device.

Background technique

With the fast development of computer technology, the various communication class clients applied to user terminal are come into being.With Family can be communicated by these communication class clients with good friend.

It when user communicates by above-mentioned communication class client with good friend, can be communicated by text information, also It can be communicated by voice messaging, greatly facilitate user in this way, however, in some cases, for example, participating in some Perhaps ambient enviroment is noisy or in the case that user is not intended to other people to hear for meeting, and user, which is inconvenient to listen to, to be received Voice messaging, it may be considered that the voice messaging received is converted, text is converted into and is shown to user.

In view of the foregoing, a kind of voice messaging conversion method need to be provided, to convert speech information into text information.

Summary of the invention

The embodiment of the present application discloses a kind of conversion of voice messaging, information generating method and device, and voice messaging is turned Change text information into.

In order to achieve the above objectives, the embodiment of the present application discloses a kind of voice messaging conversion method, is applied to electronic equipment, The described method includes:

Receive target voice information；

In the case where meeting information switch condition, speech recognition is carried out to the target voice information and obtains text conversion Information, so that client shows the text conversion information based on the display location of the target voice information.

In order to achieve the above objectives, the embodiment of the present application discloses a kind of information generating method, which comprises

The target voice information that reception source client is sent；

According to the first user speech library of corresponding first user of the target voice information, to the target voice information Carry out speech recognition；

Recognition result is sent to the source client；

Receive the update information for the recognition result that the source client is sent；

First user speech library is updated according to the update information.

In order to achieve the above objectives, the embodiment of the present application discloses a kind of voice messaging conversion equipment, is applied to electronic equipment, Described device includes:

Information receiving module, for receiving target voice information；

Speech recognition module, for carrying out language to the target voice information in the case where meeting information switch condition Sound identification obtains text conversion information, so that client shows the text based on the display location of the target voice information Transitional information.

In order to achieve the above objectives, the embodiment of the present application discloses a kind of information generation device, and described device includes:

Information receiving module, for receiving the target voice information of source client transmission；

Speech recognition module, for the first user speech library according to corresponding first user of the target voice information, Speech recognition is carried out to the target voice information；

As a result sending module, for recognition result to be sent to the source client；

Update information receiving module is believed for receiving the amendment for the recognition result that the source client is sent Breath；

Sound bank update module, for updating first user speech library according to the update information.

As seen from the above, in scheme provided by the embodiments of the present application, after receiving target voice information, turn meeting information The case where changing condition carries out speech recognition to target voice information and obtains text conversion information, and such client can be based on mesh The display location for marking voice messaging shows above-mentioned text conversion information.As it can be seen that using scheme provided by the embodiments of the present application, it can Convert speech information into text information.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is the flow diagram of the first voice messaging conversion method provided by the embodiments of the present application；

Fig. 2 is the flow diagram of second of voice messaging conversion method provided by the embodiments of the present application；

Fig. 3 is the flow diagram of the third voice messaging conversion method provided by the embodiments of the present application；

Fig. 4 is the flow diagram of the 4th kind of voice messaging conversion method provided by the embodiments of the present application；

Fig. 5 is the flow diagram of the 5th kind of voice messaging conversion method provided by the embodiments of the present application；

Fig. 6 is the flow diagram of the 6th kind of voice messaging conversion method provided by the embodiments of the present application；

Fig. 7 is a kind of flow diagram of information generating method provided by the embodiments of the present application；

Fig. 8 is a kind of structural schematic diagram of voice messaging conversion equipment provided by the embodiments of the present application；

Fig. 9 is a kind of structural schematic diagram of information generation device provided by the embodiments of the present application；

Figure 10 a is the first voice messaging conversion effect schematic diagram provided by the embodiments of the present application；

Figure 10 b is second of voice messaging conversion effect schematic diagram provided by the embodiments of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.

First concept involved in the embodiment of the present application is introduced below.

1, sound bank

It include at least a pair of above-mentioned correspondence in sound bank for the corresponding relationship between stored voice message and text information Relationship.Wherein, the corresponding text information of voice messaging can be single language, be also possible to it is bilingual, the application not to this into Row limits.

In addition, sound bank can be the included sound bank of system, it is also possible to the voice trained in use Library.

For example, corresponding " child " two word of the voice messaging of pronunciation " child "；Or it is corresponding " child "；Pronounce " child " Voice messaging, " shoes " etc. of corresponding Sichuan words.

2, sound bank set

It include at least one sound bank in sound bank set, these sound banks can be divided into: received pronunciation library, common voice Library, classification sound bank etc..

Wherein, received pronunciation library can be understood as the sound bank for received pronunciation, for example, for mandarin in Chinese Sound bank, the sound bank for British English, the sound bank for Americanese etc., commonly use sound bank；

Common sound bank, can be the sound bank selected according to own situation, can be client according to current device institute The geographical location choice sound bank at place can also be the sound bank that client is selected according to current language mode.

Classification sound bank can be the sound bank obtained based on different classification foundations.

For example, the different type according to language can be divided into: sound bank, the Chinese are talked about in Chinese putonghua speech library, Chinese Sichuan Language Cantonese speech library, Chinese Guangdong language sound bank, English Phonetics library, German speech library, French sound bank, Russian sound bank etc. Deng；

It can be divided into according to different professional classifications: law class sound bank, computer sound bank, economy class sound bank etc. Deng.

In addition, the method that voice of the existing technology is converted to text, often only supports the speech recognition of mandarin.But It is known that everyone has dialect, there is oneself unique pronunciation to some vocabulary, or some vocabulary is misread Possibility also compares more, and is exchanged between country variant user with different language and also becomes more frequent and this Exchange is all often mode by wireless communication, and bandwidth and network environment are all more complex, and existing voice is caused to be converted to text Method cannot effectively identify all voices, and be smoothly converted to text, be not able to satisfy the needs of most users.

There are personalization features from above sound, i.e., the accent of each user, pronunciation, clarity, word speed, tone not phase Same angle is set out, and can be established sound bank for each user, can be referred to as user speech library, can in above-mentioned sound bank set With the user speech library comprising each user.

One initial library can be first set when specifically, establishing sound bank for each user, then in use process In, constantly the initial library is modified to obtain for the use according to the voice messaging, recognition result and update information of user The user speech library at family realizes that the unique sound of each user can correspond to correct Chinese text or English words.

Wherein, above-mentioned initial library can be artificially selected, for example, the normal word of user is mandarin, then the initial library It can be selected as Chinese putonghua speech library, the normal word of user is Sichuan words, then the initial library can be selected as Chinese Sichuan Talk about sound bank.In addition, above-mentioned initial library can also be Default sound library, the application is defined not to this.

Fig. 1 is the flow diagram of the first voice messaging conversion method provided by the embodiments of the present application, this method application In electronic equipment.

Specifically, the executing subject as the embodiment of the present application, above-mentioned electronic equipment can be server, be also possible to use Family terminal further in the case where above-mentioned electronic equipment is user terminal, is often equipped with various clients in user terminal End, therefore, in this case, the executing subject of the provided scheme of the embodiment of the present application is it can be appreciated that client.

Specifically, the above method includes:

S101: target voice information is received.

S102: in the case where meeting information switch condition, speech recognition is carried out to target voice information and obtains text turn Information is changed, so that client shows text conversion information based on the display location of target voice information.

Above-mentioned target text transitional information includes the text conversion information of at least one form, for example, can only include Chinese can also be Chinese, English, can also be that Chinese, English, French, German etc. include the text conversion of diversified forms Information.

Specifically, carrying out speech recognition in the case where meeting information switch condition to target voice information and obtaining text It when transitional information, can be after receiving target voice information, speech recognition directly carried out to target voice information and obtains text Transitional information, be also possible to monitoring receive the information conversion instruction for target voice information after, then to target voice carry out Speech recognition obtains text conversion information.

It is, receiving target voice information can be considered to meet information switch condition, or only connecing Information conversion instruction is received just to be considered to meet information switch condition later.

In addition, carrying out may having the word continuously repeated in speech recognition text information obtained to target voice information Language, or the word of speech habits is not obviously met, so can be advised according to preset amendment after obtaining above-mentioned text information Processing first then is modified to above-mentioned text information, the text of target voice information is then obtained further according to revised text information Word transitional information.It is of course also possible to which above-mentioned text information to be directly determined as to the text conversion information of target voice information, this Shen It please be to be illustrated for above-mentioned, be defined not to this.

Specifically, above-mentioned preset modification rule can be, filters out duplicate word, filters out duplicate word, it will not The word for meeting speech habits, which is modified to, meets word of speech habits etc., and the application is not to above-mentioned preset modification rule Particular content be defined.

For example, I hobby, after amendment are as follows: my hobby；

Your child's paper is very lovely, after amendment are as follows: your child is very lovely.

It is below server, for client by executing subject, a point different situations are illustrated:

It is assumed that being communicated between the first client and the second client.

Situation one, executing subject are server

Above-mentioned target voice information is sent to server by the first client, and server receives above-mentioned target voice information Afterwards, above-mentioned target voice information is sent to the second client, the second client receives after above-mentioned target voice information at it It shows and shows target voice information on interface.

In a kind of situation, server just starts to carry out target voice information after receiving above-mentioned target voice information Speech recognition obtains text conversion information, will be above-mentioned when server receives the information conversion instruction of the second client transmission Text conversion information is sent to the second client, shows the text based on the display location of target voice information by the second client Transitional information.Since server just starts to carry out speech recognition after receiving target voice information, second is being received in this way After the information conversion instruction that client is sent, it is enable to respond quickly the second client, sends text conversion letter to the second client The delay of breath is small, and user experience is preferable.

In another case, server after receiving above-mentioned target voice information, can not also start at once to target Voice messaging carries out speech recognition, but after the information conversion instruction for receiving the transmission of the second client, just carry out voice knowledge Not Huo get text conversion information, such server only just carries out voice knowledge in the case where client has voice conversion requirements Not, server resource can effectively be saved.

Situation two, executing subject are server

Above-mentioned target voice information is sent to the second client, while the first client or the second visitor by the first client Family end also sends the target voice information to server, after the second client receives above-mentioned target voice information, in its displaying Target voice information is shown on interface.

Situation three, executing subject are server

Above-mentioned target voice information is sent to the second client by the first client, and the second client receives above-mentioned target After voice messaging, is shown at it and show target voice information on interface.

When the second client has voice messaging conversion requirements, the second user end to server sends information conversion instruction, Target voice information is carried in the information conversion instruction, after server receives above- mentioned information conversion instruction, is referred to from information conversion Target voice information is parsed in order, is then carried out speech recognition and is obtained text conversion information, and by text conversion obtained Information is sent to the second client, is shown by the second client.

Situation four, executing subject are client

Above-mentioned target voice information is sent to the second client by the first client, and the second client receives target voice It is shown after information at it and shows target voice information on interface, when carrying out voice communication in this case, the first client is direct Target voice information is sent to the second client, without by means of server, such other equipment also can not be just somebody's turn to do Target voice information, to a certain degree from can be improved the safety of voice communication.

In a kind of situation, after the second client receives above-mentioned target voice information, just start to target voice information into Row speech recognition obtains text conversion information, after the second client receives information conversion instruction, direct basis target voice The display location of information shows above-mentioned text conversion information.

In another case, the second client can also only be shown on interface at it after receiving target voice information It shows target voice information, speech recognition is carried out to target voice information without starting, is only receiving information conversion instruction Afterwards, just start to carry out speech recognition acquisition text conversion information, and then obtain the text conversion information of target voice information, then Just according to the display location of target voice information, above-mentioned text conversion information is shown.

The second user end to server mentioned in above situation one, situation two and situation three sends information conversion instruction, It can be and send after the instruction that the second client receives user's long-pressing target voice information, naturally it is also possible to be second Client receives user for sending after the selection operation of voice conversion, and the application is not to second client of triggering to clothes The condition that business device sends information conversion instruction is defined.

In above situation four, the instruction that the second client receives user's long-pressing target voice information may be considered second Client has received information conversion instruction, and the second client receives user can also recognize for the selection operation of voice conversion To be that the second client has received information conversion instruction.

In addition, when client shows text conversion information based on the display location of target voice information, it can be in target language In the displayed page of message breath, the preset direction of the display location of target voice information, text conversion information is shown, for example, can Be the positions such as lower section, the top of the display location of target voice information show text conversion information, text conversion information with The distance between target voice information may be set according to actual conditions, and the application does not limit the exhibition of above-mentioned text conversion information Show, specific location can be determines according to actual conditions.For example, with reference to the exhibition that Figure 10 a, Figure 10 b are respectively in target voice information Show the schematic diagram that text conversion information is shown in the page, above and below the display location of target voice information.

As seen from the above, in scheme provided in this embodiment, after receiving target voice information, meeting information conversion stripes The case where part, carries out speech recognition to target voice information and obtains text conversion information, and such client can be based on target language The display location of message breath shows above-mentioned text conversion information.As it can be seen that using scheme provided by the embodiments of the present application, it can be by language Message breath is converted into text information.

In a kind of specific implementation of the application, referring to fig. 2, the stream of second of voice messaging conversion method is provided Journey schematic diagram, compared with previous embodiment, in the present embodiment, in the case where meeting information switch condition, to target language message Breath carries out speech recognition and obtains text conversion information, so that client shows text based on the display location of target voice information Transitional information (S102), comprising:

S102A: in the case where meeting information switch condition, voice word segmentation processing is carried out to target voice information, is obtained The voice participle that target voice information includes.

Specifically, when carrying out voice word segmentation processing to target voice information, it can be in conjunction with the amplitude of sound in voice messaging Information, for example, indicate that amplitude information of sound to pause etc. carries out voice word segmentation processing, it can also be according to fixed data length Voice word segmentation processing is carried out, the application is not defined the concrete mode of voice word segmentation processing.

It can summarize and learn from foregoing description, carrying out voice word segmentation processing to voice messaging can simply understand are as follows: right Voice messaging carries out segment processing, each after segment processing to be segmented the small voice messaging of a corresponding data length.Also It is above-mentioned voice participle it is to be understood that the data length voice messaging smaller than the data length of target voice information.

S102B: for each voice participle obtained, rule is selected according to preset sound bank, from sound bank set Middle selection sound bank, and the voice is segmented by selected sound bank and carries out speech recognition acquisition text conversion information, with So that client shows text conversion information based on the display location of target voice information.

Above-mentioned preset sound bank, which selects rule, to be the rule for sound bank selecting sequence, carry out speech recognition When, according to sound bank selecting sequence specified in sound bank selection rule, a sound bank is first selected, with the sound bank to voice Participle carries out speech recognition, if recognition result shows that discrimination is low, selects according to above-mentioned sound bank selecting sequence next Sound bank segments voice carry out speech recognition again, until recognition result shows that discrimination meets preset requirement.

For example, it is assumed that preset sound bank selection rule are as follows: first select user speech library, the identification knot in user speech library In the case that fruit is unsatisfactory for preset requirement, selection criteria sound bank is first adopted when then carrying out speech recognition to a voice participle Speech recognition is carried out with the corresponding user speech library of active user, if recognition result shows that discrimination is higher than preset threshold value, The speech recognition result that the corresponding recognition result in present user speech library is segmented as the voice；If recognition result display identification Rate is not higher than preset threshold value, then is segmented using received pronunciation library to above-mentioned voice and carry out speech recognition, and recognition result is made For the speech recognition result of voice participle.

Certainly above-mentioned preset sound bank selection rule is also possible to: first selection criteria sound bank is being directed to received pronunciation In the case that the recognition result in library is unsatisfactory for preset requirement, user speech library is selected.

Specifically, above-mentioned preset sound bank selection rule can at least one of following information determination according to Rule:

The classification of first user's said target group of target voice information is sent, such as: the first user said target group The classification of group is law class, then the topic probability relevant to law that user discusses in the group is higher, preferential to select law class The classification of sound bank, first user's said target group is IT class, then the topic that user discusses in the group is relevant to IT general Rate is higher, preferential to select IT class sound bank etc.；

The language form of above-mentioned target group title, for example, the entitled Chinese of target group, then preferentially selection Chinese is common Sound bank is talked about, the entitled English of target group then preferentially selects English Phonetics library etc.；

The customer attribute information of above-mentioned first user, above-mentioned customer attribute information may include gender, age etc., example Such as, the customer attribute information of the first user are as follows: gender: female, age: 5 years old, then female voice library and child's voice are preferentially selected Library；

Send geographical location locating for the source client of target voice information, the certain journey in geographical location locating for the client of source Language used by a user is able to reflect on degree, for example, geographical location locating for the client of source are as follows: Beijing, then user speaks standard Chinese pronunciation Probability it is higher, can preferential selection criteria sound bank, geographical location locating for the client of source are as follows: Britain, then user is English-speaking Probability is higher, can preferentially select English Phonetics library, in addition, geographical location locating for the client of source can pass through the IP of source client The information acquisitions such as address, movable signal, GPS information；

Before receiving target voice information, the text information and/or voice messaging of stored first user corresponds to voice In participle, sound bank belonging to the first preset quantity voice participle that frequency of occurrence sorts forward；

Before receiving target voice information, the text information and/or voice messaging of stored target group correspond to voice In participle, sound bank belonging to the second preset quantity voice participle that frequency of occurrence sorts forward；

The sound bank selecting sequence of user setting.

It should be noted that the application is only illustrated for above-mentioned, preset sound bank selection in practical application The particular content of rule is not limited to that.

In a kind of implementation of the application, the preset sound bank selection rule determined according to aforesaid way can be with Carry out speech recognition is combined with user speech library, specifically, for group, wherein generally comprising multiple users, voice It the user speech library that may include each user in the set of library can also be first using transmission voice letter when carrying out speech recognition The user speech library of the user of breath carries out speech recognition, and in the case where recognition result cannot be met the requirements, reselection is according to this The sound bank that the information of group determines is identified, if if recognition result cannot still be met the requirements, it is normal can to continue selection Speech recognition is carried out with sound bank, received pronunciation library etc..For example, the voice messaging sent to the user A in law class group When carrying out speech recognition, speech recognition first can be carried out using the user speech library of user A, cannot be met the requirements in recognition result In the case where, select law class sound bank to carry out speech recognition.

In addition, selecting sound bank from sound bank set, and segment to voice by selected sound bank in this step Speech recognition is carried out, is that each voice obtained is segmented for individual voice participle, is required to repeat above-mentioned Process.

As seen from the above, in scheme provided in this embodiment, voice word segmentation processing first is carried out to target voice information, then Selection sound bank is segmented for each voice and carries out speech recognition, the accuracy rate of speech recognition can be improved, especially for mesh It the case where including the voice messaging of the voice messaging of different language, different user in mark voice messaging, can further improve The accuracy rate of speech recognition.

It should be understood that accent is also to be likely to occur variation, for example, sometimes for even for same user It is communicated using mandarin, is communicated sometimes using dialect, in consideration of it, in a kind of specific implementation of the application, Referring to Fig. 3, the flow diagram of the third voice messaging conversion method is provided, compared with previous embodiment, in the present embodiment, In the case where meeting information switch condition, speech recognition is carried out to target voice information and obtains text conversion information, so that Client shows text conversion information (S102) based on the display location of target voice information, comprising:

S102C: in the case where meeting information switch condition, rule is determined according to preset voice segments, obtains target language The first aim voice segments of message breath.

Specifically, above-mentioned preset voice segments determine that rule may is that first pause in detection target voice information, Then target voice information initial position is started to be determined as target voice to the voice messaging between the above-mentioned position detected The first aim voice segments of information.

It wherein, can be with the amplitude information of reference voice, naturally it is also possible to refer to when detecting the pause in target voice information Other information, the method for detecting the pause in voice messaging belong to the prior art, and I will not elaborate.

In addition, above-mentioned preset voice segments determine rule it may also is that since the initial position of target voice information, select Select first aim voice segments of the voice messaging as target voice information of preset length.

The value of above-mentioned preset length can be preset fixed value, can also be the length according to target voice information Spend determining numerical value.

S102D: being respectively adopted each sound bank in sound bank set, carries out speech recognition to target language segment.

S102E: the highest sound bank of discrimination is determined as target voice library.

S102F: using target voice library to part carries out speech recognition in addition to target language segment in target voice information, Obtain the first recognition result.

S102G: according to the first recognition result and the second recognition result, obtaining the text conversion information of target voice information, So that client shows text conversion information based on the display location of target voice information.

Wherein, above-mentioned second recognition result are as follows: carry out the result of speech recognition to target language segment using target voice library.

Those skilled in the art are it is understood that determine that target language segment may have error, using voice Each sound bank in the set of library is also likely to be present error when carrying out speech recognition to target language segment, therefore, final selected Target voice library out may not be optimal sound bank.In consideration of it, in a kind of optional implementation of the application, according to First recognition result and the second recognition result can be according to the first identifications when obtaining the text conversion information of target voice information As a result with the second recognition result, the first discrimination for being directed to target voice library is obtained, then judges whether the first discrimination is less than Preset discrimination threshold value, if it is, carrying out speech recognition to target voice information using default Default sound library, and according to needle To the recognition result in default Default sound library, the text conversion information of target voice information is obtained.

Specifically, may include a sound bank in above-mentioned default Default sound library also may include multiple sound banks, example Such as, received pronunciation library, English Phonetics library, user speech library etc. be may include in above-mentioned default Default sound library, in addition, may be used also To provide the use priority information of sound bank included in above-mentioned default Default sound library, for example, can specify that user speech The priority in library is higher than the priority in English Phonetics library higher than the priority in received pronunciation library, received pronunciation library, that is, preferentially User speech library is selected, in the case where recognition result is undesirable, selection criteria sound bank carries out speech recognition, in recognition result In the case where undesirable, reselection English Phonetics library carries out speech recognition, if this time recognition result reaches requirement, by this Recognition result is as final recognition result, can be by target voice library, received pronunciation if this time recognition result is also undesirable Best recognition result is as final recognition result in library, the corresponding recognition result in English Phonetics library.The application be more than It is illustrated for stating, user can set each sound bank in default Default sound library according to their own needs in practical application Priority, be defined not to this, user each in this way can possess personalized sound bank recognition sequence.

As seen from the above, in scheme provided in this embodiment, according to the first aim voice segments in target voice information It determines the sound bank for carrying out speech recognition, rather than fixed sound bank is used to carry out speech recognition, help to improve voice The accuracy of identification.

Above-mentioned voice messaging conversion method is further discussed in detail below by specific example.

It is assumed that the Sichuan user of only one China is carrying out speech exchange, Chinese Sichuan user hair with a Japanese user The voice messaging sent are as follows: " child wears all well and good, meets physics engineering, the Sichuan intonation voice messaging of very good ", day After this user receives the voice messaging, selection carries out voice conversion, the default language of Japanese user institute's using terminal selection are as follows: Japanese,

When being identified to " child wear all well and good, meet physics engineering ", discovery user speech library (Sichuan language Adjust) when being identified, discrimination is higher, it is believed that the corresponding Chinese of this section of voice, obtained Chinese recognition result are that " shoes are worn Get up all well and good, meet physics engineering ", it is contemplated that Japanese user may be failed to understand, and English and Japanese OCR can be provided for it As a result, specifically, above-mentioned Chinese recognition result first can be converted to English recognition result, then again by English recognition result turn Japanese OCR is changed to as a result, English recognition result and Japanese OCR result directly can certainly be obtained by Chinese recognition result；

When identifying to " very good ", it is found that when sound bank is identified in English, discrimination is higher, so recognizing For the corresponding English of this section of voice, can directly obtain English recognition result is " very good ", then according to English recognition result Chinese recognition result " fine " is obtained, Japanese OCR result is obtained according to English recognition result；

Japanese OCR result, English recognition result and Chinese recognition result are finally supplied to Japanese user together.

In addition, when in voice messaging include japanese voice when, when can be identified with japanese voice library, discrimination compared with Height can determine that this section of voice corresponds to Japanese, directly obtain Japanese recognition result and be supplied to Japanese user, and without with its Conversion between his languages.

In a kind of specific implementation of the application, referring to fig. 4, the stream of the 4th kind of voice messaging conversion method is provided Journey schematic diagram, compared with previous embodiment, in the present embodiment, above-mentioned voice messaging conversion method further include:

S103: text conversion information is sent to the first client.

Wherein, the first client is to send the client of target voice information.

S104: the update information for text conversion information that the first client is sent is received, and more according to update information New above-mentioned text conversion information.

It learns that the executing subject of the present embodiment can be server by the description of front, is also possible to client.

For server, above-mentioned text conversion information is carried out more according to the update information that the first client is sent Newly, facilitate the correct transformation result of client acquisition that other requests carry out information conversion, in addition, in the sound bank of server There are in the case where user speech library in set, server can also update user speech library according to above-mentioned update information, in this way Help to improve the subsequent accuracy rate for carrying out speech recognition.

It is noted that improve the safety of user's communication information, above-mentioned server can be only in a period of time Voice messaging, text information etc. between the first client of interior storage and the second client, after the time for reaching setting, in deletion State voice messaging, the text information between two clients.And for user speech library, wherein only storing the voice according to user The participle information that information determines, the complete voice messaging without storing user, it is possible to guarantee that user communicates letter well The safety of breath.

For client, obtain update information from the first client, can the recognition result to client carry out school Just, so that user sees accurate recognition result, user experience is helped to improve, in addition, in the sound bank set of client There is also in the case where user speech library, client can also update user speech library according to above-mentioned update information, help in this way In the accuracy rate for improving subsequent progress speech recognition.

In a kind of specific implementation of the application, in the case where above-mentioned electronic equipment is server, above-mentioned voice letter Ceasing conversion method can also include:

Judge whether the text conversion information before updating has been sent to other clients, if it is, to above-mentioned text has been received Second client of word transitional information sends amendment prompt information, and according to the second client for above-mentioned amendment prompt information Feedback result, it is determined whether send updated text conversion information to the second client.

Specifically, server is not necessarily to second if the second client feedback does not need to be updated according to update information Client sends updated text conversion information；If the second client needs are updated according to update information, server Updated text conversion information directly can be sent to the second client, the second client believes updated text conversion Breath shows user.

It is learnt from the description of front, speech recognition can be carried out using user speech library, so, directly to target language message When breath carries out speech recognition, the first user corresponding first user speech library can be used, directly target voice information is carried out Speech recognition, wherein the first user are as follows: send the user of target voice information；

When carrying out speech recognition to target voice information, using the first user speech library, language is carried out to target voice information Sound identification.

Used when due to carrying out speech recognition is the corresponding user speech library of the first user, and user speech library There may be error, for guarantee it is subsequent reuse accuracy rate with higher when the user speech library carries out speech recognition, connect After receiving above-mentioned update information, the first user speech library can also be updated further according to above-mentioned update information.

As seen from the above, in the scheme provided in the present embodiment, after obtaining text conversion information, by text conversion information It is sent to the first client for sending target voice information, and receives the update information that the first client is sent, is believed according to amendment Breath updates text conversion information, enables to the text conversion information of target voice information more accurate in this way.

In a kind of specific implementation of the application, referring to Fig. 5, the stream of the 5th kind of voice messaging conversion method is provided Journey schematic diagram, compared with previous embodiment, in the present embodiment, in the case where meeting information switch condition, to target language message Breath carries out speech recognition and obtains text conversion information, so that client shows text based on the display location of target voice information Transitional information, comprising:

S102H: in the case where meeting information switch condition, the frequency of the included audio frame of target voice information is obtained.

S102I: according to frequency obtained, target voice information is divided at least one audio section.

It should be understood that may include the voice messaging of multiple users in target voice information, and the sound of each user Frequency is not generally identical, can be first according to the frequency of audio frame by target language when carrying out speech recognition to target voice information Message breath is divided into multiple audio sections.Specifically, frequency can be located at the audio frame in a certain frequency range is divided into one Audio section.One frequency range can be understood as a user, and different frequency ranges corresponds to different users.

Specifically, frequency can be located in a certain frequency range and adjacent audio frame is divided into an audio section, it will Frequency is located at another frequency range and adjacent audio frame is divided into another audio section.

S102J: being each frequency range selection from sound bank set based on the frequency range for dividing obtained audio section Corresponding sound bank, and then determine the corresponding sound bank of each audio section.

It is assumed that a frequency range corresponds to multiple audio sections, then it only can be first directed to one of audio section, from voice Sound bank is selected in the set of library, then corresponds to sound bank for selected sound bank as the audio section, carries out speech recognition.

S102K: using the corresponding sound bank of each audio section, speech recognition is carried out respectively to each audio section.

Specifically, for each audio section, when carrying out speech recognition according to its corresponding sound bank, recognition result It cannot meet the requirements, then can further select other sound banks to be identified, for example, the voice determined according to group information Library, common sound bank, received pronunciation library etc., the application is defined not to this.

S102L: according to the recognition result of each audio section, the text conversion information of target voice information is obtained.

As seen from the above, in scheme provided in this embodiment, target voice information is divided into different sounds according to frequency Then frequency range selects sound bank to carry out speech recognition respectively for different audio sections, in this way can be according to not audio segments It is different specific, determine sound bank, and then obtain preferable speech recognition result.

In a kind of specific implementation of the application, referring to Fig. 6, the stream of the 6th kind of voice messaging conversion method is provided Journey schematic diagram, compared with previous embodiment, in the present embodiment, above-mentioned voice messaging conversion method further include:

S105: it receives meeting summary and generates instruction.

S106: the corresponding text conversion information of text information and voice messaging for generating meeting summary is obtained.

It should be noted that may be had been completed before obtaining the corresponding text conversion information of above-mentioned voice messaging For the speech recognition of voice messaging, then the text conversion information obtained after speech recognition can be directly obtained, if not completing also For the speech recognition of voice messaging, then speech recognition can be first carried out, then obtain text conversion letter further according to recognition result Breath.

S107: according to text information obtained and text conversion information obtained, according to preset meeting summary Format generates meeting summary.

Above-mentioned preset meeting summary format may include: the time of meeting, meeting time span, participate in meeting personnel, spokesman, Minutes, meeting keyword etc. information.

Specifically, above-mentioned the time of meeting, meeting time span can be according to text information, the voice for generating meeting summary What the earliest sending time of information and the latest sending time determined.

It can be what the information such as the user according to included in group determined referring to meeting personnel.

Meeting keyword, which can be, carries out keyword to text information obtained and text conversion information obtained Extract determining, the keyword extracted may be relatively more, can according to rules such as the sequence of frequency of occurrence from less to more, Select certain amount keyword as final meeting keyword, it can also be according to preset filtering rule, to what is extracted Keyword is filtered, for example, filter out " ", the words such as " obtaining ", it is crucial that filtered keyword is determined as final meeting Word.

Specifically, being recorded according to text information obtained and text conversion information obtained according to preset meeting Format is wanted, meeting summary is generated, comprising:

Determine the user name and IP for sending the user of text information obtained and text conversion information obtained Address；

According to each audio frame in identified user name, IP address and voice messaging for generating meeting summary Frequency determines spokesman；

According to text information obtained, text conversion information obtained and spokesman, record according to preset meeting Format is wanted, meeting summary is generated.

Above-mentioned meeting summary can be pure words form, can also be multimedia form, it can believe comprising voice The information such as breath, picture, video, text.

Specifically, other than generating meeting summary, can be combined with meeting summary in a kind of implementation of the application A voice backup information and/or text backup information are generated, to facilitate later stage work personnel to proofread.

Furthermore it is also possible to receive the update information for meeting summary, and above-mentioned meeting is updated according to above-mentioned update information Summary.

As seen from the above, in scheme provided in this embodiment, after the conference is over, it is not necessarily to staff manual editing Meeting summary is generated, the operating pressure of staff is alleviated, improves work efficiency.

Fig. 7 is a kind of flow diagram of information generating method provided by the embodiments of the present application, this method comprises:

S701: the target voice information that source client is sent is received.

S702: according to the first user speech library of corresponding first user of target voice information, to target voice information into Row speech recognition.

S703: recognition result is sent to source client.

S704: the update information for recognition result that source client is sent is received.

S705: first user speech library is updated according to update information.

Specifically, the initial speech library in the first user speech library can be preset received pronunciation library.

As seen from the above, in scheme provided in this embodiment, the target voice information of source client transmission is received, is gone forward side by side After row speech recognition, then recognition result is sent to source client, is corrected by source client for recognition result, then root Sound bank update is carried out according to the update information that source client is sent, the personalized speech library for user can be generated in this way, have When helping the later period and carrying out speech recognition to the voice messaging of user using the user speech library, accurate recognition result is obtained.

Corresponding with above-mentioned voice messaging conversion method, the embodiment of the present application provides a kind of voice messaging conversion equipment.

Fig. 8 is a kind of structural schematic diagram of voice messaging conversion equipment provided by the embodiments of the present application, which is applied to Electronic equipment, comprising:

Information receiving module 801, for receiving target voice information；

Speech recognition module 802, for being carried out to the target voice information in the case where meeting information switch condition Speech recognition obtains text conversion information, so that client shows the text based on the display location of the target voice information Word transitional information.

Specifically, the speech recognition module 802 can be specifically used for carrying out voice participle to the target voice information Processing obtains the voice participle that the target voice information includes, segments for each voice obtained, according to preset language Sound library selection rule, selects sound bank, and segment by selected sound bank to the voice and carry out language from sound bank set Sound identification obtains text conversion information, so that client shows the text based on the display location of the target voice information Transitional information.

Specifically, the preset sound bank selection rule can at least one of following information determines according to rule Then:

Send the classification of first user's said target group of the target voice information；

The language form of the target group title；

The customer attribute information of first user；

Send geographical location locating for the source client of the target voice information；

Before receiving the target voice information, the text information and/or voice messaging of stored first user In corresponding voice participle, sound bank belonging to the first preset quantity voice participle that frequency of occurrence sorts forward；

Before receiving the target voice information, the text information and/or voice messaging of the stored target group In corresponding voice participle, sound bank belonging to the second preset quantity voice participle that frequency of occurrence sorts forward；

The sound bank selecting sequence of user setting.

Specifically, the speech recognition module 802 includes:

Voice segments obtain submodule, for determining rule according to preset voice segments, obtain the target voice information First aim voice segments；

First speech recognition submodule, each sound bank for being respectively adopted in sound bank set, to the target language Segment carries out speech recognition；

Sound bank determines submodule, for the highest sound bank of discrimination to be determined as target voice library；

Second speech recognition submodule, for using the target voice library in the target voice information remove the mesh Poster segment carries out speech recognition with outer portion, obtains the first recognition result；

The first information obtains submodule, for obtaining the mesh according to first recognition result and the second recognition result Mark the text conversion information of voice messaging, wherein second recognition result are as follows: using the target voice library to the target The result of voice segments progress speech recognition.

Specifically, the information acquisition submodule may include:

Discrimination computing unit, for obtaining and being directed to the mesh according to first recognition result and the second recognition result Mark the first discrimination of sound bank；

Discrimination judging unit, for judging whether first discrimination is less than preset discrimination threshold value；

Information obtainment unit, in the case where the judging result of the discrimination judging unit, which is, is, using default Default sound library carries out speech recognition to the target voice information, and according to the identification knot for being directed to the default Default sound library Fruit obtains the text conversion information of the target voice information.

Specifically, the speech recognition module 802, after receiving the target voice information, directly to institute It states target voice information and carries out speech recognition acquisition text conversion information, so that client is based on the target voice information Display location shows the text conversion information；Or

Specifically for monitoring whether to receive the information conversion instruction for the target voice information, if it is, to institute It states target voice and carries out speech recognition acquisition text conversion information, so that displaying of the client based on the target voice information Position shows the text conversion information.

Specifically, described device can also include:

As a result sending module, for the text conversion information to be sent to first client, wherein described first Client is to send the client of the target voice information；

As a result update module is believed for receiving the amendment for the text conversion information that first client is sent Breath, and the text conversion information is updated according to the update information.

Specifically, described device can also include: in the case where the electronic equipment is server

As a result judgment module, for judging whether the text conversion information before updating has been sent to other clients；

Prompt information sending module is the case where being, to having received for the judging result in the result judgment module Second client of the text conversion information sends amendment prompt information, and is directed to the amendment according to second client The feedback result of prompt information, it is determined whether the second client of Xiang Suoshu sends the updated text conversion information.

Specifically, the speech recognition module 802, is specifically used for using the first user corresponding first user speech library, Speech recognition directly is carried out to the target voice information and obtains text conversion information, so that client is based on the target language The display location of message breath shows the text conversion information, wherein first user are as follows: send the target voice information User；Or

The speech recognition module 802 is specifically used for using first user speech library, to the target voice information It carries out speech recognition and obtains text conversion information, so that client shows institute based on the display location of the target voice information State text conversion information；

Described device can also include:

Specifically, the speech recognition module 802 may include:

Frame per second obtains submodule, in the case where meeting information switch condition, obtaining the target voice information institute Frequency comprising audio frame；

Audio section divides submodule, for according to frequency obtained, the target voice information to be divided at least one A audio section；

Sound bank selects submodule, for being from sound bank set based on the frequency range for dividing obtained audio section Each frequency range selects corresponding sound bank, and then determines the corresponding sound bank of each audio section；

Third speech recognition submodule, for using the corresponding sound bank of each audio section, to each audio section respectively into Row speech recognition；

Second information acquisition submodule obtains the target voice information for the recognition result according to each audio section Text conversion information.

Specifically, the voice messaging conversion equipment can also include:

Command reception module generates instruction for receiving meeting summary；

Information acquisition module, for obtaining text information and the corresponding text of voice messaging for generating meeting summary Transitional information；

Summary generation module is used for according to text information obtained and text conversion information obtained, according to pre- If meeting summary format, generate meeting summary.

Specifically, the summary generation module may include:

Information determines submodule, sends text information obtained and text conversion information obtained for determining The user name and IP address of user；

Spokesman determines submodule, for according to identified user name, IP address and for generating meeting summary The frequency of each audio frame, determines spokesman in voice messaging；

Summary generates submodule, for according to text information obtained, text conversion information obtained and described Spokesman generates meeting summary according to preset meeting summary format.

As seen from the above, in the scheme that above-mentioned each embodiment provides, after receiving target voice information, meeting information The case where switch condition, carries out speech recognition to target voice information and obtains text conversion information, and such client can be based on The display location of target voice information shows above-mentioned text conversion information.As it can be seen that using scheme provided by the embodiments of the present application, energy Enough convert speech information into text information.

Corresponding with above- mentioned information generation method, the embodiment of the present application also provides a kind of information generation devices.

Fig. 9 is a kind of structural schematic diagram of information generation device provided by the embodiments of the present application, which includes:

Information receiving module 901, for receiving the target voice information of source client transmission；

Speech recognition module 902, for the first user speech according to corresponding first user of the target voice information Library carries out speech recognition to the target voice information；

As a result sending module 903, for recognition result to be sent to the source client；

Update information receiving module 904, the amendment for the recognition result sent for receiving the source client Information；

Sound bank update module 905, for updating first user speech library according to the update information.

Specifically, the initial speech library in first user speech library is preset received pronunciation library.

For device embodiment, since it is substantially similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.

It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.

Those of ordinary skill in the art will appreciate that all or part of the steps in realization above method embodiment is can It is completed with instructing relevant hardware by program, the program can store in computer-readable storage medium, The storage medium designated herein obtained, such as: ROM/RAM, magnetic disk, CD.

The foregoing is merely the preferred embodiments of the application, are not intended to limit the protection scope of the application.It is all Any modification, equivalent replacement, improvement and so within spirit herein and principle are all contained in the protection scope of the application It is interior.

Claims

1. a kind of voice messaging conversion method is applied to electronic equipment, which is characterized in that the described method includes:

Receive target voice information；

In the case where meeting information switch condition, speech recognition is carried out to the target voice information and obtains text conversion letter Breath, so that client shows the text conversion information based on the display location of the target voice information；

It is wherein, described that the step of speech recognition obtains text conversion information is carried out to the target voice information, comprising:

Rule is determined according to preset voice segments, obtains the first aim voice segments of the target voice information；It is respectively adopted Each sound bank in sound bank set carries out speech recognition to the target language segment；The highest sound bank of discrimination is true It is set to target voice library；Using the target voice library in the target voice information in addition to the target language segment part Speech recognition is carried out, the first recognition result is obtained；According to first recognition result and the second recognition result, the target is obtained The text conversion information of voice messaging, wherein second recognition result are as follows: using the target voice library to the target language The result of segment progress speech recognition.

2. the method according to claim 1, wherein described tie according to first recognition result and the second identification Fruit obtains the text conversion information of the target voice information, comprising:

According to first recognition result and the second recognition result, the first discrimination for being directed to the target voice library is obtained；

Judge whether first discrimination is less than preset discrimination threshold value；

If it is, carrying out speech recognition to the target voice information using default Default sound library, and according to for described pre- If the recognition result in Default sound library, the text conversion information of the target voice information is obtained.

3. the method according to claim 1, wherein described in the case where meeting information switch condition, to institute It states target voice information and carries out speech recognition acquisition text conversion information, comprising:

After receiving the target voice information, speech recognition directly is carried out to the target voice information and obtains text conversion letter Breath；Or

Monitor whether to receive the information conversion instruction for the target voice information, if it is, to the target voice into Row speech recognition obtains text conversion information.

4. according to the method described in claim 3, it is characterized in that, the method also includes:

The text conversion information is sent to the first client, wherein first client is to send the target voice The client of information；

The update information for the text conversion information that first client is sent is received, and according to the update information Update the text conversion information.

5. according to the method described in claim 4, it is characterized in that, the electronic equipment be server in the case where, it is described Method further include:

Judge whether the text conversion information before updating has been sent to other clients；

If it is, sending amendment prompt information to the second client for having received the text conversion information, and according to described the Feedback result of two clients for the amendment prompt information, it is determined whether the second client of Xiang Suoshu sends updated institute State text conversion information.

6. according to the method described in claim 4, it is characterized in that,

It is described that speech recognition acquisition text conversion information directly is carried out to the target voice information, comprising:

Using the first user corresponding first user speech library, speech recognition directly is carried out to the target voice information and obtains text Word transitional information, wherein first user are as follows: send the user of the target voice information；

It is described that speech recognition acquisition text conversion information is carried out to the target voice information, comprising:

Using first user speech library, speech recognition is carried out to the target voice information and obtains text conversion information；

The method also includes:

According to the update information, first user speech library is updated.

7. method according to claim 1 to 6, which is characterized in that the method also includes:

It receives meeting summary and generates instruction；

Obtain the corresponding text conversion information of text information and voice messaging for generating meeting summary；

It is generated according to text information obtained and text conversion information obtained according to preset meeting summary format Meeting summary.

8. the method according to the description of claim 7 is characterized in that described according to text information obtained and obtained Text conversion information generates meeting summary according to preset meeting summary format, comprising:

Determine the user name and IP address for sending the user of text information obtained and text conversion information obtained；

According to the frequency of each audio frame in identified user name, IP address and voice messaging for generating meeting summary Rate determines spokesman；

According to text information obtained, text conversion information obtained and the spokesman, record according to preset meeting Format is wanted, meeting summary is generated.

9. method according to claim 1 to 6, which is characterized in that the client is based on the target voice The display location of information shows the text conversion information, comprising:

Client is in the displayed page of the target voice information, the default side of the display location of the target voice information To showing the text conversion information.

10. a kind of voice messaging conversion equipment, it is applied to electronic equipment, which is characterized in that described device includes:

Information receiving module, for receiving target voice information；

Speech recognition module, for carrying out voice knowledge to the target voice information in the case where meeting information switch condition Not Huo get text conversion information so that client shows the text conversion based on the display location of the target voice information Information；

Wherein, the speech recognition module, comprising: voice segments obtain submodule, for determining rule according to preset voice segments, Obtain the first aim voice segments of the target voice information；First speech recognition submodule, for sound bank to be respectively adopted Each sound bank in set carries out speech recognition to the target language segment；Sound bank determines submodule, is used for discrimination Highest sound bank is determined as target voice library；Second speech recognition submodule, for using the target voice library to described Part carries out speech recognition in addition to the target language segment in target voice information, obtains the first recognition result；The first information Submodule is obtained, for obtaining the text of the target voice information according to first recognition result and the second recognition result Transitional information, wherein second recognition result are as follows: voice knowledge is carried out to the target language segment using the target voice library Other result.

11. device according to claim 10, which is characterized in that the information acquisition submodule, comprising:

Discrimination computing unit, for obtaining and being directed to the target language according to first recognition result and the second recognition result First discrimination in sound library；

Information obtainment unit, for the judging result of the discrimination judging unit be in the case where, using default default Sound bank to the target voice information carry out speech recognition, and according to be directed to the default Default sound library recognition result, Obtain the text conversion information of the target voice information.

12. device according to claim 10, which is characterized in that

The speech recognition module, after receiving the target voice information, directly to the target voice information It carries out speech recognition and obtains text conversion information, so that client shows institute based on the display location of the target voice information State text conversion information；

Or

Specifically for monitoring whether to receive the information conversion instruction for the target voice information, if it is, to the mesh Poster sound carries out speech recognition and obtains text conversion information, so that display location of the client based on the target voice information Show the text conversion information.

13. device according to claim 12, which is characterized in that described device further include:

As a result sending module, for the text conversion information to be sent to the first client, wherein first client is Send the client of the target voice information；

As a result update module, the update information for the text conversion information sent for receiving first client, And the text conversion information is updated according to the update information.

14. device according to claim 13, which is characterized in that in the case where the electronic equipment is server, institute State device further include:

Prompt information sending module is the case where being, described in having received for the judging result in the result judgment module Second client of text conversion information sends amendment prompt information, and is prompted according to second client for the amendment The feedback result of information, it is determined whether the second client of Xiang Suoshu sends the updated text conversion information.

15. device according to claim 13, which is characterized in that

The speech recognition module is specifically used for using the first user corresponding first user speech library, directly to the target Voice messaging carries out speech recognition and obtains text conversion information, so that displaying position of the client based on the target voice information It sets and shows the text conversion information, wherein first user are as follows: send the user of the target voice information；Or

The speech recognition module, is specifically used for using first user speech library, carries out language to the target voice information Sound identification obtains text conversion information, so that client shows the text based on the display location of the target voice information Transitional information；

Described device further include:

16. device described in any one of 0-15 according to claim 1, which is characterized in that described device further include:

Command reception module generates instruction for receiving meeting summary；

Information acquisition module, for obtaining text information and the corresponding text conversion of voice messaging for generating meeting summary Information；

Summary generation module is used for according to text information obtained and text conversion information obtained, according to preset Meeting summary format generates meeting summary.

17. device according to claim 16, which is characterized in that the summary generation module, comprising:

Information determines submodule, for determining the user for sending text information obtained and text conversion information obtained User name and IP address；

Spokesman determines submodule, for according to identified user name, IP address and voice for generating meeting summary The frequency of each audio frame, determines spokesman in information；

Summary generates submodule, for according to text information obtained, text conversion information obtained and the speech People generates meeting summary according to preset meeting summary format.