CN112242142B - Voice recognition input method and related device - Google Patents

Voice recognition input method and related device Download PDF

Info

Publication number
CN112242142B
CN112242142B CN201910647006.0A CN201910647006A CN112242142B CN 112242142 B CN112242142 B CN 112242142B CN 201910647006 A CN201910647006 A CN 201910647006A CN 112242142 B CN112242142 B CN 112242142B
Authority
CN
China
Prior art keywords
input
target
user
voice recognition
obtaining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910647006.0A
Other languages
Chinese (zh)
Other versions
CN112242142A (en
Inventor
王丹
崔欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201910647006.0A priority Critical patent/CN112242142B/en
Publication of CN112242142A publication Critical patent/CN112242142A/en
Application granted granted Critical
Publication of CN112242142B publication Critical patent/CN112242142B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Abstract

The application discloses a voice recognition input method and a related device, wherein the method comprises the following steps: learning the input habit and/or user information of the target user to obtain personalized data of the target user; after the target user inputs voice, input voice data and a user identifier are obtained; searching and obtaining personalized data of a target user according to the user identification; and combining personalized data to identify input voice data to obtain a target voice identification result. Therefore, the personalized data auxiliary voice recognition input of the target user is obtained by learning the personalized input habit and/or user information of the target user, and the obtained target voice recognition result accords with the input habit and user information of the target user, so that the modification cost of the voice recognition input is reduced, and the effect and user experience of the voice recognition input are improved.

Description

Voice recognition input method and related device
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for voice recognition input.
Background
With the development and progress of speech technology, speech recognition technology is applied to more and more fields, for example, the field of input methods. At present, voice recognition input gradually becomes an important input mode in the field of input methods. The specific voice recognition input mode is that after a user inputs voice, voice recognition technology is utilized to recognize input voice data of the user to obtain a voice recognition result, and the voice recognition result is used as input content to be displayed.
However, the inventor finds that the input habits and the user information of different users are different, and for different users, the voice recognition result obtained by adopting the voice recognition input cannot meet the difference of the input habits and the user information of different users, so that the voice recognition result still needs to be modified based on the input habits and the user information of different users after the voice recognition input obtains the voice recognition result, the cost of the voice recognition input is increased, the effect of the voice recognition input is not ideal, and the user experience of the voice recognition input is reduced.
Disclosure of Invention
The technical problem to be solved by the application is to provide a voice recognition input method and a related device so as to accord with the input habit and user information of a target user, thereby reducing the modification cost of the voice recognition input and improving the effect and user experience of the voice recognition input.
In a first aspect, embodiments of the present application provide a method for speech recognition input, the method comprising:
obtaining input voice data and user identification of a target user;
acquiring personalized data of the target user according to the user identifier, wherein the personalized data is obtained by learning the input habit and/or user information of the target user;
And combining the personalized data to identify the input voice data to obtain a target voice identification result.
Optionally, the input habit includes input behavior and/or history input, and the user information includes user portrait information.
Optionally, the identifying the input voice data in combination with the personalized data to obtain the target voice recognition result includes:
recognizing the input voice data to obtain a voice recognition result;
and obtaining the target voice recognition result based on the voice recognition result and the personalized data.
Optionally, if the identifying the input voice data obtains a voice identification result, specifically: recognizing the input voice data to obtain a plurality of voice recognition results;
correspondingly, the obtaining the target voice recognition result based on the voice recognition result and the personalized data comprises the following steps:
obtaining an acoustic model score and a language model score for each of the speech recognition results;
and based on a plurality of the voice recognition results, combining the personalized data, the acoustic model score and the language model score to obtain the target voice recognition result.
Optionally, the obtaining the target speech recognition result based on the plurality of speech recognition results in combination with the personalized data, the acoustic model score, and the language model score includes:
Obtaining a plurality of updated speech recognition results based on a plurality of the speech recognition results in combination with the personalized data;
obtaining a language model score of each updated speech recognition result as a target language model score;
and based on a plurality of updated voice recognition results, combining the acoustic model score and the target language model score to obtain the target voice recognition result.
Optionally, the target voice recognition result is obtained by combining the personalized data to recognize the input voice data, specifically:
and recognizing the input voice data by combining the personalized data to directly obtain the target voice recognition result.
Optionally, after the target voice recognition result is obtained, the method further includes:
and displaying the target voice recognition result to the target user.
In a second aspect, embodiments of the present application provide a device for speech recognition input, the device comprising:
the first obtaining unit is used for obtaining input voice data of a target user and a user identifier;
the second obtaining unit is used for obtaining personalized data of the target user according to the user identifier, wherein the personalized data are obtained by learning the input habit and/or user information of the target user;
And a third obtaining unit, configured to identify the input voice data in combination with the personalized data to obtain a target voice identification result.
Optionally, the input habit includes input behavior and/or history input, and the user information includes user portrait information.
Optionally, the third obtaining unit includes:
a recognition subunit, configured to recognize the input voice data to obtain a voice recognition result;
the first obtaining subunit is used for obtaining the target voice recognition result based on the voice recognition result and combining the personalized data.
Optionally, if the identification subunit is specifically configured to: recognizing the input voice data to obtain a plurality of voice recognition results;
correspondingly, the obtaining subunit based on the first obtaining subunit includes:
a first obtaining module, configured to obtain an acoustic model score and a language model score of each of the speech recognition results;
and the second obtaining module is used for obtaining the target voice recognition result by combining the personalized data, the acoustic model score and the language model score based on a plurality of voice recognition results.
Optionally, the second obtaining module includes:
the first obtaining submodule is used for obtaining a plurality of updated voice recognition results based on the voice recognition results and the personalized data;
A second obtaining sub-module, configured to obtain a language model score of each of the updated speech recognition results as a target language model score;
and the third obtaining sub-module is used for obtaining the target voice recognition result by combining the acoustic model score and the target language model score based on a plurality of updated voice recognition results.
Optionally, the third obtaining unit is specifically configured to:
and recognizing the input voice data by combining the personalized data to directly obtain the target voice recognition result.
Optionally, the third obtaining unit further includes:
and the display unit is used for displaying the target voice recognition result to the target user.
In a third aspect, embodiments of the present application provide an apparatus for speech recognition input, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:
acquiring input voice data and user identification of a target user;
acquiring personalized data of the target user according to the user identifier, wherein the personalized data is obtained by learning the input habit and/or user information of the target user;
And combining the personalized data to identify the input voice data to obtain a target voice identification result.
In a fourth aspect, embodiments of the present application provide a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a method of speech recognition input as described in one or more of the first aspects above.
Compared with the prior art, the application has at least the following advantages:
by adopting the technical scheme of the embodiment of the application, the input habit and/or the user information of the target user are learned to obtain the personalized data of the target user; after the target user inputs voice, input voice data and a user identifier are obtained; searching and obtaining personalized data of a target user according to the user identification; and combining personalized data to identify input voice data to obtain a target voice identification result. Therefore, the personalized data auxiliary voice recognition input of the target user is obtained by learning the personalized input habit and/or user information of the target user, and the obtained target voice recognition result accords with the input habit and user information of the target user, so that the modification cost of the voice recognition input is reduced, and the effect and user experience of the voice recognition input are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is a schematic diagram of a system frame related to an application scenario in an embodiment of the present application;
FIG. 2 is a flowchart of a method for speech recognition input according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a device for voice recognition input according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an apparatus for speech recognition input according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
There are differences in input habits and user information, etc. of different users, i.e. each user has a personalized input habit and user information. The current voice recognition input specific mode is to directly utilize the voice recognition technology to recognize the input voice data of the user to obtain a voice recognition result and display the voice recognition result to the user, wherein the voice recognition result obtained by adopting the current voice recognition input mode does not accord with the personalized input habit and user information of the user, namely, the current voice recognition input mode does not support the difference of the input habit and user information of different users. Therefore, after the voice recognition result is obtained by the voice recognition input, the voice recognition result still needs to be modified based on the input habits of different users, the user information and the like, and the cost of the voice recognition input is increased, so that the effect of the voice recognition input is not ideal enough, and the user experience of the voice recognition input is reduced.
In order to solve the problem, in the embodiment of the application, the input habit of the target user and/or the personalized data of the target user obtained by the user information are learned; after the target user inputs voice, input voice data and a user identifier are obtained; searching and obtaining personalized data of a target user according to the user identification; and combining personalized data to identify input voice data to obtain a target voice identification result. Therefore, the personalized data auxiliary voice recognition input of the target user is obtained by learning the personalized input habit and/or user information of the target user, and the obtained target voice recognition result accords with the input habit and user information of the target user, so that the modification cost of the voice recognition input is reduced, and the effect and user experience of the voice recognition input are improved.
For example, one of the scenarios of the embodiments of the present application may be applied to the scenario shown in fig. 1, where the scenario includes the client 101 and the processor 102, and the client 101 and the processor 102 are loaded in the user terminal 100. The target user may perform voice input through the client 101, the client 101 sends input voice data of the target user and a user identifier to the processor 102, and the processor 102 performs voice recognition input by adopting the method of the embodiment of the present application to obtain a target voice recognition result, and displays the target voice recognition result to the target user through the client 101.
It will be appreciated that in the above application scenario, although the actions of the embodiments of the present application are described as being performed by the processor 102, these actions may also be performed by the client 101, or may also be performed in part by the client 101, in part by the processor 102. The present application is not limited to the execution subject, and may be executed by performing the operations disclosed in the embodiments of the present application.
It is understood that the above scenario is only one example of a scenario provided in the embodiments of the present application, and the embodiments of the present application are not limited to this scenario.
Specific implementation manners of the voice recognition input method and the related device in the embodiments of the present application are described in detail below by way of embodiments with reference to the accompanying drawings.
Exemplary method
Referring to fig. 2, a flow chart of a method for speech recognition input in an embodiment of the present application is shown. In this embodiment, the method may include, for example, the steps of:
step 201: input voice data and user identification of a target user are obtained.
It will be appreciated that any user may be the target user, and that the speech recognition input is provided that the target user performs speech input via the client so that the processor obtains the input speech data. Because the target user has personalized input habit and user information, in order to make the voice recognition result obtained by voice recognition input conform to the personalized input habit and user information of the user, a user identifier for uniquely identifying the identity of the target user is also required to be obtained, so that the personalized input habit and user information of the user can be clarified based on the user identifier.
Step 202: and obtaining personalized data of the target user according to the user identifier, wherein the personalized data is obtained by learning the input habit and/or user information of the target user.
It should be noted that, because the target users have personalized input habits and user information, the input habits and user information of different target users have differences, the existing speech recognition input mode does not support the differences of the input habits and user information of different target users, and the obtained speech recognition result does not conform to the personalized input habits and user information of the target users. Thus, learning of the input habits and/or user information of the target user is considered in order to obtain personalized data of the target user that can assist in speech recognition input. The input habit can be the input behavior of the target user, the history input of the target user, and the combination of the input behavior of the target user and the history input; the user information may be user portrait information of the target user. Thus, in some implementations of the embodiments of the present application, the input habits include input behavior and/or historical input, and the user information includes user portrayal information. For example, the input behavior may be the modification behavior of the target user after the existing voice recognition input mode obtains the voice recognition result; the historical input may be historical user words, address book words, and the like; the user profile information may be the user's age, gender, language, location, etc. The embodiment of the application is not limited to the input behavior and the input mode in the process of obtaining the history input and the user portrait information, and can be used for keyboard input, voice input and the like. It should be noted that, the information (including but not limited to user equipment information, user information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use, and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
As an example, the target user makes a pronunciation of "wochuqule" through the client, and the voice recognition result "i am out" obtained by using the existing voice recognition input method. "the target user goes out the speech recognition result. "modified to" I go out "; the target user carries out pronunciation of the voice input haoea through the client again, and the voice recognition result obtained by adopting the existing voice recognition input mode is hungry. The target user starves the voice recognition result. The modification is hungry, the modification behavior of the target user on the voice recognition result obtained by the existing voice recognition input mode is learned, and the personalized data of the target user are obtained as follows: "delete end punctuation of speech recognition result". As another example, the target user inputs "and the route" through the client, learns the historical user words "and the route" of the target user, and obtains the personalized data of the target user as follows: "Heyang road". As yet another example, the target user inputs "He Huaxin" in the address book through the client, learns the address book word "He Huaxin" of the target user, and obtains the personalized data of the target user as follows: "He Huaxin".
It will be appreciated that the learned personalized data of the target user is stored corresponding to the user identifier of the target user, so after the user identifier of the target user is obtained in step 201, the personalized data of the target user needs to be obtained by searching according to the user identifier of the target user, that is, step 202 is performed.
Step 203: and combining the personalized data to identify the input voice data to obtain a target voice identification result.
It can be understood that, in order to avoid the problems existing in the existing speech recognition input, after the personalized data of the target user is obtained in step 202, the personalized data of the target user needs to be combined for auxiliary recognition when the input speech data of the target user is recognized, so that the speech recognition result conforming to the personalized input habit of the target user and the user information can be obtained and recorded as the target speech recognition result.
It should be noted that, in the embodiment of the present application, the step 203 may at least adopt the following two embodiments:
in the first alternative embodiment of step 203, for the input voice data of the target user, the personalized data of the target user is directly combined during voice recognition, so as to directly obtain the target voice recognition result according with the personalized input habit of the target user and the user information. Thus, in an alternative implementation manner of the embodiment of the present application, the step 203 may be, for example, specifically: and recognizing the input voice data by combining the personalized data to directly obtain the target voice recognition result.
As an example, if the target user inputs the pronunciation of "jintzaodian" through the client voice, the voice input data of the target user is the pronunciation of "jintzaodian", and the personalized data of the target user is assumed to be "deleting the last punctuation mark of the voice recognition result"; the target voice recognition result is directly obtained by combining the personalized data of the target user and the pronunciation of the end punctuation mark of the voice recognition result, namely, the recognition voice input data, namely, the voice recognition result is 'today's early hours.
As another example, the target user inputs the pronunciation of "heyanglu" through the client voice, the voice input data of the target user is the pronunciation of "heyanglu", and the target voice recognition result is directly obtained by combining the pronunciation of the history user word "and the positive road" recognition voice input data "heyanglu" of the personalized data of the target user, assuming that the personalized data of the target user is the history user word "and the positive road".
In a second optional implementation manner of step 203, for the input voice data of the target user, when performing voice recognition, firstly, a voice recognition result which does not conform to the input habit and the user information according to the individuation of the target user is obtained based on the existing voice recognition mode, and then, the individuation data of the target user is combined on the basis of the voice recognition result, so as to obtain the target voice recognition result which conforms to the input habit and the user information of the individuation of the target user. Thus, in another alternative implementation of the embodiments of the present application, the step 203 may include, for example, the following steps:
Step A: recognizing the input voice data to obtain a voice recognition result;
and (B) step (B): and obtaining the target voice recognition result based on the voice recognition result and the personalized data.
As an example, if the target user inputs the pronunciation of "jiintazaodianaxiaban" through the client voice, the voice input data of the target user is the pronunciation of "jiintazaodianaxiaban", and the personalized data of the target user is assumed to be "delete the last punctuation mark of the voice recognition result". First, recognizing the pronunciation of the input voice data "jintiazaodianaxiaban" obtains a voice recognition result of "today's breakfast. "; then, based on the voice recognition result, "today's breakfast. The target voice recognition result is obtained by deleting the end punctuation mark of the voice recognition result in combination with the personalized data of the target user.
As another example, the target user inputs the pronunciation of "heyanglu" through the client voice, and the voice input data of the target user is the pronunciation of "heyanglu", assuming that the personalized data of the target user is the historical user word "and the sunny road". Firstly, recognizing pronunciation of input voice data 'heyanglu' to obtain voice recognition result as 'river-yang road'; then, based on the voice recognition result 'Heyang road' and the personalized data of the target user, historical user words 'and yang road', the target voice recognition result 'Heyang road' is obtained.
It should be noted that, performing existing speech recognition on the input speech data of the target user may obtain a plurality of different speech recognition results, and at this time, the target speech recognition result may be directly obtained based on the plurality of different speech recognition results in combination with the personalized data of the target user; a scoring mechanism may also be introduced for a plurality of different speech recognition results to assist in determining the speech recognition accuracy of the speech recognition results, for which acoustic model scores and language model scores are of primary concern. Then, firstly, the acoustic model score and the language model score of each voice recognition result are required to be defined, and then, the target voice recognition result is obtained by combining the personalized data, the acoustic model score and the language model score of the target user on the basis of a plurality of voice recognition results. Thus, in some implementations of the embodiments of the present application, if the step a specifically includes: recognizing the input voice data to obtain a plurality of voice recognition results; correspondingly, the step B may comprise, for example, the steps of:
step B1: obtaining an acoustic model score and a language model score for each of the speech recognition results;
Step B2: and based on a plurality of the voice recognition results, combining the personalized data, the acoustic model score and the language model score to obtain the target voice recognition result.
In the implementation of step B2, first, a plurality of updated speech recognition results are obtained by combining personalized data of a target user on the basis of the plurality of speech recognition results, the speech recognition results are updated to cause the language model score of the updated speech recognition results to be different from the language model score of the original speech recognition results, the language model score of each updated speech recognition result needs to be obtained as a target language model score, and finally, the acoustic model score and the language model score are combined on the basis of the plurality of updated speech recognition results to obtain the target speech recognition result. Thus, in some implementations of the embodiments of the present application, the step B2 may include, for example, the steps of:
step B21: obtaining a plurality of updated speech recognition results based on a plurality of the speech recognition results in combination with the personalized data;
step B22: obtaining a language model score of each updated speech recognition result as a target language model score;
Step B23: and based on a plurality of updated voice recognition results, combining the acoustic model score and the target language model score to obtain the target voice recognition result.
As an example, the target performs pronunciation of the "woshouhehuhuaxin" through the client, the voice input data of the target user is pronunciation of the "woshouhehuhuaxin", and the target user is assumed to have personalized data of "He Huaxin" and "delete the end punctuation mark of the voice recognition result", and the pronunciation of the input voice data "woshouhehuhuaxin" is recognized to obtain a plurality of voice recognition results as shown in the following table 1, wherein each row in the table 1 represents one voice recognition result; an acoustic model score (AM) and a language model score (LM) for each speech recognition result are obtained, see the AM and LM data in table 1. Obtaining a plurality of updated speech recognition results based on the plurality of speech recognition results in combination with personalized data "He Huaxin" and "delete end punctuation" of the target user as shown in table 2 below, wherein each row in table 2 represents one updated speech recognition result; the target language model score (lm_c) for each updated speech recognition result is obtained, see the lm_c data in table 2. Because the acoustic model score (AM) and the target language model score (lm_c) in table 2 adopt a calculation mode, the smaller the combination of the acoustic model score (AM) and the target language model score (lm_c) is, the more accurate the speech recognition rate of the updated speech recognition result is, and the target speech recognition result is obtained as "i say He Huaxin" by combining the acoustic model score (AM) and the target language model score (lm_c) based on a plurality of updated speech recognition results.
Table 1
Table 2
It should be further noted that, in order for the target user to intuitively understand the target speech recognition result obtained by the locally explicit speech recognition input, the target speech recognition result needs to be displayed to the target user after step 203. Thus, in some implementations of the embodiments of the present application, following said step 203, for example, the steps may be further included: and displaying the target voice recognition result to the target user.
In implementation, for the first alternative embodiment of step 203, since the target speech recognition result is directly obtained, the target speech recognition result may be directly displayed to the target user. For the second optional implementation of step 203, since the speech recognition result is obtained first and then the target speech recognition result is obtained, the target speech recognition result may be displayed directly to the target user; or the voice recognition result can be displayed to the target user first, and then the target voice recognition result is displayed to the target user, and the specific implementation mode is determined by the display strategy.
Through the various implementation manners provided by the embodiment, the input habit and/or the user information of the target user are learned to obtain the personalized data of the target user; after the target user inputs voice, input voice data and a user identifier are obtained; searching and obtaining personalized data of a target user according to the user identification; and combining personalized data to identify input voice data to obtain a target voice identification result. Therefore, the personalized data auxiliary voice recognition input of the target user is obtained by learning the personalized input habit and/or user information of the target user, and the obtained target voice recognition result accords with the input habit and user information of the target user, so that the modification cost of the voice recognition input is reduced, and the effect and user experience of the voice recognition input are improved.
Exemplary apparatus
Referring to fig. 3, a schematic structural diagram of a device for speech recognition input in an embodiment of the present application is shown. In this embodiment, the apparatus may specifically include, for example:
a first obtaining unit 301, configured to obtain input voice data of a target user and a user identifier;
a second obtaining unit 302, configured to obtain, according to the user identifier, personalized data of the target user, where the personalized data is obtained by learning input habits and/or user information of the target user;
a third obtaining unit 303, configured to identify the input voice data in combination with the personalized data to obtain a target voice recognition result.
In an alternative implementation of the embodiment of the present application, the input habit includes input behavior and/or history input, and the user information includes user portrait information.
In an optional implementation manner of the embodiment of the present application, the third obtaining unit 303 includes:
a recognition subunit, configured to recognize the input voice data to obtain a voice recognition result;
the first obtaining subunit is used for obtaining the target voice recognition result based on the voice recognition result and combining the personalized data.
In an optional implementation manner of the embodiment of the present application, if the identification subunit is specifically configured to: recognizing the input voice data to obtain a plurality of voice recognition results;
correspondingly, the obtaining subunit based on the first obtaining subunit includes:
a first obtaining module, configured to obtain an acoustic model score and a language model score of each of the speech recognition results;
and the second obtaining module is used for obtaining the target voice recognition result by combining the personalized data, the acoustic model score and the language model score based on a plurality of voice recognition results.
In an optional implementation manner of the embodiment of the present application, the second obtaining module includes:
the first obtaining submodule is used for obtaining a plurality of updated voice recognition results based on the voice recognition results and the personalized data;
a second obtaining sub-module, configured to obtain a language model score of each of the updated speech recognition results as a target language model score;
and the third obtaining sub-module is used for obtaining the target voice recognition result by combining the acoustic model score and the target language model score based on a plurality of updated voice recognition results.
In an optional implementation manner of the embodiment of the present application, the third obtaining unit 303 is specifically configured to:
and recognizing the input voice data by combining the personalized data to directly obtain the target voice recognition result.
In an optional implementation manner of the embodiment of the present application, the third obtaining unit 303 further includes:
and the display unit is used for displaying the target voice recognition result to the target user.
Through the various implementation manners provided by the embodiment, the input habit and/or the user information of the target user are learned to obtain the personalized data of the target user; the first obtaining unit obtains input voice data and user identification after voice input is carried out by a target user; the second obtaining unit searches and obtains personalized data of the target user according to the user identification; the third obtaining unit is used for obtaining a target voice recognition result by combining the personalized data recognition input voice data. Therefore, the personalized data auxiliary voice recognition input of the target user is obtained by learning the personalized input habit and/or user information of the target user, and the obtained target voice recognition result accords with the input habit and user information of the target user, so that the modification cost of the voice recognition input is reduced, and the effect and user experience of the voice recognition input are improved.
Fig. 4 is a block diagram illustrating an apparatus 400 for speech recognition according to an example embodiment. For example, apparatus 400 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 4, apparatus 400 may include one or more of the following components: a processing component 402, a memory 404, a power supply component 406, a multimedia component 408, an audio component 410, an input/output (I/O) interface 412, a sensor component 414, and a communication component 416.
The processing component 402 generally controls the overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 may include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.
Memory 404 is configured to store various types of data to support operations at device 400. Examples of such data include instructions for any application or method operating on the apparatus 400, contact data, phonebook data, messages, pictures, videos, and the like. The memory 404 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 406 provides power to the various components of the apparatus 400. The power supply components 406 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 400.
The multimedia component 408 includes a screen between the device 400 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or sliding action, but also the duration and pressure associated with the touch or sliding operation. In some embodiments, the multimedia component 408 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 400 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 410 is configured to output and/or input audio signals. For example, the audio component 410 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 further includes a speaker for outputting audio signals.
The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 414 includes one or more sensors for providing status assessment of various aspects of the apparatus 400. For example, the sensor assembly 414 may detect the on/off state of the device 400, the relative positioning of the components, such as the display and keypad of the apparatus 400, the sensor assembly 414 may also detect the change in position of the apparatus 400 or one component of the apparatus 400, the presence or absence of user contact with the apparatus 400, the orientation or acceleration/deceleration of the apparatus 400, and the change in temperature of the apparatus 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 416 is configured to facilitate communication between the apparatus 400 and other devices in a wired or wireless manner. The apparatus 400 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication part 416 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for executing the methods described above.
In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as memory 404, including instructions executable by processor 420 of apparatus 400 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a method of speech recognition input, the method comprising:
acquiring input voice data and user identification of a target user;
acquiring personalized data of the target user according to the user identifier, wherein the personalized data is obtained by learning the input habit and/or user information of the target user;
and combining the personalized data to identify the input voice data to obtain a target voice identification result.
Fig. 5 is a schematic structural diagram of a server in an embodiment of the present application. The server 500 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPU) 522 (e.g., one or more processors) and memory 532, one or more storage media 530 (e.g., one or more mass storage devices) storing applications 542 or data 544. Wherein memory 532 and storage medium 530 may be transitory or persistent. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 522 may be configured to communicate with a storage medium 530 and execute a series of instruction operations in the storage medium 530 on the server 500.
The server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input/output interfaces 558, one or more keyboards 556, and/or one or more operating systems 541, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the present application in any way. While the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application. Any person skilled in the art may make many possible variations and modifications to the technical solution of the present application, or modify equivalent embodiments, using the methods and technical contents disclosed above, without departing from the scope of the technical solution of the present application. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present application, which do not depart from the content of the technical solution of the present application, still fall within the scope of protection of the technical solution of the present application.

Claims (6)

1. A method of speech recognition input, comprising:
obtaining input voice data and user identification of a target user;
according to the corresponding relation between the user identification and the personalized data of the target user, personalized data of the target user are obtained, wherein the personalized data are obtained by learning the input habit of the target user; the input habit comprises input behavior and history input, and the user information comprises user portrait information; the input behavior includes: the method comprises the steps that after a voice recognition result is obtained, a target user modifies the voice recognition result;
recognizing the input voice data to obtain a plurality of voice recognition results;
obtaining an acoustic model score and a language model score for each of the speech recognition results;
obtaining a plurality of updated speech recognition results based on a plurality of the speech recognition results in combination with the personalized data;
obtaining a language model score of each updated speech recognition result as a target language model score;
and based on a plurality of updated voice recognition results, combining the acoustic model score and the target language model score to obtain the target voice recognition result.
2. The method of claim 1, further comprising, after said obtaining the target speech recognition result:
and displaying the target voice recognition result to the target user.
3. An apparatus for speech recognition input, comprising:
the first obtaining unit is used for obtaining input voice data of a target user and a user identifier;
the second obtaining unit is used for obtaining personalized data of the target user according to the corresponding relation between the user identifier and the personalized data of the target user, wherein the personalized data are obtained by learning the input habit of the target user, the input habit comprises input behaviors and historical input, and the user information comprises user portrait information; the input behavior includes: the method comprises the steps that after a voice recognition result is obtained, a target user modifies the voice recognition result;
a third obtaining unit for obtaining a target voice recognition result by recognizing the input voice data in combination with the personalized data;
the third obtaining unit includes:
a recognition subunit, configured to recognize the input voice data to obtain a voice recognition result;
a first obtaining subunit, configured to obtain the target speech recognition result based on the speech recognition result in combination with the personalized data;
If the identification subunit is specifically configured to: recognizing the input voice data to obtain a plurality of voice recognition results;
correspondingly, the obtaining subunit based on the first obtaining subunit includes:
a first obtaining module, configured to obtain an acoustic model score and a language model score of each of the speech recognition results;
a second obtaining module, configured to obtain the target speech recognition result based on a plurality of the speech recognition results in combination with the personalized data, the acoustic model score, and the language model score, where the second obtaining module includes:
the first obtaining submodule is used for obtaining a plurality of updated voice recognition results based on the voice recognition results and the personalized data;
a second obtaining sub-module, configured to obtain a language model score of each of the updated speech recognition results as a target language model score;
and the third obtaining sub-module is used for obtaining the target voice recognition result by combining the acoustic model score and the target language model score based on a plurality of updated voice recognition results.
4. A device according to claim 3, characterized in that at the third obtaining unit further comprises:
And the display unit is used for displaying the target voice recognition result to the target user.
5. An apparatus for speech recognition input, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:
acquiring input voice data and user identification of a target user;
according to the corresponding relation between the user identification and the personalized data of the target user, personalized data of the target user are obtained, wherein the personalized data are obtained by learning the input habit of the target user; the input habit comprises input behavior and history input, and the user information comprises user portrait information; the input behavior includes: the method comprises the steps that after a voice recognition result is obtained, a target user modifies the voice recognition result;
recognizing the input voice data to obtain a plurality of voice recognition results;
obtaining an acoustic model score and a language model score for each of the speech recognition results;
Obtaining a plurality of updated speech recognition results based on a plurality of the speech recognition results in combination with the personalized data;
obtaining a language model score of each updated speech recognition result as a target language model score;
and based on a plurality of updated voice recognition results, combining the acoustic model score and the target language model score to obtain the target voice recognition result.
6. A machine readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method of speech recognition input of any of claims 1 to 2.
CN201910647006.0A 2019-07-17 2019-07-17 Voice recognition input method and related device Active CN112242142B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910647006.0A CN112242142B (en) 2019-07-17 2019-07-17 Voice recognition input method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910647006.0A CN112242142B (en) 2019-07-17 2019-07-17 Voice recognition input method and related device

Publications (2)

Publication Number Publication Date
CN112242142A CN112242142A (en) 2021-01-19
CN112242142B true CN112242142B (en) 2024-01-30

Family

ID=74167713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910647006.0A Active CN112242142B (en) 2019-07-17 2019-07-17 Voice recognition input method and related device

Country Status (1)

Country Link
CN (1) CN112242142B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103035240A (en) * 2011-09-28 2013-04-10 苹果公司 Speech recognition repair using contextual information
CN103699530A (en) * 2012-09-27 2014-04-02 百度在线网络技术(北京)有限公司 Method and equipment for inputting texts in target application according to voice input information
CN104346160A (en) * 2013-08-09 2015-02-11 联想(北京)有限公司 Method for processing information and electronic equipment
WO2016011159A1 (en) * 2014-07-15 2016-01-21 JIBO, Inc. Apparatus and methods for providing a persistent companion device
CN106653007A (en) * 2016-12-05 2017-05-10 苏州奇梦者网络科技有限公司 Speech recognition system
CN107544726A (en) * 2017-07-04 2018-01-05 百度在线网络技术(北京)有限公司 Method for correcting error of voice identification result, device and storage medium based on artificial intelligence
CN107678561A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Phonetic entry error correction method and device based on artificial intelligence
CN108428446A (en) * 2018-03-06 2018-08-21 北京百度网讯科技有限公司 Audio recognition method and device
CN109243430A (en) * 2017-07-04 2019-01-18 北京搜狗科技发展有限公司 A kind of audio recognition method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7865362B2 (en) * 2005-02-04 2011-01-04 Vocollect, Inc. Method and system for considering information about an expected response when performing speech recognition
US8019777B2 (en) * 2006-03-16 2011-09-13 Nexify, Inc. Digital content personalization method and system
US8260809B2 (en) * 2007-06-28 2012-09-04 Microsoft Corporation Voice-based search processing
US8589157B2 (en) * 2008-12-05 2013-11-19 Microsoft Corporation Replying to text messages via automated voice search techniques
US8990085B2 (en) * 2009-09-30 2015-03-24 At&T Intellectual Property I, L.P. System and method for handling repeat queries due to wrong ASR output by modifying an acoustic, a language and a semantic model
US20130103628A1 (en) * 2011-10-20 2013-04-25 Sidebar, Inc. User activity dashboard for depicting behaviors and tuning personalized content guidance
KR102413282B1 (en) * 2017-08-14 2022-06-27 삼성전자주식회사 Method for performing personalized speech recognition and user terminal and server performing the same

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103035240A (en) * 2011-09-28 2013-04-10 苹果公司 Speech recognition repair using contextual information
CN103699530A (en) * 2012-09-27 2014-04-02 百度在线网络技术(北京)有限公司 Method and equipment for inputting texts in target application according to voice input information
CN104346160A (en) * 2013-08-09 2015-02-11 联想(北京)有限公司 Method for processing information and electronic equipment
WO2016011159A1 (en) * 2014-07-15 2016-01-21 JIBO, Inc. Apparatus and methods for providing a persistent companion device
CN106653007A (en) * 2016-12-05 2017-05-10 苏州奇梦者网络科技有限公司 Speech recognition system
CN107544726A (en) * 2017-07-04 2018-01-05 百度在线网络技术(北京)有限公司 Method for correcting error of voice identification result, device and storage medium based on artificial intelligence
CN109243430A (en) * 2017-07-04 2019-01-18 北京搜狗科技发展有限公司 A kind of audio recognition method and device
CN107678561A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Phonetic entry error correction method and device based on artificial intelligence
CN108428446A (en) * 2018-03-06 2018-08-21 北京百度网讯科技有限公司 Audio recognition method and device

Also Published As

Publication number Publication date
CN112242142A (en) 2021-01-19

Similar Documents

Publication Publication Date Title
CN105489220B (en) Voice recognition method and device
EP3171270A1 (en) Method and device for information push
CN109961791B (en) Voice information processing method and device and electronic equipment
US20170032287A1 (en) Method and device for providing ticket information
EP3958110A1 (en) Speech control method and apparatus, terminal device, and storage medium
US9959487B2 (en) Method and device for adding font
US11335348B2 (en) Input method, device, apparatus, and storage medium
US20160314164A1 (en) Methods and devices for sharing cloud-based business card
CN108304078B (en) Input method and device and electronic equipment
CN106657543B (en) Voice information processing method and device
CN111177521A (en) Method and device for determining query term classification model
CN109408796B (en) Information processing method and device and electronic equipment
CN112784151B (en) Method and related device for determining recommended information
CN109901726B (en) Candidate word generation method and device and candidate word generation device
CN109799916B (en) Candidate item association method and device
CN112242142B (en) Voice recognition input method and related device
CN108108356B (en) Character translation method, device and equipment
CN109725736B (en) Candidate sorting method and device and electronic equipment
CN110213062B (en) Method and device for processing message
CN113946228A (en) Statement recommendation method and device, electronic equipment and readable storage medium
CN112331194A (en) Input method and device and electronic equipment
CN109151155B (en) Communication processing method, device and machine readable medium
CN112990240B (en) Method and related device for determining vehicle type
CN109408623B (en) Information processing method and device
CN113127613B (en) Chat information processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant