CN112951232A

CN112951232A - Voice input method, device, equipment and computer readable storage medium

Info

Publication number: CN112951232A
Application number: CN202110232353.4A
Authority: CN
Inventors: 张学明
Original assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Current assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Priority date: 2021-03-02
Filing date: 2021-03-02
Publication date: 2021-06-11

Abstract

The invention discloses a voice input method, which comprises the following steps: when detecting that the information input function is started, starting a voice input function and acquiring first voice information; converting the first voice information into text information, and determining a focus position corresponding to the information input function; outputting the text information to the focal position. The invention also discloses a voice input device, equipment and a computer readable storage medium. The invention realizes the quick input of the text information through the voice and improves the input efficiency of the information.

Description

Voice input method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech input method, apparatus, device, and computer-readable storage medium.

Background

At present, when characters are input in a text box of intelligent equipment such as a television and the like, a traditional input mode is often adopted, namely, the characters are input by a virtual keyboard generated by the intelligent equipment through a remote controller or a touch mode, some intelligent equipment is provided with an interaction function, and a user can establish communication with the intelligent equipment through a mobile terminal such as a mobile phone and the like, so that character input operation is completed on a mobile terminal such as the mobile phone and the like, and the characters are synchronized to the intelligent equipment. However, the input efficiency of this method is too low, especially in the equipment similar to the television, when inputting by using the remote controller, the operation is complicated, and when searching for the program, the whole spelling of the program name is mostly input for searching, the input information has certain limitation, and when continuously inputting the search content, the last search content needs to be cleared manually, resulting in low input efficiency.

Disclosure of Invention

The invention mainly aims to provide a voice input method, a voice input device, voice input equipment and a computer readable storage medium, and aims to solve the technical problem that the traditional information input mode is low in input efficiency at present.

In addition, to achieve the above object, the present invention further provides a voice input method, including the steps of:

when detecting that the information input function is started, starting a voice input function and acquiring first voice information;

converting the first voice information into text information, and determining a focus position corresponding to the information input function;

outputting the text information to the focal position.

Optionally, the step of starting the voice input function when the start of the information input function is detected includes:

when detecting that the information input function is started, outputting prompt information for starting the voice input function and acquiring a starting instruction;

and starting a voice input function according to the starting instruction.

Optionally, the step of converting the first voice message into text message includes:

analyzing the first voice information to obtain a conversion instruction corresponding to the first voice information;

and converting the first voice information into text information according to the conversion instruction.

Optionally, the step of analyzing the first voice information to obtain a conversion instruction corresponding to the first voice information includes:

extracting keyword information from the first voice information;

analyzing the keyword information to determine whether the first voice information needs to be optimized when the first voice information is converted into text information;

and if the first voice information does not need to be optimized, generating a conversion instruction for converting the first voice information into corresponding text information.

Optionally, after the step of analyzing the keyword information to determine whether the first speech information needs to be optimized when the first speech information is converted into text information, the method includes:

if the first voice information needs to be optimized, generating an information optimization instruction for optimizing the first voice information according to the keyword information;

and generating a conversion instruction for converting the first voice information into text information based on the information optimization instruction.

Optionally, after the step of outputting the text information to the focal position, the method further includes:

acquiring a change instruction of the text information;

changing the text information according to the change instruction to obtain target text information;

and outputting the target text information to the focus position.

Optionally, the modifying instruction includes a re-input instruction and a modifying instruction, and the step of modifying the text information according to the modifying instruction to obtain the target text information includes:

if the change instruction is a re-input instruction, acquiring second voice information and converting the second voice information to obtain target text information;

and if the modification instruction is a modification instruction, modifying the text information according to the modification instruction to obtain target text information.

Further, to achieve the above object, the present invention also provides a voice input device including:

the voice input module is used for starting a voice input function and acquiring first voice information when detecting that the equipment starts the information input function;

the voice recognition module is used for converting the first voice information into text information and determining a focus position corresponding to the information input function;

and the text output module is used for outputting the text information to the focus position.

Further, to achieve the above object, the present invention also provides a voice input apparatus including: the voice input method comprises a memory, a processor and a voice input program which is stored on the memory and can run on the processor, wherein the voice input program realizes the steps of the voice input method when being executed by the processor.

In addition, to achieve the above object, the present invention also provides a computer readable storage medium having a voice input program stored thereon, which when executed by a processor, implements the steps of the voice input method as described above.

The embodiment of the invention provides a voice input method, a voice input device, voice input equipment and a computer readable storage medium. In the prior art, information is input through a virtual keyboard by using a remote controller or a touch mode and the like, so that the information input efficiency is low; converting the first voice information into text information, and determining a focus position corresponding to the information input function; outputting the text information to the focal position. The method and the device finish the quick input of the text information by using the voice, solve the problem of slow information input in the traditional input mode and improve the information input efficiency.

Drawings

Fig. 1 is a schematic hardware structure diagram of an implementation manner of a voice input device according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first exemplary embodiment of a speech input method according to the present invention;

FIG. 3 is a schematic diagram of a prompt message in a first embodiment of a voice input method according to the invention;

FIG. 4 is another prompt intent in accordance with the first embodiment of the present invention;

FIG. 5 is a functional block diagram of a voice input device according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in itself. Thus, "module", "component" or "unit" may be used mixedly.

The voice input device (also called terminal, device or terminal device) in the embodiment of the invention can be a PC, and can also be a mobile terminal device with display and voice functions, such as a smart phone, a smart television, a tablet computer, a portable computer and the like.

As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.

Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a voice input program.

In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to invoke a voice input program stored in the memory 1005, which when executed by the processor, implements operations in the voice input method provided by the embodiments described below.

Based on the hardware structure of the equipment, the embodiment of the voice input method is provided.

Referring to fig. 2, in a first embodiment of the voice input method of the present invention, the voice input method includes:

step S10, when detecting the start of the information input function, starting the voice input function and acquiring the first voice information;

the voice input method of the present invention is applied to an intelligent terminal device having voice and display functions, such as a television, and is described below by taking an example of application to an intelligent television (a television for short). The input function of the television is monitored, and when the fact that the user starts the information input function is detected, the voice input function is started and voice information input by the user is obtained. In this embodiment, the information input function that is started by the user on the television may be an information input function preset on the television, or an information input function set in an application installed in the television, and is not limited specifically herein.

The input function of the television is detected, the focus of a cursor of the television is tracked and monitored, and when the focus of the television falls into an area with the input function, such as an input text box, the voice input function is started. When the voice input function is started, the television can be started in the background, a user can select whether to use the voice input function according to the self requirement, and when the user selects to use the voice input function, the voice information input by the user is acquired.

The refining step of the step S10 comprises the steps A1-A2:

step A1, when detecting that the information input function is started, outputting prompt information for starting the voice input function and obtaining a starting instruction;

and step A2, starting the voice input function according to the starting instruction.

Further, when it is detected that the television set has started the information input function, the voice input function is started in the background of the television set, and a prompt message for starting the voice input function is output and displayed on the display screen of the television set, and the prompt message may be in the form of a pop-up dialog box that is displayed similarly to "is the voice input function started, used or not? The prompt message "is shown in fig. 3, where fig. 3 is a schematic diagram of a prompt message dialog box in this embodiment, and the prompt message may include a selection button for a user to select, for example, as shown in fig. 3, the prompt message is popped up in a dialog box form, a prompt button of" use/not use "is set below the prompt content, then a start instruction input by the user is obtained, a voice input function is started in a front console of the television according to the start instruction, and the voice input function interacts with the user to obtain the voice message input by the user. The obtained start instruction input by the user may be an instruction triggered by a prompt button of "use" in the prompt information under the control of the user through a remote controller, or a voice instruction of the user, for example, when information such as "use/start" and the like input by the user is obtained, a voice input function is started on a front desk of the television, and voice information input by the user is obtained.

And if the obtained starting instruction input by the user is that the voice input function is not used, starting the voice input function at the background of the television, monitoring the focus of the television, and outputting and displaying the prompt information for starting the voice input function again when the focus of the television is detected to fall into the area with the information input function again. When it is detected that the user selects not to use the voice input function for a plurality of times, an option of closing the prompt message similar to "no longer prompt this time when the user turns on the computer this time" may be added to the prompt message for the next time, as shown in fig. 4, when detecting that the user is continuously in the prompt message for a plurality of times, triggering an instruction not to use the voice input function, then, an option that does not display any more prompt information is added to the prompt information shown in fig. 3, if an instruction corresponding to the option triggered by the user is obtained, the focus of the television set can continue to be monitored, but when the focus of the subsequent television set falls again into the area with input functionality, the prompt information for starting the voice input function is not output or displayed any more, and the option for closing the prompt information is displayed after the prompt information for starting the voice input function is continuously displayed for a plurality of times, so that the user can perform self-defined setting.

Further, after the prompt message for starting the voice input function is turned off, when the focus of the television is detected to fall into the area with the information input function, a button for starting the voice input function is displayed in the input area, for example, a start button for setting "voice input" on the input text box, and when the user wants to start the voice input function in the foreground of the television, the start instruction of the voice input function can be triggered through the start button or a voice instruction, so as to call up the voice input function running in the background of the television. It should be noted that, when the voice input function is started in the background, the focus of the cursor is detected, and when the focus is in the area with the information input function, for example, in the input text box, the voice input function is started in the background, the foreground of the voice input function is started according to the obtained instruction triggered by the user, and when the foreground is started, the voice start instruction of the user may be preset by the television or may be a start instruction set by the user in a self-defined manner.

Step S20, converting the first voice information into text information, and determining a focus position corresponding to the information input function;

when detecting that the user starts the voice input function, acquiring voice information input by the user, converting the acquired voice information into text information, and meanwhile, monitoring a focus of the television to determine a focus position corresponding to the currently started information input function of the television, in this embodiment, the focus position refers to a position where a cursor of the television is focused, for example, when the user searches on the television, the search function is provided with an input text box, when the user moves the cursor of the television into the input text box, the voice input function is triggered to start at a background, and when the acquired voice information input by the user is converted into text information, the position where the cursor is focused is determined again to determine an output position of the text information.

Further, when the voice information input by the user is converted into text information, the voice information of the user can be recognized through a voice recognition technology preset by the television to obtain corresponding text information. Therefore, when the television is started for the first time, the voice information of the user can be acquired, for example, specific keywords or sentences displayed on the television and read by the user are included, so as to extract the acoustic features of the user and establish an acoustic model of the user, thereby improving the accuracy of voice recognition and further improving the accuracy of conversion between the voice information and the text information.

Step S30, outputting the text information to the focus position.

And after the position of the focus of the cursor, namely the focus position, is determined, outputting the converted text information to the focus position, and displaying the output text information to the user so that the user can carry out the next operation.

After step S30, steps B1-B3 are included:

step B1, acquiring a change instruction of the text information;

step B2, modifying the text information according to the modification instruction to obtain target text information;

and step B3, outputting the target text information to the focus position.

And further, outputting the text information to the focus position of the cursor, displaying the text information to a user, acquiring a change instruction of the user for the text information, and changing the text information according to the acquired change instruction. Due to the complex and diversified Chinese characters, the existence of homophones and homophones, and the common harmonic peduncles of program names, such as 'hip-hop' and 'unappreciable', the difficulty of voice recognition is increased, after the text information converted from the voice information is displayed to a user, the change instruction of the user is obtained, the text information is changed according to the change instruction to obtain target text information, and the target text information is output to the focus position of a television cursor and displayed to the user.

The refining step of the step B2 comprises the steps B21-B22:

step B21, if the change instruction is a re-input instruction, acquiring second voice information and converting the second voice information to obtain target text information;

and step B22, if the modification instruction is a modification instruction, modifying the text information according to the modification instruction to obtain target text information.

Furthermore, the obtained modification instruction of the user for the text information comprises a re-input instruction and a modification instruction, and when the obtained modification instruction triggered by the user is the re-input instruction, the voice information of the user is obtained again, and the voice information is converted to obtain the target text information. And when the obtained modification instruction triggered by the user is a modification instruction, modifying the output text information according to the modification instruction triggered by the user to obtain the target text information.

In the embodiment, when the information input function is detected to be started, the voice input function is started and first voice information is acquired; converting the first voice information into text information, and determining a focus position corresponding to the information input function; outputting the text information to the focal position. The method and the device finish the quick input of the text information by using the voice, solve the problem of slow information input in the traditional input mode and improve the information input efficiency.

Further, on the basis of the above-described embodiments of the present invention, a second embodiment of the voice input method of the present invention is proposed.

The present embodiment is a step of the refinement of step S20 in the first embodiment, and includes steps C1-C2:

step C1, analyzing the first voice information to obtain a conversion instruction corresponding to the first voice information;

and step C2, converting the first voice information into text information according to the conversion instruction.

In this embodiment, taking the television in the above embodiment as an example, when converting voice information input by a user into text information, the voice information input by the user is firstly analyzed to obtain a conversion instruction corresponding to the voice information. Due to the diversity of language expressions, the generated text information does not necessarily completely correspond to the voice information input by the user, and in the process of analyzing the halo information, the sentence pattern and the grammar information of the sentence corresponding to the voice information input by the user can be extracted from the voice information input by the user, so that the input intention of the user is predicted. The finally output text information does not necessarily correspond exactly to the voice information input by the user. For example, if it is acquired that the voice information input by the user is "input 123456", the phrase is extracted, and the keyword "input" in the voice information is recognized as an action command, and conversion into text information is not necessary, the text information is "123456" in the generated conversion command, and therefore the finally generated and output text information is "123456".

The refinement of the step C1 comprises the steps C11-C13:

step C11, extracting keyword information from the first voice information;

step C12, analyzing the keyword information to determine whether the first voice information needs to be optimized when converting the first voice information into text information;

step C13, if the first voice information does not need to be optimized, a conversion instruction for converting the first voice information into corresponding text information is generated.

Specifically, when analyzing the voice information input by the user, first, keyword information is extracted from the voice information input by the user, for example, "input", "text information 123456", etc., taking the above-mentioned voice information "input 123456" as an example, and then the extracted keyword information is analyzed, thereby determining whether the voice information input by the user needs to be optimized when converting the voice information input by the user into text information, and if the optimization is not needed, a conversion instruction corresponding to the voice information input by the user is directly generated.

After the step C12, the method also comprises the steps C14-C15:

step C14, if the first voice information needs to be optimized, generating an information optimization instruction for optimizing the first voice information according to the keyword information;

and step C15, generating a conversion instruction for converting the first voice information into text information based on the information optimization instruction.

Further, when the user inputs the voice information, the expression mode may be simplified or the expression with similar meaning may be used, for example, the voice information "input 123456" may be "input 1 to 6", in this case, the keyword information extracted from the voice information is "input" and "text information 1 to 6", it is known that the content that the user really wants to input is "123456", and the voice information input by the user needs to be optimized. Therefore, an information optimization instruction needs to be generated first, the simplified expression of the user is converted into corresponding complete text content according to the information optimization instruction, then a conversion instruction is generated based on the generated information optimization instruction, and the text information included in the generated conversion instruction is text information obtained by optimizing text information corresponding to voice information input by the user, that is, text information which the user wants to input. Therefore, optimizing the voice information includes expanding, supplementing and changing the text information corresponding to the multiple voice information, when the user uses the simplified expression, the voice information of the user needs to be expanded and supplemented, and when the user uses the approximate expression, the voice information of the user needs to be changed so as to optimize the text information which the user actually wants to input.

In this embodiment, a conversion instruction corresponding to first voice information is obtained by analyzing the first voice information input by a user, converting the first voice information according to the generated conversion instruction to obtain corresponding text information, that is, by extracting keyword information from the first voice information input by the user, and analyzing the extracted keyword information, determining whether the text information in the first voice information needs to be optimized, if the text information corresponding to the first voice information needs to be optimized, generating an information optimization instruction, generating a conversion instruction based on the information optimization instruction, optimizing the instruction and the conversion instruction according to the generated information, and optimizing and converting the text information to obtain the text information corresponding to the first voice information, so that the conversion accuracy between the voice information and the text information is improved.

In addition, referring to fig. 5, an embodiment of the present invention further provides a voice input device, where the voice input device includes:

the voice input module 10 is used for starting a voice input function and acquiring first voice information when detecting that the equipment starts the information input function;

the voice recognition module 20 is configured to convert the first voice information into text information, and determine a focus position corresponding to the information input function;

a text output module 30, configured to output the text information to the focal position.

Optionally, the voice input module 10 includes:

the detection unit is used for outputting prompt information for starting the voice input function and acquiring a starting instruction when detecting that the information input function is started;

and the starting unit is used for starting the voice input function according to the starting instruction.

Optionally, the speech recognition module 20 includes:

the voice analysis unit is used for analyzing the first voice information to obtain a conversion instruction corresponding to the first voice information;

and the information conversion unit is used for converting the first voice information into text information according to the conversion instruction.

Optionally, the voice parsing unit includes:

an information extraction subunit, configured to extract keyword information from the first voice information;

the analysis subunit is used for analyzing the keyword information to determine whether the first voice information needs to be optimized when the first voice information is converted into text information;

and the first instruction subunit is used for generating a conversion instruction for converting the first voice information into corresponding text information if the first voice information does not need to be optimized.

Optionally, the voice parsing unit further includes:

the supplementary instruction subunit is used for generating an information optimization instruction for optimizing the first voice information according to the keyword information if the first voice information needs to be optimized;

and the second instruction subunit is used for generating a conversion instruction for converting the first voice information into text information based on the information optimization instruction.

Optionally, the voice input device further includes:

the change instruction unit is used for acquiring a change instruction of the text information;

the information changing unit is used for changing the text information according to the changing instruction to obtain target text information;

and the text output unit is used for outputting the target text information to the focus position.

Optionally, the information modifying unit includes:

the first changing subunit is used for acquiring second voice information and converting the second voice information to obtain target text information if the changing instruction is a re-input instruction;

and the second modification subunit is used for modifying the text information according to the modification instruction to obtain the target text information if the modification instruction is the modification instruction.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a voice input program is stored on the computer-readable storage medium, and when the voice input program is executed by a processor, the voice input program implements operations in the voice input method provided in the foregoing embodiment.

The method executed by each program module can refer to each embodiment of the method of the present invention, and is not described herein again.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity/action/object from another entity/action/object without necessarily requiring or implying any actual such relationship or order between such entities/actions/objects; the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

For the apparatus embodiment, since it is substantially similar to the method embodiment, it is described relatively simply, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, in that elements described as separate components may or may not be physically separate. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be substantially or partially embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the voice input method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A voice input method, characterized by comprising the steps of:

outputting the text information to the focal position.

2. The voice input method of claim 1, wherein the step of activating the voice input function when activation of the information input function is detected comprises:

and starting a voice input function according to the starting instruction.

3. The voice input method of claim 1, wherein the step of converting the first voice information into text information comprises:

4. The voice input method according to claim 3, wherein the step of analyzing the first voice message to obtain a conversion instruction corresponding to the first voice message comprises:

extracting keyword information from the first voice information;

5. The speech input method of claim 4, wherein the step of analyzing the keyword information to determine whether the first speech information needs to be optimized when converting the first speech information into text information comprises:

6. The voice input method of claim 1, wherein the step of outputting the text information to the focal position is followed by:

acquiring a change instruction of the text information;

and outputting the target text information to the focus position.

7. The voice input method of claim 6, wherein the modification instruction includes a re-input instruction and a modification instruction, and the step of modifying the text information according to the modification instruction to obtain the target text information includes:

8. A voice input apparatus, characterized in that the voice input apparatus comprises:

9. A voice input device characterized by comprising: memory, a processor and a speech input program stored on the memory and executable on the processor, the speech input program, when executed by the processor, implementing the steps of the speech input method according to any one of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon a speech input program which, when executed by a processor, implements the steps of the speech input method of any one of claims 1 to 7.