CN115798465A

CN115798465A - Voice input method, system and readable storage medium

Info

Publication number: CN115798465A
Application number: CN202310072790.3A
Authority: CN
Inventors: 吴天; 丁国平; 黄聪聪; 熊阳; 刘智鹏; 占祥东
Original assignee: Tianchuang Optoelectronic Engineering Co ltd
Current assignee: Tianchuang Optoelectronic Engineering Co ltd
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2023-03-14
Anticipated expiration: 2043-02-07
Also published as: CN115798465B

Abstract

The invention discloses a voice input method, a system and a readable storage medium, comprising: acquiring a first to-be-recognized voice input by a user; obtaining the character number of a first speech to be recognized according to the first speech speed information and the speech duration, and analyzing the first speech to be recognized to obtain a first speech content corresponding to the first speech to be recognized; acquiring a second voice to be recognized input by a user, and judging whether the number of characters of the second voice to be recognized is larger than that of the first voice to be recognized; if not, judging whether second voice content corresponding to the second voice to be recognized is sub-voice content of the first voice content; and if the content is the sub-voice content, updating the first voice content based on the second voice content, and outputting the updated first voice content. The method and the device realize that the voices with different speech speeds are suitable for different voice recognition models, and solve the problems that the recognition is invalid and inaccurate if the general voice recognition is adopted.

Description

Voice input method, system and readable storage medium

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a voice input method, a voice input system and a readable storage medium.

Background

Most of the current speech recognition functions are realized by adopting speech recognition models, and the training of the speech recognition models is carried out by adopting a training library based on standard speech. However, in practical application scenarios, the speech rate of the user may be different due to various reasons, such as the language growing environment of the user or the physiological reasons of the body.

For a part of users with fast speech speed, problems of recognition invalidation, such as inaccurate recognition, even incapability of recognition and the like may exist by adopting general speech recognition, so that the speech communication of the part of users using the speech recognition technology is greatly hindered, and the use experience of the users is seriously influenced.

Disclosure of Invention

The invention provides a voice input method, a system and a readable storage medium, which are used for solving the technical problems that the recognition is invalid, for example, the recognition is inaccurate or even the recognition cannot be carried out by adopting universal voice recognition for a part of users with fast voice speed.

In a first aspect, the present invention provides a speech input method, including: when a first voice input instruction is received, acquiring a first to-be-recognized voice input by a user, and extracting first voice feature information in the first to-be-recognized voice, wherein the first voice feature information comprises first voiceprint information and first speech speed information corresponding to the first voiceprint information; obtaining the number of characters of the first speech to be recognized according to the first speech speed information and the speech duration, and analyzing the first speech to be recognized based on a pre-trained speech recognition model associated with the number of characters to obtain first speech content corresponding to the first speech to be recognized; when a second voice input instruction is received, acquiring a second voice to be recognized input by a user, and judging whether the number of characters of the second voice to be recognized is larger than that of the first voice to be recognized; if the number of the characters of the second voice to be recognized is not larger than that of the first voice to be recognized, judging whether second voice content corresponding to the second voice to be recognized is sub-voice content of the first voice content, wherein the sub-voice content of the first voice content is voice content with a first character and a last character both existing in the first voice content and/or voice content with the number of characters same as the characters in the first voice content larger than a preset threshold value; and if the second voice content corresponding to the second voice to be recognized is the sub-voice content of the first voice content, updating the first voice content based on the second voice content, and outputting the updated first voice content.

Further, before obtaining the number of characters of the first speech to be recognized according to the first speech rate information and the speech duration, the method further includes: judging whether a nonsense voice section exists in the first voice to be recognized, wherein the nonsense voice section comprises a blank voice section and a lingering long voice section; if the nonsense speech segment exists in the first speech to be recognized, removing the nonsense speech segment in the first speech to be recognized based on the starting time and the ending time of the nonsense speech segment, and splicing into a first target speech to be recognized which only comprises a sense speech segment; and acquiring first speech speed information in the first target speech to be recognized and the speech duration of the first target speech to be recognized.

Further, the analyzing the first speech to be recognized based on a pre-trained speech recognition model associated with the number of characters to obtain first speech content corresponding to the first speech to be recognized includes: training a neural network model based on training voices with different character numbers and training voice contents corresponding to the training voices to obtain at least one voice recognition model, wherein one voice recognition model is used for recognizing the training voices with the character numbers within a character number range; selecting a certain voice recognition model corresponding to the number of the characters according to the number of the characters of the first voice to be recognized; and analyzing the first to-be-recognized voice according to the certain voice recognition model to obtain first voice content corresponding to the first to-be-recognized voice.

Further, before the determining whether the number of characters of the second speech to be recognized is greater than the number of characters of the first speech to be recognized, the method further includes: judging whether second voiceprint information in the second voice to be recognized is the same as the first voiceprint information; and if the second voiceprint information in the second voice to be recognized is the same as the first voiceprint information, obtaining the character number of the second voice to be recognized based on the first voice speed information and the voice duration of the second voice to be recognized.

Further, after determining whether the second voiceprint information in the second speech to be recognized is the same as the first voiceprint information, the method further includes: and if the second voiceprint information in the second voice to be recognized is different from the first voiceprint information, directly outputting the first voice content corresponding to the first voice to be recognized.

Further, after determining whether the number of characters of the second speech to be recognized is greater than the number of characters of the first speech to be recognized, the method further includes: and if the number of the characters of the second voice to be recognized is larger than that of the characters of the first voice to be recognized, directly outputting the first voice content corresponding to the first voice to be recognized.

In a second aspect, the present invention provides a speech input system comprising: the voice recognition method comprises the steps that an obtaining module is configured to obtain a first to-be-recognized voice input by a user when a first voice input instruction is received, and extract first voice feature information in the first to-be-recognized voice, wherein the first voice feature information comprises first voiceprint information and first speech speed information corresponding to the first voiceprint information; the analysis module is configured to obtain the number of characters of the first to-be-recognized voice according to the first speech rate information and the voice duration, and analyze the first to-be-recognized voice based on a pre-trained voice recognition model associated with the number of characters to obtain first voice content corresponding to the first to-be-recognized voice; the first judging module is configured to acquire a second voice to be recognized input by a user when a second voice input instruction is received, and judge whether the number of characters of the second voice to be recognized is larger than that of the first voice to be recognized; a second determining module, configured to determine whether a second voice content corresponding to the second voice to be recognized is a sub-voice content of the first voice content if the number of characters of the second voice to be recognized is not greater than the number of characters of the first voice to be recognized, where the sub-voice content of the first voice content is a voice content in which both a first character and a last character exist in the first voice content and/or a voice content in which the number of characters identical to the characters in the first voice content is greater than a preset threshold; and the updating module is configured to update the first voice content based on the second voice content and output the updated first voice content if the second voice content corresponding to the second voice to be recognized is the sub-voice content of the first voice content.

In a third aspect, an electronic device is provided, which includes: the speech input system comprises at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the speech input method of any of the embodiments of the present invention.

In a fourth aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program, which program instructions, when executed by a processor, cause the processor to perform the steps of the speech input method of any of the embodiments of the present invention.

According to the voice input method, the system and the readable storage medium, different voice recognition models are adopted to recognize input voices with different character numbers, so that voices with different speech speeds are suitable for different voice recognition models, and the problems that recognition is invalid, such as inaccurate recognition and even impossible recognition possibly exists in part of users with fast speech speeds by adopting universal voice recognition are solved; in the voice input process, whether the voice content is correct or not can be judged in a default mode by judging whether the number of characters of the second voice to be recognized is larger than that of the first voice to be recognized and judging whether the second voice content corresponding to the second voice to be recognized is the sub-voice content of the first voice content, and therefore the voice input efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a voice input method according to an embodiment of the present invention;

fig. 2 is a block diagram of a voice input system according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Please refer to fig. 1, which shows a flow chart of a voice input method of the present application.

As shown in fig. 1, the voice input method specifically includes the following steps:

step S101, when a first voice input instruction is received, a first to-be-recognized voice input by a user is obtained, and first voice feature information in the first to-be-recognized voice is extracted, wherein the first voice feature information comprises first voiceprint information and first speech speed information corresponding to the first voiceprint information.

In this embodiment, when a first voice input instruction is received, after a first to-be-recognized voice input by a user is acquired, a preset voice feature extraction tool may be used to perform feature extraction on the first to-be-recognized voice, so as to obtain first voice feature information.

It should be noted that the voice input instruction may be obtained based on a plurality of voice input manners, and based on the difference of the voice input manners, the voice input instruction may include a voice input instruction, a button voice input instruction, a touch voice input instruction, and the like. Specifically, if the voice input command is a voice type voice input command, the terminal device may obtain the voice input command by detecting a voice wakeup word or other voice inputs; if the input command is a key-type voice input command, the terminal equipment can receive the voice input command when detecting a key pressing signal; if the touch type voice input instruction is received, the terminal device can acquire the voice input instruction by detecting whether the touch signal is acquired in the designated area, and the like.

In some embodiments, the input speech may be analyzed using front-end tool Praat speech analysis software. Praat is to collect and analyze the voice signal of the input voice and output the analysis result in a text report or a language graph, so as to obtain the speed information of the input voice according to the analysis result.

Step S102, obtaining the number of characters of the first speech to be recognized according to the first speech speed information and the speech duration, and analyzing the first speech to be recognized based on a pre-trained speech recognition model associated with the number of characters to obtain first speech content corresponding to the first speech to be recognized.

In this embodiment, the number of characters may be the number of valid characters in the input speech. For example, if the input voice is "navigate to victory road", the number of characters of the input voice may be 6. For another example, the input voice is "navigate to the grand yunnan hotel of victory road 226", the number of characters of the input voice may be 17.

Specifically, the number of characters of the input speech can be obtained according to the speech speed information and the speech duration of the input speech, and a speech recognition model corresponding to the number of characters is selected according to the number of characters, and the input speech is recognized according to the speech recognition model, so that the speech content corresponding to the input speech is obtained.

In some embodiments, training a neural network model based on training voices with different numbers of characters and training voice content corresponding to the training voices to obtain at least one voice recognition model, wherein one voice recognition model is used for recognizing the training voices with the number of characters within a character number range; selecting a certain voice recognition model corresponding to the number of characters according to the number of the characters of the first voice to be recognized; and analyzing the first to-be-recognized voice according to a certain voice recognition model to obtain first voice content corresponding to the first to-be-recognized voice.

For example, an input voice "navigate victory road" within 2 seconds is taken as a training voice, and another input voice "navigate victory road 226 grand yunnan hotel" within 2 seconds is taken as another training voice, and is respectively input into the neural network model, so as to respectively obtain two voice recognition models, wherein one voice recognition model is used for recognizing the input voice with 1-10 character numbers, and the other voice recognition model is used for recognizing the input voice with 11-20 character numbers. Therefore, the input voices with different character numbers are recognized by adopting different voice recognition models, the voices with different speech speeds are suitable for different voice recognition models, and the problems that recognition is invalid, such as inaccurate recognition and even impossible recognition possibly exists in the case of adopting general voice recognition for a part of users with high speech speeds are solved.

In some embodiments, before obtaining the number of characters of the first speech to be recognized according to the first speech rate information and the speech duration, the method further includes:

judging whether a nonsense voice section exists in the first voice to be recognized, wherein the nonsense voice section comprises a blank voice section and a lingering long voice section; if the nonsense speech segment exists in the first speech to be recognized, removing the nonsense speech segment in the first speech to be recognized based on the starting time and the ending time of the nonsense speech segment, and splicing into a first target speech to be recognized which only contains the sense speech segment; acquiring first speech speed information in the first target speech to be recognized and the speech duration of the first target speech to be recognized. Therefore, the nonsense speech segment in the input speech is removed, and the speech to be recognized which only contains the sense speech segment is formed by splicing again, so that the phenomenon that the speech without actual meaning occupies the speech duration of the whole input speech can be avoided, and the accuracy of obtaining the number of the characters of the speech to be recognized is improved.

It should be noted that the sound intensity of the blank speech segment is smaller than the preset intensity, the blank speech segment may represent a speech segment in which the user does not speak or the speaking sound intensity is very small, and the lingering speech segment refers to a speech segment in which the user is speaking and the sound intensity is not smaller than the preset intensity, and the speech segment has no semantic content. For example, the input speech "navigate to victory road 226 number, amount \8230" within 2s, the grand yunnan hotel, "wherein" amount, amount \8230, "is a lingering voice segment. The nonsense speech segment in the first speech to be recognized is removed based on the start time and the end time of the nonsense speech segment and is re-spliced into the first target speech to be recognized including only the sense speech segment ("the grand cloud hotel navigating victory road 226").

Step S103, when a second voice input instruction is received, a second voice to be recognized input by the user is obtained, and whether the number of characters of the second voice to be recognized is larger than that of the first voice to be recognized is judged.

In this embodiment, when a second voice input instruction is received, a second voice to be recognized input by a user is acquired, and whether second voiceprint information in the second voice to be recognized is the same as the first voiceprint information is judged; and if the second voiceprint information in the second voice to be recognized is different from the first voiceprint information, directly outputting the first voice content corresponding to the first voice to be recognized. Therefore, when the voiceprint information of the first voice to be recognized is different from that of the second voice to be recognized, the fact that the accuracy of the first voice content is determined by default by the previous user is indicated, the user is switched at the moment, the accuracy of the first voice content is determined through the voiceprint of the next user, and the fluency of voice content output in the process of switching the user is effectively improved.

In a specific application scenario, a user a inputs a voice of "navigate to greenish cloud hotel" through a wakeup word ("input"), and after the voice is recognized by a voice recognition model, the voice content of "navigate to greenish cloud hotel" is presented to the user, at this time, a user B inputs a voice of "victory road", so that the voice content of "navigate to greenish cloud hotel" corresponding to the input voice of the user a is determined to be correct through different voiceprint information of the user B and the user a, at this time, greenish cloud hotels with multiple addresses are displayed in a display list, for example, the greenish cloud hotel of victory road, the greenish cloud hotel of cloud flying road, and the voice of "victory road greenish cloud hotel" is input through the user B, and the final purpose is determined to be: "navigate to victory the Rogue cloud sky Hotel".

In the prior art, after the user a inputs the voice of "navigate to the granny yunnan hotel", the system generates the interactive information of "whether the voice is the voice of" navigate to the granny yunnan hotel "after voice recognition, and at this time, the user needs to determine the voice replying to" yes "again to determine that the voice content is correct, thereby causing inconvenience in voice input. After the method of the embodiment is adopted, the user B inputs the voice of 'Windowre Greenland Yunyan Hotel', and the voice content input by the user A can be directly determined to be correct through different voiceprint information of the user B and the user A.

It should be noted that, if the second voiceprint information in the second speech to be recognized is the same as the first voiceprint information, the number of characters of the second speech to be recognized is obtained based on the first speech speed information and the speech duration of the second speech to be recognized. The character number of the second speech to be recognized is obtained through the first speech speed information and the speech duration of the second speech to be recognized, so that subsequent speech input operation can be facilitated.

Step S104, if the number of characters of the second speech to be recognized is not larger than that of the first speech to be recognized, judging whether a second speech content corresponding to the second speech to be recognized is a sub-speech content of the first speech content.

In this embodiment, the sub-speech content of the first speech content is the speech content in which the first character and the last character both exist in the first speech content and/or the speech content in which the number of characters identical to the characters in the first speech content is greater than a preset threshold. Whether the first voice content is directly output or not can be judged by judging whether the second voice content corresponding to the second voice to be recognized is the sub-voice content of the first voice content or not, so that the phenomenon that a user needs to repeatedly determine whether the voice content is correct or not is reduced, and the voice input efficiency is effectively improved.

In another specific application scenario, a user a inputs a voice of "navigate to go to greenish cloud hotel" through a wakeup word ("input"), and after the voice is recognized by a voice recognition model, the voice content "navigate to greenish hotel" is presented to the user, because the recognized voice content "navigate to greenish hotel" is wrong, the user a inputs "go to greenish cloud hotel" again, and because the voice content "go" and "shop" are included in the voice content "go to greenish cloud hotel" input again by the user a, it is determined that the voice content "go to greenish cloud hotel" input again by the user a is a sub-voice content of the voice content input last time by the user, and the last voice content of the user is replaced by "navigate to greenish cloud hotel".

Furthermore, the user A inputs the voice of 'navigate to the Greenland cloud sky hotel' through the awakening word ('input'), the voice content 'navigate to the Greenland hotel' is displayed to the user through the identification of the voice identification model, the user A inputs 'Shengliland Rowland cloud sky hotel' again at the moment because the identified voice content 'navigate to the Greenland hotel' is wrong, the characters of the voice content of 'Shengliland Rowland cloud sky hotel' input by the user A again are 5 and are more than 4, and therefore the voice content of 'Shengliland Rowland cloud sky hotel' input by the user A again is judged to be the sub-voice content of the voice content input by the user last time, and the voice content of the user last time is replaced by 'navigate to the Shengliland Rowland cloud sky hotel'.

In some optional embodiments, if the number of characters of the second speech to be recognized is greater than the number of characters of the first speech to be recognized, the first speech content corresponding to the first speech to be recognized is directly output.

Because the number of the characters of the second speech to be recognized is larger than that of the characters of the first speech to be recognized, the purpose of the second speech to be recognized is different from that of the first speech to be recognized, and the first speech to be recognized can be directly determined to be correct in recognition.

For example, in an application scenario in which a text is input by voice, if a short sentence is recognized correctly, the next sentence can be directly input by voice, and if the short sentence is recognized incorrectly, the correct vocabulary is input by voice for the wrong part of the short sentence, so that the short sentence is replaced.

Specifically, a certain phrase voice is 'today weather is clear to cloudy', if the recognized voice content is 'today weather is clear to cloudy', then the next phrase voice input by the user is 'Xiaoming and classmates have performed fishing activities outdoors', so that the default recognized voice content is 'today weather is clear to cloudy', and the user does not need to confirm again; if the recognized voice content is 'today weather is sunny to rainy', the next voice input by the user is 'sunny to cloudy', and because two characters in 'sunny to cloudy' are the same as the recognized characters in the previous sentence, the 'sunny to cloudy' is considered as the sub-voice content of 'today weather is sunny to rainy', and finally the voice content is replaced by 'today weather is sunny to cloudy'.

Step S105, if the second voice content corresponding to the second voice to be recognized is a sub-voice content of the first voice content, updating the first voice content based on the second voice content, and outputting the updated first voice content.

In this embodiment, if the second speech content corresponding to the second speech to be recognized is the sub-speech content of the first speech content, the first speech content is updated based on the second speech content, and the updated first speech content is output.

And if the second voice content corresponding to the second voice to be recognized is not the sub-voice content of the first voice content, directly outputting the first voice content corresponding to the first voice to be recognized.

For example, if a phrase speech is "today weather is sunny to cloudy", and the recognized speech content is "today weather is sunny to cloudy", then the next phrase speech input by the user speech is "morning" and "time is morning", and the phrase speech is not the sub-speech content "today weather is sunny to cloudy", so that the recognized speech content is "today weather is sunny to cloudy" correctly by default.

In summary, the method of the application adopts different speech recognition models to recognize input speech with different character numbers, realizes that speech with different speech speeds is suitable for different speech recognition models, and solves the problems that the recognition is invalid, such as inaccurate recognition and even incapability of recognition possibly exists in the case of adopting general speech recognition for a part of users with fast speech speeds; in the voice input process, whether the voice content is correct or not can be judged in a default mode by judging whether the number of characters of the second voice to be recognized is larger than that of the first voice to be recognized and judging whether the second voice content corresponding to the second voice to be recognized is the sub-voice content of the first voice content, and therefore the voice input efficiency is improved.

Referring to fig. 2, a block diagram of a speech input system of the present application is shown.

As shown in fig. 2, the speech input system 200 includes an obtaining module 210, an analyzing module 220, a first determining module 230, a second determining module 240, and an updating module 250.

The obtaining module 210 is configured to, when a first voice input instruction is received, obtain a first to-be-recognized voice input by a user, and extract first voice feature information in the first to-be-recognized voice, where the first voice feature information includes first voiceprint information and first speech speed information corresponding to the first voiceprint information; an analysis module 220, configured to obtain a number of characters of the first to-be-recognized speech according to the first speech rate information and the speech duration, and analyze the first to-be-recognized speech based on a pre-trained speech recognition model associated with the number of characters to obtain first speech content corresponding to the first to-be-recognized speech; the first determining module 230 is configured to, when receiving a second voice input instruction, obtain a second voice to be recognized, which is input by a user, and determine whether the number of characters of the second voice to be recognized is greater than the number of characters of the first voice to be recognized; a second determining module 240, configured to determine whether a second speech content corresponding to the second speech to be recognized is a sub-speech content of the first speech content if the number of characters of the second speech to be recognized is not greater than the number of characters of the first speech to be recognized, where the sub-speech content of the first speech content is a speech content in which both a first character and a last character exist in the first speech content and/or a speech content in which the number of characters same as the characters in the first speech content is greater than a preset threshold; an updating module 250 configured to update the first voice content based on the second voice content and output the updated first voice content if the second voice content corresponding to the second voice to be recognized is a sub-voice content of the first voice content.

It should be understood that the modules depicted in fig. 2 correspond to various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 2, and are not described again here.

In still other embodiments, the present invention further provides a computer-readable storage medium, on which a computer program is stored, wherein the program instructions, when executed by a processor, cause the processor to execute the voice input method in any of the above-mentioned method embodiments;

as one embodiment, the computer-readable storage medium of the present invention stores computer-executable instructions configured to:

when a first voice input instruction is received, acquiring a first to-be-recognized voice input by a user, and extracting first voice feature information in the first to-be-recognized voice, wherein the first voice feature information comprises first voiceprint information and first speech speed information corresponding to the first voiceprint information;

obtaining the number of characters of the first speech to be recognized according to the first speech speed information and the speech duration, and analyzing the first speech to be recognized based on a pre-trained speech recognition model associated with the number of characters to obtain first speech content corresponding to the first speech to be recognized;

when a second voice input instruction is received, acquiring a second voice to be recognized input by a user, and judging whether the number of characters of the second voice to be recognized is larger than that of the first voice to be recognized;

if the number of the characters of the second voice to be recognized is not larger than that of the first voice to be recognized, judging whether second voice content corresponding to the second voice to be recognized is sub-voice content of the first voice content, wherein the sub-voice content of the first voice content is voice content with a first character and a last character both existing in the first voice content and/or voice content with the number of characters same as the characters in the first voice content larger than a preset threshold value;

and if the second voice content corresponding to the second voice to be recognized is the sub-voice content of the first voice content, updating the first voice content based on the second voice content, and outputting the updated first voice content.

The computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice input system, and the like. Further, the computer-readable storage medium may include high speed random access memory, and may also include memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the speech input system via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device includes: a processor 310 and a memory 320. The electronic device may further include: an input device 330 and an output device 340. The processor 310, the memory 320, the input device 330, and the output device 340 may be connected by a bus or other means, such as the bus connection in fig. 3. The memory 320 is the computer-readable storage medium described above. The processor 310 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 320, that is, implements the voice input method of the above-described method embodiment. The input device 330 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the voice input system. The output device 340 may include a display device such as a display screen.

The electronic device can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to a voice input system and used for a client, and includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

when a first voice input instruction is received, acquiring a first to-be-recognized voice input by a user, and extracting first voice characteristic information in the first to-be-recognized voice, wherein the first voice characteristic information comprises first voiceprint information and first speech speed information corresponding to the first voiceprint information;

Those of skill in the art will understand that the logic and/or steps illustrated in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A speech input method, comprising:

2. The method according to claim 1, wherein before obtaining the number of characters of the first speech to be recognized according to the first speech rate information and the speech duration, the method further comprises:

judging whether a nonsense voice section exists in the first voice to be recognized, wherein the nonsense voice section comprises a blank voice section and a lingering long voice section;

if the nonsense speech segment exists in the first speech to be recognized, removing the nonsense speech segment in the first speech to be recognized based on the starting time and the ending time of the nonsense speech segment, and splicing into a first target speech to be recognized which only comprises a sense speech segment;

and acquiring first speech speed information in the first target speech to be recognized and the speech duration of the first target speech to be recognized.

3. The method as claimed in claim 1, wherein the analyzing the first to-be-recognized speech based on a pre-trained speech recognition model associated with the number of characters to obtain a first speech content corresponding to the first to-be-recognized speech comprises:

training a neural network model based on training voices with different character numbers and training voice contents corresponding to the training voices to obtain at least one voice recognition model, wherein one voice recognition model is used for recognizing the training voices with the character numbers within a character number range;

selecting a certain voice recognition model corresponding to the number of the characters according to the number of the characters of the first voice to be recognized;

and analyzing the first to-be-recognized voice according to the certain voice recognition model to obtain first voice content corresponding to the first to-be-recognized voice.

4. The voice input method according to claim 1, wherein before the determining whether the number of characters of the second voice to be recognized is larger than the number of characters of the first voice to be recognized, the method further comprises:

judging whether second voiceprint information in the second voice to be recognized is the same as the first voiceprint information;

and if the second voiceprint information in the second voice to be recognized is the same as the first voiceprint information, obtaining the character number of the second voice to be recognized based on the first voice speed information and the voice duration of the second voice to be recognized.

5. The speech input method according to claim 4, wherein after determining whether the second voiceprint information in the second speech to be recognized is the same as the first voiceprint information, the method further comprises:

and if the second voiceprint information in the second voice to be recognized is different from the first voiceprint information, directly outputting the first voice content corresponding to the first voice to be recognized.

6. The voice input method according to claim 1, wherein after determining whether the number of characters of the second voice to be recognized is larger than the number of characters of the first voice to be recognized, the method further comprises:

and if the number of the characters of the second speech to be recognized is larger than that of the characters of the first speech to be recognized, directly outputting the first speech content corresponding to the first speech to be recognized.

7. A speech input system, comprising:

the voice recognition method comprises the steps that an obtaining module is configured to obtain a first to-be-recognized voice input by a user when a first voice input instruction is received, and extract first voice feature information in the first to-be-recognized voice, wherein the first voice feature information comprises first voiceprint information and first speech speed information corresponding to the first voiceprint information;

the analysis module is configured to obtain the number of characters of the first to-be-recognized voice according to the first speech speed information and the voice duration, and analyze the first to-be-recognized voice based on a pre-trained voice recognition model associated with the number of characters to obtain first voice content corresponding to the first to-be-recognized voice;

the first judgment module is configured to acquire a second voice to be recognized input by a user when a second voice input instruction is received, and judge whether the number of characters of the second voice to be recognized is larger than that of the first voice to be recognized;

a second determining module, configured to determine whether a second voice content corresponding to a second voice to be recognized is a sub-voice content of the first voice content if the number of characters of the second voice to be recognized is not greater than the number of characters of the first voice to be recognized, where the sub-voice content of the first voice content is a voice content in which both a first character and a last character exist in the first voice content and/or a voice content in which the number of characters identical to the characters in the first voice content is greater than a preset threshold;

and the updating module is configured to update the first voice content based on the second voice content and output the updated first voice content if the second voice content corresponding to the second voice to be recognized is the sub-voice content of the first voice content.

8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 6.