CN111261161A

CN111261161A - Voice recognition method, device and storage medium

Info

Publication number: CN111261161A
Application number: CN202010111854.2A
Authority: CN
Inventors: 生士东
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-02-24
Filing date: 2020-02-24
Publication date: 2020-06-09
Anticipated expiration: 2040-02-24
Also published as: CN111261161B

Abstract

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, and storage medium. In the process of receiving the voice of the voice object, when the tail end silence in the voice reaches the first preset time, the target voice is obtained and uploaded to the voice recognition server, the voice recognition server performs voice recognition on the target voice in advance before the voice is completely finished, a preprocessing result is obtained, when the tail end silence in the voice reaches the second preset time, namely the voice is completely finished, the voice recognition is requested to the voice recognition server, the server can quickly determine and issue the voice recognition result according to the preprocessing result, the client can obtain the voice recognition result when the client confirms that the voice is completely finished, and the waiting time for the client to obtain the server data processing result is reduced.

Description

Voice recognition method, device and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, and storage medium.

Background

With the continuous development of electronic devices, control systems, such as voice control systems, of electronic devices, which are important components of electronic devices, are also continuously developed, and with the rapid development and maturity of voice recognition technologies, various voice recognition software is developed, so that the communication between people and electronic devices becomes simple and interesting. In order to avoid misoperation when a person performs voice control on the electronic equipment, a wake-up word can be set, and when the electronic equipment receives the wake-up word matched with the electronic equipment, the electronic equipment receives external voice control information and executes corresponding operation according to the voice control information.

When a user performs voice interaction with the electronic equipment every time, the electronic equipment receives the voice of the user, transmits the voice of the user to the server, performs voice recognition by the server, and then performs operation according to the instruction of the user. Generally, the Voice transmitted to the server must be complete audio data, and the complete audio data includes four parts of audio data, namely wake-up audio data, VAD (Voice Activity Detection) mute front-end data, audio data during speaking, and mute end audio data after speaking. The awakening audio data can be expressed as awakening words and used for awakening the electronic equipment; the VAD mute front-end data aims at the delay phenomenon (namely when the VAD front-end point is detected, a current person says a period of time) of the VAD front-end point detection in the prior art, and the forward protection is added for ensuring the accuracy of the overall identification; the audio data in the speaking process can contain a control instruction sent by a person to the electronic equipment; the end-of-silence audio data is used to determine the end of a utterance, which is deemed to be due to a lack of silence detected during the person's utterance, and thus needs to be observed for a period of time (e.g., 500ms) to find a continuous silence. In the process of voice interaction between a user and the electronic equipment, the electronic equipment can upload complete audio data to the server only after waiting for the user to speak and continuously mute for a period of time, and after the server identifies the audio data, the identification result fed back by the server can be obtained.

Disclosure of Invention

The invention provides a voice recognition method, a voice recognition device and a storage medium, which can obtain a voice recognition result in time after the voice of a voice object is determined to be finished, reduce the time for a client to wait for response and improve the interactive experience.

In a first aspect, the present invention provides a speech recognition method, including:

receiving voice of a voice object, wherein the voice comprises at least one unit voice, and each unit voice comprises an instruction voice and tail end silence after the instruction voice is finished;

determining the current unit voice according to the time sequence of voice receiving;

for the current unit of speech, performing the following data processing operations:

when the duration of the tail end silence of the current unit voice reaches a first preset time, determining a target voice based on the received voice, and transmitting the target voice to a voice recognition server; the voice recognition server is used for preprocessing the target voice to obtain a preprocessing result;

when the duration of the tail end silence reaches the second preset time, sending a voice recognition request to the voice recognition server; so that the voice recognition server determines response data to the voice recognition request according to the preprocessing result;

and receiving the response data returned by the voice recognition server.

In a second aspect, the present invention provides another speech recognition method, the method comprising:

receiving target voice uploaded by a client, wherein the target voice is determined according to the received voice when the duration of mute of the tail end of the current unit voice in the voice reaches a first preset time in the process of receiving the voice by the client;

preprocessing the received target voice to obtain a preprocessing result;

receiving a voice recognition request sent by a client, wherein the voice recognition request is generated when the duration of the tail end silence reaches the second preset time;

determining response data to the voice recognition request according to the preprocessing result;

and sending the response data to the client.

A third aspect provides a speech recognition apparatus, the apparatus comprising:

the voice receiving module is used for receiving voice of a voice object, wherein the voice comprises at least one unit voice, and each unit voice comprises an instruction voice and tail end silence after the instruction voice is finished;

the current unit voice determining module is used for determining the current unit voice according to the time sequence of voice receiving;

the data processing module is used for executing data processing operation on the current unit voice;

the data processing module comprises a target voice sending unit and a voice recognition request sending unit;

the target voice sending unit is used for determining a target voice based on the received voice and transmitting the target voice to a voice recognition server when the continuous time of the tail end silence of the current unit voice reaches a first preset time; the voice recognition server is used for preprocessing the target voice to obtain a preprocessing result;

the voice recognition request sending unit is configured to send a voice recognition request to the voice recognition server when the duration of the tail end silence reaches the second preset time; so that the voice recognition server determines response data to the voice recognition request according to the preprocessing result;

and the response data receiving module is used for receiving the response data returned by the voice recognition server.

In a fourth aspect, the present invention provides another speech recognition apparatus, comprising:

the target voice receiving module is used for receiving target voice uploaded by the client, wherein the target voice is determined according to the received voice when the duration of mute of the tail end of the current unit voice in the voice reaches a first preset time in the process that the client receives the voice;

the preprocessing module is used for preprocessing the received target voice to obtain a preprocessing result;

a voice recognition request receiving module, configured to receive a voice recognition request sent by a client, where the voice recognition request is generated by the client when the duration of the tail end silence reaches the second preset time;

the voice recognition request processing module is used for determining response data to the voice recognition request according to the preprocessing result;

and the response data sending module is used for sending the response data to the client.

A fifth aspect provides an electronic device comprising a processor and a memory, the memory having stored therein at least one instruction and at least one program, the at least one instruction or the at least one program being loaded and executed by the processor to implement the speech recognition method according to the first or second aspect.

A sixth aspect provides a computer storage medium having at least one instruction or at least one program stored therein, the at least one instruction or the at least one program being loaded and executed by a processor to implement the speech recognition method according to the first or second aspect.

The voice recognition method, the voice recognition device and the storage medium have the following technical effects:

in the process of receiving the voice of the voice object, when tail end silence in the voice reaches first preset time, the target voice is obtained and uploaded to the voice recognition server, the voice recognition server performs voice recognition on the target voice in advance before the voice is completely finished, a preprocessing result is obtained, when the tail end silence in the voice reaches second preset time, namely the voice is completely finished, the voice recognition server is requested to perform voice recognition, the server can quickly determine and issue the voice recognition result according to the preprocessing result, the client can immediately obtain the voice recognition result when the client confirms that the voice is completely finished, the waiting time for the client to obtain the server data processing result is shortened, the end-to-end response speed is improved, and the experience of a user for voice operation is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a block diagram of a speech recognition system according to an embodiment of the inventive concept;

FIG. 2 is a flow chart illustrating a speech recognition method according to an embodiment of the present invention;

FIG. 3 is a flow diagram illustrating one embodiment of performing data processing operations on a current speech unit provided by the present invention;

FIG. 4 is a flow chart illustrating another embodiment of a speech recognition method provided by the present invention;

FIG. 5 is a flow diagram illustrating one embodiment of pre-processing a target speech provided by the present invention;

FIG. 6 is a diagram illustrating an application scenario of the speech recognition method provided by the present invention;

FIG. 7 is a schematic diagram of a speech recognition interaction using a generic speech recognition method;

FIG. 8 is a schematic diagram of data processing logic when performing speech recognition interactions using a generic speech recognition method;

FIG. 9 is a schematic diagram of a speech recognition interaction using the speech recognition method provided by the present invention;

FIG. 10 is a schematic diagram of data processing logic for performing speech recognition interactions using the speech recognition method provided by the present invention;

FIG. 11 is a schematic structural diagram of an embodiment of a speech recognition apparatus provided in the present invention;

FIG. 12 is a schematic diagram of another embodiment of a speech recognition apparatus according to the present invention;

FIG. 13 is a schematic structural diagram of an embodiment of a client provided by the present invention;

FIG. 14 is a schematic block diagram of one embodiment of a server provided by the present invention;

FIG. 15 is an alternative structure diagram of the distributed system applied to the blockchain system according to the embodiment of the present invention;

fig. 16 is an alternative schematic diagram of a block structure according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The voice recognition scheme provided by the embodiment of the invention can realize the quick voice recognition by utilizing artificial intelligence and cloud computing.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a study on how to makeMachine with a rotatable shaftThe science of "seeing", more specifically, it means to use the camera and computer to replace the human eye to identify, track and measure the targetMachine for working Machine visionAnd further processing the graph to make the computer process into an image more suitable for human eye observation or transmitted to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to establish that information can be obtained from images or multidimensional dataArtificial intelligenceProvided is a system. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

A key Technology of Speech Technology (Speech Technology) is the automatic Speech recognition Technology: (ASR) Andspeech sound box Become intoTechnique (A)TTS) And voiceprint recognition techniques. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP)Computer with a memory cardField of science andartificial intelligence Can be used forAn important direction in the field. It can be studied by a human bodyComputer with a memory cardVarious theories and methods of efficient communication in natural language. Natural language processing is a goal of integrationLinguistics with language learning、Science of computer、Mathematics, andin one bodyScience of. Accordingly, research in this area will involveNatural languageI.e. daily use by peopleLanguage(s)So that it andlinguistics with language learningThere is a close connection between the studies. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a one-to-many domain interdisciplinary discipline involving probability theory, statistics,Theory of approximation、Convex analysis、Complexity of algorithmTheory, etc. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. The machine learning isArtificial intelligenceThe core of (1) is a fundamental approach to making computers intelligent, and the application of the method is spread in various fields of artificial intelligence. Machine learning and deep learning generally include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formulaTeaching and learning and the like.

The automatic driving technology generally comprises technologies such as high-precision maps, environment perception, behavior decision, path planning, motion control and the like, and the self-determined driving technology has wide application prospect,

with the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.

According to the logic function division, a PaaS (Platform as a Service) layer can be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer is deployed on the PaaS layer, and the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, a web container, etc. SaaS is a variety of business software, such as web portal, sms, and mass texting. Generally speaking, SaaS and PaaS are upper layers relative to IaaS.

An artificial intelligence cloud Service is also commonly referred to as AIaaS (AI as a Service, chinese). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API (application programming interface), and part of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain the self-dedicated cloud artificial intelligence services.

The scheme provided by the embodiment of the application relates to the technologies such as artificial intelligence voice recognition and the like, and is specifically explained by the following embodiment.

Fig. 1 is a block diagram of a voice recognition system according to an embodiment of the inventive concept. Referring to fig. 1, the speech recognition system may include a client 200 and a speech recognition device 100. However, this is only a preferred embodiment for achieving the object of the inventive concept, and some structural elements may be added or deleted as needed, of course. For example, although the voice recognition apparatus 100 is illustrated in fig. 1, according to the embodiment, a data processing apparatus may be further included for performing data processing on the voice recognition result of the voice recognition apparatus 100 to obtain response data. Further, each of the constituent elements of the speech recognition system shown in fig. 1 represents a functional element distinguished by function, and it should be noted that at least one of the constituent elements may be implemented in a form of being combined with each other in an actual physical environment. For example, the voice recognition apparatus 100 and the data processing apparatus may implement the respective functions of the voice recognition apparatus 100 so as to be built in the same server or server cluster, or may belong to different servers or server clusters.

In the voice recognition system, the client 200 is a terminal that receives a voice signal input by a user and provides response data 30 returned by the voice recognition apparatus 100, and the response data 30 may be a voice recognition result in the case of not including a data processing apparatus, and may be a data processing result for the voice recognition result in the case of including a data processing apparatus. In fig. 1, although the client 200 is illustrated as a smartphone, it may be implemented as any device, such as a smart speaker, a wearable smart device, or the like.

In the speech recognition system, the speech recognition apparatus 100 is an input speech data 10 and provides a calculation apparatus based on a recognition result. Here, the voice data is a general meaning including a wave file representing a voice signal in the form of a wave (wave), a spectrogram (spectrogram) representing the wave file in the form of a Frequency, a Mel-Frequency cepstral coefficient (MFCC), and the like. Further, the computing device may be a notebook, a desktop (desktop), a laptop (laptop) or a smart phone (smart phone), etc., but is not limited thereto and may include all kinds of devices having an arithmetic unit.

According to an embodiment of the inventive concept, in order to provide end-to-end speech recognition, the speech recognition apparatus 100 may construct an acoustic model composed of a deep neural network, and provide a recognition result of the speech data 10 using the constructed acoustic model. Here, the deep Neural Network may be, for example, a Recurrent Neural Network (RNN), a Bidirectional Recurrent Neural Network (BRNN), a Long Short Term Memory (LSTM), a bidirectional Long short term memory (Bi-directional LSTM, BLSTM), a gated round robin Unit (GRU), or a bidirectional gated round robin Unit (Bi-directional GRU, BGRU), but is not limited thereto.

In addition, according to an embodiment of the present inventive concept, in the process that the client 200 receives the voice of the voice object, before the duration of the tail end silence in the voice reaches the second preset time, the target voice is uploaded to the voice recognition apparatus 100, the voice recognition apparatus 100 determines a to-be-processed voice according to the received target voice, performs preprocessing on the to-be-processed voice to obtain a preprocessing result, when the duration of the tail end silence in the voice reaches the second preset time, the client 200 sends a voice recognition request to the voice recognition apparatus 100, and the voice recognition apparatus 100 determines response data to the voice recognition request according to the preprocessing result. Therefore, the client 200 uploads the speech recognition device before the speech is completely finished (the speech is completely finished when the tail end silence duration reaches the second preset time), so that the speech recognition device can perform speech processing in advance, and can obtain response data immediately after the speech is completely finished, thereby improving the end-to-end response speed. A detailed description thereof will be described later with reference to fig. 2 to 10.

The speech recognition system related to the embodiment of the invention can be a distributed system formed by connecting a client, a plurality of nodes (any type of computing equipment in an access network, such as a server and a user terminal) through a network communication mode.

Taking a distributed system as an example of a blockchain system, referring To fig. 15, fig. 15 is an optional structural schematic diagram of a blockchain system To which a distributed system 1510 provided in an embodiment of the present invention is applied, where the system is formed by a plurality of nodes 1520 (computing devices in any form in an access network, such as servers and user terminals) and clients 1530, a Peer-To-Peer (P2P, Peer To Peer) network is formed between the nodes, and a P2P Protocol is an application layer Protocol operating on top of a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server or a terminal, can join to become a node, and the node comprises a hardware layer, a middle layer, an operating system layer and an application layer.

Referring to the functions of each node in the blockchain system shown in fig. 15, the functions involved include:

1) routing, a basic function that a node has, is used to support communication between nodes.

Besides the routing function, the node may also have the following functions:

2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recording data, carrying a digital signature in the recording data to represent a source of task data, and sending the recording data to other nodes in the block chain system, so that the other nodes add the recording data to a temporary block when the source and integrity of the recording data are verified successfully.

For example, the services implemented by the application include:

2.1) wallet, for providing the function of transaction of electronic money, including initiating transaction (i.e. sending the transaction record of current transaction to other nodes in the blockchain system, after the other nodes are successfully verified, storing the record data of transaction in the temporary blocks of the blockchain as the response of confirming the transaction is valid; of course, the wallet also supports the querying of the remaining electronic money in the electronic money address;

and 2.2) sharing the account book, wherein the shared account book is used for providing functions of operations such as storage, query and modification of account data, record data of the operations on the account data are sent to other nodes in the block chain system, and after the other nodes verify the validity, the record data are stored in a temporary block as a response for acknowledging that the account data are valid, and confirmation can be sent to the node initiating the operations.

2.3) Intelligent contracts, computerized agreements, which can enforce the terms of a contract, implemented by codes deployed on a shared ledger for execution when certain conditions are met, for completing automated transactions according to actual business requirement codes, such as querying the logistics status of goods purchased by a buyer, transferring the buyer's electronic money to the merchant's address after the buyer signs for the goods; of course, smart contracts are not limited to executing contracts for trading, but may also execute contracts that process received information.

3) And the Block chain comprises a series of blocks (blocks) which are mutually connected according to the generated chronological order, new blocks cannot be removed once being added into the Block chain, and recorded data submitted by nodes in the Block chain system are recorded in the blocks.

Referring to fig. 16, fig. 16 is an optional schematic diagram of a Block Structure (Block Structure) according to an embodiment of the present invention, where each Block includes a hash value of a transaction record stored in the Block (hash value of the Block) and a hash value of a previous Block, and the blocks are connected by the hash values to form a Block chain. The block may include information such as a time stamp at the time of block generation. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using cryptography, and each data block contains related information for verifying the validity (anti-counterfeiting) of the information and generating a next block.

While one embodiment of the speech recognition method of the present invention is described below, fig. 2 is a flow chart of one embodiment of the speech recognition method of the present invention, and the present specification provides the method steps as described in the embodiments or flow charts, more or fewer steps may be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system products may be executed sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) in accordance with the methods described in the embodiments or figures. Specifically, as shown in fig. 2, the speech recognition method may be executed by a client, and includes:

s201: receiving voice of a voice object, wherein the voice comprises at least one unit voice, and each unit voice comprises an instruction voice and tail end silence after the instruction voice is finished.

The speech object can be a client user, the speech refers to the speech spoken by the client user, here, the speech spoken by the client user not only includes the speech content itself, but also includes a mute section after the speech content is finished, and the end of the speech of the client user is determined only after the mute section lasts for a period of time. In the speaking process, the user may pause for several seconds after speaking, and then continue the next speech, so the user speech can be divided into at least one unit speech, each unit speech includes a section of speaking content and a mute part after the last word in the section of speaking content stops, wherein the speaking content is the instruction speech, and the mute part is the tail end mute.

S203: and determining the current unit voice according to the time sequence of voice receiving.

Specifically, the words spoken by the user gradually increase with the passage of time, so that the speech received by the client has a time sequence. In one possible embodiment, determining the current unit of speech based on the timing of speech reception includes: in the process of receiving the voice, determining the unit voice to which the currently received voice belongs, and taking the determined unit voice as the current unit voice.

S205: for the current unit of speech, performing the following data processing operations:

when the duration of the tail end silence reaches the second preset time, sending a voice recognition request to the voice recognition server; so that the voice recognition server determines response data to the voice recognition request according to the preprocessing result.

According to the concept of the embodiment of the invention, the received voice is processed in advance before the voice is completely finished, and the method can be embodied as that each received unit voice is judged for the duration of tail end silence, and a corresponding processing strategy is executed according to the judgment result.

Fig. 3 is a flowchart illustrating an embodiment of performing a data processing operation on a current speech unit according to the present invention, and referring to fig. 3, the data processing operation includes:

s301: and acquiring the duration of the tail end silence of the current unit voice.

S303: and judging whether the duration of the tail end silence reaches a first preset time, if so, executing the step S309, and if not, executing the step S305.

S305: and judging whether other unit voice is received or not, if not, continuously receiving the voice and returning to execute the step S301, and if so, executing the step S307.

The method specifically comprises the following steps: and judging whether the voices of other units are received when the duration meeting the tail end silence exceeds a third preset time and does not reach the first preset time, wherein the third preset time is less than the first preset time.

S307: and continuously receiving the voices of the other units until the duration of the tail end silence of the received voices of the other units reaches the first preset time, taking the voices of the other units meeting the requirement that the duration of the tail end silence reaches the first preset time as new current unit voices, and returning to execute the step S301 for the new current unit voices.

S309: determining the time corresponding to the time when the duration of the tail end silence reaches a first preset time as a first time point, acquiring the voice which is not uploaded to a voice recognition server before the first time point, determining the starting time of the first unit voice in the acquired voice as a second time point, determining a target voice according to the voice between the second time point and the first time point, and uploading the target voice to the voice recognition server. The process continues to step S311.

S311: judging whether other unit voices are received when the duration of the tail end silence exceeds a first preset time and does not reach a second preset time, executing step S313 if the other unit voices are received when the duration of the tail end silence exceeds the first preset time and does not reach the second preset time, and executing step S315 if the other unit voices are not received when the duration of the tail end silence exceeds the first preset time and does not reach the second preset time.

S313: determining that the corresponding time when the tail end silence reaches a second preset time is a second time point and the starting time of the first unit voice after the tail end silence is a third time point; determining a target voice according to the voice from the second time point to the third time point; uploading the target voice to a voice recognition server; and, taking the first unit voice with the mute tail end as the new current unit voice, and returning to execute the step S301 for the new current unit voice.

S315: and when the duration of the tail end silence reaches a second preset time, sending a voice recognition request to a voice recognition server.

It should be noted that the second preset time is a time after the first preset time, and the third preset time is a time before the first preset time. In particular, the third preset time may be Oms, corresponding to the start time of the tail mute; the second preset time may be the same as the maximum duration of the tail end silence corresponding to completely determining the end of the speech in the prior art, for example, when the duration of the tail end silence reaches 500ms in the prior art, it is determined that the speech is completely ended, the second preset time may be set to 500 ms; the first preset time is between the second preset time and the third preset time, and the value range is between 50ms and 300ms, preferably 100 ms.

S207: and receiving the response data returned by the voice recognition server.

The voice recognition server determines response data to the voice recognition request according to the preprocessing result. In one possible embodiment, the voice recognition server uses the pre-processing result as a voice recognition result of the voice object uploaded by the client, the voice recognition server can directly feed back the voice recognition result to the client as response data to the voice recognition request, or transmit the voice recognition result to the data processing server, the data processing server obtains corresponding processing data based on the voice recognition result, uses the obtained processing data as response data to the voice recognition request, and feeds back the response data to the client.

According to the embodiment, in the process of receiving the voice of the voice object, when the tail end silence in the voice reaches the first preset time, the target voice is acquired and uploaded to the voice recognition server, the voice recognition server performs voice recognition on the target voice in advance before the voice is completely finished, a preprocessing result is acquired, when the tail end silence in the voice reaches the second preset time, namely the voice is completely finished, the voice recognition is requested to the voice recognition server, the server can quickly determine and issue the voice recognition result according to the preprocessing result, the client can immediately acquire the voice recognition result when the voice is confirmed to be completely finished, the waiting time for the client to acquire the server data processing result is shortened, the end-to-end response speed is improved, and the experience of a user for voice operation is improved. The voice recognition method of the embodiment can be applied to voice translation or intelligent sound box control.

Another embodiment of the speech recognition method of the present invention is described below with reference to a speech recognition server as an execution subject, fig. 4 is a schematic flow chart of another embodiment of the speech recognition method provided by the present invention, and the present specification provides the method operation steps as described in the embodiment or the flow chart, but may include more or less operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In actual implementation, the system or client product may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 4, the method may include:

s401: receiving target voice uploaded by a client, wherein the target voice is determined according to the received voice when the duration of tail end silence of the current unit voice in the voice reaches a first preset time in the process that the client receives the voice.

The voice object provides voice through the client, wherein the voice comprises at least one unit voice, each unit voice comprises an instruction voice and tail end silence after the instruction voice is finished, the client determines the current unit voice according to the time sequence of voice receiving, and performs data processing operation on the current unit voice one by one to determine a target voice corresponding to the current moment, and the data processing of the client to the current unit voice comprises the following steps:

(1) when the duration of the tail end silence of the current unit voice reaches a first preset time, determining a target voice based on the received voice, and transmitting the target voice to a voice recognition server; and the voice recognition server is used for preprocessing the target voice to obtain a preprocessing result.

(2) When the duration of tail end silence of the current unit voice reaches a first preset time, determining that the time corresponding to the duration of the tail end silence reaching the first preset time is a first time point, acquiring the voice which is not uploaded to the voice recognition server before the first time point, determining the starting time of the first unit voice in the acquired voice as a second time point, determining a target voice according to the voice between the second time point and the first time point, and uploading the target voice to the voice recognition server.

(3) And if the other unit voices are received when the duration of the tail end silence of the current unit voice exceeds the third preset time but does not reach the first preset time, the other unit voices are continuously received until the duration of the tail end silence of the received other unit voices reaches the first preset time, and the other unit voices, of which the duration of the tail end silence reaches the first preset time, are taken as new current unit voices.

It should be noted that the flow of the client performing the data processing operation on the current unit voice is the same as that in the embodiment corresponding to fig. 3, and details may refer to the embodiment corresponding to fig. 3, which is not described herein again.

S403: and preprocessing the received target voice to obtain a preprocessing result.

Fig. 5 is a flowchart illustrating an embodiment of preprocessing a target speech according to the present invention, please refer to fig. 5, where the preprocessing the target speech to obtain a preprocessing result includes:

s501: and determining the voice to be processed according to the receiving time sequence of the target voice.

Specifically, the voice output by the voice object to the client has a time sequence, the client performs data processing on the current unit voice according to the time sequence to obtain the target voice, and the target voice is uploaded to the voice recognition server once being obtained, so that the target voice received by the voice recognition server also has the time sequence. The voice recognition server determines the voice to be processed according to the time sequence of the received target voice so as to receive any voice processing task of any client, wherein the voice to be processed refers to the sum of the currently received target voice and the target voice received before the currently received target voice, for example, the voice recognition server receives three sections of target voices, and arranges the target voices into a first target voice, a second target voice and a third target voice according to the time sequence, wherein when the first target voice is received, the corresponding voice to be processed is the first target voice, when the second target voice is received, the corresponding voice to be processed is composed of the first target voice and the second target voice, and when the third target voice is received, the corresponding voice to be processed is composed of the first target voice, the second target voice and the third target voice.

The voice recognition server determines the voice to be processed once every time the voice recognition server receives one target voice, and the number of the voice to be processed is the same as that of the target voice.

S503: and performing voice recognition on each voice to be processed to obtain a voice recognition result corresponding to each voice to be processed.

In one possible embodiment, an Automatic Speech Recognition (ASR) technique may be used to convert each to-be-processed speech into a text, and obtain a speech recognition result corresponding to the to-be-processed speech.

ASR is a technique for converting human speech into text, the basic principles of which include:

training (Training): and analyzing the voice characteristic parameters in advance, making a voice template and storing the voice template in a voice parameter library.

Identification (Recognition): and analyzing the voice to be recognized in the same way as during training to obtain voice parameters. Comparing it with reference templates in parameter library one by one, and finding out the template closest to speech characteristics by decision method to obtain recognition result.

Distortion measure (Distortion Measures): there is a criterion in making the comparison, which is to measure the "distortion measure" between the speech feature parameter vectors.

The main recognition framework is as follows: dynamic Time Warping (DTW) based on pattern matching and Hidden Markov Models (HMM) based on statistical models.

Because ASR is a current mature speech recognition technology and the speech recognition process is not the focus of the embodiments of the present invention, ASR will not be described here.

S505: and according to the determination time of the voice to be processed, taking the voice recognition result corresponding to the newly determined voice to be processed as the preprocessing result.

S405: and receiving a voice recognition request sent by a client, wherein the voice recognition request is generated when the duration of the tail end silence reaches the second preset time.

S407: and determining response data to the voice recognition request according to the preprocessing result.

In one possible embodiment, the voice recognition server uses the pre-processing result as a voice recognition result of the voice object uploaded by the client, the voice recognition server can directly feed back the voice recognition result to the client as response data to the voice recognition request, or transmit the voice recognition result to the data processing server, the data processing server obtains corresponding processing data based on the voice recognition result, uses the obtained processing data as response data to the voice recognition request, and feeds back the response data to the client.

S409: and sending the response data to the client.

In the above embodiment, during the process of receiving the voice of the voice object, when the tail end silence in the voice reaches the first preset time, namely, the target voice is obtained and uploaded to the voice recognition server, so that the voice recognition server carries out voice recognition on the target voice in advance before the voice is completely finished to obtain a preprocessing result, when the mute of the tail end in the voice reaches the second preset time, namely the voice is completely finished, the voice recognition is requested to the voice recognition server, the server can quickly determine and send the voice recognition result according to the preprocessing result, so that the client can immediately obtain the voice recognition result of the voice when the voice is completely finished, the voice recognition method can be used for voice translation or intelligent sound box control, can reduce the waiting time for the client to obtain the server data processing result, improves the end-to-end response speed, and improves the experience of the user for voice operation.

Fig. 6 is a schematic diagram of an application scenario of the speech recognition method provided by the present invention. Please refer to fig. 6, which illustrates a scenario of performing voice interaction with an AI smart device, in which a person is used as a voice object, and performs voice interaction with the AI smart device in an oneshot manner, and a cloud server with cloud computing capability is used as a voice recognition server to implement data processing. Oneshot refers to a recognition way in which a person continuously speaks a wakeup word and recognizes a command to a smart device. Under the high in the clouds speech recognition mode, can normally send four bibliographic data to speech recognition service, do respectively: wake-up audio data, VAD mute front-end data, audio data during speaking, and mute end audio data after speaking. The intelligent device wakes up the voice recognition component after receiving the wake-up audio data, controls the media playing component to pause the current playing content, continues to receive the voice command of the receiver and recognizes the voice command, keeps the media playing component in a pause playing state in the process, and returns response data to the voice object and broadcasts the updated content after the voice command is recognized. In fig. 6, "jingle" is the wake-up audio data, "change song bar" is the audio data during the speech, "good, start playing …" below is the response data of the speech recognition server.

FIG. 7 is a schematic diagram of a speech recognition interaction using a generic speech recognition method. Referring to fig. 7, for the voice recognition interactive data, the general AI intelligent device transmits a complete audio data to the cloud, that is, when the silence after the end of speaking continues to reach 500ms and the voice is determined to be completely finished, the wake-up word, the VAD push-forward audio data, the audio data during speaking and the silence 500ms are taken as the complete audio data and uploaded to the voice recognition server through the network, and the voice recognition server performs audio decoding by using the data processing method shown in fig. 8.

FIG. 8 is a schematic diagram of data processing logic when performing speech recognition interactions using a generic speech recognition method. Referring to fig. 8, after receiving the complete audio data uploaded by the intelligent device, the speech recognition server performs audio decoding using the ASR to obtain a corresponding speech recognition result, where the decoding time of the decoder is about 200ms, and the decoding time is reflected to the intelligent device without considering the time consumed by network transmission, and after uploading the complete audio data, the intelligent device needs to wait at least 200ms to respond to the speech object, and for the speech object, the costs of waiting for 500ms silence and 200ms processing time are required after the end of speaking, and the interaction experience is poor.

For the above defects in the prior art, in the embodiment of the present invention, under the existing framework of determining the end of voice, the uploading timing and uploading mode of the audio data uploaded to the voice recognition server are changed by designing a control policy, so that the voice recognition server prepares the preprocessing result data of the voice before the voice is completely finished, and thus, the voice recognition result can be immediately issued according to the preprocessing result data when the voice is completely finished, and the end-to-end data response is improved. The following is an analysis with reference to fig. 9 and 10.

FIG. 9 is a schematic diagram of a speech recognition interaction using the speech recognition method provided by the present invention; FIG. 10 is a schematic diagram of data processing logic for performing speech recognition interactions using the speech recognition method provided by the present invention. As shown in fig. 9 and 10, under the original rule that a mute 500ms after the end of speech determines that the speech is completely ended, when the mute reaches 100ms (corresponding to a first preset time), determining the audio of the mute 100ms and the audio that is not uploaded to the speech recognition server before the audio is determined as the target speech, and uploading the target speech to the speech recognition server through the network module, so that the speech recognition server can start pre-decoding before the speech is completely ended, if pre-decoding takes 200ms to calculate, when the mute reaches 300ms, the speech recognition server completes pre-decoding, and when the mute reaches 500ms, the speech recognition server can directly pull the pre-decoding result to issue to the intelligent device. In this way, the smart device can respond to the speech object when the speech is completely finished, and the speech object also reduces the waiting time of 200 ms. In addition, the time consumed by network transmission and the time consumed by pre-decoding the voice recognition server are reduced to a certain extent because the audio data transmitted to the voice recognition server by the intelligent device is reduced by 400 ms.

An embodiment of the present invention further provides a speech recognition apparatus, where the speech recognition apparatus may be disposed in a client, fig. 11 is a schematic structural diagram of an embodiment of the speech recognition apparatus provided in the present invention, please refer to fig. 11, where the apparatus may include:

a voice receiving module 111, configured to receive a voice of a voice object, where the voice includes at least one unit voice, and each unit voice includes an instruction voice and a tail end mute after the instruction voice is ended;

a current unit voice determining module 113, configured to determine a current unit voice according to a timing of voice reception;

a data processing module 115, configured to perform a data processing operation on the current unit voice;

the data processing module 115 includes a target voice transmitting unit 1151 and a voice recognition request transmitting unit 1153;

the target voice sending unit 1151 is configured to determine a target voice based on the received voice when the duration of the tail end silence of the current unit voice reaches a first preset time, and transmit the target voice to a voice recognition server; the voice recognition server is used for preprocessing the target voice to obtain a preprocessing result;

the voice recognition request sending unit 1153 is configured to send a voice recognition request to the voice recognition server when the duration of the tail end silence reaches the second preset time; so that the voice recognition server determines response data to the voice recognition request according to the preprocessing result;

a response data receiving module 117, configured to receive the response data returned by the speech recognition server.

Further, the target voice sending unit 1151 is further configured to: when the duration of tail end silence of the current unit voice reaches a first preset time, determining that the time corresponding to the duration of the tail end silence reaching the first preset time is a first time point, acquiring the voice which is not uploaded to the voice recognition server before the first time point, determining the starting time of the first unit voice in the acquired voice as a second time point, determining a target voice according to the voice between the second time point and the first time point, and uploading the target voice to the voice recognition server.

Further, the data processing module 115 further includes a first current unit voice determination unit and a second current unit voice determination unit. Wherein the content of the first and second substances,

and the first current unit voice determining unit is used for continuously receiving other unit voices when the duration of the tail end silence of the current unit voice exceeds the third preset time but does not reach the first preset time and taking other unit voices reaching the first preset time when the duration of the tail end silence meeting the requirement of the tail end silence of the received other unit voices reaches the first preset time as new current unit voices.

A second current unit voice determining unit, configured to determine, when receiving other unit voices when the duration of tail end silence of the current unit voice exceeds a first preset time and does not reach a second preset time, that a time corresponding to when the tail end silence reaches the second preset time is a second time point, and that a start time of a first unit voice after the tail end silence is a third time point; determining a target voice according to the voice from the second time point to the third time point; uploading the target voice to the voice recognition server; and taking the first unit voice with the muted tail end as a new current unit voice.

The speech recognition device in the described embodiment of the speech recognition device is based on the same inventive concept as the method embodiments corresponding to fig. 1-3.

An embodiment of the present invention further provides a speech recognition apparatus, where the speech recognition apparatus may be disposed in a speech recognition server, fig. 12 is a schematic structural diagram of another embodiment of the speech recognition apparatus provided in the present invention, please refer to fig. 12, where the apparatus may include:

the target voice receiving module 121 is configured to receive a target voice uploaded by a client, where the target voice is determined according to a received voice when a duration of silence of a tail end of a current unit voice in the voice reaches a first preset time in a process that the client receives the voice;

the preprocessing module 123 is configured to preprocess the received target voice to obtain a preprocessing result;

a speech recognition request receiving module 125, configured to receive a speech recognition request sent by a client, where the speech recognition request is generated by the client when the duration of the tail end silence reaches the second preset time;

a voice recognition request processing module 127, configured to determine response data to the voice recognition request according to the preprocessing result;

a response data sending module 129, configured to send the response data to the client.

Further, the preprocessing module 123 is further configured to: determining voices to be processed according to the time sequence of target voice receiving, and performing voice recognition on each voice to be processed to obtain a voice recognition result corresponding to each voice to be processed; and according to the determination time of the voice to be processed, taking the voice recognition result corresponding to the newly determined voice to be processed as the preprocessing result.

The speech recognition device in the described embodiment of the speech recognition device is based on the same inventive concept as the method embodiments corresponding to fig. 4-5.

An embodiment of the present invention provides an electronic device, which includes a processor and a memory, where at least one instruction and at least one program are stored in the memory, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the speech recognition method corresponding to fig. 1 to 3.

The memory may be used to store software programs and modules, and the processor may execute various functional applications and data processing by operating the software programs and modules stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system, application programs needed by functions and the like; the storage data area may store data created according to use of the apparatus, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory may also include a memory controller to provide the processor access to the memory.

The embodiment of the present invention further provides a schematic structural diagram of a client, as shown in fig. 13, where the client may be used to implement the speech recognition method provided in the foregoing embodiment. Specifically, the method comprises the following steps:

the client may include components such as RF (Radio Frequency) circuitry 1310, memory 1320 including one or more computer-readable storage media, input unit 1330, display unit 1340, sensors 1350, audio circuitry 1360, WiFi (wireless fidelity) module 1370, processor 1380 including one or more processing cores, and power supply 1390. Those skilled in the art will appreciate that the client architecture shown in fig. 13 does not constitute a limitation on the client, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components. Wherein:

RF circuit 1310 may be used for receiving and transmitting signals during a message transmission or communication session, and in particular, for receiving downlink information from a base station and processing the received downlink information by one or more processors 1380; in addition, data relating to uplink is transmitted to the base station. In general, RF circuit 1310 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, an LNA (Low Noise Amplifier), a duplexer, and the like. In addition, RF circuit 810 may also communicate with networks and other clients via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System for mobile communications), GPRS (General Packet Radio Service), CDMA (Code Division Multiple Access), WCDMA (Wideband Code Division Multiple Access), LTE (Long Term Evolution), email, SMS (Short Messaging Service), and the like.

The memory 1320 may be used to store software programs and modules, and the processor 1380 executes various functional applications and data processing by operating the software programs and modules stored in the memory 1320. The memory 1320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, application programs required for functions, and the like; the storage data area may store data created according to the use of the client, and the like. Further, the memory 1320 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 1320 may also include a memory controller to provide access to memory 1320 by processor 880 and input unit 1330.

The input unit 1330 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, input unit 1330 may include a touch-sensitive surface 1331 as well as other input devices 1332. Touch-sensitive surface 1331, also referred to as a touch display screen or touch pad, may collect touch operations by a user on or near the touch-sensitive surface 1331 (e.g., operations by a user on or near the touch-sensitive surface 1331 using a finger, a stylus, or any other suitable object or attachment), and drive the corresponding connection device according to a predetermined program. Alternatively, touch-sensitive surface 1331 may comprise two portions, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1380, where the touch controller can receive and execute commands sent by the processor 1380. Additionally, touch sensitive surface 1331 may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. In addition to touch-sensitive surface 1331, input unit 1330 may include other input devices 1332. In particular, other input devices 1332 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1340 may be used to display information input by or provided to the user and various graphical user interfaces of the client, which may be made up of graphics, text, icons, video, and any combination thereof. The display unit 1340 may include a display panel 1341, and optionally, the display panel 1341 may be configured in the form of an LCD (Liquid crystal display), an OLED (Organic Light-Emitting Diode), or the like. Further, touch-sensitive surface 1331 may overlay display panel 1341 and, upon detecting a touch operation on or near touch-sensitive surface 1331, communicate to processor 1380 to determine the type of touch event, and processor 1380 then provides a corresponding visual output on display panel 1341 based on the type of touch event. Touch-sensitive surface 1331 and display panel 1341 may be two separate components to implement input and output functions, although touch-sensitive surface 1331 may be integrated with display panel 1341 to implement input and output functions in some embodiments.

The client may also include at least one sensor 1350, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1341 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 1341 and/or the backlight when the client moves to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the device is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for identifying client gestures, and related functions (such as pedometer and tapping) for vibration identification; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which may be further configured at the client, detailed description is omitted here.

An audio circuit 1360, speakers 1361, microphone 1362 may provide an audio interface between the user and the client. The audio circuit 1360 may transmit the electrical signal converted from the received audio data to the speaker 1361, and the electrical signal is converted into a sound signal by the speaker 1361 and output; on the other hand, the microphone 1362 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 1360, processes the audio data by the audio data output processor 1380, and then transmits the audio data to, for example, another client via the RF circuit 1310, or outputs the audio data to the memory 1320 for further processing. The audio circuit 1360 may also include an ear-bud jack to provide communication of peripheral headphones with the client.

WiFi belongs to short-distance wireless transmission technology, the client can help a user to send and receive e-mails, browse webpages, access streaming media and the like through a WiFi module 1370, and wireless broadband internet access is provided for the user. Although fig. 13 shows a WiFi module 1370, it is understood that it does not belong to the essential constitution of the client and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1380 is a control center of the client, connects various parts of the entire client using various interfaces and lines, and performs various functions of the client and processes data by operating or executing software programs and/or modules stored in the memory 1320 and calling data stored in the memory 1320, thereby performing overall monitoring of the client. Optionally, processor 1380 may include one or more processing cores; preferably, the processor 1380 may integrate an application processor, which handles primarily operating systems, user interfaces, application programs, etc., and a modem processor, which handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1380.

The client also includes a power supply 1390 (e.g., a battery) to supply power to the various components, which may preferably be logically coupled to the processor 1380 via a power management system to manage charging, discharging, and power consumption management functions via the power management system. The power supply 1390 may also include any component or components including one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown, the client may further include a camera, a bluetooth module, and the like, which are not described herein again. Specifically, in this embodiment, the display unit of the client is a touch screen display, the client further includes a memory and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the one or more processors according to the instructions of the method embodiments of the present invention.

The embodiment of the present invention further provides a storage medium, which may be disposed in the client to store at least one instruction or at least one program for implementing a speech recognition method in the method embodiment, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the speech recognition method provided in the method embodiment.

Optionally, in this embodiment, the storage medium may be located in at least one network client of a plurality of network clients of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The voice recognition method, the voice recognition device and the storage medium provided by the invention have the advantages that in the process of receiving the voice of the voice object, when the tail end silence in the voice reaches the first preset time, the target voice is acquired and uploaded to the voice recognition server, the voice recognition server carries out voice recognition on the target voice in advance before the voice is completely finished, a preprocessing result is acquired, when the tail end silence in the voice reaches the second preset time, namely the voice is completely finished, the voice recognition is requested to the voice recognition server, the server can quickly determine and issue the voice recognition result according to the preprocessing result, the client can immediately acquire the voice recognition result when confirming that the voice is completely finished, the waiting time for the client to acquire the server data processing result is shortened, the end-to-end response speed is improved, and the experience of a user for voice operation is improved.

An embodiment of the present invention provides an electronic device, which includes a processor and a memory, where at least one instruction and at least one program are stored in the memory, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the speech recognition method corresponding to fig. 4 to 5.

Referring to fig. 14, the server 1400 is configured to implement the voice recognition method provided in the foregoing embodiment, and specifically, the server structure may include the voice recognition apparatus. The server 1400 may vary widely in configuration or performance, and may include one or more Central Processing Units (CPUs) 1410 (e.g., one or more processors) and memory 1430, one or more storage media 1420 (e.g., one or more mass storage devices) that store applications 1423 or data 1422. Memory 1430 and storage medium 1420 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 1420 may include one or more modules, each of which may include a series of instruction operations on a server. Still further, a central processor 1410 may be provided in communication with the storage medium 1420 to execute a series of instruction operations in the storage medium 1420 on the server 1400. The server 1400 may also include one or more power supplies 1460, one or more wired or wireless network interfaces 1450, one or more input-output interfaces 1440, and/or one or more operating systems 1421, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

Embodiments of the present invention also provide a storage medium, where the storage medium may be disposed in a server to store at least one instruction and at least one program for implementing a speech recognition method in the method embodiments, and the at least one instruction and the at least one program are loaded and executed by the processor to implement the speech recognition method corresponding to fig. 3 to 4.

Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the device and server embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the partial description of the method embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of speech recognition, the method comprising:

and receiving the response data returned by the voice recognition server.

2. The method of claim 1, wherein when the duration of the mute of the tail end of the current unit of speech reaches a first preset time, determining a target speech based on the received speech, and transmitting the target speech to a speech recognition server, comprises:

when the duration of tail end silence of the current unit voice reaches a first preset time, determining that the time corresponding to the duration of the tail end silence reaching the first preset time is a first time point, acquiring the voice which is not uploaded to the voice recognition server before the first time point, determining the starting time of the first unit voice in the acquired voice as a second time point, determining a target voice according to the voice between the second time point and the first time point, and uploading the target voice to the voice recognition server.

3. The method of claim 1, wherein the data processing operations performed on the current unit of speech further comprise:

and if the other unit voices are received when the duration of the tail end silence of the current unit voice exceeds the third preset time but does not reach the first preset time, the other unit voices are continuously received until the duration of the tail end silence of the received other unit voices reaches the first preset time, and the other unit voices meeting the requirement that the duration of the tail end silence reaches the first preset time are used as new current unit voices.

4. The method of claim 1, wherein the data processing operations performed on the current unit of speech further comprise:

if other unit voices are received when the duration of the tail end silence of the current unit voice exceeds a first preset time and does not reach a second preset time, determining that the corresponding time when the tail end silence reaches the second preset time is a second time point and the starting time of the first unit voice after the tail end silence is a third time point; determining a target voice according to the voice from the second time point to the third time point; uploading the target voice to the voice recognition server; and the number of the first and second groups,

and taking the first unit voice with the mute tail end as a new current unit voice.

5. A method of speech recognition, the method comprising:

preprocessing the received target voice to obtain a preprocessing result;

and sending the response data to the client.

6. The method of claim 5, wherein: the preprocessing the received target voice to obtain a preprocessing result includes:

determining the speech to be processed according to the timing of the target speech reception,

performing voice recognition on each voice to be processed to obtain a voice recognition result corresponding to each voice to be processed;

and according to the determination time of the voice to be processed, taking the voice recognition result corresponding to the newly determined voice to be processed as the preprocessing result.

7. A speech recognition apparatus, characterized in that the apparatus comprises:

8. A speech recognition apparatus, characterized in that the apparatus comprises:

9. An electronic device, comprising a processor and a memory, wherein at least one instruction and at least one program are stored in the memory, and wherein the at least one instruction or the at least one program is loaded and executed by the processor to implement the speech recognition method according to any of claims 1-4 or claims 5-6.

10. A computer storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement the speech recognition method of any of claims 1-4 or claims 5-6.