CN112837688A

CN112837688A - Voice transcription method, device, related system and equipment

Info

Publication number: CN112837688A
Application number: CN201911159513.6A
Authority: CN
Inventors: 陈梦喆; 陈谦; 李博
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2021-05-25
Anticipated expiration: 2039-11-22
Also published as: CN112837688B; WO2021098637A1

Abstract

The application discloses a voice recognition method, a voice recognition device, a related system and electronic equipment. The method comprises the following steps: determining a first text sequence corresponding to voice data to be recognized; determining acoustic feature information of the voice data; and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information. By adopting the processing mode, on the basis of determining punctuation mark information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation mark information, and the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain punctuation marks which are more in line with spoken language; therefore, the recognition accuracy of the punctuation marks of the voice text can be effectively improved.

Description

Voice transcription method, device, related system and equipment

Technical Field

The application relates to the technical field of data processing, in particular to a voice interaction system, a voice interaction method, a voice transcription system, a voice transcription method, a voice recognition device, a method and a device for constructing a voice text punctuation prediction model, a voice processing method, ordering equipment, an intelligent sound box, voice transcription equipment and electronic equipment.

Background

The voice transcription system is a voice processing system which can transcribe voice into characters. The system can automatically form a conference summary so as to improve conference efficiency, exert conference functions, avoid waste of manpower, material resources and financial resources, reduce conference cost and achieve manpower resource efficiency.

Real-time speech transcription systems typically output text without punctuation, which can be costly for a user to read. In order to ensure that the text recognized by the automatic speech recognition ASR system has good on-screen reading experience, after the decoding result of the speech data is obtained by the ASR system, punctuation marks need to be marked on the decoding result of the ASR through a punctuation mark prediction model so as to facilitate reading. Punctuation symbol prediction is a task of judging punctuation symbols for a current text, and a typical punctuation symbol prediction method adopts the following processing modes: and predicting punctuation marks which possibly appear in the spoken language text obtained by decoding the ASR based on the text semantics of the spoken language.

However, in the process of implementing the invention, the inventor finds that the technical scheme has at least the following problems: the scheme only considers text semantics to predict punctuation marks, but the semantics of the spoken language material is sometimes not complete, so that the marking purely by the semantics can often obtain an undesirable result. In conclusion, the existing scheme has the problem of low punctuation mark recognition accuracy of the voice text.

Disclosure of Invention

The application provides a voice transcription system to solve the problem that punctuation marks of voice texts can not be correctly recognized in the prior art. The application further provides a voice transcription method and device, a voice recognition method and device, a method and device for constructing a voice text punctuation symbol prediction model, a voice interaction system, a method and device, a voice processing method, ordering equipment, an intelligent sound box, voice transcription equipment and electronic equipment.

The application provides a voice transcription system, including:

the server is used for receiving the voice data to be transcribed sent by the client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; sending the second text sequence back to the client;

the client is used for collecting the voice data and sending the voice data to the server; and receiving the second text sequence returned by the server, and displaying the second text sequence.

The application also provides a voice transcription method, which comprises the following steps:

receiving voice data to be transcribed sent by a client;

determining a first text sequence corresponding to the voice data;

determining acoustic feature information of the voice data;

determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information;

and sending the second text sequence back to the client.

Optionally, the punctuation information includes punctuation information related to text semantic information of the voice data and the acoustic feature information.

Optionally, the acoustic feature information includes at least one of the following information:

a Bottleneck feature, a fbank feature, a word duration, a post-word mute duration, and a pitch feature.

Optionally, the determining a first text sequence corresponding to the voice data includes:

determining the first text sequence through an acoustic model and a language model;

the determining acoustic feature information of the voice data includes:

and acquiring the acoustic characteristic information output by the acoustic model.

Optionally, the determining, according to the first text sequence and the acoustic feature information, a second text sequence including punctuation information corresponding to the speech data includes:

determining first punctuation information related to text semantic information of the speech data according to the first text sequence by a first punctuation prediction subnetwork included in a punctuation prediction model;

determining second punctuation information related to text semantic information and the acoustic feature information of the voice data according to the first punctuation information and the acoustic feature information through a second punctuation prediction subnetwork included in the punctuation prediction model;

and determining the second text sequence according to the second punctuation mark information and the first text sequence.

Optionally, the first punctuation prediction subnetwork comprises at least one transform layer;

the second punctuation prediction subnetwork comprises at least one transform layer.

Optionally, the determining, by the second punctuation mark prediction subnetwork included in the punctuation mark prediction model, the second punctuation mark information according to the first punctuation mark information and the acoustic feature information includes:

for each word in the first text sequence, determining acoustic feature information of the word;

and taking a word as a unit, taking paired data of the first punctuation mark information and the acoustic feature information corresponding to each word as input data of the second punctuation mark prediction subnetwork, and determining the second punctuation mark information of each word through the second punctuation mark prediction subnetwork.

Optionally, the acoustic feature information of the voice data includes acoustic feature information of a plurality of data frames in units of voice data frames;

the determining acoustic feature information of the word comprises:

and determining the acoustic characteristic information of the word from the acoustic characteristic information of the plurality of data frames according to the time information of the plurality of data frames related to the word.

Optionally, the acoustic model includes one of the following modules of the network structure: a deep feedforward sequence memory neural network structure DFSMN and a bidirectional long-time memory network BLSTM;

the determining the acoustic feature information of the word from the acoustic feature information of the plurality of data frames according to the time information of the plurality of data frames to which the word is related comprises:

and taking the acoustic characteristic information of the last data frame related to the word as the acoustic characteristic information of the word, wherein the acoustic characteristic information of the last data frame comprises the acoustic characteristic information of the plurality of data frames.

Optionally, the method further includes:

and learning to obtain the punctuation mark prediction model from the corresponding relation set between the voice data and the text sequence comprising the punctuation mark information.

collecting voice data to be transcribed;

sending the voice data to a server;

receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation mark information;

displaying the second text sequence;

wherein the second text sequence is determined by the steps of: the server receives the voice data; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; and sending the second text sequence back to the client.

The present application further provides a voice transcription device, including:

the voice data receiving unit is used for receiving the voice data to be transcribed sent by the client;

a first text sequence generating unit configured to determine a first text sequence corresponding to the voice data;

an acoustic feature information determination unit configured to determine acoustic feature information of the voice data;

a second text sequence generating unit, configured to determine, according to the first text sequence and the acoustic feature information, a second text sequence including punctuation information corresponding to the voice data;

and the second text sequence loopback unit is used for loopback the second text sequence to the client.

the voice data acquisition unit is used for acquiring voice data to be transcribed;

the voice data sending unit is used for sending the voice data to a server;

a second text sequence receiving unit, configured to receive a second text sequence including punctuation information and corresponding to the voice data, returned by the server;

the second text sequence display unit is used for displaying the second text sequence;

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing a voice transcription method, the apparatus performing the following steps after being powered on and running the program for the voice transcription method by the processor: receiving voice data to be transcribed sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; and sending the second text sequence back to the client.

a processor; and

a memory for storing a program for implementing a voice transcription method, the apparatus performing the following steps after being powered on and running the program for the voice transcription method by the processor: collecting voice data to be transcribed; sending the voice data to a server; receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation mark information; displaying the second text sequence; wherein the second text sequence is determined by the steps of: the server receives the voice data; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; and sending the second text sequence back to the client.

The application also provides a method for constructing a prediction model of punctuation marks of a voice text, which comprises the following steps:

determining a corresponding relation set among words, acoustic feature information of the words related to voice data of the words and punctuation mark information of the words;

constructing a network structure of a voice text punctuation symbol prediction model;

and learning the punctuation mark prediction model from the corresponding relation set.

Optionally, the corresponding relationship set is determined in the following manner:

and determining a corresponding relation set among the words, the acoustic feature information of the words related to the voice data of the words and the word punctuation mark information according to the corresponding relation set between the voice data and the text sequence comprising the punctuation mark information.

the data determining unit is used for determining a corresponding relation set among the words, the acoustic feature information of the words related to the voice data of the words and the mark information of the punctuation marks of the words;

the network construction unit is used for constructing a network structure of the voice text punctuation prediction model;

and the model training unit is used for obtaining the punctuation mark prediction model from the corresponding relation set learning.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing a voice transcription method, the apparatus performing the following steps after being powered on and running the program for the voice transcription method by the processor: determining a corresponding relation set among words, acoustic feature information of the words related to voice data of the words and punctuation mark information of the words; constructing a network structure of a voice text punctuation symbol prediction model; and learning the punctuation mark prediction model from the corresponding relation set.

The application also provides a voice recognition method, which comprises the following steps:

determining a first text sequence corresponding to voice data to be recognized;

determining acoustic feature information of the voice data;

and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information.

The present application further provides a speech recognition apparatus, including:

the first text sequence generating unit is used for determining a first text sequence corresponding to the voice data to be recognized;

and the second text sequence generating unit is used for determining a second text sequence which corresponds to the voice data and comprises punctuation mark information according to the first text sequence and the acoustic characteristic information.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing a speech recognition method, the apparatus performing the following steps after being powered on and running the program for the speech recognition method by the processor: determining a first text sequence corresponding to voice data to be recognized; determining acoustic feature information of the voice data; and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information.

The present application further provides a voice interaction system, comprising:

the server is used for receiving a voice interaction request aiming at target voice data sent by the client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; returning the voice reply information to the client;

the client is used for determining the target voice data and sending the voice interaction request to the server; and receiving the voice reply information returned by the server side, and displaying the voice reply information.

The application also provides a voice interaction method, which comprises the following steps:

receiving a voice interaction request aiming at target voice data sent by a client;

determining a first text sequence corresponding to the voice data;

determining acoustic feature information of the voice data;

determining voice reply information according to the second text sequence;

and returning the voice reply information to the client.

determining target voice data;

sending a voice interaction request aiming at the target voice data to a server;

receiving voice reply information returned by the server;

displaying the voice reply information;

the voice reply message is determined by adopting the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; and returning the voice reply information to the client.

The present application further provides a voice interaction apparatus, including:

the request receiving unit is used for receiving a voice interaction request aiming at target voice data sent by a client;

the voice reply information determining unit is used for determining voice reply information according to the second text sequence;

and the voice reply message loopback unit is used for loopback the voice reply message to the client.

a voice data determination unit for determining target voice data;

the request sending unit is used for sending a voice interaction request aiming at the target voice data to a server;

the voice reply message receiving unit is used for receiving the voice reply message returned by the server;

the voice reply information display unit is used for displaying the voice reply information;

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing the voice interaction method, wherein after the device is powered on and the program for implementing the voice interaction method is executed by the processor, the following steps are executed: receiving a voice interaction request aiming at target voice data sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; and returning the voice reply information to the client.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing the voice interaction method, wherein after the device is powered on and the program for implementing the voice interaction method is executed by the processor, the following steps are executed: determining target voice data; sending a voice interaction request aiming at the target voice data to a server; receiving voice reply information returned by the server; displaying the voice reply information; the voice reply message is determined by adopting the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; and returning the voice reply information to the client.

the server is used for receiving a voice interaction request aiming at target voice data sent by the client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; returning the voice instruction information to the client;

the client is used for determining the voice data and sending the voice interaction request to the server; and receiving the voice instruction information returned by the server and executing the voice instruction information.

determining a first text sequence corresponding to the voice data;

determining acoustic feature information of the voice data;

determining voice instruction information according to the second text sequence;

and returning the voice instruction information to the client.

determining target voice data;

sending a voice interaction request aiming at the voice data to a server;

receiving voice instruction information returned by the server;

executing the voice instruction information;

wherein the voice instruction information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; and returning the voice instruction information to the client.

the voice instruction information determining unit is used for determining voice instruction information according to the second text sequence;

and the voice instruction information loopback unit is used for loopback the voice instruction information to the client.

a voice data determination unit for determining target voice data;

the request sending unit is used for sending a voice interaction request aiming at the voice data to a server;

the voice instruction information receiving unit is used for receiving the voice instruction information returned by the server;

the voice instruction information execution unit is used for executing the voice instruction information;

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing the voice interaction method, wherein after the device is powered on and the program for implementing the voice interaction method is executed by the processor, the following steps are executed: receiving a voice interaction request aiming at target voice data sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; and returning the voice instruction information to the client.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing the voice interaction method, wherein after the device is powered on and the program for implementing the voice interaction method is executed by the processor, the following steps are executed: determining target voice data; sending a voice interaction request aiming at the voice data to a server; receiving voice instruction information returned by the server; executing the voice instruction information; wherein the voice instruction information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; and returning the voice instruction information to the client.

Optionally, the apparatus includes: the intelligent sound box, the intelligent television, the subway voice ticket purchasing equipment or the ordering equipment.

The application also provides a voice processing method, which comprises the following steps:

collecting voice data to be transcribed;

determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data;

processing associated with the second text sequence is performed.

Optionally, if the speech processing condition is satisfied, executing the method;

the method further comprises the following steps:

if the voice processing condition is not satisfied, determining a first text sequence corresponding to the voice data; and determining a third text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence.

Optionally, the speech processing conditions include: the noise of the voice data acquisition environment is smaller than a noise threshold, or the noise of the voice data acquisition environment is larger than the noise threshold;

the method further comprises the following steps:

noise data of a speech data collection environment is determined.

Optionally, the method further includes:

determining a user-specified noise threshold;

storing the noise threshold.

Optionally, determining a target voice processing method specified by a user;

and if the target voice processing method is the method, the voice processing condition is satisfied.

Optionally, the method further includes:

and displaying the voice processing progress information.

Optionally, the progress information includes at least one of the following information: and completing voice data acquisition, completing the determination of the first text sequence, completing the determination of the acoustic characteristic information, and completing the determination of the second text sequence.

The present application further provides an ordering device, comprising:

a voice acquisition device;

a processor; and

a memory for storing a program for implementing the voice interaction method, wherein after the device is powered on and the program for implementing the voice interaction method is executed by the processor, the following steps are executed: collecting voice ordering data of a first user; determining a first ordering text sequence corresponding to the voice ordering data; determining acoustic characteristic information of the voice ordering data; determining a second ordering text sequence which corresponds to the voice ordering data and comprises punctuation mark information according to the first ordering text sequence and the acoustic characteristic information; and determining ordering information according to the second ordering text sequence, so that the second user prepares meals according to the ordering information.

The application further provides an intelligent sound box, include:

a processor; and

a memory for storing a program for implementing the voice interaction method, wherein after the device is powered on and the program for implementing the voice interaction method is executed by the processor, the following steps are executed: collecting voice data of a first user; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information and/or voice instruction information according to the second text sequence; and displaying the voice reply message and/or executing the voice command message.

The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.

Compared with the prior art, the method has the following advantages:

the voice recognition method provided by the embodiment of the application determines a first text sequence corresponding to voice data to be recognized; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, and the self intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language; therefore, the recognition accuracy of the punctuation marks of the voice text can be effectively improved.

The voice interaction system provided by the embodiment of the application determines target voice data through the client and sends a voice interaction request aiming at the voice data to the server; the server side responds to the request and determines a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; the voice reply information is sent back to the client side, and the client side receives and displays the voice reply information; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the voice reply information is determined based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply can be effectively improved.

The voice interaction system provided by the embodiment of the application determines target voice data through the client and sends a voice interaction request aiming at the voice data to the server; the server side responds to the request and determines a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; returning the voice instruction information to the client; the client executes the voice instruction information; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the voice instruction information is determined based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of voice interaction can be effectively improved.

The voice transcription system provided by the embodiment of the application collects voice data through the client and sends the voice data to the server; the server determines a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; the second text sequence is sent back to the client side, and the client side receives and displays the second text sequence; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, and the self intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language; therefore, the accuracy of voice transcription can be effectively improved.

The method for constructing the voice text punctuation mark prediction model provided by the embodiment of the application determines a corresponding relation set among words, acoustic feature information of the words related to voice data of the words and annotation information of the punctuation marks of the words; constructing a network structure of a voice text punctuation symbol prediction model; learning from the corresponding relation set to obtain the punctuation mark prediction model; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, and the self intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language; therefore, the model accuracy can be effectively improved.

The voice processing method provided by the embodiment of the application collects voice data to be transcribed; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; performing processing associated with the second text sequence; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation more in line with the spoken language, and then the processing related to the second text sequence is executed based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of the correlation process can be effectively improved.

The ordering equipment provided by the embodiment of the application acquires the voice ordering data of the first user; determining a first ordering text sequence corresponding to the voice ordering data; determining acoustic characteristic information of the voice ordering data; determining a second ordering text sequence which corresponds to the voice ordering data and comprises punctuation mark information according to the first ordering text sequence and the acoustic characteristic information; determining ordering information according to the second ordering text sequence, so that a second user can prepare food according to the ordering information; the processing mode ensures that the acoustic characteristic information of the voice ordering data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice ordering data, the self-intention of an ordering person can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the ordering information (such as dish names, personal taste requirements and the like) is determined on the basis of the ordering text comprising the more accurate punctuation; therefore, the ordering accuracy can be effectively improved, and the user experience is improved.

According to the intelligent sound box provided by the embodiment of the application, the voice data of the first user are collected; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information and/or voice instruction information according to the second text sequence; displaying the voice reply information and/or executing the voice instruction information; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the voice reply information and/or the voice instruction information are determined on the basis of the second text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply and voice instruction can be effectively improved, and the user experience is improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a speech recognition method provided herein;

FIG. 2 is a schematic diagram of an application scenario of an embodiment of a speech recognition method provided in the present application;

FIG. 3 is a detailed flow chart of an embodiment of a speech recognition method provided herein;

FIG. 4 is a diagram of a model network architecture for an embodiment of a speech recognition method provided by the present application;

FIG. 5 is a detailed flow chart of an embodiment of a speech recognition method provided herein;

FIG. 6 is a schematic diagram of an embodiment of a speech recognition apparatus provided herein;

FIG. 7 is a schematic diagram of an embodiment of an electronic device provided herein;

FIG. 8 is a device interaction diagram of an embodiment of a voice interaction system provided by the present application;

FIG. 9 is a device interaction diagram of an embodiment of a voice interaction system provided by the present application;

FIG. 10 is a schematic diagram of device interaction of an embodiment of a speech transcription system provided by the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The application provides a voice transcription system, a voice transcription method and a voice recognition device, a method and a device for constructing a voice text punctuation symbol prediction model, a voice interaction system, a method and a device, a voice processing method, ordering equipment, an intelligent sound box, voice transcription equipment and electronic equipment. Each of the schemes is described in detail in the following examples.

First embodiment

Please refer to fig. 1, which is a flowchart illustrating a speech recognition method according to an embodiment of the present application. The execution subject of the method is a voice recognition device, and the device is usually deployed at a server, but is not limited to the server, and may be any device capable of implementing the voice recognition method. The speech recognition method provided by the embodiment comprises the following steps:

step S101: a first text sequence corresponding to the speech data to be recognized is determined.

In this embodiment, the first text sequence may be determined by an Acoustic Model (AM) and a language Model (e.g., an N-gram language Model). Wherein, the acoustic model can realize the posterior probability score of converting the input voice signal into acoustic modeling units (also called phoneme and pronunciation unit); the language model may be used to predict the prior probability of occurrence of a sequence of words, given a sequence of words:

then, a decoding network is constructed through the joint acoustic model score and the language model score by a decoder, and a decoding result is obtained through the preferred path search, namely: a first text sequence.

The first text sequence may be a text sequence that does not include punctuation information, such as "drug-deficient patients are not in the market for being illegally handled resulting in price escalation-in recent years-the domestic shortage of drug supply problem is currently being paid much attention … …".

In the process of implementing the present invention, the inventor finds that in the prior art, only text semantics are considered for punctuation prediction, and input of an ASR system, that is, acoustic feature information, is not considered, but in fact, sometimes the semantics of a spoken corpus are not very complete, for the spoken corpus, besides the text semantics, a large amount of punctuation information is hidden in speech, for example, pauses are often the positions of punctuation symbols, and the pauses are also the favorable distinguishing information of commas or periods, and for example, a change in tone often indicates generation of a question mark, etc., so that marking purely by semantics often leads to an undesirable result. Based on the consideration, the inventor provides a technical idea that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized, and the punctuation which is more accordant with the spoken language is obtained, so that the punctuation recognition accuracy of the voice text can be improved.

Please refer to fig. 2, which is a schematic view of a usage scenario of an embodiment of the speech recognition method of the present application. In this embodiment, 6 microphone arrays are deployed in a conference site, and each microphone array includes a data collection device, each microphone array sends a respective target sound source signal to the data collection device, a target voice signal is sent to a cloud terminal through the data collection device, a voice transcription is performed by the voice recognition device deployed at the cloud terminal, and the data collection device further receives and displays a transcription result. Wherein the transcription result comprises: punctuation information associated with text semantic information of the speech data and the acoustic feature information.

Step S103: acoustic feature information of the speech data is determined.

The acoustic feature information includes, but is not limited to, at least one of the following acoustic feature information: a Bottleneck feature, a fbank feature, a word duration, a post-word mute duration, a pitch feature, and the like.

In a specific implementation, the acoustic feature information of the speech data may be determined by using an acoustic feature extraction method in the prior art, such as linear prediction analysis (LPC), perceptual linear prediction coefficients (PLP), Tandem features and bottleeck features, filter bank-based Fbank features (Filterbank), Linear Prediction Cepstral Coefficients (LPCC), mel-frequency cepstrum coefficients (MFCC), and the like.

In one example, step S103 may be implemented as follows: and acquiring the acoustic feature information output by the acoustic model in the step S101.

Step S105: and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information.

The second text sequence is a text sequence of the voice data including punctuation information, for example, the second text sequence is' drug shortage patient is unavailable, market is illegally manipulated leading to price rise … … in recent years, and the problem of domestic medicine supply shortage is always concerned. Day-ahead, … … ".

The punctuation information includes, but is not limited to: punctuation information associated with text semantic information of the speech data and the acoustic feature information. In a specific implementation, the punctuation information may further include first punctuation information related only to text semantic information of the speech data, and punctuation information related only to the acoustic feature information.

Please refer to fig. 3, which is a flowchart illustrating a speech recognition method according to an embodiment of the present application. In this embodiment, step S105 may include the following sub-steps:

and S1051, determining first punctuation mark information related to the text semantic information of the voice data according to the first text sequence through a first punctuation mark prediction subnetwork included in a punctuation mark prediction model.

In the method provided by the embodiment of the application, first punctuation information, namely punctuation information related to text semantic information of the voice data, is determined according to the first text sequence through a first punctuation prediction subnetwork included in a punctuation prediction model.

And S1053, determining second punctuation mark information related to the text semantic information and the acoustic characteristic information of the voice data according to the first punctuation mark information and the acoustic characteristic information through a second punctuation mark prediction subnetwork included in the punctuation mark prediction model.

On the basis of determining the first punctuation mark information according to the text semantic information of the voice data, the acoustic characteristic information of the voice data is comprehensively utilized to predict the second punctuation mark information, and the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation marks which are more in line with the spoken language.

The punctuation information output by the punctuation prediction model may only include a part of the first punctuation information, because, after the acoustic feature information of the speech data is synthesized, a part of the first punctuation information which does not conform to the spoken language but is related to the text semantics may be removed and filtered. The punctuation information output by the punctuation prediction model may further include punctuation symbols other than the first punctuation symbol, and such additional punctuation symbols may be punctuation symbols related to the acoustic feature information.

Please refer to fig. 4, which is a diagram illustrating a punctuation prediction model according to an embodiment of the speech recognition method of the present application. The punctuation prediction model comprises a first punctuation prediction subnetwork and a second punctuation prediction subnetwork. In this embodiment, the punctuation prediction model is modeled by using a transform model, the punctuation prediction model may include several transform layers, the input of which is a string of words (also called token), and the output of which is a classification task with punctuation prediction for each word through several layers of transform networks, and the concrete classification of punctuation can be determined according to actual needs, such as comma, period, question mark, exclamation mark, and so on. Wherein, the input data of the first punctuation mark prediction subnetwork is the input data of the punctuation mark prediction model, namely a string of words, such as a plurality of words forming a text segment; the output data of the first punctuation prediction subnetwork is the first punctuation information. The input data of the second punctuation mark prediction subnetwork is paired first punctuation mark information of each word and acoustic characteristic information of the word; the output data of the second punctuation prediction subnetwork is the second punctuation information.

It should be noted that the acoustic model of the present embodiment outputs the acoustic feature information of each frame of speech data with the frame as an output unit, and the punctuation prediction model outputs the first punctuation information of each word with the word as an output unit. The input data of the second punctuation prediction subnetwork is the paired first punctuation information of each word and the acoustic characteristic information of the word, so that the first punctuation information and the acoustic characteristic information of the word are aligned by taking the word as a unit.

In a specific implementation, step S1053 may include the following sub-steps: 1) for each word in the first text sequence, determining acoustic feature information of the word; 2) and taking a word as a unit, taking paired data of the first punctuation mark information and the acoustic feature information corresponding to each word as input data of the second punctuation mark prediction subnetwork, and determining the second punctuation mark information of each word through the second punctuation mark prediction subnetwork.

In this embodiment, the acoustic feature information of the voice data includes acoustic feature information of a plurality of data frames in units of voice data frames; the step of determining the acoustic feature information of the word may be implemented as follows: and determining the acoustic characteristic information of the word from the acoustic characteristic information of the plurality of data frames according to the time information of the plurality of data frames related to the word.

In this embodiment, the acoustic model may adopt a model with long-term information recording capability, and includes one of the following modules of a network structure: the deep feedforward sequence memory neural network structure DFSMN and the bidirectional long-time memory network BLSTM are used, and the acoustic model enables the acoustic characteristic information of the last frame of each word to actually contain the acoustic information of the whole word. In order to obtain a better punctuation mark recognition effect, the embodiment adopts the acoustic feature information of the last frame of each word as information spliced to the transform model, and the starting time point and the result time point of each word can be obtained in the decoding process of the acoustic model.

It should be noted that the punctuation mark prediction model may also be modeled by a model other than a transform model, and utilize acoustic feature information in combination with model network characteristics.

And S1055, determining the second text sequence according to the second punctuation mark information and the first text sequence.

After the second punctuation information and the first text sequence are determined, these information may be concatenated together, thereby determining the second text sequence.

Please refer to fig. 5, which is a flowchart illustrating a speech recognition method according to an embodiment of the present application. In this embodiment, the method may further include the steps of:

step S501: and learning to obtain the punctuation mark prediction model from the corresponding relation set between the voice data and the text sequence comprising the punctuation mark information.

According to the method provided by the embodiment of the application, the punctuation mark prediction model is obtained by learning in the corresponding relation set through a supervised machine learning method. The voice data can be converted into a text sequence through the existing voice recognition method, and then punctuation marking processing can be carried out on the text sequence manually to form the text sequence comprising punctuation marking information. The text sequence including the punctuation mark information is a text sequence including punctuation marks and including voice data. The set of correspondence relationships is used as training data.

In this embodiment, step S401 may include the following sub-steps:

step S4011: and determining a corresponding relation set among the words, the acoustic feature information of the words related to the voice data of the words and the word punctuation mark information according to the corresponding relation set between the voice data and the text sequence comprising the punctuation mark information.

And the acoustic characteristic information of the words related to the voice data of the words comprises acoustic information of the words related to the voice data of the words.

Table 1 shows a set of correspondence relationships among the words, the acoustic feature information of the words related to the speech data to which the words belong, and the punctuation mark information of the words in this embodiment.

TABLE 1 set of correspondences

As can be seen from table 1, for the same word, it may have different acoustic feature information in different context voices, and thus have different punctuation classifications. For example, the word "happy" is a punctuation mark in the voice data 1 is "comma", and a punctuation mark in the voice data 2 is "period".

Step S4013: and learning to obtain the punctuation mark prediction model from the corresponding relation set among the words, the acoustic feature information of the words related to the voice data of the words and the symbolic marking information of the punctuation marks of the words.

After the corresponding relation set among the words, the acoustic feature information of the words related to the voice data of the words and the word punctuation mark information is obtained, the punctuation mark prediction model can be obtained through learning. In the model training process, if the difference between the predicted punctuation mark and the pre-marked punctuation mark reaches the optimization target, the model training is finished, and the model parameters are stored so as to be convenient for the use in the prediction stage.

As can be seen from the foregoing embodiments, the speech recognition method provided in the embodiments of the present application determines a first text sequence corresponding to speech data to be recognized; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, and the self intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language; therefore, the recognition accuracy of the punctuation marks of the voice text can be effectively improved.

Second embodiment

In the foregoing embodiment, a speech recognition method is provided, and correspondingly, a speech recognition apparatus is also provided in the present application. The apparatus corresponds to an embodiment of the method described above.

Please refer to fig. 6, which is a schematic diagram of an embodiment of a speech recognition apparatus provided in the present application, and parts of this embodiment that are the same as the first embodiment are not repeated, please refer to corresponding parts in the first embodiment. The present application provides a speech recognition apparatus including:

a first text sequence generating unit 601 configured to determine a first text sequence corresponding to speech data to be recognized;

an acoustic feature information determining unit 603 configured to determine acoustic feature information of the voice data;

a second text sequence generating unit 605, configured to determine, according to the first text sequence and the acoustic feature information, a second text sequence corresponding to the voice data and including punctuation information.

Third embodiment

Please refer to fig. 7, which is a schematic diagram of an embodiment of an electronic device according to the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor 701 and a memory 702; a memory for storing a program for implementing a speech recognition method, the apparatus performing the following steps after being powered on and running the program for the speech recognition method by the processor: determining a first text sequence corresponding to voice data to be recognized; determining acoustic feature information of the voice data; and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information.

Fourth embodiment

In the foregoing embodiment, a speech recognition method is provided, and correspondingly, the present application further provides a speech interaction system.

Please refer to fig. 8, which is a schematic device interaction diagram of an embodiment of the voice interaction system of the present application. Since the system embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The system embodiments described below are merely illustrative.

The present application additionally provides a voice interaction system, comprising: a server and a client.

The server is used for receiving a voice interaction request aiming at target voice data sent by the client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; returning the voice reply information to the client; the client is used for determining the target voice data and sending the voice interaction request to the server; and receiving the voice reply information returned by the server side, and displaying the voice reply information.

The voice reply message can be a text reply message, a voice reply message or other forms of reply messages.

As can be seen from the foregoing embodiments, in the voice interaction system provided in the embodiments of the present application, target voice data is determined by a client, and a voice interaction request for the voice data is sent to the server; the server side responds to the request and determines a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; the voice reply information is sent back to the client side, and the client side receives and displays the voice reply information; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the voice reply information is determined based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply can be effectively improved.

Fifth embodiment

Corresponding to the voice interaction system, the application also provides a voice interaction method, and the execution subject of the method includes but is not limited to a server side, and can be other client sides. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The voice interaction method provided by the application comprises the following steps:

step 1: receiving a voice interaction request aiming at target voice data sent by a client;

step 2: determining a first text sequence corresponding to the voice data;

and step 3: determining acoustic feature information of the voice data;

and 4, step 4: determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information;

and 5: determining voice reply information according to the second text sequence;

step 6: and returning the voice reply information to the client.

As can be seen from the foregoing embodiments, in the voice interaction method provided in the embodiments of the present application, a voice interaction request for target voice data sent by a client is received; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; returning the voice reply information to the client; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the voice reply information is determined based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply can be effectively improved.

Sixth embodiment

In the foregoing embodiment, a voice interaction method is provided, and correspondingly, the present application further provides a voice interaction apparatus. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application additionally provides a voice interaction apparatus, comprising:

Seventh embodiment

In the foregoing embodiment, a voice interaction method is provided, and correspondingly, the present application further provides an electronic device. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and runs the program of the voice interaction method through the processor, the following steps are executed: receiving a voice interaction request aiming at target voice data sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; and returning the voice reply information to the client.

Eighth embodiment

Corresponding to the voice interaction system, the application also provides a voice interaction method, and the execution subject of the method includes but is not limited to a mobile communication device, a personal computer, a PAD, an iPad, an RF gun and other clients. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment. The voice interaction method provided by the application comprises the following steps:

step 1: determining target voice data;

step 2: sending a voice interaction request aiming at the target voice data to a server;

and step 3: receiving voice reply information returned by the server;

and 4, step 4: displaying the voice reply information;

As can be seen from the foregoing embodiments, the voice interaction method provided in the embodiments of the present application determines target voice data; sending a voice interaction request aiming at the target voice data to a server; receiving voice reply information returned by the server; displaying the voice reply information; the voice reply message is determined by adopting the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; returning the voice reply information to the client; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the voice reply information is determined based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply can be effectively improved.

Ninth embodiment

a voice data determination unit for determining target voice data;

Tenth embodiment

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and runs the program of the voice interaction method through the processor, the following steps are executed: determining target voice data; sending a voice interaction request aiming at the target voice data to a server; receiving voice reply information returned by the server; displaying the voice reply information; the voice reply message is determined by adopting the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; and returning the voice reply information to the client.

Eleventh embodiment

Please refer to fig. 9, which is a schematic device interaction diagram of an embodiment of the voice interaction system of the present application. Since the system embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The system embodiments described below are merely illustrative.

The server is used for receiving a voice interaction request aiming at target voice data sent by the client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; returning the voice instruction information to the client; the client is used for determining the voice data and sending the voice interaction request to the server; and receiving the voice instruction information returned by the server and executing the voice instruction information.

In one example, the client is a smart speaker that collects user voice data, such as "tianmao puck that adjusts the air conditioner temperature higher," by which the system can determine that the voice command information is "air conditioner: the temperature is more than 25 degrees, and the intelligent sound box can execute the instruction and adjust the air conditioner to be more than 25 degrees.

As can be seen from the foregoing embodiments, in the voice interaction system provided in the embodiments of the present application, target voice data is determined by a client, and a voice interaction request for the voice data is sent to the server; the server side responds to the request and determines a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; returning the voice instruction information to the client; the client executes the voice instruction information; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the voice instruction information is determined based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of voice interaction can be effectively improved.

Twelfth embodiment

and step 3: determining a first text sequence corresponding to the voice data;

and 4, step 4: determining acoustic feature information of the voice data;

and 5: determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information;

step 6: determining voice instruction information according to the second text sequence;

and 7: and returning the voice instruction information to the client.

As can be seen from the foregoing embodiments, in the voice interaction method provided in the embodiments of the present application, a voice interaction request for target voice data sent by a client is received; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; returning the voice instruction information to the client; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the voice instruction information is determined based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of voice interaction can be effectively improved.

Thirteenth embodiment

Fourteenth embodiment

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and runs the program of the voice interaction method through the processor, the following steps are executed: receiving a voice interaction request aiming at target voice data sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; and returning the voice instruction information to the client.

Fifteenth embodiment

step 1: determining target voice data;

step 2: sending a voice interaction request aiming at the voice data to a server;

and step 3: receiving voice instruction information returned by the server;

and 4, step 4: executing the voice instruction information;

As can be seen from the foregoing embodiments, the voice interaction method provided in the embodiments of the present application determines target voice data; sending a voice interaction request aiming at the voice data to a server; receiving voice instruction information returned by the server; executing the voice instruction information; wherein the voice instruction information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; returning the voice instruction information to the client; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the voice instruction information is determined based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of voice interaction can be effectively improved.

Sixteenth embodiment

a voice data determination unit for determining target voice data;

Seventeenth embodiment

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and runs the program of the voice interaction method through the processor, the following steps are executed: determining target voice data; sending a voice interaction request aiming at the voice data to a server; receiving voice instruction information returned by the server; executing the voice instruction information; wherein the voice instruction information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; and returning the voice instruction information to the client.

The devices, including but not limited to: the intelligent sound box, the intelligent television, the subway voice ticket purchasing equipment, or the ordering equipment and the like.

Eighteenth embodiment

In the foregoing embodiment, a speech recognition method is provided, and correspondingly, the present application further provides a speech transcription system.

Please refer to fig. 10, which is a schematic diagram of the device interaction of an embodiment of the speech transcription system of the present application. Since the system embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The system embodiments described below are merely illustrative.

The present application additionally provides a voice transcription system, comprising: a server and a client.

As shown in fig. 2, the client may be a voice capture device connected to multiple microphones deployed at a conference site.

As can be seen from the foregoing embodiments, the voice transcription system provided in the embodiments of the present application collects voice data through the client, and sends the voice data to the server; the server determines a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; the second text sequence is sent back to the client side, and the client side receives and displays the second text sequence; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, and the self intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language; therefore, the accuracy of voice transcription can be effectively improved.

Nineteenth embodiment

Corresponding to the voice transcription system, the application also provides a voice transcription method, and the execution main body of the method includes but is not limited to a server side, and can be other client sides. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The voice transcription method provided by the application comprises the following steps:

step 1: receiving voice data to be transcribed sent by a client;

step 2: a first text sequence corresponding to the speech data is determined.

In this embodiment, step 2 can be implemented as follows: determining the first text sequence through an acoustic model and a language model.

And step 3: acoustic feature information of the speech data is determined.

In this embodiment, step 3 can be implemented as follows: and acquiring the acoustic characteristic information output by the acoustic model.

The acoustic feature information includes at least one of: a Bottleneck feature, a fbank feature, a word duration, a post-word mute duration, and a pitch feature.

And 4, step 4: and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information.

The punctuation information includes, but is not limited to: punctuation information associated with text semantic information of the speech data and the acoustic feature information.

In this embodiment, step 4 may include the following sub-steps: 1) determining first punctuation information related to text semantic information of the speech data according to the first text sequence by a first punctuation prediction subnetwork included in a punctuation prediction model; 2) determining second punctuation information related to text semantic information and the acoustic feature information of the voice data according to the first punctuation information and the acoustic feature information through a second punctuation prediction subnetwork included in the punctuation prediction model; 3) and determining the second text sequence according to the second punctuation mark information and the first text sequence.

In this embodiment, the first punctuation prediction subnetwork comprises at least one transform layer; the second punctuation prediction subnetwork comprises at least one transform layer.

In a specific implementation, the step of determining the second punctuation information according to the first punctuation information and the acoustic feature information by the second punctuation prediction subnetwork included in the punctuation prediction model may include the following sub-steps: 1) for each word in the first text sequence, determining acoustic feature information of the word; 2) and taking a word as a unit, taking paired data of the first punctuation mark information and the acoustic feature information corresponding to each word as input data of the second punctuation mark prediction subnetwork, and determining the second punctuation mark information of each word through the second punctuation mark prediction subnetwork.

In this embodiment, the acoustic model includes one of the following modules of the network structure: a deep feedforward sequence memory neural network structure DFSMN and a bidirectional long-time memory network BLSTM; the step of determining the acoustic feature information of the word from the acoustic feature information of the plurality of data frames according to the time information of the plurality of data frames related to the word may be implemented as follows: and taking the acoustic characteristic information of the last data frame related to the word as the acoustic characteristic information of the word, wherein the acoustic characteristic information of the last data frame comprises the acoustic characteristic information of the plurality of data frames.

In this embodiment, the method may further include the steps of: and learning to obtain the punctuation mark prediction model from the corresponding relation set between the voice data and the text sequence comprising the punctuation mark information.

And 5: and sending the second text sequence back to the client.

As can be seen from the foregoing embodiments, the voice transcription method provided in the embodiments of the present application receives voice data to be transcribed, which is sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; sending the second text sequence back to the client; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, and the self intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language; therefore, the accuracy of voice transcription can be effectively improved.

Twentieth embodiment

In the foregoing embodiment, a voice transcription method is provided, and correspondingly, the present application further provides a voice transcription apparatus. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application additionally provides a voice transcription device, comprising:

Twenty-first embodiment

In the foregoing embodiment, a voice transcription method is provided, and correspondingly, the present application further provides an electronic device. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing the voice transcription method, and after the equipment is powered on and runs the program of the voice transcription method through the processor, the following steps are executed: receiving voice data to be transcribed sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; and sending the second text sequence back to the client.

Twenty-second embodiment

Corresponding to the voice transcription system, the application also provides a voice transcription method, and the execution main body of the method comprises but is not limited to a mobile communication device, a personal computer, a PAD, an iPad, an RF gun and other clients. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment. The voice transcription method provided by the application comprises the following steps:

step 1: collecting voice data to be transcribed;

step 2: sending the voice data to a server;

and step 3: receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation mark information;

and 4, step 4: displaying the second text sequence;

As can be seen from the foregoing embodiments, the voice transcription method provided in the embodiments of the present application collects voice data to be transcribed; sending the voice data to a server; receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation mark information; displaying the second text sequence; wherein the second text sequence is determined by the steps of: the server receives the voice data; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; sending the second text sequence back to the client; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, and the self intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language; therefore, the accuracy of voice transcription can be effectively improved.

Twenty-third embodiment

the voice data sending unit is used for sending the voice data to a server;

Twenty-fourth embodiment

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing the voice transcription method, and after the equipment is powered on and runs the program of the voice transcription method through the processor, the following steps are executed: collecting voice data to be transcribed; sending the voice data to a server; receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation mark information; displaying the second text sequence; wherein the second text sequence is determined by the steps of: the server receives the voice data; determining a first text sequence corresponding to the voice data; determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; and sending the second text sequence back to the client.

The electronic devices include, but are not limited to: the voice data gathering device shown in fig. 2 may be a mobile communication device, etc.

Twenty-fifth embodiment

Corresponding to the above-mentioned speech recognition method, the present application also provides a method for constructing a speech text punctuation prediction model, and the execution subject of the method includes but is not limited to a server, and may be any other device that can implement the method for constructing a speech text punctuation prediction model. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a method for constructing a prediction model of punctuation marks of a voice text, which comprises the following steps:

step 1: determining a corresponding relation set among words, acoustic feature information of the words related to voice data of the words and punctuation mark information of the words;

the set of correspondence relationships may be determined as follows: and determining a corresponding relation set among the words, the acoustic feature information of the words related to the voice data of the words and the word punctuation mark information according to the corresponding relation set between the voice data and the text sequence comprising the punctuation mark information.

Step 2: constructing a network structure of a voice text punctuation symbol prediction model;

and step 3: and learning the punctuation mark prediction model from the corresponding relation set.

As can be seen from the foregoing embodiments, the method for constructing a prediction model of punctuation marks of a speech text provided in the embodiments of the present application determines a set of correspondence relationships among words, acoustic feature information of the words related to speech data to which the words belong, and annotation information of the punctuation marks of the words; constructing a network structure of a voice text punctuation symbol prediction model; learning from the corresponding relation set to obtain the punctuation mark prediction model; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, and the self intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language; therefore, the model accuracy can be effectively improved.

Twenty-sixth embodiment

In the above embodiment, a method for constructing a speech text punctuation prediction model is provided, and correspondingly, the present application also provides a device for constructing a speech text punctuation prediction model. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application further provides an apparatus for constructing a prediction model of punctuation marks of a speech text, comprising:

Twenty-seventh embodiment

In the foregoing embodiment, a speech recognition method is provided, and accordingly, the present application also provides an electronic device. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing the method for constructing the phonetic text punctuation prediction model, and after the device is powered on and the program for realizing the method for constructing the phonetic text punctuation prediction model is run by the processor, the following steps are executed: determining a corresponding relation set among words, acoustic feature information of the words related to voice data of the words and punctuation mark information of the words; constructing a network structure of a voice text punctuation symbol prediction model; and learning the punctuation mark prediction model from the corresponding relation set.

Twenty-eighth embodiment

Corresponding to the voice recognition method, the application also provides a voice processing method, and the execution main body of the method includes but is not limited to a mobile communication device, a personal computer, a PAD, an iPad, an RF gun and other clients, and any other devices capable of implementing the voice processing method. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The voice processing method provided by the application comprises the following steps:

step 1: collecting voice data to be transcribed;

step 2: determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data;

and step 3: determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information;

and 4, step 4: processing associated with the second text sequence is performed.

The processing related to the second text sequence may be displaying the second text sequence, determining a voice reply message according to the second text sequence, determining a voice instruction message according to the second text sequence, and the like.

In one example, if the speech processing condition is satisfied, the above steps 1-4 are executed; accordingly, the method may further comprise the steps of: if the voice processing condition is not satisfied, determining a first text sequence corresponding to the voice data; and determining a third text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence. Punctuation in the third text sequence includes punctuation associated with the text semantic information.

The speech processing conditions include, but are not limited to: the noise of the voice data acquisition environment is smaller than a noise threshold, or the noise of the voice data acquisition environment is larger than the noise threshold; other conditions are also possible, such as the device currently having available computing resources greater than a computing resource threshold, and so forth.

In one example, the speech processing conditions are: the noise of the voice data acquisition environment is less than a noise threshold; the method may further comprise the steps of: noise data of a speech data collection environment is determined. To determine the noise data of the voice data collection environment, a mature prior art may be used, such as determining that the noise data reaches x decibels, and so on. By adopting the processing mode, when the environmental noise is small, punctuation marks can be predicted by combining text semantic information and acoustic characteristic information, and when the environmental noise is too large, the acoustic characteristic information of the voice data with higher quality can not be extracted, so that the punctuation marks can be predicted only according to the text semantic information; therefore, computing resources can be effectively saved.

In specific implementation, the method can further comprise the following steps: 1) determining a user-specified noise threshold; 2) storing the noise threshold. For example, a user interface for setting the noise threshold may be provided, such that a user may set/adjust the noise threshold according to actual needs, and so on.

In another example, the speech processing conditions are: the current available computing resource of the voice processing equipment is larger than a computing resource threshold value; the method may further comprise the steps of: a current available computing resource for the device is determined. To determine the currently available computing resources of the device, more sophisticated existing techniques may be employed, such as determining available memory, CPU utilization, and so forth. By adopting the processing mode, when the currently available computing resources of the voice processing equipment are larger, punctuation marks can be predicted by combining text semantic information and acoustic characteristic information, and when the currently available computing resources of the voice processing equipment are less, punctuation marks can be predicted only according to the text semantic information; therefore, the voice processing speed can be effectively improved, and the user experience is improved.

In yet another example, the method may further include the steps of: 1) determining a target voice processing method appointed by a user; 2) and if the target voice processing method is the method, the voice processing condition is satisfied. By adopting the processing mode, a user can select a target speech processing method suitable for the user from several optional speech processing methods, such as the method provided by the embodiment of the application or the method for performing punctuation prediction only according to text speech information, and the like, and if the method specified by the user is the method provided by the embodiment of the application, the speech processing condition is satisfied.

In one example, the method may further comprise the steps of: and displaying the voice processing progress information. By adopting the processing mode, the user can perceive the voice processing progress in real time, such as completion of voice data acquisition, completion of first text sequence determination, completion of acoustic characteristic information determination, completion of second text sequence determination and the like; therefore, the user experience can be effectively improved.

As can be seen from the foregoing embodiments, the voice processing method provided in the embodiments of the present application collects voice data to be transcribed; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; performing processing associated with the second text sequence; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation more in line with the spoken language, and then the processing related to the second text sequence is executed based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of the correlation process can be effectively improved.

Twenty-ninth embodiment

In the embodiment, a voice interaction method is provided, and correspondingly, the application further provides ordering equipment. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An order equipment of this embodiment, this order equipment includes: the voice acquisition device, the processor and the memory; the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: collecting voice ordering data of a first user; determining a first ordering text sequence corresponding to the voice ordering data; determining acoustic characteristic information of the voice ordering data; determining a second ordering text sequence which corresponds to the voice ordering data and comprises punctuation mark information according to the first ordering text sequence and the acoustic characteristic information; and determining ordering information according to the second ordering text sequence, so that the second user prepares meals according to the ordering information.

As can be seen from the above embodiments, the ordering device provided in the embodiments of the present application acquires the voice ordering data of the first user; determining a first ordering text sequence corresponding to the voice ordering data; determining acoustic characteristic information of the voice ordering data; determining a second ordering text sequence which corresponds to the voice ordering data and comprises punctuation mark information according to the first ordering text sequence and the acoustic characteristic information; determining ordering information according to the second ordering text sequence, so that a second user can prepare food according to the ordering information; the processing mode ensures that the acoustic characteristic information of the voice ordering data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice ordering data, the self-intention of an ordering person can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the ordering information (such as dish names, personal taste requirements and the like) is determined on the basis of the ordering text comprising the more accurate punctuation; therefore, the ordering accuracy can be effectively improved, and the user experience is improved.

Thirtieth embodiment

In the above embodiment, a voice interaction method is provided, and correspondingly, the application further provides an intelligent sound box. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An intelligent audio amplifier of this embodiment, this intelligent audio amplifier includes: the voice acquisition device, the processor and the memory; the memory is used for storing a program for realizing the voice interaction method, and after the equipment is powered on and the program for realizing the voice interaction method is run by the processor, the following steps are executed: collecting voice data of a first user; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information and/or voice instruction information according to the second text sequence; and displaying the voice reply information and/or executing the voice instruction information.

As can be seen from the above embodiments, the intelligent sound box provided by the embodiment of the application collects the voice data of the first user; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information and/or voice instruction information according to the second text sequence; displaying the voice reply information and/or executing the voice instruction information; the processing mode ensures that the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, the intention of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation which is more accordant with the spoken language, and then the voice reply information and/or the voice instruction information are determined on the basis of the second text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply and voice instruction can be effectively improved, and the user experience is improved.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A voice transcription system, comprising:

2. A method of voice transcription, comprising:

receiving voice data to be transcribed sent by a client;

determining a first text sequence corresponding to the voice data;

determining acoustic feature information of the voice data;

and sending the second text sequence back to the client.

3. The method of claim 2, wherein the punctuation information comprises punctuation information related to textual semantic information and the acoustic feature information of the speech data.

4. The method of claim 2, wherein determining a second text sequence corresponding to the speech data that includes punctuation information based on the first text sequence and the acoustic feature information comprises:

5. The method of claim 4, wherein the determining the second punctuation information from the first punctuation information and the acoustic feature information by the second punctuation prediction sub-network included by the punctuation prediction model comprises:

6. The method of claim 5,

the acoustic feature information of the voice data comprises acoustic feature information of a plurality of data frames taking a voice data frame as a unit;

the determining acoustic feature information of the word comprises:

7. The method of claim 6,

8. The method of claim 4, further comprising:

9. A method of voice transcription, comprising:

collecting voice data to be transcribed;

sending the voice data to a server;

displaying the second text sequence;

10. A speech transcription device, comprising:

11. A speech transcription device, comprising:

the voice data sending unit is used for sending the voice data to a server;

12. An electronic device, comprising:

a processor; and

13. A voice transcription apparatus, characterized by comprising:

a processor; and

14. A method for constructing a predictive model of punctuation of speech text, comprising:

15. A speech transcription device, comprising:

16. An electronic device, comprising:

a processor; and

17. A speech recognition method, comprising:

determining a first text sequence corresponding to voice data to be recognized;

determining acoustic feature information of the voice data;

18. A speech recognition apparatus, comprising:

19. An electronic device, comprising:

a processor; and

20. A voice interaction system, comprising:

21. A method of voice interaction, comprising:

determining a first text sequence corresponding to the voice data;

determining acoustic feature information of the voice data;

determining voice reply information according to the second text sequence;

and returning the voice reply information to the client.

22. A method of voice interaction, comprising:

determining target voice data;

receiving voice reply information returned by the server;

displaying the voice reply information;

23. A voice interaction apparatus, comprising:

24. A voice interaction apparatus, comprising:

a voice data determination unit for determining target voice data;

25. An electronic device, comprising:

a processor; and

26. An electronic device, comprising:

a processor; and

27. A voice interaction system, comprising:

28. A method of voice interaction, comprising:

determining a first text sequence corresponding to the voice data;

determining acoustic feature information of the voice data;

and returning the voice instruction information to the client.

29. A method of voice interaction, comprising:

determining target voice data;

sending a voice interaction request aiming at the voice data to a server;

receiving voice instruction information returned by the server;

executing the voice instruction information;

30. A voice interaction apparatus, comprising:

31. A voice interaction apparatus, comprising:

a voice data determination unit for determining target voice data;

32. An electronic device, comprising:

a processor; and

33. An electronic device, comprising:

a processor; and

34. A method of speech processing, comprising:

collecting voice data to be transcribed;

processing associated with the second text sequence is performed.

35. The method of claim 34,

if the voice processing condition is satisfied, executing the method;

the method further comprises the following steps:

36. The method of claim 35,

the speech processing conditions include: the noise of the voice data acquisition environment is smaller than a noise threshold, or the noise of the voice data acquisition environment is larger than the noise threshold;

the method further comprises the following steps:

noise data of a speech data collection environment is determined.

37. The method of claim 36, further comprising:

determining a user-specified noise threshold;

storing the noise threshold.

38. The method of claim 35,

determining a target voice processing method appointed by a user;

39. The method of claim 34, further comprising:

and displaying the voice processing progress information.

40. The method of claim 39,

the progress information includes at least one of: and completing voice data acquisition, completing the determination of the first text sequence, completing the determination of the acoustic characteristic information, and completing the determination of the second text sequence.

41. An ordering device, comprising:

a voice acquisition device;

a processor; and

42. An intelligent sound box, comprising:

a processor; and