CN116110392A

CN116110392A - Correction method and device for voice recognition result

Info

Publication number: CN116110392A
Application number: CN202310115537.1A
Authority: CN
Inventors: 张献涛; 支涛
Original assignee: Beijing Yunji Technology Co Ltd
Current assignee: Beijing Yunji Technology Co Ltd
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-05-12

Abstract

The disclosure relates to the technical field of voice recognition, and provides a method and a device for correcting a voice recognition result. The method comprises the following steps: acquiring target robot sentences sent by the target robot and target user sentences of the identified target user answer target robot sentences in the current round of dialogue between the target robot and the target user; generating a target pinyin sequence of the target user statement; respectively processing a target robot sentence, a target user sentence and a target pinyin sequence by using a word vector model to obtain a target robot vector, a target user vector and a target pinyin vector; splicing the target user vector and the target pinyin vector to obtain a target spliced vector; and inputting the target robot vector and the target splicing vector into a voice correction model, and outputting corrected target user sentences. By adopting the technical means, the problem of low accuracy in recognizing the voice of the user in the prior art is solved.

Description

Correction method and device for voice recognition result

Technical Field

The disclosure relates to the technical field of voice recognition, and in particular relates to a method and a device for correcting a voice recognition result.

Background

With the deep development of digital and intelligent technologies in various fields, more intelligent devices play a role in life. Various intelligent voice devices, such as intelligent speakers, smart phones, intelligent robots, etc., have supported voice conversations. In a voice conversation between a robot or a smart client and a user, the user voice needs to be recognized by using an automatic voice recognition technology, but there are many errors in recognition.

In the process of implementing the disclosed concept, the inventor finds that at least the following technical problems exist in the related art: and the problem of low accuracy of the voice of the recognition user is solved.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a computer-readable storage medium for correcting a speech recognition result, so as to solve the problem in the prior art that the accuracy of recognizing the speech of a user is low.

In a first aspect of an embodiment of the present disclosure, a method for correcting a speech recognition result is provided, including: acquiring target robot sentences sent by the target robot and target user sentences of the identified target user answer target robot sentences in the current round of dialogue between the target robot and the target user; generating a target pinyin sequence of the target user statement; respectively processing a target robot sentence, a target user sentence and a target pinyin sequence by using a word vector model to obtain a target robot vector, a target user vector and a target pinyin vector; splicing the target user vector and the target pinyin vector to obtain a target spliced vector; and inputting the target robot vector and the target splicing vector into a voice correction model, and outputting corrected target user sentences.

In a second aspect of the embodiments of the present disclosure, there is provided a device for correcting a speech recognition result, including: the acquisition module is configured to acquire target robot sentences sent by the target robot and target user sentences of the identified target user answer to the target robot sentences in the current round of dialogue between the target robot and the target user; the generation module is configured to generate a target pinyin sequence of a target user sentence; the processing module is configured to respectively process the target robot statement, the target user statement and the target pinyin sequence by utilizing the word vector model to obtain a target robot vector, a target user vector and a target pinyin vector; the splicing module is configured to splice the target user vector and the target pinyin vector to obtain a target splicing vector; and the correction module is configured to input the target robot vector and the target splicing vector into the voice correction model and output corrected target user sentences.

In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the steps of the above-described method.

Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: acquiring target robot sentences sent by the target robot and target user sentences of the identified target user answer target robot sentences in the current round of dialogue between the target robot and the target user; generating a target pinyin sequence of the target user statement; respectively processing a target robot sentence, a target user sentence and a target pinyin sequence by using a word vector model to obtain a target robot vector, a target user vector and a target pinyin vector; splicing the target user vector and the target pinyin vector to obtain a target spliced vector; and inputting the target robot vector and the target splicing vector into a voice correction model, and outputting corrected target user sentences. By adopting the technical means, the problem of low accuracy of recognizing the user voice in the prior art can be solved, and the accuracy of recognizing the user voice is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a scene schematic diagram of an application scene of an embodiment of the present disclosure;

fig. 2 is a flowchart of a method for correcting a speech recognition result according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a device for correcting a speech recognition result according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

A method and apparatus for correcting a speech recognition result according to embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 1 is a scene diagram of an application scene of an embodiment of the present disclosure. The application scenario may include

terminal devices

101, 102, and 103, server 104, and network 105.

The

terminal devices

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, and 103 are hardware, they may be various electronic devices having a display screen and supporting communication with the server 104, including but not limited to smartphones, tablets, laptop and desktop computers, etc.; when the

terminal devices

101, 102, and 103 are software, they may be installed in the electronic device as above.

Terminal devices

101, 102, and 103 may be implemented as multiple software or software modules, or as a single software or software module, as embodiments of the present disclosure are not limited in this regard. Further, various applications, such as a data processing application, an instant messaging tool, social platform software, a search class application, a shopping class application, and the like, may be installed on the

terminal devices

101, 102, and 103.

The server 104 may be a server that provides various services, for example, a background server that receives a request transmitted from a terminal device with which communication connection is established, and the background server may perform processing such as receiving and analyzing the request transmitted from the terminal device and generate a processing result. The server 104 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center, which is not limited in the embodiments of the present disclosure.

The server 104 may be hardware or software. When the server 104 is hardware, it may be various electronic devices that provide various services to the

terminal devices

101, 102, and 103. When the server 104 is software, it may be a plurality of software or software modules providing various services to the

terminal devices

101, 102, and 103, or may be a single software or software module providing various services to the

terminal devices

101, 102, and 103, which is not limited by the embodiments of the present disclosure.

The network 105 may be a wired network using coaxial cable, twisted pair wire, and optical fiber connection, or may be a wireless network that can implement interconnection of various communication devices without wiring, for example, bluetooth (Bluetooth), near field communication (Near Field Communication, NFC), infrared (Infrared), etc., which are not limited by the embodiments of the present disclosure.

The user can establish a communication connection with the server 104 via the network 105 through the

terminal devices

101, 102, and 103 to receive or transmit information or the like. It should be noted that the specific types, numbers and combinations of the

terminal devices

101, 102 and 103, the server 104 and the network 105 may be adjusted according to the actual requirements of the application scenario, which is not limited by the embodiment of the present disclosure.

Fig. 2 is a flowchart of a method for correcting a speech recognition result according to an embodiment of the present disclosure. The method of correcting the speech recognition result of fig. 2 may be performed by the terminal device or the server of fig. 1. As shown in fig. 2, the method for correcting the speech recognition result includes:

s201, acquiring target robot sentences sent by the target robot and target user sentences which are identified by the target user and answer the target robot sentences in the current round of dialogue between the target robot and the target user;

s202, generating a target pinyin sequence of a target user statement;

s203, respectively processing the target robot sentence, the target user sentence and the target pinyin sequence by using a word vector model to obtain a target robot vector, a target user vector and a target pinyin vector;

s204, splicing the target user vector and the target pinyin vector to obtain a target spliced vector;

s205, inputting the target robot vector and the target splicing vector into a voice correction model, and outputting corrected target user sentences.

Typically, the robot statement is preceded and the user statement is followed in a round of dialogue. The "target" in the embodiments of the present disclosure is used to distinguish between the use process and the training process of the speech modification model, and has no other meaning. The robot can also intelligently serve. The target pinyin sequence of the target user sentence may be a pinyin that labels each word in the target user sentence. The word vector model may be a word2vec model. And respectively processing the target robot sentence, the target user sentence and the target pinyin sequence by using the word vector model to obtain a target robot vector of the target robot sentence, a target user vector of the target user sentence and a target pinyin vector of the target pinyin sequence. The speech modification model may be a neural network model commonly used in the acoustic field, and is not limited herein.

According to the technical scheme provided by the embodiment of the disclosure, target robot sentences sent by the target robot and target user sentences which are recognized by the target user and answer the target robot sentences in the current round of dialogue between the target robot and the target user are obtained; generating a target pinyin sequence of the target user statement; respectively processing a target robot sentence, a target user sentence and a target pinyin sequence by using a word vector model to obtain a target robot vector, a target user vector and a target pinyin vector; splicing the target user vector and the target pinyin vector to obtain a target spliced vector; and inputting the target robot vector and the target splicing vector into a voice correction model, and outputting corrected target user sentences. By adopting the technical means, the problem of low accuracy of recognizing the user voice in the prior art can be solved, and the accuracy of recognizing the user voice is improved.

The target robot vector and the target splicing vector are input into a voice correction model, and before corrected target user sentences are output, the method further comprises: acquiring a training corpus, wherein the training corpus comprises multiple rounds of conversations between a robot and a user, and each round of conversations comprises robot sentences sent by the robot and user sentences for the user to answer the robot sentences; correcting user sentences in each dialog according to robot sentences in the dialog; generating a pinyin sequence of user sentences in each round of dialogue; respectively processing a robot sentence, a user sentence and a pinyin sequence in each round of dialogue by using a word vector model to obtain a robot vector, a user vector and a pinyin vector; splicing the user vector and the pinyin vector in each round of dialogue to obtain the spliced vector in each round of dialogue; and taking the robot vector and the splicing vector in each round of dialogue as input, taking the corrected user statement in the round of dialogue as output, and training a voice correction model.

The user statement in each dialog is corrected by correcting the place of the recognition error. And splicing the user vector and the pinyin vector in each round of dialogue, namely carrying out vector stacking processing on the user vector and the pinyin vector. The voice correction model is trained, and the corresponding relation between robot vectors and spliced vectors and user sentences is learned and stored in the voice correction model.

Before obtaining the training corpus, the method further comprises: acquiring a history dialogue record of the robot and a user; dividing the history dialogue record by taking punctuation marks as dividing basis to obtain multiple rounds of dialogue; and constructing a training corpus according to the multiple rounds of conversations obtained by segmentation.

The historic dialogue record is divided according to the division of the period, the division mark, the question mark and the exclamation mark.

Generating a pinyin sequence of user sentences in each round of dialog, comprising: dividing the user sentences in each dialogue by taking a word as a unit to obtain a user word sequence corresponding to the user sentences in the dialogue; and annotating pinyin for each word in the user word sequence corresponding to the user sentence in each round of dialogue, and obtaining the pinyin sequence of the user sentence in the round of dialogue.

It should be noted that the user word sequence preserves the original order of each word in the user sentence. For each word note pinyin, which is a tone that requires a note, embodiments of the present disclosure use 12345 to represent one, two, three, four, and light, respectively. Pinyin as "I" is "wo3".

Annotating pinyin for each word in a sequence of user words corresponding to a user sentence in each round of conversations to obtain a pinyin sequence of the user sentence in the round of conversations, including: when the polyphone exists, determining pronunciation of the polyphone according to word forming conditions of the polyphone and peripheral words in a user sentence where the polyphone is located; when the polyphones exist and the pronunciation of the polyphones cannot be determined, each pronunciation of the polyphones and the pinyin of other words in the user sentences where the polyphones are located are arranged and combined according to the sequence in the user word sequence corresponding to the user sentences where the polyphones are located, so that a plurality of pinyin sequences of the user sentences where the polyphones are located are obtained.

In the "I claim to send" the "I" to "the" send "and" the "object" all have only one corresponding pronunciation, "wo3", "qiu2", "song4" and "wu4". But "to" is a polyphone, which may be "yao1" and "yao4". It may be found that the word "require" words are pronounded by the query to "yao1qiu 2". Finally, the sound "wo3 yao1qiu song4 wu4" is obtained.

Such as "to send" the exact pronunciation of "to" may not be available. At this time, the pronunciation of the polyphones is reserved, and 2×1×1=2 combinations are combined, so that two kinds of pronunciation yao song4 wu4 and yao song4 wu4 are finally obtained.

Splicing the user vector and the pinyin vector in each round of dialogue to obtain the spliced vector in each round of dialogue, comprising: sequentially splicing the user vectors and the pinyin vectors in each round of dialogue by taking one group vector as a unit to obtain spliced vectors in each round of dialogue, wherein each user vector and each pinyin vector are one group vector; or alternatively splicing the user vector and the pinyin vector in each dialog round by taking one word vector as a unit to obtain the spliced vector in each dialog round, wherein each user vector and each pinyin vector comprise a plurality of word vectors.

Such as "biology" and the corresponding pinyin "share 1 wu4", wherein: vector v1_1 of "raw" vector v1_2 of "object"; vector v2_1 of "share 1", vector v2_2 of "wu 4"; finally, the two layers are alternately spliced into v1_1v2_1v1_2v2_2 or sequentially spliced into v1_1v1_2v2_1v2_2.

Inputting the target robot vector and the target splicing vector into a voice correction model, and outputting corrected target user sentences, wherein the method comprises the following steps: when the target user sentence has a plurality of target pinyin sequences, determining the highest score in a plurality of outputs of the speech correction model as a corrected target user sentence.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Fig. 3 is a schematic diagram of a device for correcting a speech recognition result according to an embodiment of the disclosure. As shown in fig. 3, the device for correcting the speech recognition result includes:

an obtaining module 301 configured to obtain a target robot sentence sent by the target robot and a target user sentence in which the identified target user answers the target robot sentence in a current round of dialogue between the target robot and the target user;

a generation module 302 configured to generate a target pinyin sequence of target user sentences;

a processing module 303 configured to process the target robot sentence, the target user sentence, and the target pinyin sequence, respectively, using the word vector model, to obtain a target robot vector, a target user vector, and a target pinyin vector;

the stitching module 304 is configured to stitch the target user vector and the target pinyin vector to obtain a target stitching vector;

the correction module 305 is configured to input the target robot vector and the target stitching vector into the speech correction model, and output the corrected target user sentence.

The search text can be a search formula for searching, which is input by a user, and the search text limits the search condition; the dialogue corpus to be searched is a data set, and the user searches in the data set to find out the required information. According to the matching score corresponding to each dialogue corpus, a search result is determined from the dialogue corpus set, and one or more dialogue corpora with the highest score can be used as the search result.

Optionally, the modification module 305 is further configured to obtain a training corpus, where the training corpus includes multiple rounds of conversations of the robot with the user, each round of conversations including robot sentences issued by the robot and user sentences from which the user answers the robot sentences; correcting user sentences in each dialog according to robot sentences in the dialog; generating a pinyin sequence of user sentences in each round of dialogue; respectively processing a robot sentence, a user sentence and a pinyin sequence in each round of dialogue by using a word vector model to obtain a robot vector, a user vector and a pinyin vector; splicing the user vector and the pinyin vector in each round of dialogue to obtain the spliced vector in each round of dialogue; and taking the robot vector and the splicing vector in each round of dialogue as input, taking the corrected user statement in the round of dialogue as output, and training a voice correction model.

Optionally, the modification module 305 is further configured to obtain a historical dialogue record of the robot with the user; dividing the history dialogue record by taking punctuation marks as dividing basis to obtain multiple rounds of dialogue; and constructing a training corpus according to the multiple rounds of conversations obtained by segmentation.

Optionally, the correction module 305 is further configured to divide the user sentence in each dialog with a word as a unit, so as to obtain a user word sequence corresponding to the user sentence in the dialog; and annotating pinyin for each word in the user word sequence corresponding to the user sentence in each round of dialogue, and obtaining the pinyin sequence of the user sentence in the round of dialogue.

Optionally, the correction module 305 is further configured to determine, when there is a polyphone, pronunciation of the polyphone according to word formation conditions of the polyphone and peripheral words in a user sentence in which the polyphone is located; when the polyphones exist and the pronunciation of the polyphones cannot be determined, each pronunciation of the polyphones and the pinyin of other words in the user sentences where the polyphones are located are arranged and combined according to the sequence in the user word sequence corresponding to the user sentences where the polyphones are located, so that a plurality of pinyin sequences of the user sentences where the polyphones are located are obtained.

Optionally, the correction module 305 is further configured to sequentially splice the user vector and the pinyin vector in each dialog with a group vector as a unit, to obtain a spliced vector in each dialog, where each user vector and each pinyin vector are a group vector; or alternatively splicing the user vector and the pinyin vector in each dialog round by taking one word vector as a unit to obtain the spliced vector in each dialog round, wherein each user vector and each pinyin vector comprise a plurality of word vectors.

Optionally, the modification module 305 is further configured to determine a highest score of the plurality of outputs of the speech modification model as the modified target user sentence when the target user sentence has a plurality of target pinyin sequences.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.

Fig. 4 is a schematic diagram of an electronic device 4 provided by an embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps of the various method embodiments described above are implemented by processor 401 when executing computer program 403. Alternatively, the processor 401, when executing the computer program 403, performs the functions of the modules/units in the above-described apparatus embodiments.

Illustratively, the computer program 403 may be partitioned into one or more modules/units, which are stored in the memory 402 and executed by the processor 401 to complete the present disclosure. One or more of the modules/units may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program 403 in the electronic device 4.

The electronic device 4 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the electronic device 4 and is not meant to be limiting of the electronic device 4, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the electronic device may also include an input-output device, a network access device, a bus, etc.

The processor 401 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 4. Further, the memory 402 may also include both internal storage units and external storage devices of the electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device. The memory 402 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In the embodiments provided in the present disclosure, it should be understood that the disclosed apparatus/electronic device and method may be implemented in other manners. For example, the apparatus/electronic device embodiments described above are merely illustrative, e.g., the division of modules or elements is merely a logical functional division, and there may be additional divisions of actual implementations, multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.

The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims

1. A method for correcting a speech recognition result, comprising:

acquiring a target robot sentence sent by a target robot in a current round dialogue between the target robot and a target user sentence which is identified and is answered by the target user;

generating a target pinyin sequence of the target user statement;

respectively processing the target robot sentence, the target user sentence and the target pinyin sequence by using a word vector model to obtain a target robot vector, a target user vector and a target pinyin vector;

splicing the target user vector and the target pinyin vector to obtain a target spliced vector;

and inputting the target robot vector and the target splicing vector into a voice correction model, and outputting corrected target user sentences.

2. The method of claim 1, wherein the target robot vector and the target splice vector are input into a speech modification model, and wherein prior to outputting the modified target user statement, the method further comprises:

acquiring a training corpus, wherein the training corpus comprises multiple rounds of conversations between a robot and a user, and each round of conversations comprises a robot sentence sent by the robot and a user sentence for the user to answer the robot sentence;

correcting the user sentences in each dialog according to the robot sentences in the dialog;

generating a pinyin sequence of the user statement in each round of dialogue;

respectively processing the robot statement, the user statement and the pinyin sequence in each round of dialogue by using a word vector model to obtain a robot vector, a user vector and a pinyin vector;

splicing the user vector and the pinyin vector in each round of dialogue to obtain a spliced vector in each round of dialogue;

and taking the robot vector and the splicing vector in each round of dialogue as input, taking the corrected user statement in the round of dialogue as output, and training the voice correction model.

3. The method of claim 1, wherein prior to obtaining the corpus, the method further comprises:

acquiring a history dialogue record of the robot and the user;

dividing the history dialogue record by taking punctuation marks as dividing basis to obtain multiple rounds of dialogue;

and constructing the training corpus according to the multi-round dialogue obtained by segmentation.

4. The method of claim 2, wherein generating the pinyin sequence for the user statement in each round of dialog comprises:

dividing the user sentences in each round of dialogue by taking a word as a unit to obtain a user word sequence corresponding to the user sentences in the round of dialogue;

and annotating pinyin for each word mark in the user word sequence corresponding to the user sentence in each round of dialogue, and obtaining the pinyin sequence of the user sentence in the round of dialogue.

5. The method of claim 4, wherein annotating pinyin for each word in the sequence of user words corresponding to the user sentence in each round of conversations to obtain a pinyin sequence for the user sentence in the round of conversations, comprising:

when the polyphone exists, determining pronunciation of the polyphone according to word forming conditions of the polyphone and peripheral words in the user statement;

when the polyphones exist and the pronunciation of the polyphones cannot be determined, each pronunciation of the polyphones and the pinyin of other words in the user sentence where the polyphones are located are arranged and combined according to the sequence in the user word sequence corresponding to the user sentence where the polyphones are located, so that a plurality of pinyin sequences of the user sentence where the polyphones are located are obtained.

6. The method of claim 2, wherein concatenating the user vector and the pinyin vector for each round of conversations results in a concatenated vector for each round of conversations, comprising:

sequentially splicing the user vector and the pinyin vector in each round of dialogue by taking a group vector as a unit to obtain a spliced vector in each round of dialogue, wherein each user vector and each pinyin vector are a group vector; or (b)

And alternately splicing the user vector and the pinyin vector in each round of dialogue by taking one word vector as a unit to obtain a spliced vector in each round of dialogue, wherein each user vector and each pinyin vector comprise a plurality of word vectors.

7. The method of claim 1, wherein inputting the target robot vector and the target splice vector into a speech modification model, outputting a modified target user statement, comprises:

and when the target user sentence has a plurality of target pinyin sequences, determining the highest score in a plurality of outputs of the voice correction model as a corrected target user sentence.

8. A device for correcting a speech recognition result, comprising:

an acquisition module configured to acquire a target robot sentence issued by a target robot in a current round of dialogue between the target robot and a target user sentence recognized by the target user to answer the target robot sentence;

the generation module is configured to generate a target pinyin sequence of the target user statement;

the processing module is configured to respectively process the target robot statement, the target user statement and the target pinyin sequence by utilizing a word vector model to obtain a target robot vector, a target user vector and a target pinyin vector;

the splicing module is configured to splice the target user vector and the target pinyin vector to obtain a target splicing vector;

and the correction module is configured to input the target robot vector and the target splicing vector into a voice correction model and output corrected target user sentences.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.