CN113241061A

CN113241061A - Method and device for processing voice recognition result, electronic equipment and storage medium

Info

Publication number: CN113241061A
Application number: CN202110533064.8A
Authority: CN
Inventors: 王乾坤; 杜春赛; 姚佳立; 徐文铭; 杨晶生
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2021-08-10
Anticipated expiration: 2041-05-17
Also published as: CN113241061B

Abstract

The disclosure provides a method and a device for processing a voice recognition result, electronic equipment and a storage medium. One embodiment of the method comprises: determining at least one pronunciation corresponding to each character in the target text according to the target text and the corresponding target audio, wherein the target text is obtained by performing voice recognition on the target audio; determining whether target content consistent with the pronunciation of a preset word exists in the target text or not according to at least one pronunciation corresponding to each character in the target text; and under the condition that the target content exists in the target text, modifying the target content into a preset word. The method and the device can improve the accuracy of the alignment of the sound and the character, and further improve the accuracy of the error correction of the voice recognition text.

Description

Method and device for processing voice recognition result, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of voice recognition, in particular to a method and a device for processing a voice recognition result, electronic equipment and a storage medium.

Background

The Speech Recognition technology (ASR) is to use super large scale language pattern Recognition and autonomous learning technology to predict conversation context, and to make centralized analysis and processing on the sound signals generated by various services, so as to realize high-efficiency Speech transcription word service.

The speech recognized text is often in error and needs to be corrected. For example, for some proper nouns, such as names and terms, the recognition difficulty is high, the error rate is high, and the proper nouns are often mapped into some common words and need to be corrected into proper nouns. The error correction mode in the prior art has the problem of low accuracy.

Therefore, it is necessary to provide a new technical solution for processing the speech recognition result.

Disclosure of Invention

The embodiment of the disclosure provides a method and a device for processing a voice recognition result, electronic equipment and a storage medium.

In a first aspect, the present disclosure provides a method for processing a speech recognition result, including:

determining at least one pronunciation corresponding to each character in the target text according to the target text and the corresponding target audio, wherein the target text is obtained by performing voice recognition on the target audio;

determining whether target content consistent with the pronunciation of a preset word exists in the target text or not according to at least one pronunciation corresponding to each character in the target text;

and under the condition that the target content exists in the target text, modifying the target content into a preset word.

In some optional embodiments, determining at least one pronunciation corresponding to each word in the target text according to the target text and the corresponding target audio includes:

and inputting the target text and the target audio into a pre-trained first machine learning model to obtain at least one pronunciation corresponding to each character in the target text.

In some optional embodiments, in the case that the target content exists in the target text, modifying the target content into a preset word includes:

determining whether the target content needs to be modified or not according to the target content, the preset words and the related content of the target content in the target text;

and under the condition that the target content is determined to need to be modified, modifying the target content into a preset word.

In some optional embodiments, determining whether the target content needs to be modified according to the target content, the preset word and the related content of the target content in the target text includes:

and inputting the target content, the preset words and the related content of the target content in the target text into a pre-trained second machine learning model to obtain a judgment result of whether the target content needs to be modified.

In some optional embodiments, determining whether target content consistent with the pronunciation of the preset word exists in the target text according to at least one pronunciation corresponding to each word in the target text includes:

determining the frequency grade of the preset words, wherein the frequency grade represents the occurrence frequency or the occurrence probability of the preset words in the target text;

under the condition that the frequency level of the preset words is a first level, determining whether target content consistent with the pronunciation of the preset words exists in the target text or not according to a first preset number of pronunciations corresponding to each word in the target text;

under the condition that the frequency level of the preset words is a second level, determining whether target content consistent with the pronunciation of the preset words exists in the target text or not according to a second preset number of pronunciations corresponding to each word in the target text;

the first grade is higher than the second grade, and the first preset number is larger than the second preset number.

In some alternative embodiments, the predetermined words are from a predetermined set of words, and the pronunciations of the words in the predetermined set of words are stored in a dictionary tree manner.

In some optional embodiments, the target audio is an audio of the target conference, and the preset word is a hotword corresponding to the target conference.

In a second aspect, the present disclosure provides a device for processing a speech recognition result, including:

the system comprises a sound-character alignment unit, a first machine learning model training unit and a second machine learning model training unit, wherein the sound-character alignment unit is used for inputting a target text and corresponding target audio into the pre-trained first machine learning model to obtain at least one pronunciation corresponding to each character in the target text, and the target text is obtained by performing voice recognition on the target audio;

the matching unit is used for determining whether target content consistent with the pronunciation of a preset word exists in the target text according to at least one pronunciation corresponding to each character in the target text;

and the modifying unit is used for modifying the target content into preset words under the condition that the target content exists in the target text.

In some optional embodiments, the phonetic-word alignment unit is further configured to:

In some optional embodiments, the modifying unit is further configured to:

In some optional embodiments, the matching unit is further configured to:

In a third aspect, the present disclosure provides an electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described in any embodiment of the first aspect of the disclosure.

In a fourth aspect, the present disclosure provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method as described in any one of the embodiments of the first aspect of the present disclosure.

According to the processing method, the processing device, the electronic equipment and the storage medium for the voice recognition result, the pronunciation corresponding to the characters in the target text is determined according to the target text and the corresponding target audio, the pronunciation of the characters in the target text can be accurately obtained, on the basis, the preset words are combined for error correction, and the accuracy of error correction of the voice recognition text can be improved.

Drawings

Other features, objects, and advantages of the disclosure will become apparent from a reading of the following detailed description of non-limiting embodiments which proceeds with reference to the accompanying drawings. The drawings are only for purposes of illustrating the particular embodiments and are not to be construed as limiting the invention. In the drawings:

FIG. 1 is a system architecture diagram of one embodiment of a speech recognition result processing system according to the present disclosure;

FIG. 2 is a flow diagram for one embodiment of a speech recognition result processing method according to the present disclosure;

FIG. 3 is an exploded flow diagram of one embodiment of the reading matching step according to the present disclosure;

FIG. 4 is a schematic block diagram of one embodiment of a speech recognition result processing apparatus according to the present disclosure;

FIG. 5 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the speech recognition result processing method, apparatus, terminal device, and storage medium of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a voice interaction application, a video conference application, a short video social application, a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a microphone and a speaker, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as a plurality of software or software modules (for example, for a speech recognition result processing service) or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background server that provides processing services for target text and target audio captured on the

terminal devices

101, 102, 103. The background server can perform corresponding processing on the received target text, the target audio and the like.

In some cases, the speech recognition result processing method provided by the present disclosure may be executed by the

terminal devices

101, 102, 103 and the server 105 together, for example, the step of "determining at least one pronunciation corresponding to each word in the target text according to the target text and the corresponding target audio" may be executed by the server 105, and the step of determining whether there is target content in the target text that is consistent with the pronunciation of the preset word according to the at least one pronunciation corresponding to each word in the target text "may be executed by the

terminal devices

101, 102, 103. The present disclosure is not limited thereto. Accordingly, the speech recognition result processing means may be provided in the

terminal devices

101, 102, 103 and the server 105, respectively.

terminal devices

101, 102, and 103, and accordingly, the speech recognition result processing apparatus may also be disposed in the

terminal devices

101, 102, and 103, and in this case, the system architecture 100 may not include the server 105.

In some cases, the speech recognition result processing method provided by the present disclosure may be executed by the server 105, and accordingly, the speech recognition result processing apparatus may also be disposed in the server 105, and in this case, the system architecture 100 may not include the

terminal devices

101, 102, and 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continuing reference to fig. 2, a flow 200 of one embodiment of a speech recognition result processing method according to the present disclosure is shown, applied to the terminal device or the server in fig. 1, the flow 200 including the following steps:

step 201, determining at least one pronunciation corresponding to each character in the target text (i.e. performing sound-character alignment) according to the target text and the corresponding target audio, wherein the target text is obtained by performing voice recognition on the target audio.

In the present embodiment, the target text is a text to be processed, which is obtained by performing speech recognition on the target audio. The voice recognition technology can utilize the language pattern recognition and the autonomous learning technology to perform centralized analysis processing on the sound signals generated by various services, thereby realizing high-efficiency voice transcription character service. The speech recognition can comprise three basic parts of feature extraction, pattern matching and reference model library, and is divided into two stages of learning and training. Firstly, training the characteristic parameters of the recognition content to obtain a reference template, and then matching the test template with the existing reference template through a recognition decision to obtain the best matched reference template, thereby forming a voice recognition result.

In this embodiment, the target text may correspond to any language, such as chinese, english, etc., and the disclosure is not limited thereto. The characters in this embodiment may be specific language units in the language corresponding to the target text, such as chinese characters, english words, and so on.

In one example, step 201 may be implemented as follows: and inputting the target text and the target audio into a pre-trained first machine learning model to obtain at least one pronunciation corresponding to each character in the target text. Here, the first machine learning model is an acoustic model. The acoustic model can be modeled by a hidden Markov model, can also be modeled by a deep learning model, and can also be modeled by other model structures, which is not limited by the disclosure. The first machine learning model may be obtained by performing machine learning training in advance through a training sample set (each training sample in the training sample set includes a text sample, an audio sample, and a correspondence relationship between characters in the text sample and phonemes in the audio sample).

In the above example, the first machine learning model may output a plurality of pronunciations corresponding to each character, for example, 5 pronunciations, 10 pronunciations, 20 pronunciations, etc. corresponding to each character. For example, assume that the content of the target text is "yellow river" and the target audio is an audio file with a duration of 2 s. After the target text and the target audio are input into the first machine learning model, the first machine learning model may first determine an audio segment corresponding to each word in the target text, for example, "yellow" corresponds to an audio segment of 0s-1s in the target audio, and "river" corresponds to an audio segment of 1s-2s in the target audio. The first machine learning model may then determine possible pronunciations corresponding to the audio segments by analyzing the audio signals contained in the audio segments. For example, by analyzing the sound signals contained in the 0s-1s audio passage, it is determined that the corresponding pronunciation may be "huang", "tan", and "hang". Accordingly, the first machine learning model may output a plurality of pronunciations corresponding to "yellow", i.e., "huang", "tan", and "hang". Similarly, the first machine learning model may also output a plurality of readings corresponding to a "river".

In the above example, the first machine learning model may further output a probability corresponding to each pronunciation, for example, the probability corresponding to "huang" is 0.6, the probability corresponding to "tan" is 0.2, and the probability corresponding to "hang" is 0.1.

Step 202, determining whether target content consistent with the pronunciation of a preset word exists in the target text according to at least one pronunciation corresponding to each character in the target text.

In this embodiment, the preset word may be a word that is set in advance and is used as a standard for error correction. The preset words may be error-prone words collected in advance in the speech recognition field, or words related to the source scene of the target text, or the like.

In one example, the target text may be derived from a conference scene, the target audio may be audio of the target conference, and the preset word may be a hotword corresponding to the target conference. Wherein the target conference may be an online conference. The hotword corresponding to the target meeting may be a word related to the target meeting, such as a name of a participant of the target meeting, a word in a title of the target meeting, a word in a historical text of the target meeting (e.g., a meeting caption or meeting record, etc.), and so on. In these application scenarios, as the conference progresses, the hotword may change accordingly, that is, the preset word may also change.

In this embodiment, target content consistent with the preset word pronunciation may be searched from the target text based on at least one pronunciation corresponding to each word in the target text. And under the condition that the target text and the preset words are both Chinese, the target content may be words or a plurality of Chinese characters which do not form words.

In one example, the predetermined words may be from a predetermined set of words, and the pronunciations of the words in the predetermined set of words may be stored in a dictionary tree (trie tree) manner. Therefore, the efficiency of searching the pronunciation of the preset word is improved, and the efficiency of searching the target content is improved.

In one example, step 202 may further include the steps of:

step 2021, determining a frequency level of the preset word, wherein the frequency level represents the occurrence frequency or the occurrence probability of the preset word in the target text.

In the example of a conference scenario, the frequency levels of the preset words may be divided into a high frequency level and a low frequency level. And the occurrence frequency or the occurrence probability of the preset words with high frequency level in the target text is higher. In the first case, the preset words with high frequency level actually appear in the target text with higher frequency, for example, the appearance frequency of each word in the target text may be counted, and the word with the appearance frequency higher than the preset frequency is determined as the preset word with high frequency level. In the second case, the preset words of the high-frequency level have a high probability of appearing in the target text (not necessarily actually appearing), for example, in a meeting scene, names of participants generally have a high probability of appearing, and can be determined as the preset words of the high-frequency level.

The occurrence frequency or the occurrence probability of the preset words with the low frequency level in the target text is low. In the first case, the frequency of the actual appearance of the preset words at the low frequency level in the target text is low, for example, the appearance frequency of each word in the target text may be counted, and a word with the appearance frequency lower than the preset frequency may be determined as the preset word at the low frequency level. In the second case, the preset words at the low frequency level have a low probability of appearing in the target text (not necessarily actually appearing), for example, in an online meeting scene, names of communication contacts of participants may appear, but the appearance probability is lower than that of the names of the participants, and the preset words at the low frequency level can be determined.

In this embodiment, the frequency level of the preset word may be determined according to specific situations, which is not limited in this disclosure.

Step 2022, determining whether the target text has target content consistent with the pronunciation of the preset word or not according to the pronunciation of the first preset number corresponding to each character in the target text under the condition that the frequency level of the preset word is the first level.

Step 2023, determining whether the target text has target content consistent with the pronunciation of the preset word or not according to a second preset number of pronunciations corresponding to each character in the target text under the condition that the frequency level of the preset word is a second level.

In step 2022 and step 2023, the first level is higher than the second level, and the first predetermined number is greater than the second predetermined number. For example, for a preset word at a high frequency level in a conference scene, such as names of conference participants, a greater number of pronunciations (e.g., 5 pronunciations with the highest probability) corresponding to each word in the target text may be obtained, and pronunciation matching is performed on the basis. Therefore, the situation that the missed pronunciation corresponding to the lower probability is just matched with the pronunciation of the preset word can be avoided, the pronunciation matching accuracy can be guaranteed, and the accuracy of the speech recognition text error correction can be improved. For preset words of low frequency level in a meeting scene, such as names of communication contacts of participants, a small number of pronunciations (such as 3 pronunciations with highest probability) corresponding to each character in a target text can be obtained, and pronunciation matching is performed on the basis. Therefore, the data processing method is beneficial to reducing the data volume needing to be processed and improving the processing speed.

And step 203, modifying the target content into a preset word under the condition that the target content exists in the target text.

In one example, the necessity of modifying the target content may be determined before modifying the target content into the preset word. In this example, step 203 may further include the steps of:

firstly, whether the target content needs to be modified or not can be determined according to the target content, the preset words and the related content of the target content in the target text.

Secondly, the target content can be modified into preset words under the condition that the target content needs to be modified.

Here, the relevant content of the target content in the target text is, for example, a sentence where the target content is located, a paragraph where the target content is located, a full text of the target text, and the like.

Here, whether modification is required may be determined from one or more of the similarity of the pronunciation of the target content with the preset word, the sentence order of the sentence in which the target content is located, the order of the sentence after replacement with the preset word, and the like. For example, it may be determined that the target content needs to be modified in a case that the pronunciation similarity of the target content and the preset word is greater than a preset similarity threshold. For another example, it may be determined that the target content needs to be modified when the sentence popularity of the sentence in which the target content is located is smaller than the preset popularity threshold. For another example, it may be determined that the target content needs to be modified when the sentence popularity of the target content replaced with the preset word is higher than the popularity of the original sentence. In addition, the necessity of modifying the target content can be comprehensively scored from the plurality of angles at the same time, and whether the target content needs to be modified or not can be determined according to the comprehensive scoring.

Here, the target content, the preset word and the related content of the target content in the target text may be input into a second machine learning model trained in advance, so as to obtain a determination result of whether the target content needs to be modified. The second machine learning model may be obtained by performing machine learning training in advance through a training sample set (the training samples in the training sample set include sample target content, sample preset words, relevant texts of the sample target content, and a label indicating whether the sample target content needs to be modified).

According to the processing method of the voice recognition result provided by the embodiment of the disclosure, the pronunciation corresponding to the character in the target text is determined according to the target text and the corresponding target audio, the pronunciation of the character in the target text can be more accurately obtained, on the basis, the error correction is performed by combining the preset word, and the accuracy of the error correction of the voice recognition text can be improved.

With further reference to fig. 4, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a speech recognition result processing apparatus, which corresponds to the method embodiment shown in fig. 2, and which is specifically applicable to various terminal devices.

As shown in fig. 4, the speech recognition result processing apparatus 400 of the present embodiment includes: a phonetic-word alignment unit 401, a matching unit 402 and a modification unit 403. The system comprises a sound-character alignment unit 401, a first machine learning model training unit, a second machine learning model training unit and a target audio frequency unit, wherein the sound-character alignment unit 401 is used for inputting a target text and a corresponding target audio frequency into the pre-trained first machine learning model to obtain at least one pronunciation corresponding to each character in the target text, and the target text is obtained by performing voice recognition on the target audio frequency; a matching unit 402, configured to determine whether a target content consistent with the pronunciation of a preset word exists in a target text according to at least one pronunciation corresponding to each word in the target text; a modifying unit 403, configured to modify the target content into a preset word if the target content exists in the target text.

In this embodiment, the specific processing of the phonetic-character alignment unit 401, the matching unit 402 and the modification unit 403 of the speech recognition result processing apparatus 400 and the technical effects thereof can refer to the related descriptions of step 201, step 202 and step 203 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional embodiments, the phonetic-word alignment unit 401 may further be configured to: and inputting the target text and the target audio into a pre-trained first machine learning model to obtain at least one pronunciation corresponding to each character in the target text.

In some optional embodiments, the modifying unit 403 may be further configured to: determining whether the target content needs to be modified or not according to the target content, the preset words and the related content of the target content in the target text; and under the condition that the target content is determined to need to be modified, modifying the target content into a preset word.

In some optional embodiments, the modifying unit 403 may be further configured to: and inputting the target content, the preset words and the related content of the target content in the target text into a pre-trained second machine learning model to obtain a judgment result of whether the target content needs to be modified.

In some optional embodiments, the matching unit 402 may further be configured to: determining the frequency grade of the preset words, wherein the frequency grade represents the occurrence frequency or the occurrence probability of the preset words in the target text; under the condition that the frequency level of the preset words is a first level, determining whether target content consistent with the pronunciation of the preset words exists in the target text or not according to a first preset number of pronunciations corresponding to each word in the target text; under the condition that the frequency level of the preset words is a second level, determining whether target content consistent with the pronunciation of the preset words exists in the target text or not according to a second preset number of pronunciations corresponding to each word in the target text; the first grade is higher than the second grade, and the first preset number is larger than the second preset number.

It should be noted that, for details of implementation and technical effects of each unit in the speech recognition result processing apparatus provided in the embodiment of the present disclosure, reference may be made to descriptions of other embodiments in the present disclosure, and details are not described herein again.

Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing the terminal devices of the present disclosure is shown. The computer system 500 shown in fig. 5 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.

As shown in fig. 5, computer system 500 may include a processing device (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage device 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for the operation of the computer system 500 are also stored. The processing device 501, the ROM502, and the RAM503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, a touch pad, a keyboard, a mouse, a camera, a microphone, and the like; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the computer system 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates a computer system 500 having various means of electronic equipment, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program, when executed by the processing device 501, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the speech recognition result processing method as shown in the embodiment shown in fig. 2 and its alternative embodiments.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation on the unit itself, for example, a phonetic-to-word alignment unit may also be described as a "unit for inputting a target text and corresponding target audio into a pre-trained first machine-learned model, resulting in at least one pronunciation for each word in the target text".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for processing a speech recognition result comprises the following steps:

determining at least one pronunciation corresponding to each character in a target text according to the target text and corresponding target audio, wherein the target text is obtained by performing voice recognition on the target audio;

and under the condition that the target content exists in the target text, modifying the target content into the preset words.

2. The method of claim 1, wherein the determining at least one pronunciation corresponding to each word in the target text according to the target text and the corresponding target audio comprises:

3. The method of claim 1, wherein the modifying the target content to the preset word if the target content exists in the target text comprises:

and under the condition that the target content needs to be modified, modifying the target content into the preset words.

4. The method of claim 3, wherein the determining whether the target content needs to be modified according to the target content, the preset words and the related content of the target content in the target text comprises:

5. The method of claim 1, wherein the determining whether target content consistent with the pronunciation of a preset word exists in the target text according to at least one pronunciation corresponding to each word in the target text comprises:

determining a frequency grade of the preset words, wherein the frequency grade represents the occurrence frequency or the occurrence probability of the preset words in the target text;

under the condition that the frequency level of the preset words is a first level, determining whether target content consistent with the pronunciation of the preset words exists in the target text or not according to the pronunciations of a first preset number corresponding to each word in the target text;

6. The method of any one of claims 1-5, wherein the predetermined words are from a predetermined set of words, the pronunciations of the words in the predetermined set of words being stored in a dictionary tree.

7. The method according to any one of claims 1-5, wherein the target audio is audio of a target conference, and the preset word is a hotword corresponding to the target conference.

8. A speech recognition result processing apparatus comprising:

the system comprises a sound-character alignment unit, a first machine learning model training unit and a second machine learning model training unit, wherein the sound-character alignment unit is used for inputting a target text and a corresponding target audio frequency into the first machine learning model training in advance to obtain at least one pronunciation corresponding to each character in the target text, and the target text is obtained by carrying out voice recognition on the target audio frequency;

and the modifying unit is used for modifying the target content into the preset words under the condition that the target content exists in the target text.

9. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method of any one of claims 1-7.