CN114678018A - Voice recognition method, device, equipment, medium and product - Google Patents

Voice recognition method, device, equipment, medium and product Download PDF

Info

Publication number
CN114678018A
CN114678018A CN202210152963.8A CN202210152963A CN114678018A CN 114678018 A CN114678018 A CN 114678018A CN 202210152963 A CN202210152963 A CN 202210152963A CN 114678018 A CN114678018 A CN 114678018A
Authority
CN
China
Prior art keywords
text
voice
recognized
speech
transcription
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210152963.8A
Other languages
Chinese (zh)
Inventor
姚佳立
王心怡
杨晶生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202210152963.8A priority Critical patent/CN114678018A/en
Publication of CN114678018A publication Critical patent/CN114678018A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure provides a voice recognition method, apparatus, device, medium and product, relating to the technical field of voice recognition, the method includes obtaining a voice to be recognized, and determining a first transcription text of the voice to be recognized according to the voice to be recognized; when the keywords in the first transcription text are hit in the wrong keyword set, performing semantic repair on the first transcription text; the wrong keyword set comprises a plurality of keywords corresponding to the same voice. According to the method, the semantic restoration is performed on the first transcription text through the wrong keyword set, so that the accuracy of recognizing the speech of the keywords containing the professional terms is improved, and the service requirements are met.

Description

Voice recognition method, device, equipment, medium and product
Technical Field
The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, device, computer-readable storage medium, and computer program product.
Background
With the continuous development of speech recognition technology, Automatic Speech Recognition (ASR) technology is widely used, for example, a speech recognition model based on which a speech is transcribed to obtain a corresponding text, thereby providing convenience.
At present, in instant messaging application, voice can be converted into text through a voice recognition model, so that a user can obtain corresponding text without typing; in the conference application, a text corresponding to conference voice is automatically generated through a voice recognition model, and then conference records are conveniently generated.
However, the speech to be recognized may contain professional terms, such as the speech of "we debug this system" where "debug" is a professional term, and when the speech recognition model recognizes the speech containing the professional term, we often get "we eighth this system". Therefore, the accuracy of the obtained recognition result is poor when the speech containing the professional terms is recognized.
Disclosure of Invention
The purpose of the present disclosure is: a speech recognition method, apparatus, device, computer-readable storage medium and computer program product are provided, which can improve the accuracy of recognizing speech containing professional terms and meet business requirements.
In a first aspect, the present disclosure provides a speech recognition method, including:
acquiring a voice to be recognized;
determining a first transcription text of the voice to be recognized according to the voice to be recognized;
When the keywords in the first transcription text are hit in the wrong keyword set, performing semantic repair on the first transcription text; the error keyword set comprises a plurality of keywords corresponding to the same voice.
In a second aspect, the present disclosure provides a speech recognition apparatus comprising:
the acquisition module is used for acquiring the voice to be recognized;
the text transcription module is used for determining a first transcription text of the voice to be recognized according to the voice to be recognized;
the semantic repair module is used for performing semantic repair on the first transcription text when the keywords in the first transcription text are hit in the wrong keyword set; the wrong keyword set comprises a plurality of keywords corresponding to the same voice.
In a third aspect, the present disclosure provides a computer-readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of any one of the first aspects of the present disclosure.
In a fourth aspect, the present disclosure provides an electronic device comprising:
a storage device having a computer program stored thereon;
processing means for executing the computer program in the storage means to implement the steps of the method of any one of the first aspect of the present disclosure.
In a fifth aspect, the present disclosure provides a computer program product comprising instructions which, when run on a device, cause the device to perform the method according to any of the implementations of the first aspect.
It can be seen from the above technical solutions that the present disclosure has the following advantages:
according to the method and the device, after the transcribed text of the voice to be recognized is obtained, semantic recovery is performed on the transcribed text by utilizing the error keyword set. And when the keywords in the transcribed text hit in an error keyword set, performing semantic repair on the transcribed text, wherein the error keyword set comprises a plurality of keywords corresponding to the same voice. Therefore, after semantic restoration is performed on the transcribed text by using the error keyword set, the accuracy of the transcribed text obtained after restoration can be improved.
Further, in the decoding process, after a plurality of candidate keywords of the speech to be recognized are obtained, the output probabilities of the plurality of candidate keywords are intervened by using a preset keyword set, for example, the output probability of a target keyword in the candidate keywords is improved, the target keyword hits in the preset keyword set, the preset keyword set is a keyword corresponding to a preset service scene, for example, a professional term and the like in the service scene, and then the transcription text of the speech to be recognized is obtained based on the output probabilities of the plurality of candidate keywords. Therefore, the speech recognition method provided by the disclosure has higher accuracy of the transcribed text obtained by recognizing the speech containing the professional term, and can meet the service requirement.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
In order to more clearly illustrate the technical method of the embodiments of the present disclosure, the drawings used in the embodiments will be briefly described below.
Fig. 1 is a schematic diagram of a speech recognition system provided by an embodiment of the present disclosure;
fig. 2 is a flowchart of a speech recognition method provided by an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a recording interface according to an embodiment of the disclosure;
fig. 4 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
The terms "first", "second" in the embodiments of the present disclosure are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.
Some technical terms involved in the embodiments of the present disclosure will be described first.
Speech recognition refers to converting speech into its corresponding text. Generally, a speech recognition model can be trained and obtained based on a sample corpus, and a speech to be recognized is recognized through the speech recognition model, so as to obtain a text corresponding to the speech to be recognized.
However, in some scenarios, professional terms may exist in the speech to be recognized, and the conventional speech recognition model recognizes the speech containing the professional terms, often resulting in a wrong recognition result. Taking a conference scene of software development as an example, in the scene, the professional terms may be "debug", "admin", and the like, the speech to be recognized may be the speech of "we debug this system", and in the process of recognizing the speech containing "debug" by the conventional speech recognition model, the speech of "debug" is decoded into "eighth", so as to obtain an erroneous recognition result, thereby reducing the accuracy of speech recognition.
In view of this, an embodiment of the present disclosure provides a speech recognition method, including: acquiring a voice to be recognized, and determining a first transcription text of the voice to be recognized according to the voice to be recognized; and when the keywords in the first transcription text hit in the wrong keyword set, performing semantic repair on the first transcription text, wherein the wrong keyword combination comprises a plurality of keywords corresponding to the same voice.
According to the method, after the first transcription text of the voice to be recognized is obtained, semantic restoration is performed on the first transcription text by using the error keyword set, and the accuracy of the transcription text obtained after restoration can be improved. Furthermore, in the decoding process, the output probabilities of the candidate keywords are intervened by using a preset keyword set, so that the output probability of the target keyword in the candidate keywords is improved, and then the transcription text of the voice to be recognized is obtained according to the output probabilities of the candidate keywords. Thus, when the speech to be recognized contains a professional term, the accuracy of the transcribed text obtained by recognizing the speech containing the professional term can be improved.
The voice recognition method provided by the embodiment of the disclosure can be applied to different APPs (applications). For example, the voice recognition method can be applied to video applications, and by the method, voices included in videos are recognized, so that subtitles are added to the videos; for another example, the voice recognition method can be applied to conference applications, and by using the method, voice in a conference process is recognized, and then a conference record (text corresponding to the voice in the conference process) is automatically generated. The embodiment of the present disclosure does not specifically limit the application scenario of the speech recognition method, and the above is only an exemplary description.
When the speech recognition method is applied to the above application, the speech recognition method is specifically implemented in the form of a computer program. In some embodiments, the computer program may be stand-alone, e.g. may be a stand-alone application with corresponding functionality. In other embodiments, the computer program may be a functional module or a plug-in, etc., that runs attached to an existing application.
The voice recognition method provided by the embodiment of the disclosure can be executed by the terminal alone, the server alone, or the terminal and the server cooperatively. When the speech recognition method is executed by a terminal alone, for example, a terminal of a speech recognition application, it indicates that the speech recognition application can be operated offline. For ease of understanding, the following description is exemplified in a case where the voice recognition method is cooperatively performed by the terminal and the server.
In order to make the technical solution of the present disclosure clearer and easier to understand, the following describes an architecture of a speech recognition system provided in an embodiment of the present disclosure with reference to the drawings.
Referring to a system architecture diagram of the speech recognition system 100 shown in fig. 1, the speech recognition system 100 includes a terminal 11 and a server 12. The terminal 11 and the server 12 are connected via a network. The terminal 11 is disposed in a terminal, and the terminal includes but is not limited to a smart phone, a tablet computer, a notebook computer, a Personal Digital Assistant (PDA), or an intelligent wearable device. The server 12 may be a cloud server, such as a central server in a central cloud computing cluster, or an edge server in an edge cloud computing cluster. Of course, the server 12 may also be a server in a local data center. The local data center refers to a data center directly controlled by a user.
In some examples, the terminal 11 is configured to send the speech to be recognized to the server 12, and the server 12 is configured to recognize the speech to be recognized after acquiring the speech to be recognized, and determine the first transcribed text of the speech to be recognized. In the process of recognizing the speech to be recognized, the server 12 intervenes in the plurality of candidate keywords corresponding to the speech to be recognized by using the preset keyword set, so as to improve the output probability of the target keyword among the plurality of candidate keywords, where the target keyword may be a keyword of a specialized term, such as "debug", and thus, the accuracy of recognizing the speech containing the specialized term is improved.
Further, after the server 12 obtains the first transcribed text, when a keyword in the first transcribed text hits in the error keyword set, the server may further determine the reasonable degree of the first transcribed text, where the error keyword set may include multiple keywords corresponding to the same voice. The multiple keywords corresponding to the same voice may be "eighth" and "debug", and may also be "hack" and "gap". Taking the first transcribed text as "we are the eighth system", the first transcribed text includes the keyword "eighth" in the error keyword set, the server 12 may replace "eighth" with "debug" to obtain a second transcribed text "we are the eighth system", and then judge the reasonable degree of the first transcribed text and the second transcribed text. It can be seen that the second transcription text is more reasonable than the first transcription text, and the server 12 replaces the first transcription text with the second transcription text, thereby further improving the accuracy of recognizing the speech containing the specialized terms.
In order to make the technical solution of the present disclosure clearer and easier to understand, the speech recognition method provided in the embodiments of the present disclosure is described below in terms of a terminal and a server. Referring to fig. 2, which is a flowchart of a speech recognition method provided in an embodiment of the present disclosure, the method includes:
S201: and the terminal collects the voice to be recognized.
In some examples, the terminal may collect the speech to be recognized based on a microphone. For the convenience of understanding, taking a conference scene as an example, a user can participate in a conference (e.g., a video conference) through a terminal, and during the conference, speak out a voice containing a professional term, and the terminal can collect the voice when the user speaks. In other examples, the terminal may further record the conference based on an operation triggered by the user, so as to obtain the video file. The video file includes sound information such as speech during a meeting.
It should be noted that the speech to be recognized collected by the terminal is not limited to the above example, and those skilled in the art may determine the speech to be recognized according to actual needs.
S202: and the terminal sends the voice to be recognized to the server.
The server can receive the voice to be recognized sent by the terminal so as to recognize the voice to be recognized.
S203: and the server determines a plurality of candidate keywords corresponding to the voice to be recognized according to the voice to be recognized.
The speech to be recognized may be speech corresponding to a sentence, or may be speech corresponding to a word. Taking the speech to be recognized as the speech corresponding to the sentence as an example, the speech to be recognized contains the professional term, such as the speech of "we debug this system", and the candidate keywords for the speech to be recognized may be "we", "eighth", "debug", "next", "this" and "system". In the decoding process, the server may determine a text corresponding to the speech to be recognized based on the output probability of each candidate keyword.
In some examples, the server pre-trains a speech recognition model having a function of recognizing speech containing a term of art. The corresponding professional terms in different service scenes are different, based on the fact that the linguistic data containing the professional terms in the preset service scene can be determined according to actual needs, and then model training is carried out by utilizing the linguistic data containing the professional terms and corresponding to the preset service scene, so that the voice recognition model is obtained. The corpus containing the professional terms comprises a preset text and voice of the preset text, and the preset text comprises keywords corresponding to the preset service scene.
Taking a conference scene in which a preset service scene is developed as software as an example, the server may obtain a preset text based on feedback of a user, or may automatically generate the preset text by preset keywords, which are described below respectively.
In some examples, the server may generate a recognition result (e.g., transcription text) of the speech to be recognized and transmit the recognition result to the terminal, the terminal may present the recognition result to the user, and the terminal may receive a verification result of the user on the recognition result. For example, the verification result may include whether the recognition result is accurate and the modified recognition result. Taking the speech to be recognized containing the professional term as "the system with debubg" as an example, when the text corresponding to the recognition result of the speech to be recognized is "the system with eighth speech", the user may mark the recognition result as an error, modify the text of the recognition result into "the system with debubg", and then use the modified text as the preset text. Further, the terminal may record the wrong keywords "eighth" and "debug" to obtain a wrong keyword set.
In other examples, the server may further send a recognition result (e.g., a history transcription text) of the history speech to be recognized to the terminal, the terminal may present the recognition result to the user, and the terminal may further receive a verification result of the history transcription text by the user. For example, the user may modify the historical transcription text, mark whether the historical transcription text is accurate, or the like, and then use the modified historical transcription text as the preset text.
In other examples, the server may also automatically generate a preset text based on a preset keyword, for example, the preset keyword may be a professional term in a preset service scenario, so as to further increase the data scale of the preset text, and further increase the data scale of the corpus.
The server can receive voice of the preset text sent by the terminal. In some examples, the server may send the preset text to the terminal, the terminal presents the preset text to the user, prompts the user to speak the voice of the preset text, so as to record the voice of the preset text, and then sends the recorded voice of the preset text to the server.
As shown in fig. 3, the figure is a schematic diagram of a terminal recording interface provided in the embodiment of the present disclosure. The recording interface includes a keyword list 310, a text presentation area 320, a recording control 330, and an editing control 340. The user may click a keyword in the keyword list 310, the terminal may display a text (e.g., a preset text) including the keyword clicked by the user in the text display area 320 based on a click operation of the user on the keyword, then the user may speak a voice of the text displayed in the text display area 320 after clicking the recording control 330, the terminal starts recording based on the click operation of the user on the recording control 330, and then the voice of the recorded preset text is sent to the server.
Further, the user can also edit the text displayed in the text display area 320 by clicking the editing control 340, then the terminal performs subsequent recording processing based on the edited text, and sends the edited text and the voice of the text to the server together, so that the corpus in the preset service scene is richer.
Then, the server may perform additional training on the traditional speech recognition model by using the corpus in the preset service scene to obtain a new speech recognition model. Based on the method, the server can input the voice to be recognized into the new voice recognition model to obtain a plurality of candidate keywords corresponding to the voice to be recognized.
S204: when a target keyword in the candidate keywords hits in a preset keyword set, the server improves the output probability of the target keyword.
The preset keyword set comprises keywords corresponding to a preset service scene. In the above embodiment, the keywords corresponding to the preset scenario may be "debug", "admin", "gap", and the like, and the target keyword "debug" in the candidate keywords hits in the preset keyword set, so that the server may improve the output probability of the target keyword. As shown in table 1 below:
Table 1:
candidate keywords Eighth item debug
Raw output probability 0.6 0.4
Output probability of prognosis of trunk 0.6 0.7
As can be seen from table 1, performing a dry prognosis on a plurality of candidate keywords by using a preset keyword set, for example, performing a dry prognosis on a target keyword can improve the output probability of the target keyword, so that the server can decode the speech to be recognized containing the specialized term into a text containing the specialized term in the decoding process. For the keywords shown in table 1, the server can decode the speech to be recognized into "our debug one system" instead of "our eighth one system", thereby improving the accuracy of recognizing the speech to be recognized containing the specialized terms.
S205: the server obtains a first transcription text of the voice to be recognized based on the output probabilities of the candidate keywords.
As for the candidate keywords shown in table 1, in the decoding process, the server outputs the candidate keywords with a high output probability to obtain a keyword "debug", and performs similar processing on other candidate keywords of the speech to be recognized to obtain the first transcription text of the speech to be recognized.
In the embodiment of the disclosure, the server intervenes the output probabilities of the candidate keywords by using a preset keyword set, and then obtains the transcribed text of the speech to be recognized based on the output probabilities of the candidate keywords. Therefore, the accuracy rate of recognizing the speech to be recognized containing the professional terms is improved, and the service requirement is met.
S206: and when the keywords in the first transcription text hit in the wrong keyword set, performing semantic repair on the first transcription text by the server.
The error keyword set comprises a plurality of keywords corresponding to the same voice. As described above, the keyword "eighth" and the keyword "debug" have the same voice, and the error keyword set may include "debug" and "eighth"; as another example, the keyword "gap" and the keyword "change slope" are the same voice, and the set of wrong keywords may include "gap" and "change slope".
It should be noted that the manner of obtaining the error keyword set is not specifically limited in the present disclosure, and in some examples, the error keyword set may be obtained in a preset manner.
In order to further improve the accuracy of recognizing the speech to be recognized containing the professional terms, the server can perform semantic repair on the first transcription text after obtaining the first transcription text. Specifically, when a keyword in a first transcription text hits in an error keyword set, semantic repair is performed on the first transcription text. For example, the first transcribed text may be "system we are the eighth to next", the error keyword set may include "eighth" and "debug", it is seen that the first transcribed text includes the keyword "eighth" in the error keyword set, and the server performs semantic repair on the first transcribed text.
In some examples, the server may replace a keyword in the first transcribed text with a keyword in the set of error keywords to obtain a second transcribed text, and then evaluate the scores of the first transcribed text and the second transcribed text, respectively, to obtain a first accuracy score for the first transcribed text and a second accuracy score for the second transcribed text. The scoring criterion may be that the scoring is performed based on the accuracy of the transcribed text, and when the accuracy of the transcribed text is higher, the more reasonable the transcribed text is characterized, and when the accuracy of the transcribed text is lower, the more unreasonable the transcribed text is characterized. For example, the first transcription text is "the system is the eighth of us", and after "the eighth" is replaced by the keyword "debug" in the error keyword set, the second transcription text "the system is the debug of us". The server may then score the first transcribed text and the second transcribed text, respectively, using the language model to obtain a first accuracy score and a second accuracy score.
The server then semantically repairs the first transcribed text based on the first accuracy score and the second accuracy score. For example, when the second accuracy score is greater than the first accuracy score, the first transcribed text is repaired to be the second transcribed text, that is, the first transcribed text is replaced by the second transcribed text, which indicates that the second transcribed text is more reasonable than the first transcribed text; when the first accuracy score is greater than or equal to the second accuracy score, the first transcribed text is still output, indicating that the first transcribed text is more reasonable relative to the second transcribed text.
In the embodiment of the disclosure, after the first transcription text is obtained, the server further performs semantic repair on the first transcription text, so as to further ensure accuracy of recognizing the speech to be recognized containing the professional term, and avoid intervention of a correct recognition result into an incorrect recognition result only through intervention of a preset keyword set.
S207: and the server sends the first transliteration text after semantic repair to the terminal.
The semantically repaired first transcription text is the transcription text with higher grade in the first transcription text and the second transcription text, and may be the second transcription text, for example.
In some examples, after determining the semantically repaired first transcription text, the server may send the semantically repaired first transcription text to the terminal, and the terminal may present the semantically repaired first transcription text. In other examples, the server may also generate a control instruction based on the semantically repaired first transcription text, where the control instruction is used to control a device corresponding to the preset service scenario. For example, the server may send the control instruction to the controlled device (e.g., projector, intelligent switch lamp), so as to control the controlled device. In some examples, the operation corresponding to the first transcribed text may be to turn on the projector, and the server may generate a control instruction to turn on the projector based on the first transcribed text, and then transmit the control instruction to turn on the projector to the projector, so that the projector is started based on the control instruction.
In other embodiments, the server may also directly send the first transcribed text to the terminal, or generate a control instruction based on the first transcribed text, that is, the server does not perform semantic repair on the first transcribed text.
Based on the above description, the embodiments of the present disclosure provide a speech recognition method, which obtains a corpus including a professional term corresponding to a preset scene by directionally acquiring an audio, and then obtains a new speech recognition model by performing additional training on the corpus based on a conventional speech recognition model. After the voice recognition model obtains a plurality of candidate keywords, intervening the output probabilities of the candidate keywords through a preset keyword set, and further obtaining a recognition result; the method also utilizes the error keyword set to carry out semantic repair processing on the recognition result, thereby realizing the post error correction, further improving the accuracy rate of recognizing the speech to be recognized containing the professional terms and meeting the business requirements.
Fig. 4 is a schematic diagram illustrating a speech recognition apparatus according to an exemplary disclosed embodiment, and as shown in fig. 4, the speech recognition apparatus 400 includes:
an obtaining module 401, configured to obtain a speech to be recognized;
A text transcription module 402, configured to determine, according to the speech to be recognized, a first transcribed text of the speech to be recognized;
a semantic repair module 403, configured to perform semantic repair on the first transcribed text when a keyword in the first transcribed text hits in an error keyword set; the error keyword set comprises a plurality of keywords corresponding to the same voice.
Optionally, the semantic repair module 403 is specifically configured to replace a keyword in the first transcribed text with a keyword in the error keyword set to obtain a second transcribed text; evaluating a first accuracy score of the first transcribed text and a second accuracy score of the second transcribed text; replacing the first transcribed text with the second transcribed text when the second accuracy score is greater than the first accuracy score.
Optionally, the text transcription module 402 is specifically configured to determine, according to the speech to be recognized, a plurality of candidate keywords corresponding to the speech to be recognized; when a target keyword in the candidate keywords hits in a preset keyword set, improving the output probability of the target keyword; the preset keyword set comprises keywords corresponding to a preset service scene; and obtaining the first transcription text of the voice to be recognized based on the output probabilities of the candidate keywords.
Optionally, the text transcription module 402 is specifically configured to input the speech to be recognized into a speech recognition model, so as to obtain a plurality of candidate keywords corresponding to the speech to be recognized; the voice recognition model is obtained based on corpus training corresponding to a preset service scene.
Optionally, the corpus corresponding to the preset service scene includes a preset text and a voice of the preset text, and the preset text includes a keyword corresponding to the preset service scene.
Optionally, the preset text is obtained through user feedback or automatically generated through the preset keyword.
Optionally, the preset text is obtained by checking the history transcribed text by the user.
Optionally, the apparatus further includes an instruction generating module, configured to generate a control instruction according to the first transcribed text, where the control instruction is used to control a device corresponding to the preset service scene to execute an operation corresponding to the first transcribed text.
The functions of the above modules have been elaborated in the method steps in the previous embodiment, and are not described herein again.
Referring now to fig. 5, a schematic diagram of an electronic device 500 suitable for implementing an embodiment of the present disclosure is shown, where the electronic device may be a server 12, and the server 12 is used for implementing the corresponding functions of the speech recognition apparatus 400 shown in fig. 4. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the terminals, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may be separate and not incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a voice to be recognized; determining a first transcription text of the voice to be recognized according to the voice to be recognized; when the keywords in the first transcription text hit in an error keyword set, performing semantic repair on the first transcription text; the error keyword set comprises a plurality of keywords corresponding to the same voice.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not in some cases form a limitation of the module itself, for example, the first obtaining module may also be described as a "module for obtaining at least two internet protocol addresses".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, example 1 provides a speech recognition method of acquiring a speech to be recognized; determining a first transcription text of the voice to be recognized according to the voice to be recognized; when the keywords in the first transcription text hit in an error keyword set, performing semantic repair on the first transcription text; the wrong keyword set comprises a plurality of keywords corresponding to the same voice.
Example 2 provides the method of example 1, the semantically repairing the first transcribed text, comprising:
replacing the keywords in the first transcription text by the keywords in the error keyword set to obtain a second transcription text;
evaluating a first accuracy score of the first transcribed text and a second accuracy score of the second transcribed text;
replacing the first transcribed text with the second transcribed text when the second accuracy score is greater than the first accuracy score.
Example 3 provides the method of example 1, the determining, from the speech to be recognized, the first transcribed text of the speech to be recognized, including:
Determining a plurality of candidate keywords corresponding to the voice to be recognized according to the voice to be recognized;
when a target keyword in the candidate keywords hits in a preset keyword set, improving the output probability of the target keyword; the preset keyword set comprises keywords corresponding to a preset service scene;
and obtaining the first transcription text of the voice to be recognized based on the output probabilities of the candidate keywords.
Example 4 provides the method of example 3, wherein determining a plurality of candidate keywords corresponding to the speech to be recognized according to the speech to be recognized includes:
inputting the voice to be recognized into a voice recognition model to obtain a plurality of candidate keywords corresponding to the voice to be recognized; the voice recognition model is obtained based on corpus training corresponding to a preset service scene.
Example 5 provides the method of example 4, where the corpus corresponding to the preset service scenario includes a preset text and a voice of the preset text, and the preset text includes a keyword corresponding to the preset service scenario.
Example 6 provides the method of example 5, the preset text being obtained through user feedback or automatically generated through the preset keywords, according to one or more embodiments of the present disclosure.
Example 7 provides the method of example 6, and the preset text is specifically obtained by a user checking the history transcribed text according to one or more embodiments of the present disclosure.
Example 8 provides the method of example 7, further comprising, in accordance with one or more embodiments of the present disclosure:
and generating a control instruction according to the first transcription text, wherein the control instruction is used for controlling equipment corresponding to the preset service scene to execute the operation corresponding to the first transcription text.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims (12)

1. A speech recognition method, comprising:
acquiring a voice to be recognized;
determining a first transcription text of the voice to be recognized according to the voice to be recognized;
when the keywords in the first transcription text hit in an error keyword set, performing semantic repair on the first transcription text; the wrong keyword set comprises a plurality of keywords corresponding to the same voice.
2. The method of claim 1, wherein the semantically repairing the first transcribed text comprises:
replacing the keywords in the first transcription text by the keywords in the error keyword set to obtain a second transcription text;
Evaluating a first accuracy score of the first transcribed text and a second accuracy score of the second transcribed text;
replacing the first transcribed text with the second transcribed text when the second accuracy score is greater than the first accuracy score.
3. The method of claim 1, wherein the determining a first transcribed text of the speech to be recognized according to the speech to be recognized comprises:
determining a plurality of candidate keywords corresponding to the voice to be recognized according to the voice to be recognized;
when a target keyword in the candidate keywords hits in a preset keyword set, improving the output probability of the target keyword; the preset keyword set comprises keywords corresponding to a preset service scene;
and obtaining the first transcription text of the voice to be recognized based on the output probabilities of the candidate keywords.
4. The method according to claim 3, wherein the determining a plurality of candidate keywords corresponding to the speech to be recognized according to the speech to be recognized comprises:
inputting the voice to be recognized into a voice recognition model to obtain a plurality of candidate keywords corresponding to the voice to be recognized; the voice recognition model is obtained based on corpus training corresponding to a preset service scene.
5. The method according to claim 4, wherein the corpus corresponding to the predetermined service scenario includes a predetermined text and a voice of the predetermined text, and the predetermined text includes a keyword corresponding to the predetermined service scenario.
6. The method according to claim 5, wherein the preset text is obtained through user feedback or automatically generated through the preset keyword.
7. The method according to claim 6, wherein the preset text is obtained by checking the history transcribed text by a user.
8. The method according to any one of claims 1-7, further comprising:
and generating a control instruction according to the first transcribed text, wherein the control instruction is used for controlling equipment corresponding to the preset service scene to execute the operation corresponding to the first transcribed text.
9. A speech recognition apparatus, comprising:
the acquisition module is used for acquiring the voice to be recognized;
the text transcription module is used for determining a first transcription text of the voice to be recognized according to the voice to be recognized;
the semantic repair module is used for performing semantic repair on the first transcription text when the keywords in the first transcription text are hit in the wrong keyword set; the error keyword set comprises a plurality of keywords corresponding to the same voice.
10. An electronic device, comprising:
a storage device having a computer program stored thereon;
processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 8.
11. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 8.
12. A computer program product, characterized in that it causes a computer to carry out the steps of the method according to any one of claims 1 to 8, when said computer program product is run on a computer.
CN202210152963.8A 2022-02-18 2022-02-18 Voice recognition method, device, equipment, medium and product Pending CN114678018A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210152963.8A CN114678018A (en) 2022-02-18 2022-02-18 Voice recognition method, device, equipment, medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210152963.8A CN114678018A (en) 2022-02-18 2022-02-18 Voice recognition method, device, equipment, medium and product

Publications (1)

Publication Number Publication Date
CN114678018A true CN114678018A (en) 2022-06-28

Family

ID=82072213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210152963.8A Pending CN114678018A (en) 2022-02-18 2022-02-18 Voice recognition method, device, equipment, medium and product

Country Status (1)

Country Link
CN (1) CN114678018A (en)

Similar Documents

Publication Publication Date Title
CN112115706B (en) Text processing method and device, electronic equipment and medium
JP6820058B2 (en) Speech recognition methods, devices, devices, and storage media
CN108985358B (en) Emotion recognition method, device, equipment and storage medium
CN110415679B (en) Voice error correction method, device, equipment and storage medium
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN109754783B (en) Method and apparatus for determining boundaries of audio sentences
CN109858045B (en) Machine translation method and device
CN112509562B (en) Method, apparatus, electronic device and medium for text post-processing
CN111986655B (en) Audio content identification method, device, equipment and computer readable medium
CN111312209A (en) Text-to-speech conversion processing method and device and electronic equipment
CN110750996B (en) Method and device for generating multimedia information and readable storage medium
US20240029709A1 (en) Voice generation method and apparatus, device, and computer readable medium
CN113378586B (en) Speech translation method, translation model training method, device, medium, and apparatus
JP2022120024A (en) Audio signal processing method, model training method, and their device, electronic apparatus, storage medium, and computer program
CN114023315A (en) Voice recognition method and device, readable medium and electronic equipment
CN113111658B (en) Method, device, equipment and storage medium for checking information
CN113011169B (en) Method, device, equipment and medium for processing conference summary
US11620328B2 (en) Speech to media translation
US9747891B1 (en) Name pronunciation recommendation
CN112652329B (en) Text realignment method and device, electronic equipment and storage medium
TWI818427B (en) Method and system for correcting speaker diarisation using speaker change detection based on text
CN113761865A (en) Sound and text realignment and information presentation method and device, electronic equipment and storage medium
CN114678018A (en) Voice recognition method, device, equipment, medium and product
CN113221514A (en) Text processing method and device, electronic equipment and storage medium
CN117882365A (en) Verbal menu for determining and visually displaying calls

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination