WO2019232991A1 - 将会议语音识别为文本的方法、电子设备及存储介质 - Google Patents

将会议语音识别为文本的方法、电子设备及存储介质 Download PDF

Info

Publication number
WO2019232991A1
WO2019232991A1 PCT/CN2018/108113 CN2018108113W WO2019232991A1 WO 2019232991 A1 WO2019232991 A1 WO 2019232991A1 CN 2018108113 W CN2018108113 W CN 2018108113W WO 2019232991 A1 WO2019232991 A1 WO 2019232991A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
speech recognition
word
matching
preset
Prior art date
Application number
PCT/CN2018/108113
Other languages
English (en)
French (fr)
Inventor
王健宗
于夕畔
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019232991A1 publication Critical patent/WO2019232991A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the field of speech recognition technology, and in particular, to a method, an electronic device, and a storage medium for recognizing conference speech as text.
  • ASR Automatic Speech Recognition
  • the user inputs a voice by using an external or built-in microphone on a terminal such as a personal computer, a notebook computer, a tablet computer, a dedicated learning terminal, or a smart phone, and completes voice-to-text conversion through a voice recognition device.
  • a terminal such as a personal computer, a notebook computer, a tablet computer, a dedicated learning terminal, or a smart phone
  • a first aspect of the present application provides a method for recognizing conference speech as text, the method including:
  • a speech recognition text having an uneditable state is generated based on the speech recognition text after the editing operation, as the final speech recognition text.
  • a second aspect of the present application provides an electronic device including a processor and a memory, where the processor is configured to implement the method for recognizing a conference voice as text when executing computer-readable instructions stored in the memory. .
  • a third aspect of the present application provides a non-volatile readable storage medium, where the non-readable readable storage medium stores computer-readable instructions, and the computer-readable instructions are implemented when executed by a processor.
  • the method, electronic device, and storage medium for recognizing conference speech as text described in this application convert the conference speech to be recognized into text by using speech recognition technology as the initial speech recognition text; and combining the initial speech recognition text with a preset A text database is matched to obtain a matched speech recognition text; a speech recognition text draft having an editable state is generated according to the matched speech recognition text; and when an editing operation is detected on the speech recognition text draft, A speech recognition text having an uneditable state is generated according to the speech recognition text after the editing operation, as the final speech recognition text.
  • the first matching with the preset text library is performed, and the second confirmation is performed manually.
  • the two processes can effectively ensure the correctness of the text output content, improve the unreasonable text expression in traditional speech to text, effectively reduce the proofreading workload of the conference content, and improve efficiency.
  • FIG. 1 is a flowchart of a method for recognizing a conference voice as text provided in Embodiment 1 of the present application.
  • FIG. 2 is a functional module diagram of a device for recognizing conference speech as text provided in Embodiment 2 of the present application.
  • FIG. 3 is a schematic diagram of an electronic device according to a third embodiment of the present application.
  • the method for recognizing conference speech as text in the embodiment of the present application is applied to one or more electronic devices.
  • the method for recognizing conference speech as text can also be applied to a hardware environment composed of an electronic device and a server connected to the electronic device through a network.
  • the network includes, but is not limited to: a wide area network, a metropolitan area network, or a local area network.
  • the method for recognizing conference speech as text in the embodiment of the present application may be executed by a server or an electronic device; it may also be executed jointly by the server and the electronic device.
  • the function of recognizing conference voice as text provided by the method of the present application can be directly integrated on the electronic device, or a method for implementing the method of the present application can be installed.
  • Client the method provided in this application can also be run on a device such as a server in the form of Software Development Kit (SDK), and provide an interface for recognizing conference speech as text in the form of an SDK, an electronic device. Or other devices can realize meeting speech recognition as text through the provided interface.
  • SDK Software Development Kit
  • FIG. 1 is a flowchart of a method for recognizing a conference voice as text provided in Embodiment 1 of the present application. According to different requirements, the execution order in this flowchart can be changed, and some steps can be omitted.
  • the specific process of converting the conference speech to be recognized into text through speech recognition technology includes:
  • the grammar rule is the Viterbi algorithm.
  • the conference speech to be recognized is "hello"
  • it is converted into a 39-dimensional acoustic feature vector, and corresponding multiple subwords / n // i // h are obtained through multiple HMM phoneme models // ao /, splice multiple sub-words into words according to a preset pronunciation dictionary, such as you, ne; Decode the Viterbi algorithm to get the optimal sequence "Hello" and output the text.
  • At least three text databases may be set in advance, for example, a first text database, a second text database, and a third text database.
  • the first text database can be dedicated to storing multiple mood words, such as “um”, “ah”, “yes”, etc.
  • the mood words have nothing to do with the content of the meeting, and can easily affect the readability of speech after being converted into text.
  • the second text database can be dedicated to store multiple professional words and corresponding pinyin, such as “feature vector”, “feature matrix”, “tensor analysis”, etc.
  • the professional words are more complex, and therefore it is easy to appear in batches in the process of speech recognition error.
  • the third text database can be used to store multiple taboo-sensitive words, such as politically related names, superstition cults, yellow gambling drugs, guns and ammunition, cursing irony and deduction of illegal information. The emergence of taboo-sensitive words is likely to cause adverse effects .
  • the present application may also set a fourth text database in advance according to the actual situation, which is dedicated to storing sentences such as names or place names. This article does not specifically limit the number of preset text databases and their corresponding content.
  • the matching the initial speech recognition text with a preset text database specifically includes:
  • Three independent running threads can be set in advance: a first thread, a second thread, and a third thread, and the first thread executes matching the initial speech recognition text with a preset first text database to obtain a readable first matching result.
  • the second thread executes a readable instruction that matches the first matching result with a preset second text database to obtain a second matching result
  • the third thread executes the second matching result with a preset third text database Readable instructions for matching.
  • the first thread is executed, the second thread is executed immediately, and when the second thread is executed, the third thread is executed immediately. Setting three independent running threads to execute different readable instructions, which is beneficial to improve the speed of matching. Save time on matching.
  • only one thread may be provided, and the one thread sequentially executes a readable instruction for matching the initial speech recognition text with a preset first text database to obtain a first matching result, and A readable instruction for matching a first matching result with a preset second text database to obtain a second matching result, and a readable instruction for matching the second matching result with a preset third text database.
  • matching the initial speech recognition text with a preset first text database includes: determining whether there is a first word in the initial speech recognition text that matches a word in the preset first text database; When it is determined that a first word matching the word in the preset first text database exists in the initial speech recognition text, the first word matching in the initial speech recognition text is processed.
  • the processing the matched first words in the initial speech recognition text may further include: determining whether the matched first words are based on the pre-trained deep learning network-based mood word model. Is the mood word to be deleted; when it is determined that the matched first word is the mood word to be deleted, the matching first word in the initial speech recognition text is deleted; when the matching first word is determined When a word is not a mood word to be deleted, the matching first word in the initial speech recognition text is retained.
  • the initial speech recognition text is "this works well”
  • the mood word "this” is stored in the preset first text database
  • the initial speech recognition text is matched with the preset first text database and determined
  • the matching word is "this”
  • the network's mood word model determines that the matching first word "this” does not belong to the mood word to be deleted in "this works well”, and then keeps the matching first word in the initial speech recognition text,
  • the first matching result is "this works well”.
  • the initial speech recognition text is "this, we are going to meet"
  • the mood word "this” is stored in the preset first text database
  • the initial speech recognition text is matched with the preset first text database to determine The matching word is "this”, and then determine whether the matching first word "this” is a mood word to be deleted according to a pre-trained deep learning network-based particle model, and the deep learning is based on deep learning
  • the network's mood word model determines that the matching first word "this” belongs to the to-be-deleted mood word in "this, we are going to meet”, and then removes the matching first word in the initial speech recognition text, The first match we got was "We are going to a meeting.”
  • the method for training a particle model based on a deep learning network may include:
  • the positive sample text is a text in which a mood word needs to be retained
  • the negative sample text is a text in which a mood word needs to be deleted
  • the first identifier is used to identify that a word in the sample needs to be retained, for example, it may be "1".
  • the second identifier is used to identify that a word in the sample needs to be deleted, for example, it may be "0".
  • the positive sample text is input to a deep learning network for training, and it is determined whether the similarity between the output text and the input positive sample is greater than a preset similarity threshold. If the output text is between the input positive sample and the input positive sample, When the similarity of is greater than the preset similarity threshold, the training of the particle model based on the deep learning network is ended.
  • the similarity between the output text and the input positive sample can be calculated by the template matching method.
  • the template matching method is the prior art, which is not described in detail in this application.
  • matching the first matching result with a preset second text database includes:
  • the words in the first matching result are converted into the first pinyin as "zhe shi yi yi ge yuan shi ju zhen";
  • a preset second text database stores The professional word “matrix” and the corresponding second pinyin "juzheng”, when it is determined that a second pinyin identical to the first pinyin exists in the preset second text database, the second pinyin "juzheng" is corresponding The word “matrix” is extracted, and as the word corresponding to the first pinyin "juzheng", the second matching result obtained is "this is an original matrix".
  • matching the second matching result with a preset third text database includes: determining whether a third word matching a word in the preset third text database exists in the second matching result; When it is determined that a third word matching the word in the preset third text database exists in the second matching result, the third word matching in the second matching result is eliminated.
  • a draft of the speech recognition text having an editable state can be generated first.
  • Having an editable state means that the user can perform editing operations on the generated speech recognition text draft.
  • the editing operation may include a confirmation operation and a modification operation.
  • the confirmation operation means that the user confirms that the draft of the speech recognition text is correct, that is, it is determined that the modified speech recognition text does not require any modification operation.
  • the modification operation means that the user confirms that the draft of the speech recognition text is wrong, and individual or a small number of words need to be adjusted, that is, the modified speech recognition text also needs to be manually modified again manually.
  • Whether an editing operation is received on the speech recognition text draft is detected by detecting whether a touch operation is received on a preset button on the speech recognition text draft, and when receiving a detection on a preset button, When a touch operation is reached, it is considered that an editing operation is received on the speech recognition text draft; when a touch operation is not detected on a preset button, it is considered that no detection is received on the speech recognition text draft Edit operation.
  • the preset button may be a confirm button or a modify button.
  • the button may be a virtual icon or a physical button.
  • the editing operation corresponding to the confirmation button is a confirmation operation
  • the editing operation corresponding to the modification button is a modification operation.
  • the generating the speech recognition text with an uneditable state based on the speech recognition text after the editing operation includes:
  • the received editing operation is a modification operation
  • the manual modification of the user is received and the modified new content is received.
  • the confirmation operation is received again, a speech recognition text having an uneditable state is generated.
  • the method may further include: storing the original word corresponding to the modified place and the new word modified by the user in association; when the speech recognition technology is subsequently passed, according to the modified New words to convert the conference speech to be recognized into text.
  • the method may further include: storing multiple forms corresponding to each word in advance, the multiple forms may include, but are not limited to, simplified and traditional forms, space-added forms, and near Words, etc.
  • the matching the initial speech recognition text with a preset text database further includes matching the initial speech recognition text with various forms corresponding to each word of the preset text database according to the environment in which the meeting is located. Get speech recognition text that fits the context of the meeting.
  • the environment in which the meeting is located may include: the participants of the meeting, and the place where the meeting is held.
  • the recognized initial speech recognition text may be traditional characters and / or simplified characters. It is necessary to match the simplified and traditional forms of the words to obtain the speech recognition text that conforms to the traditional characters used by Taiwanese.
  • the meeting is held in the mainland. According to the habits of the mainland people, the initial speech recognition text is matched with each form of words in the preset text database to obtain the speech recognition text that conforms to the simplified characters used by the mainland people. .
  • the method for recognizing conference speech as text described in this application converts the conference speech to be recognized into text by using speech recognition technology as the initial speech recognition text; matching the initial speech recognition text with a preset text database to obtain Matching speech recognition text; generating an editable speech recognition text draft based on the matched speech recognition text; when detecting that an editing operation has been received on the speech recognition text draft, according to the editing operation
  • the speech recognition text of the generated speech recognition text with an uneditable state as the final speech recognition text.
  • the system will present the modified text to the user in a modifiable mode, and mark the modified place.
  • the user can make changes to the system's wrong operation. That is, after the preliminary recognition of the speech to be recognized, the first matching with the preset text library is performed, and then the second confirmation is performed manually.
  • the two processes can effectively ensure the correctness of the text output content, improve the unreasonable text expression in traditional speech to text, effectively reduce the proofreading workload of the conference content, and improve efficiency.
  • FIG. 2 is a functional module diagram of a preferred embodiment of an apparatus for recognizing conference speech as text in this application.
  • the device 20 for recognizing conference speech as text runs in an electronic device.
  • the apparatus 20 for recognizing conference speech as text may include a plurality of function modules composed of instruction code segments.
  • the instruction codes of the various instruction segments in the device 20 for recognizing conference speech as text may be stored in a memory and executed by at least one processor to execute (refer to FIG. 1 and its related description) the conference speech recognition Features for text.
  • the apparatus 20 for recognizing a conference voice as text of the electronic device may be divided into a plurality of functional modules according to functions performed by the apparatus 20.
  • the functional modules may include: an identification module 201, a matching module 202, a generation module 203, a detection module 204, an association module 205, and a setting module 206.
  • the module referred to in the present application refers to a series of computer-readable instruction segments capable of being executed by at least one processor and capable of performing fixed functions, which are stored in a memory. In some embodiments, functions of each module will be described in detail in subsequent embodiments.
  • the recognition module 201 is configured to convert a conference voice to be recognized into text by using a voice recognition technology as an initial voice recognition text.
  • the specific process of the recognition module 201 converting the conference speech to be recognized into text through speech recognition technology includes:
  • the grammar rule is the Viterbi algorithm.
  • the conference speech to be recognized is "hello", and after feature extraction, it is converted into a 39-dimensional acoustic feature vector, and corresponding multiple subwords / n // i // h are obtained through multiple HMM phoneme models. // ao /, splice multiple sub-words into words according to a preset pronunciation dictionary, such as you, ne; Decode the Viterbi algorithm to get the optimal sequence "Hello" and output the text.
  • the matching module 202 is configured to match the initial speech recognition text with a preset text database to obtain a matched speech recognition text.
  • At least three text databases may be set in advance, for example, a first text database, a second text database, and a third text database.
  • the first text database can be dedicated to storing multiple mood words, such as “um”, “ah”, “yes”, etc.
  • the mood words have nothing to do with the content of the meeting, and can easily affect the readability of speech after being converted into text.
  • the second text database can be dedicated to store multiple professional words, such as "feature vector”, “feature matrix”, “tensor analysis”, etc. The professional words are more complicated, so it is easy to make mistakes in batches in the process of speech recognition.
  • the third text database can be used to store multiple taboo-sensitive words, such as politically related names, superstition cults, yellow gambling drugs, guns and ammunition, cursing irony and deduction of illegal information. The emergence of taboo-sensitive words is likely to cause adverse effects .
  • the present application may also set a fourth text database in advance according to the actual situation, which is dedicated to storing sentences such as names or place names. This article does not specifically limit the number of preset text databases and their corresponding content.
  • the matching module 202 matching the initial speech recognition text with a preset text database specifically includes:
  • Three independent running threads can be set in advance: a first thread, a second thread, and a third thread, and the first thread executes matching the initial speech recognition text with a preset first text database to obtain a readable first matching result.
  • the second thread executes a readable instruction that matches the first matching result with a preset second text database to obtain a second matching result
  • the third thread executes the second matching result with a preset third text database Readable instructions for matching.
  • the first thread is executed, the second thread is executed immediately, and when the second thread is executed, the third thread is executed immediately. Setting three independent running threads to execute different readable instructions, which is beneficial to improve the speed of matching. Save time on matching.
  • only one thread may be provided, and the one thread sequentially executes a readable instruction for matching the initial speech recognition text with a preset first text database to obtain a first matching result, and A readable instruction for matching a first matching result with a preset second text database to obtain a second matching result, and a readable instruction for matching the second matching result with a preset third text database.
  • the matching module 202 may further include a first matching sub-module 2020, a second matching sub-module 2022, and a third matching sub-module 2024.
  • the first matching sub-module 2020 matching the initial speech recognition text with a preset first text database includes: determining whether the initial speech recognition text is similar to a word in the preset first text database. A matched first word; when it is determined that there is a first word in the initial speech recognition text that matches a word in a preset first text database, the first word matched in the initial speech recognition text is processed .
  • the first matching sub-module 2020 is further configured to: determine whether the matched first word is a mood word to be deleted according to a pre-trained deep learning network-based mood word model; when determining the When the matching first word is a mood word to be deleted, the matching first word in the initial speech recognition text is deleted; when it is determined that the matching first word is not a mood word to be deleted, The matching first word in the initial speech recognition text is retained.
  • the initial speech recognition text is "this works well”
  • the mood word "this” is stored in the preset first text database
  • the initial speech recognition text is matched with the preset first text database and determined
  • the matching word is "this”
  • the network's mood word model determines that the matching first word "this” does not belong to the mood word to be deleted in "this works well”, and then keeps the matching first word in the initial speech recognition text,
  • the first matching result is "this works well”.
  • the initial speech recognition text is "this, we are going to meet"
  • the mood word "this” is stored in the preset first text database
  • the initial speech recognition text is matched with the preset first text database to determine The matching word is "this”, and then determine whether the matching first word "this” is a mood word to be deleted according to a pre-trained deep learning network-based particle model, and the deep learning is based on deep learning
  • the network's mood word model determines that the matching first word "this” belongs to the to-be-deleted mood word in "this, we are going to meet”, and then removes the matching first word in the initial speech recognition text, The first match we got was "We are going to a meeting.”
  • the process of the first matching sub-module 2020 training a deep learning network-based particle model may include:
  • the positive sample text is a text in which a mood word needs to be retained
  • the negative sample text is a text in which a mood word needs to be deleted
  • the first identifier is used to identify that a word in the sample needs to be retained, for example, it may be "1".
  • the second identifier is used to identify that a word in the sample needs to be deleted, for example, it may be "0".
  • the positive sample text is input to a deep learning network for training, and it is determined whether the similarity between the output text and the input positive sample is greater than a preset similarity threshold. If the output text is between the input positive sample and the input positive sample, When the similarity of is greater than the preset similarity threshold, the training of the particle model based on the deep learning network is ended.
  • the similarity between the output text and the input positive sample can be calculated by the template matching method.
  • the template matching method is the prior art, which is not described in detail in this application.
  • the second matching sub-module 2022 matching the first matching result with a preset second text database includes: converting a word in the first matching result into a first pinyin; Whether the second pinyin identical to the first pinyin exists in the preset second text database; when it is determined that the second pinyin identical to the first pinyin exists in the preset second text database, the second The words corresponding to Pinyin are extracted and used as the words corresponding to the first Pinyin.
  • the words in the first matching result are converted into the first pinyin as "zhe shi yi yi ge yuan shi ju zhen";
  • a preset second text database stores The professional word “matrix” and the corresponding second pinyin "juzheng”, when it is determined that a second pinyin identical to the first pinyin exists in the preset second text database, the second pinyin "juzheng" is corresponding The word “matrix” is extracted, and as the word corresponding to the first pinyin "juzheng", the second matching result obtained is "this is an original matrix".
  • matching the second matching result with the preset third text database by the third matching sub-module 2024 includes: determining whether the second matching result exists with a word in the preset third text database. A matched third word; when it is determined that a third word matching a word in a preset third text database exists in the second matching result, removing the matching third word in the second matching result .
  • a generating module 203 is configured to generate a speech recognition text draft having an editable state according to the matched speech recognition text.
  • a draft of the speech recognition text having an editable state can be generated first.
  • Having an editable state means that the user can perform editing operations on the generated speech recognition text draft.
  • the editing operation may include a confirmation operation and a modification operation.
  • the confirmation operation means that the user confirms that the draft of the speech recognition text is correct, that is, it is determined that the modified speech recognition text does not require any modification operation.
  • the modification operation means that the user confirms that the draft of the speech recognition text is wrong, and individual or a small number of words need to be adjusted, that is, the modified speech recognition text also needs to be manually modified again manually.
  • the detection module 204 is configured to detect whether an editing operation is received on the speech recognition text draft.
  • the detecting module 204 detects whether an editing operation is received on the speech recognition text draft, including: detecting whether a touch operation is received on a preset button on the speech recognition text draft, and when detecting that a preset button is received When a touch operation is performed, it is considered that an editing operation is received on the speech recognition text draft; when a touch operation is not detected on a preset button, it is considered that no editing is received on the speech recognition text draft. operating.
  • the preset button may be a confirm button or a modify button.
  • the button may be a virtual icon or a physical button.
  • the editing operation corresponding to the confirmation button is a confirmation operation
  • the editing operation corresponding to the modification button is a modification operation.
  • the generating module 203 is further configured to generate an uneditable speech recognition text according to the speech recognition text after the editing operation is detected when the detection module detects that an editing operation is received on the speech recognition text draft. As the final speech recognition text.
  • the generating module 203 generates a speech recognition text with an uneditable state according to the speech recognition text after the editing operation includes:
  • the received editing operation is a modification operation
  • the manual modification of the user is received and the modified new content is received.
  • the confirmation operation is received again, a speech recognition text having an uneditable state is generated.
  • the device 20 for recognizing conference speech as text may further include: an association module 205, configured to perform the original word corresponding to the modification and the new word modified by the user.
  • the recognition module 201 is further configured to convert the conference speech to be recognized into text according to the modified new words when the speech recognition technology is subsequently passed.
  • the device 20 for recognizing conference speech as text may further include: a setting module 206 for storing multiple forms corresponding to each word in advance, and the multiple forms may include: But it is not limited to: simplified and traditional forms, space-added forms, and near-form characters.
  • the matching the initial speech recognition text with a preset text database further includes matching the initial speech recognition text with various forms corresponding to each word of the preset text database according to the environment in which the meeting is located. Get speech recognition text that fits the context of the meeting.
  • the environment in which the meeting is located may include: the participants of the meeting, and the place where the meeting is held.
  • the recognized initial speech recognition text may be traditional characters and / or simplified characters. It is necessary to match the simplified and traditional forms of the words to obtain the speech recognition text that conforms to the traditional characters used by Taiwanese.
  • the meeting is held in the mainland. According to the habits of the mainland people, the initial speech recognition text is matched with each form of words in the preset text database to obtain the speech recognition text that conforms to the simplified characters used by the mainland people. .
  • the device for recognizing conference speech as text described in this application converts the conference speech to be recognized into text by using speech recognition technology as the initial speech recognition text; matching the initial speech recognition text with a preset text database to obtain Matching speech recognition text; generating an editable speech recognition text draft based on the matched speech recognition text; when detecting that an editing operation has been received on the speech recognition text draft, according to the editing operation
  • the speech recognition text of the generated speech recognition text with an uneditable state as the final speech recognition text.
  • the system will present the modified text to the user in a modifiable mode, and mark the modified place.
  • the user can make changes to the system's wrong operation. That is, after the preliminary recognition of the speech to be recognized, the first matching with the preset text library is performed, and then the second confirmation is performed manually.
  • the two processes can effectively ensure the correctness of the text output content, improve the unreasonable text expression in traditional speech to text, effectively reduce the proofreading workload of the conference content, and improve efficiency.
  • the above integrated unit implemented in the form of a software functional module may be stored in a non-volatile readable storage medium.
  • the above software function module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a dual-screen device, or a network device) or a processor to execute the embodiments described in this application. Part of the method.
  • FIG. 3 is a schematic diagram of an electronic device provided in Embodiment 5 of the present application.
  • the electronic device 3 includes a memory 31, at least one processor 32, computer-readable instructions 33 stored in the memory 31 and executable on the at least one processor 32, and at least one communication bus 34.
  • the at least one processor 32 executes the computer-readable instructions 33, the steps in the method embodiment for recognizing conference speech as text are implemented.
  • the computer-readable instructions 33 may be divided into one or more modules / units, and the one or more modules / units are stored in the memory 31 and processed by the at least one processor 32 Execute to complete this application.
  • the one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 33 in the electronic device 3.
  • the electronic device 3 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the schematic diagram 3 is only an example of the electronic device 3 and does not constitute a limitation on the electronic device 3, and may include more or less components than shown in the figure, or combine some components, or be different
  • the electronic device 3 may further include an input / output device, a network access device, a bus, and the like.
  • the at least one processor 32 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), and application-specific integrated circuits (ASICs). ), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the processor 32 may be a microprocessor or the processor 32 may be any conventional processor.
  • the processor 32 is a control center of the electronic device 3, and uses various interfaces and lines to connect the entire electronic device 3. The various parts.
  • the memory 31 may be configured to store the computer-readable instructions 33 and / or modules / units, and the processor 32 may execute or execute the computer-readable instructions and / or modules / units stored in the memory 31, and
  • the data stored in the memory 31 is called to implement various functions of the electronic device 3.
  • the memory 31 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, application programs required for at least one function (such as a sound playback function, an image playback function, etc.), etc .; Data (such as audio data, phone book, etc.) created according to the use of the electronic device 3 are stored.
  • the memory 31 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, an internal memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD). Card, flash memory card (Flash card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
  • a non-volatile memory such as a hard disk, an internal memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD).
  • SSD Secure Digital
  • flash memory card Flash card
  • flash memory device at least one disk storage device, flash memory device, or other volatile solid-state storage device.
  • the integrated module / unit of the electronic device 3 When the integrated module / unit of the electronic device 3 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile readable storage medium. Based on this understanding, this application implements all or part of the processes in the methods of the above embodiments, and can also be completed by computer-readable instructions to instruct related hardware.
  • the computer-readable instructions can be stored in a non-volatile memory. In the read storage medium, when the computer-readable instructions are executed by a processor, the steps of the foregoing method embodiments can be implemented.
  • the computer-readable instruction code may be in a source code form, an object code form, an executable file, or some intermediate form.
  • the non-volatile readable medium may include: any entity or recording medium capable of carrying the computer-readable instruction code, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read- Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals, telecommunication signals, and software distribution media.
  • ROM Read- Only Memory
  • RAM Random Access Memory
  • each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist separately physically, or two or more units may be integrated in the same unit.
  • the integrated unit can be implemented in the form of hardware, or in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

一种将会议语音识别为文本的方法,电子设备及存储介质。所述方法包括:通过语音识别技术将待识别的会议语音转化为文本,作为初始语音识别文本(S11);将初始语音识别文本与预设文本数据库进行匹配,得到匹配后的语音识别文本(S12);根据匹配后的语音识别文本生成具有可编辑状态的语音识别文本草稿(S13);当侦测到语音识别文本草稿上接收到了编辑操作后,根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本,作为最终语音识别文本(S14)。所述方法,电子设备及存储介质通过对待识别的语音进行初步识别后,与预设文本库进行第一次匹配,再通过人工进行第二次确认,有效的保证文本输出内容的正确性,减少会议内容的校对工作量,提升了效率。

Description

将会议语音识别为文本的方法、电子设备及存储介质
本申请要求于2018年06月07日提交中国专利局,申请号为201810581922.4发明名称为“将会议语音识别为文本的方法、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音识别技术领域,具体涉及一种将会议语音识别为文本的方法、电子设备及存储介质。
背景技术
自动语音识别技术(Automatic Speech Recognition,ASR)是机器翻译、机器人控制以及下一代人机交互界面等领域的核心技术,是让计算机能够“听写”出不同人所说出的连续语音,实现“声音”到“文本”的转换。
目前,伴随着语音识别技术的不断发展,基于语音识别的应用也越来越广泛,这样的技术已经渗透入家庭生活、办公领域、娱乐等方面。用户通过利用对着个人计算机、笔记本电脑、平板电脑、专用的学习终端、智能手机等终端上外接或内置的麦克风来输入语音,经由语音识别设备完成语音-文字的转换。
现有的语音识别设备有很多,例如,被广泛使用的世界知名的Nuance、Google的语音识别服务、中国国内容科大讯飞的语音识别服务等。但是在进行语音识别的最大问题是语音识别的准确率,即使是在现有的设备中拥有最高语音识别准确率的Nuance,也无法避免以下问题:语气词等无关词汇的频繁出现导致文本分析难度加大、部分专业关键词识别不准确、禁忌敏感词无法识别等,影响了会议文本的可读分析性。
发明内容
鉴于以上内容,有必要提出一种将会议语音识别为文本的方法、电子设备及存储介质,通过预设文本数据库匹配和人工确认的双重过程,有效的保证了文本输出内容的正确性,改善了传统语音转文本中文字表达不合理的地方,有效减少了会议内容的校对工作量,提升了效率。
本申请的第一方面提供一种将会议语音识别为文本的方法,所述方法包括:
通过语音识别技术将待识别的会议语音转化为文本,作为初始语音识别文本;
将所述初始语音识别文本与预设文本数据库进行匹配,得到匹配后的语音识别文本;
根据所述匹配后的语音识别文本生成具有可编辑状态的语音识别文本草稿;
当侦测到所述语音识别文本草稿上接收到了编辑操作时,根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本,作为最终语 音识别文本。
本申请的第二方面提供一种电子设备,所述电子设备包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令时实现所述将会议语音识别为文本的方法。
本申请的第三方面提供一种非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现所述将会议语音识别为文本的方法。
本申请所述的将会议语音识别为文本的方法、电子设备及存储介质,通过语音识别技术将待识别的会议语音转化为文本,作为初始语音识别文本;将所述初始语音识别文本与预设文本数据库进行匹配,得到匹配后的语音识别文本;根据所述匹配后的语音识别文本生成具有可编辑状态的语音识别文本草稿;当侦测到所述语音识别文本草稿上接收到了编辑操作时,根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本,作为最终语音识别文本。本申请通过对待识别的语音进行初步识别后,与预设文本库进行第一次匹配,再通过人工进行第二次确认。两次过程能够有效的保证文本输出内容的正确性,改善了传统语音转文本中文字表达不合理的地方,有效减少了会议内容的校对工作量,提升了效率。
附图说明
图1是本申请实施例一提供的将会议语音识别为文本的方法的流程图。
图2是本申请实施例二提供的将会议语音识别为文本的装置的功能模块图。
图3是本申请实施例三提供的电子设备的示意图。
如下具体实施方式将结合上述附图进一步说明本申请。
具体实施方式
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本申请,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
本申请实施例的将会议语音识别为文本的方法应用在一个或者多个电子设备中。所述将会议语音识别为文本的方法也可以应用于由电子设备和通过网络与所述电子设备进行连接的服务器所构成的硬件环境中。网络包括但不限于:广域网、城域网或局域网。本申请实施例的将会议语音识别为文本的方法可以由服务器来执行,也可以由电子设备来执行;还可以是由服务器和电子设备共同执行。
所述对于需要进行将会议语音识别为文本的方法的电子设备,可以直接在电子设备上集成本申请的方法所提供的将会议语音识别为文本的功能,或者安装用于实现本申请的方法的客户端。再如,本申请所提供的方法还可以以软件开发工具包(Software Development Kit,SDK)的形式运行在服务器等设备上,以SDK的形式提供将会议语音识别为文本的功能的接口,电子设备或其他设备通过提供的接口即可实现将会议语音识别为文本。
实施例一
图1是本申请实施例一提供的将会议语音识别为文本的方法的流程图。根据不同的需求,该流程图中的执行顺序可以改变,某些步骤可以省略。
S11、通过语音识别技术将待识别的会议语音转化为文本,作为初始语音识别文本。
本实施例中,所述通过语音识别技术将待识别的会议语音转化为文本的具体过程包括:
1)提取待识别的会议语音的音频特征,转换为预设长度的声学特征向量;
2)根据解码算法将所述特征向量解码成词序;
3)通过HMM音素模型得到对应词序的子词,所述子词为声母和韵母;
4)根据预设的发音字典将多个子词拼接成文字;
5)使用语言模型语法规则解码得到最优序列,得到文本。
所述语法规则为维特比算法。举例而言,所述待识别的会议语音为“你好”,经过特征提取后转化为39维的声学特征向量,通过多个HMM音素模型得到对应的多个子词/n//i//h//ao/,根据预设的发音字典将多个子词拼接成字,如你,尼;好,号。通过维特比算法解码得到最优序列“你好”并将文本输出。
S12、将所述初始语音识别文本与预设文本数据库进行匹配,得到匹配后的语音识别文本。
本实施例中,可以预先设置至少三个文本数据库,例如,第一文本数据库、第二文本数据库及第三本文数据库。第一文本数据库可以专用于存储多个语气词,如“嗯”、“啊”、“是吧”等,语气词与会议内容无关,且又易影响语音转换为文本后的可读性。第二文本数据库可以专用于存储多个专业词及对应的拼音,如“特征向量”、“特征矩阵”、“张量分析”等,专业词较复杂,因而在识别语音的过程中容易批量出现错误。第三文本数据库可以专用于存储多个禁忌敏感词,如政治相关的人名、迷信邪教、黄赌毒、枪支弹药类、骂人讽刺类扣非法信息类等词,禁忌敏感词的出现容易造成不良影响。本申请还可以根据实际情况预先设置第四文本数据库等,专用于存储诸如姓名或者地名等的语句。本文对于预先设置的本文数据库的数量及对应的内容不作具体限制。
所述将所述初始语音识别文本与预设文本数据库进行匹配具体包括:
1)将所述初始语音识别文本与预设第一文本数据库进行匹配,得到第一匹配结果;
2)将所述第一匹配结果与预设第二文本数据库进行匹配,得到第二匹配结果;
3)将所述第二匹配结果与预设第三文本数据库进行匹配。
可以预先设置三个独立运行的线程:第一线程、第二线程及第三线程,第一线程执行将所述初始语音识别文本与预设第一文本数据库进行匹配得到第一匹配结果的可读指令,第二线程执行将所述第一匹配结果与预设第二文本数据库进行匹配得到第二匹配结果的可读指令,第三线程执行将所述第二匹配结果与预设第三文本数据库进行匹配的可读指令。当第一线程执行完毕后,即刻执行第二线程,当第二线程执行完毕后,即刻执行第三线程,设置三个独立运行的线程分别执行不同的可读指令,有利于提高匹配的速度,节省匹配的时间。
在其他实施例中,也可以仅设置一个线程,由所述一个线程顺次执行将所述初始语音识别文本与预设第一文本数据库进行匹配得到第一匹配结果的可读指令、将所述第一匹配结果与预设第二文本数据库进行匹配得到第二匹配结果的可读指令、将所述第二匹配结果与预设第三文本数据库进行匹配的可读指令。
具体地,所述将所述初始语音识别文本与预设第一文本数据库进行匹配包括:判断所述初始语音识别文本中是否存在与预设第一文本数据库中的词语相匹配的第一词语;当确定所述初始语音识别文本中存在与预设第一文本数据库中的词语相匹配的第一词语时,将所述初始语音识别文本中相匹配的第一词语进行处理。
优选地,所述将所述初始语音识别文本中相匹配的第一词语进行处理还可以进一步包括:根据预先训练的所述基于深度学习网络的语气词模型判断所述相匹配的第一词语是否为待删除的语气词;当确定所述相匹配的第一词语为待删除的语气词时,将所述初始语音识别文本中相匹配的第一词语进行剔除;当确定所述相匹配的第一词语不为待删除的语气词时,将所述初始语音识别文本中相匹配的第一词语进行保留。
举例而言,假设初始语音识别文本为“这个挺好用的”,预设第一文本数据库中存储有语气词“这个”,则将初始语音识别文本与预设第一文本数据库进行匹配后确定了相匹配的词语为“这个”,然后根据预先训练的所述基于深度学习网络的语气词模型判断所述相匹配的第一词语“这个”是否为待删除的语气词,所述基于深度学习网络的语气词模型确定相匹配的第一词语“这个”在“这个挺好用的”中不属于待删除的语气词,则将所述初始语音识别文本中相匹配的第一词语进行保留,得到的第一匹配结果为“这个挺好用的”。
再如,假设初始语音识别文本为“这个,我们要开会了”,预设第一文本数据库中存储有语气词“这个”,则将初始语音识别文本与预设第一文本数据库进行匹配后确定了相匹配的词语为“这个”,然后根据预先训练的所述基于深度学习网络的语气词模型判断所述相匹配的第一词语“这个”是否为待删除的语气词,所述基于深度学习网络的语气词模型确定相匹配的第一词语“这个”在“这个,我们要开会了”中属于待删除的语气词,则将所述初始语音识别文本中相匹配的第一词语进行剔除,得到的第一匹配结果为“我们要开会了”。
优选的,所述基于深度学习网络的语气词模型的训练方法可以包括:
1)获取大量的带有第一文本数据库中的词语的文本;
2)将所述文本划分为正样本文本和负样本文本,所述正样本文本为需要保留语气词的文本,所述负样本文本为需要删除语气词的文本;
举例说明,第一文本数据库中的词语为“这个”,则可以获取大量的含有“这个”词语的本文,如“这个项目目前正在进行中”、“这个人是谁”、“这个,嗯,还在询问中”、“这个可以这样的”,其中“这个项目目前正在进行中”及“这个人是谁”中的“这个”为需要保留的语气词,“这个,嗯,还在询问中”、“这个可以这样的”中的“这个”为需要删除的语气词。
3)将所述正样本文本打上第一标识,将所述负样本文本打上第二标识;
所述第一标识用于标识样本中的词语需要保留,例如,可以是“1”。所述第二标识用于标识样本中的词语需要删除,例如,可以是“0”。
4)将所述正样本文本输入至深度学习网络中进行训练,判断输出的文本与输入的正样本之间的相似度是否大于预设相似度阈值,如果输出的文本与输入的正样本之间的相似度大于预设相似度阈值时,则结束基于深度学习网络的语气词模型的训练。
可以通过模板匹配的方法计算输出的文本与输入的正样本之间的相似度,模板匹配的方法为现有技术,本申请在此不再详细赘述。
具体地,所述将所述第一匹配结果与预设第二文本数据库进行匹配包括:
1)将所述第一匹配结果中的词语转换为第一拼音;
2)判断所述预设第二文本数据库中是否存在与所述第一拼音相同的第二拼音;
3)当确定所述预设第二文本数据库中存在与所述第一拼音相同的第二拼音时,将第二拼音对应的词语提取出来,作为第一拼音对应的词语。
举例而言,假设第一匹配结果为“这是一个原始巨震”,将第一匹配结果中的词语转换为第一拼音为“zhe shi yige yuanshi juzhen”;预设第二文本数据库中储存有专业词“矩阵”及对应的第二拼音“juzheng”,则在确定所述预设第二文本数据库中存在与所述第一拼音相同的第二拼音时,将第二拼音“juzheng”对应的词语“矩阵”提取出来,作为第一拼音“juzheng”对应的词语,得到的第二匹配结果为“这是一个原始矩阵”。
具体地,所述将所述第二匹配结果与预设第三文本数据库进行匹配包括:判断所述第二匹配结果中是否存在与预设第三文本数据库中的词语相匹配的第三词语;当确定所述第二匹配结果中存在与预设第三文本数据库中的词语相匹配的第三词语时,将所述第二匹配结果中相匹配的第三词语进行剔除。
S13、根据所述匹配后的语音识别文本生成具有可编辑状态的语音识别文本草稿。
当根据匹配结果自动对所述初始语音识别文本进行修改后,得到匹配后的语音识别文本,可以先生成具有可编辑状态的语音识别文本草稿。具有可编辑状态是指用户能够在生成的语音识别文本草稿上进行编辑操作。所述编辑操作可以包括:确认操作,修改操作。
所述确认操作是指用户确认所述语音识别文本草稿正确无误,即确定修改后的语音识别文本不需要进行任何修改操作。所述修改操作是指用户确认所述语音识别文本草稿有错误,对个别或少量的词语需要进行调整,即修改后的语音识别文本还需要再次进行人工手动修改。
S14、当侦测到所述语音识别文本草稿上接收到了编辑操作时,根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本,作为最终语音识别文本。
所述语音识别文本草稿上是否接收到了编辑操作是通过以下方式进行侦测的:侦测所述语音识别文本草稿上的预设按钮上是否接收到了触摸操作,当侦测到预设按钮上接收到了触摸操作时,认为侦测到所述语音识别文本草稿上接收到了编辑操作;当没有侦测到预设按钮上接收到了触摸操作时,认为侦测到所述语音识别文本草稿上没有接收到编辑操作。
所述预设按钮可以为确认按钮,还可以为修改按钮。所述按钮可以是虚拟图标,还可以是实体按键。对应确认按钮的编辑操作为确认操作,对应修改按钮的编辑操作为修改操作。
所述根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本包括:
当接收到的编辑操作为确认操作时,直接生成具有不可编辑状态的语音识别文本;
当接收到的编辑操作为修改操作时,接收用户的手动修改并保存修改的新内容,当再次接收到确认操作时,生成具有不可编辑状态的语音识别文本。
优选地,在接收用户的编辑操作为修改操作之后,所述方法还可以包括:将对应修改处的原始词语及用户修改后的新词语进行关联存储;在后续通过语音识别技术时,根据修改后的新词语将待识别的会议语音转化为文本。
通过记录用户每次的修改并将对应修改处的原始词语及用户修改后的新词语进行关联存储,有利于后续再进行语音识别时,能够直接使用用户修改的新词语,从而降低识别错误率,提高语音识别的准确度,尤其是减少用户修改的麻烦。
优选地,在预设文本数据库时,所述方法还可以包括:预先存储每个词语对应的多种形式,所述多种形式可以包括,但不限于:简繁体形式、加空格形式及形近字等。所述将所述初始语音识别文本与预设文本数据库进行匹配还包括:根据会议所处的环境,将所述初始语音识别文本与预设文本数据库的每个词语对应的多种形式进行匹配,得到符合会议所处环境的语音识别文本。
会议所处的环境可以包括:会议的参与者,会议的举行地。
通过设置词语对应的多种形式,并根据会议所处的环境将初始语音识别文本与预设文本数据库中的每种形式的词语进行匹配,可以避免一些含有空格的语气词、禁忌敏感词无法识别出来。另外,还能够使用在不同的场合,例如,会议的参与者为台湾人时,在台湾等习惯使用繁体字的情况下,识别出的初始语音识别文本可能是繁体字及/或简体字,因而有必要将词语的简繁体形式均进行匹配,得到符合台湾人习惯使用的繁体字的语音识别文本。再如,会议的举行地为大陆,根据大陆人的习惯,将将初始语音识别文本与预设文本数据库中的每种形式的词语进行匹配后,得到符合大陆人习惯使用的简体字的语音识别文本。
本申请所述的将会议语音识别为文本的方法,通过语音识别技术将待识别的会议语音转化为文本,作为初始语音识别文本;将所述初始语音识别文 本与预设文本数据库进行匹配,得到匹配后的语音识别文本;根据所述匹配后的语音识别文本生成具有可编辑状态的语音识别文本草稿;当侦测到所述语音识别文本草稿上接收到了编辑操作时,根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本,作为最终语音识别文本。在通过ASR技术对会议的语音转化成文本模式后,通过调用已存词库的内容对文本内容进行搜索,分别进行相应的替换、删除等操作。但由于替换和删除等操作极小概率会导致将原有的正确文本修改为错误的文本,所以系统会将修改后的文本重新以可修改的模式呈现给用户,同时标注出已经修改过的地方供用户确认,用户可以对系统误操作的地方重新进行修改。即通过对待识别的语音进行初步识别后,与预设文本库进行第一次匹配,再通过人工进行第二次确认。两次过程能够有效的保证文本输出内容的正确性,改善了传统语音转文本中文字表达不合理的地方,有效减少了会议内容的校对工作量,提升了效率。
以上所述,仅是本申请的具体实施方式,但本申请的保护范围并不局限于此,对于本领域的普通技术人员来说,在不脱离本申请创造构思的前提下,还可以做出改进,但这些均属于本申请的保护范围。
下面结合第2至3图,分别对实现上述将会议语音识别为文本的方法的电子设备的功能模块及硬件结构进行介绍。
实施例二
图2为本申请将会议语音识别为文本的装置较佳实施例中的功能模块图。
在一些实施例中,所述将会议语音识别为文本的装置20运行于电子设备中。所述将会议语音识别为文本的装置20可以包括多个由指令代码段所组成的功能模块。所述将会议语音识别为文本的装置20中的各个指令段的指令代码可以存储于存储器中,并由至少一个处理器所执行,以执行(详见图1及其相关描述)将会议语音识别为文本的功能。
本实施例中,所述电子设备的将会议语音识别为文本的装置20根据其所执行的功能,可以被划分为多个功能模块。所述功能模块可以包括:识别模块201、匹配模块202、生成模块203、侦测模块204、关联模块205及设置模块206。本申请所称的模块是指一种能够被至少一个处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器中。在一些实施例中,关于各模块的功能将在后续的实施例中详述。
识别模块201,用于通过语音识别技术将待识别的会议语音转化为文本,作为初始语音识别文本。
本实施例中,所述识别模块201通过语音识别技术将待识别的会议语音转化为文本的具体过程包括:
1)提取待识别的会议语音的音频特征,转换为预设长度的声学特征向量;
2)根据解码算法将所述特征向量解码成词序;
3)通过HMM音素模型得到对应词序的子词,所述子词为声母和韵母;
4)根据预设的发音字典将多个子词拼接成文字;
5)使用语言模型语法规则解码得到最优序列,得到文本。
所述语法规则为维特比算法。举例而言,所述待识别的会议语音为“你 好”,经过特征提取后转化为39维的声学特征向量,通过多个HMM音素模型得到对应的多个子词/n//i//h//ao/,根据预设的发音字典将多个子词拼接成字,如你,尼;好,号。通过维特比算法解码得到最优序列“你好”并将文本输出。
匹配模块202,用于将所述初始语音识别文本与预设文本数据库进行匹配,得到匹配后的语音识别文本。
本实施例中,可以预先设置至少三个文本数据库,例如,第一文本数据库、第二文本数据库及第三本文数据库。第一文本数据库可以专用于存储多个语气词,如“嗯”、“啊”、“是吧”等,语气词与会议内容无关,且又易影响语音转换为文本后的可读性。第二文本数据库可以专用于存储多个专业词,如“特征向量”、“特征矩阵”、“张量分析”等,专业词较复杂,因而在识别语音的过程中容易批量出现错误。第三文本数据库可以专用于存储多个禁忌敏感词,如政治相关的人名、迷信邪教、黄赌毒、枪支弹药类、骂人讽刺类扣非法信息类等词,禁忌敏感词的出现容易造成不良影响。本申请还可以根据实际情况预先设置第四文本数据库等,专用于存储诸如姓名或者地名等的语句。本文对于预先设置的本文数据库的数量及对应的内容不作具体限制。
所述匹配模块202将所述初始语音识别文本与预设文本数据库进行匹配具体包括:
1)将所述初始语音识别文本与预设第一文本数据库进行匹配,得到第一匹配结果;
2)将所述第一匹配结果与预设第二文本数据库进行匹配,得到第二匹配结果;
3)将所述第二匹配结果与预设第三文本数据库进行匹配,得到第三匹配结果。
可以预先设置三个独立运行的线程:第一线程、第二线程及第三线程,第一线程执行将所述初始语音识别文本与预设第一文本数据库进行匹配得到第一匹配结果的可读指令,第二线程执行将所述第一匹配结果与预设第二文本数据库进行匹配得到第二匹配结果的可读指令,第三线程执行将所述第二匹配结果与预设第三文本数据库进行匹配的可读指令。当第一线程执行完毕后,即刻执行第二线程,当第二线程执行完毕后,即刻执行第三线程,设置三个独立运行的线程分别执行不同的可读指令,有利于提高匹配的速度,节省匹配的时间。
在其他实施例中,也可以仅设置一个线程,由所述一个线程顺次执行将所述初始语音识别文本与预设第一文本数据库进行匹配得到第一匹配结果的可读指令、将所述第一匹配结果与预设第二文本数据库进行匹配得到第二匹配结果的可读指令、将所述第二匹配结果与预设第三文本数据库进行匹配的可读指令。
所述匹配模块202还可以包括:第一匹配子模块2020、第二匹配子模块2022、第三匹配子模块2024。
具体地,所述第一匹配子模块2020将所述初始语音识别文本与预设第一文本数据库进行匹配包括:判断所述初始语音识别文本中是否存在与预设第 一文本数据库中的词语相匹配的第一词语;当确定所述初始语音识别文本中存在与预设第一文本数据库中的词语相匹配的第一词语时,将所述初始语音识别文本中相匹配的第一词语进行处理。
优选地,所述第一匹配子模块2020还用于:根据预先训练的所述基于深度学习网络的语气词模型判断所述相匹配的第一词语是否为待删除的语气词;当确定所述相匹配的第一词语为待删除的语气词时,将所述初始语音识别文本中相匹配的第一词语进行剔除;当确定所述相匹配的第一词语不为待删除的语气词时,将所述初始语音识别文本中相匹配的第一词语进行保留。
举例而言,假设初始语音识别文本为“这个挺好用的”,预设第一文本数据库中存储有语气词“这个”,则将初始语音识别文本与预设第一文本数据库进行匹配后确定了相匹配的词语为“这个”,然后根据预先训练的所述基于深度学习网络的语气词模型判断所述相匹配的第一词语“这个”是否为待删除的语气词,所述基于深度学习网络的语气词模型确定相匹配的第一词语“这个”在“这个挺好用的”中不属于待删除的语气词,则将所述初始语音识别文本中相匹配的第一词语进行保留,得到的第一匹配结果为“这个挺好用的”。
再如,假设初始语音识别文本为“这个,我们要开会了”,预设第一文本数据库中存储有语气词“这个”,则将初始语音识别文本与预设第一文本数据库进行匹配后确定了相匹配的词语为“这个”,然后根据预先训练的所述基于深度学习网络的语气词模型判断所述相匹配的第一词语“这个”是否为待删除的语气词,所述基于深度学习网络的语气词模型确定相匹配的第一词语“这个”在“这个,我们要开会了”中属于待删除的语气词,则将所述初始语音识别文本中相匹配的第一词语进行剔除,得到的第一匹配结果为“我们要开会了”。
优选的,所述第一匹配子模块2020训练基于深度学习网络的语气词模型过程可以包括:
1)获取大量的带有第一文本数据库中的词语的文本;
2)将所述文本划分为正样本文本和负样本文本,所述正样本文本为需要保留语气词的文本,所述负样本文本为需要删除语气词的文本;
举例说明,第一文本数据库中的词语为“这个”,则可以获取大量的含有“这个”词语的本文,如“这个项目目前正在进行中”、“这个人是谁”、“这个,嗯,还在询问中”、“这个可以这样的”,其中“这个项目目前正在进行中”及“这个人是谁”中的“这个”为需要保留的语气词,“这个,嗯,还在询问中”、“这个可以这样的”中的“这个”为需要删除的语气词。
3)将所述正样本文本打上第一标识,将所述负样本文本打上第二标识;
所述第一标识用于标识样本中的词语需要保留,例如,可以是“1”。所述第二标识用于标识样本中的词语需要删除,例如,可以是“0”。
4)将所述正样本文本输入至深度学习网络中进行训练,判断输出的文本与输入的正样本之间的相似度是否大于预设相似度阈值,如果输出的文本与输入的正样本之间的相似度大于预设相似度阈值时,则结束基于深度学习网络的语气词模型的训练。
可以通过模板匹配的方法计算输出的文本与输入的正样本之间的相似 度,模板匹配的方法为现有技术,本申请在此不再详细赘述。
具体地,所述第二匹配子模块2022将所述将所述第一匹配结果与预设第二文本数据库进行匹配包括:将所述第一匹配结果中的词语转换为第一拼音;判断所述预设第二文本数据库中是否存在与所述第一拼音相同的第二拼音;当确定所述预设第二文本数据库中存在与所述第一拼音相同的第二拼音时,将第二拼音对应的词语提取出来,作为第一拼音对应的词语。
举例而言,假设第一匹配结果为“这是一个原始巨震”,将第一匹配结果中的词语转换为第一拼音为“zhe shi yige yuanshi juzhen”;预设第二文本数据库中储存有专业词“矩阵”及对应的第二拼音“juzheng”,则在确定所述预设第二文本数据库中存在与所述第一拼音相同的第二拼音时,将第二拼音“juzheng”对应的词语“矩阵”提取出来,作为第一拼音“juzheng”对应的词语,得到的第二匹配结果为“这是一个原始矩阵”。
具体地,所述第三匹配子模块2024将所述第二匹配结果与预设第三文本数据库进行匹配包括:判断所述第二匹配结果中是否存在与预设第三文本数据库中的词语相匹配的第三词语;当确定所述第二匹配结果中存在与预设第三文本数据库中的词语相匹配的第三词语时,将所述第二匹配结果中相匹配的第三词语进行剔除。
生成模块203,用于根据所述匹配后的语音识别文本生成具有可编辑状态的语音识别文本草稿。
当根据匹配结果自动对所述初始语音识别文本进行修改后,得到匹配后的语音识别文本,可以先生成具有可编辑状态的语音识别文本草稿。具有可编辑状态是指用户能够在生成的语音识别文本草稿上进行编辑操作。所述编辑操作可以包括:确认操作,修改操作。
所述确认操作是指用户确认所述语音识别文本草稿正确无误,即确定修改后的语音识别文本不需要进行任何修改操作。所述修改操作是指用户确认所述语音识别文本草稿有错误,对个别或少量的词语需要进行调整,即修改后的语音识别文本还需要再次进行人工手动修改。
侦测模块204,用于侦测所述语音识别文本草稿上是否接收到了编辑操作。
所述侦测模块204侦测语音识别文本草稿上是否接收到了编辑操作包括:侦测所述语音识别文本草稿上的预设按钮上是否接收到了触摸操作,当侦测到预设按钮上接收到了触摸操作时,认为侦测到所述语音识别文本草稿上接收到了编辑操作;当没有侦测到预设按钮上接收到了触摸操作时,认为侦测到所述语音识别文本草稿上没有接收到编辑操作。
所述预设按钮可以为确认按钮,还可以为修改按钮。所述按钮可以是虚拟图标,还可以是实体按键。对应确认按钮的编辑操作为确认操作,对应修改按钮的编辑操作为修改操作。
所述生成模块203,还用于当所述侦测模块侦测到所述语音识别文本草稿上接收到了编辑操作时,根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本,作为最终语音识别文本。
所述生成模块203根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本包括:
当接收到的编辑操作为确认操作时,直接生成具有不可编辑状态的语音识别文本;
当接收到的编辑操作为修改操作时,接收用户的手动修改并保存修改的新内容,当再次接收到确认操作时,生成具有不可编辑状态的语音识别文本。
优选地,在接收用户的编辑操作为修改操作之后,所述将会议语音识别为文本的装置20还可以包括:关联模块205,用于将对应修改处的原始词语及用户修改后的新词语进行关联存储;所述识别模块201还用于在后续通过语音识别技术时,根据修改后的新词语将待识别的会议语音转化为文本。
通过记录用户每次的修改并将对应修改处的原始词语及用户修改后的新词语进行关联存储,有利于后续再进行语音识别时,能够直接使用用户修改的新词语,从而降低识别错误率,提高语音识别的准确度,尤其是减少用户修改的麻烦。
优选地,在预设文本数据库时,所述将会议语音识别为文本的装置20还可以包括:设置模块206,用于预先存储每个词语对应的多种形式,所述多种形式可以包括,但不限于:简繁体形式、加空格形式及形近字等。
所述将所述初始语音识别文本与预设文本数据库进行匹配还包括:根据会议所处的环境,将所述初始语音识别文本与预设文本数据库的每个词语对应的多种形式进行匹配,得到符合会议所处环境的语音识别文本。
会议所处的环境可以包括:会议的参与者,会议的举行地。
通过设置词语对应的多种形式,并根据会议所处的环境将初始语音识别文本与预设文本数据库中的每种形式的词语进行匹配,可以避免一些含有空格的语气词、禁忌敏感词无法识别出来。另外,还能够使用在不同的场合,例如,会议的参与者为台湾人时,在台湾等习惯使用繁体字的情况下,识别出的初始语音识别文本可能是繁体字及/或简体字,因而有必要将词语的简繁体形式均进行匹配,得到符合台湾人习惯使用的繁体字的语音识别文本。再如,会议的举行地为大陆,根据大陆人的习惯,将将初始语音识别文本与预设文本数据库中的每种形式的词语进行匹配后,得到符合大陆人习惯使用的简体字的语音识别文本。
本申请所述的将会议语音识别为文本的装置,通过语音识别技术将待识别的会议语音转化为文本,作为初始语音识别文本;将所述初始语音识别文本与预设文本数据库进行匹配,得到匹配后的语音识别文本;根据所述匹配后的语音识别文本生成具有可编辑状态的语音识别文本草稿;当侦测到所述语音识别文本草稿上接收到了编辑操作时,根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本,作为最终语音识别文本。在通过ASR技术对会议的语音转化成文本模式后,通过调用已存词库的内容对文本内容进行搜索,分别进行相应的替换、删除等操作。但由于替换和删除等操作极小概率会导致将原有的正确文本修改为错误的文本,所以系统会将修改后的文本重新以可修改的模式呈现给用户,同时标注出已经修改过的地方供用户确认,用户可以对系统误操作的地方重新进行修改。即通过对待识别的语音进行初步识别后,与预设文本库进行第一次匹配,再通过人工进行第二次确认。两次过程能够有效的保证文本输出内容的正确性,改善了传统语音转文本中文字表达不合理的地方,有效减少了会议内容的校对工作量,提升 了效率。
上述以软件功能模块的形式实现的集成的单元,可以存储在一个非易失性可读取存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,双屏设备,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的部分。
实施例三
图3为本申请实施例五提供的电子设备的示意图。
所述电子设备3包括:存储器31、至少一个处理器32、存储在所述存储器31中并可在所述至少一个处理器32上运行的计算机可读指令33及至少一条通讯总线34。
所述至少一个处理器32执行所述计算机可读指令33时实现上述将会议语音识别为文本的方法实施例中的步骤。
示例性的,所述计算机可读指令33可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器31中,并由所述至少一个处理器32执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令33在所述电子设备3中的执行过程。
所述电子设备3可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解,所述示意图3仅仅是电子设备3的示例,并不构成对电子设备3的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备3还可以包括输入输出设备、网络接入设备、总线等。
所述至少一个处理器32可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。该处理器32可以是微处理器或者该处理器32也可以是任何常规的处理器等,所述处理器32是所述电子设备3的控制中心,利用各种接口和线路连接整个电子设备3的各个部分。
所述存储器31可用于存储所述计算机可读指令33和/或模块/单元,所述处理器32通过运行或执行存储在所述存储器31内的计算机可读指令和/或模块/单元,以及调用存储在存储器31内的数据,实现所述电子设备3的各种功能。所述存储器31可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备3的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器31可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
所述电子设备3集成的模块/单元如果以软件功能单元的形式实现并作 为独立的产品销售或使用时,可以存储在一个非易失性可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述非易失性可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述非易失性可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,非易失性可读介质不包括电载波信号和电信信号。
在本申请所提供的几个实施例中,应该理解到,所揭露的电子设备和方法,可以通过其它的方式实现。例如,以上所描述的电子设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
另外,在本申请各个实施例中的各功能单元可以集成在相同处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在相同单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神范围。

Claims (20)

  1. 一种将会议语音识别为文本的方法,其特征在于,所述方法包括:
    通过语音识别技术将待识别的会议语音转化为文本,作为初始语音识别文本;
    将所述初始语音识别文本与预设文本数据库进行匹配,得到匹配后的语音识别文本;
    根据所述匹配后的语音识别文本生成具有可编辑状态的语音识别文本草稿;
    当侦测到所述语音识别文本草稿上接收到了编辑操作时,根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本,作为最终语音识别文本。
  2. 如权利要求1所述的方法,其特征在于,所述将所述初始语音识别文本与预设文本数据库进行匹配包括:
    将所述初始语音识别文本与预设第一文本数据库进行匹配,得到第一匹配结果;
    将所述第一匹配结果与预设第二文本数据库进行匹配,得到第二匹配结果;
    将所述第二匹配结果与预设第三文本数据库进行匹配;
    其中,所述预设第一文本数据库中存储有多个语气词,所述预设第二文本数据库中存储有多个专业词及对应的拼音,所述预设第三文本数据库中存储有多个禁忌敏感词。
  3. 如权利要求2所述的方法,其特征在于,所述将所述初始语音识别文本与预设第一文本数据库进行匹配包括:
    判断所述初始语音识别文本中是否存在与预设第一文本数据库中的词语相匹配的第一词语;
    当确定所述初始语音识别文本中存在与预设第一文本数据库中的词语相匹配的第一词语时,根据预先训练的基于深度学习网络的语气词模型判断所述相匹配的第一词语是否为待删除的语气词;
    当确定所述相匹配的第一词语为待删除的语气词时,将所述初始语音识别文本中相匹配的第一词语进行剔除;
    当确定所述相匹配的第一词语不为待删除的语气词时,将所述初始语音识别文本中相匹配的第一词语进行保留。
  4. 如权利要求2所述的方法,其特征在于,所述将所述第一匹配结果与预设第二文本数据库进行匹配包括:
    将所述第一匹配结果中的词语转换为第一拼音;
    判断所述预设第二文本数据库中是否存在与所述第一拼音相同的第二拼音;
    当确定所述预设第二文本数据库中存在与所述第一拼音相同的第二拼音时,将第二拼音对应的词语提取出来,作为第一拼音对应的词语。
  5. 如权利要求2所述的方法,其特征在于,所述将所述第二匹配结果与预设第三文本数据库进行匹配包括:
    判断所述第二匹配结果中是否存在与预设第三文本数据库中的词语相匹 配的第三词语;
    当确定所述第二匹配结果中存在与预设第三文本数据库中的词语相匹配的第三词语时,将所述第二匹配结果中相匹配的第三词语剔除。
  6. 如权利要求1所述的方法,其特征在于,所述根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本包括:
    当接收到的编辑操作为确认操作时,直接生成具有不可编辑状态的语音识别文本;
    当接收到的编辑操作为修改操作时,接收用户的手动修改并保存修改的新内容,当再次接收到确认操作时,生成具有不可编辑状态的语音识别文本。
  7. 如权利要求1至6任意一项所述的方法,其特征在于,所述方法还包括:
    将对应修改处的原始词语及用户修改后的新词语进行关联存储;
    在后续通过语音识别技术时,根据修改后的新词语将待识别的会议语音转化为文本。
  8. 如权利要求1至6任意一项所述的方法,其特征在于,所述方法还包括:
    预先存储每个词语对应的多种形式,所述多种形式包括:简繁体形式、加空格形式及形近字;
    将所述初始语音识别文本与预设文本数据库的每个词语对应的多种形式进行匹配。
  9. 一种电子设备,其特征在于,所述电子设备包括处理器和存储器,所述存储器用于存储至少一个指令,所述处理器用于执行所述至少一个指令以实现以下步骤:
    通过语音识别技术将待识别的会议语音转化为文本,作为初始语音识别文本;
    将所述初始语音识别文本与预设文本数据库进行匹配,得到匹配后的语音识别文本;
    根据所述匹配后的语音识别文本生成具有可编辑状态的语音识别文本草稿;
    当侦测到所述语音识别文本草稿上接收到了编辑操作时,根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本,作为最终语音识别文本。
  10. 如权利要求9所述的电子设备,其特征在于,所述将所述初始语音识别文本与预设文本数据库进行匹配包括:
    将所述初始语音识别文本与预设第一文本数据库进行匹配,得到第一匹配结果;
    将所述第一匹配结果与预设第二文本数据库进行匹配,得到第二匹配结果;
    将所述第二匹配结果与预设第三文本数据库进行匹配;
    其中,所述预设第一文本数据库中存储有多个语气词,所述预设第二文本数据库中存储有多个专业词及对应的拼音,所述预设第三文本数据库中存储有多个禁忌敏感词。
  11. 如权利要求10所述的电子设备,其特征在于,所述将所述初始语音识别文本与预设第一文本数据库进行匹配包括:
    判断所述初始语音识别文本中是否存在与预设第一文本数据库中的词语相匹配的第一词语;
    当确定所述初始语音识别文本中存在与预设第一文本数据库中的词语相匹配的第一词语时,根据预先训练的基于深度学习网络的语气词模型判断所述相匹配的第一词语是否为待删除的语气词;
    当确定所述相匹配的第一词语为待删除的语气词时,将所述初始语音识别文本中相匹配的第一词语进行剔除;
    当确定所述相匹配的第一词语不为待删除的语气词时,将所述初始语音识别文本中相匹配的第一词语进行保留。
  12. 如权利要求10所述的电子设备,其特征在于,所述将所述第一匹配结果与预设第二文本数据库进行匹配包括:
    将所述第一匹配结果中的词语转换为第一拼音;
    判断所述预设第二文本数据库中是否存在与所述第一拼音相同的第二拼音;
    当确定所述预设第二文本数据库中存在与所述第一拼音相同的第二拼音时,将第二拼音对应的词语提取出来,作为第一拼音对应的词语。
  13. 如权利要求10所述的电子设备,其特征在于,所述将所述第二匹配结果与预设第三文本数据库进行匹配包括:
    判断所述第二匹配结果中是否存在与预设第三文本数据库中的词语相匹配的第三词语;
    当确定所述第二匹配结果中存在与预设第三文本数据库中的词语相匹配的第三词语时,将所述第二匹配结果中相匹配的第三词语剔除。
  14. 如权利要求9所述的电子设备,其特征在于,所述根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本包括:
    当接收到的编辑操作为确认操作时,直接生成具有不可编辑状态的语音识别文本;
    当接收到的编辑操作为修改操作时,接收用户的手动修改并保存修改的新内容,当再次接收到确认操作时,生成具有不可编辑状态的语音识别文本。
  15. 如权利要求9至14任意一项所述的电子设备,其特征在于,所述处理器还用于执行所述至少一个指令以实现以下步骤:
    将对应修改处的原始词语及用户修改后的新词语进行关联存储;
    在后续通过语音识别技术时,根据修改后的新词语将待识别的会议语音转化为文本。
  16. 一种非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,其特征在于,所述至少一个可读指令被所述处理器执行时实现以下步骤:
    通过语音识别技术将待识别的会议语音转化为文本,作为初始语音识别文本;
    将所述初始语音识别文本与预设文本数据库进行匹配,得到匹配后的语音识别文本;
    根据所述匹配后的语音识别文本生成具有可编辑状态的语音识别文本草稿;
    当侦测到所述语音识别文本草稿上接收到了编辑操作时,根据所述编辑操作后的语音识别文本生成具有不可编辑状态的语音识别文本,作为最终语音识别文本。
  17. 如权利要求16所述的存储介质,其特征在于,所述将所述初始语音识别文本与预设文本数据库进行匹配包括:
    将所述初始语音识别文本与预设第一文本数据库进行匹配,得到第一匹配结果;
    将所述第一匹配结果与预设第二文本数据库进行匹配,得到第二匹配结果;
    将所述第二匹配结果与预设第三文本数据库进行匹配;
    其中,所述预设第一文本数据库中存储有多个语气词,所述预设第二文本数据库中存储有多个专业词及对应的拼音,所述预设第三文本数据库中存储有多个禁忌敏感词。
  18. 如权利要求17所述的存储介质,其特征在于,所述将所述初始语音识别文本与预设第一文本数据库进行匹配包括:
    判断所述初始语音识别文本中是否存在与预设第一文本数据库中的词语相匹配的第一词语;
    当确定所述初始语音识别文本中存在与预设第一文本数据库中的词语相匹配的第一词语时,根据预先训练的基于深度学习网络的语气词模型判断所述相匹配的第一词语是否为待删除的语气词;
    当确定所述相匹配的第一词语为待删除的语气词时,将所述初始语音识别文本中相匹配的第一词语进行剔除;
    当确定所述相匹配的第一词语不为待删除的语气词时,将所述初始语音识别文本中相匹配的第一词语进行保留。
  19. 如权利要求17所述的存储介质,其特征在于,所述将所述第一匹配结果与预设第二文本数据库进行匹配包括:
    将所述第一匹配结果中的词语转换为第一拼音;
    判断所述预设第二文本数据库中是否存在与所述第一拼音相同的第二拼音;
    当确定所述预设第二文本数据库中存在与所述第一拼音相同的第二拼音时,将第二拼音对应的词语提取出来,作为第一拼音对应的词语。
  20. 如权利要求17所述的存储介质,其特征在于,所述将所述第二匹配结果与预设第三文本数据库进行匹配包括:
    判断所述第二匹配结果中是否存在与预设第三文本数据库中的词语相匹配的第三词语;
    当确定所述第二匹配结果中存在与预设第三文本数据库中的词语相匹配的第三词语时,将所述第二匹配结果中相匹配的第三词语剔除。
PCT/CN2018/108113 2018-06-07 2018-09-27 将会议语音识别为文本的方法、电子设备及存储介质 WO2019232991A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810581922.4A CN108847241B (zh) 2018-06-07 2018-06-07 将会议语音识别为文本的方法、电子设备及存储介质
CN201810581922.4 2018-06-07

Publications (1)

Publication Number Publication Date
WO2019232991A1 true WO2019232991A1 (zh) 2019-12-12

Family

ID=64211364

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/108113 WO2019232991A1 (zh) 2018-06-07 2018-09-27 将会议语音识别为文本的方法、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN108847241B (zh)
WO (1) WO2019232991A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114666618B (zh) * 2022-03-15 2023-10-13 广州欢城文化传媒有限公司 音频审核方法、装置、设备及可读存储介质

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109708256B (zh) * 2018-12-06 2020-07-03 珠海格力电器股份有限公司 一种语音确定方法、装置、存储介质及空调
CN111368506B (zh) * 2018-12-24 2023-04-28 阿里巴巴集团控股有限公司 文本处理方法及装置
CN110335509A (zh) * 2019-07-09 2019-10-15 南阳理工学院 一种小学教学演示装置
CN110459224B (zh) * 2019-07-31 2022-02-25 北京百度网讯科技有限公司 语音识别结果处理方法、装置、计算机设备及存储介质
CN110619879A (zh) * 2019-08-29 2019-12-27 深圳市梦网科技发展有限公司 一种语音识别的方法及装置
CN110969026A (zh) * 2019-11-27 2020-04-07 北京欧珀通信有限公司 译文输出方法、装置、电子设备及存储介质
US11303464B2 (en) * 2019-12-05 2022-04-12 Microsoft Technology Licensing, Llc Associating content items with images captured of meeting content
CN111177353B (zh) * 2019-12-27 2023-06-09 赣州得辉达科技有限公司 文本记录生成方法、装置、计算机设备及存储介质
CN111651960B (zh) * 2020-06-01 2023-05-30 杭州尚尚签网络科技有限公司 一种从合同简体迁移到繁体的光学字符联合训练及识别方法
CN111710328B (zh) * 2020-06-16 2024-01-12 北京爱医声科技有限公司 语音识别模型的训练样本选取方法、装置及介质
CN114155860A (zh) * 2020-08-18 2022-03-08 深圳市万普拉斯科技有限公司 摘要记录方法、装置、计算机设备和存储介质
CN112037792B (zh) * 2020-08-20 2022-06-17 北京字节跳动网络技术有限公司 一种语音识别方法、装置、电子设备及存储介质
CN112468665A (zh) * 2020-11-05 2021-03-09 中国建设银行股份有限公司 一种会议纪要的生成方法、装置、设备及存储介质
CN112951210A (zh) * 2021-02-02 2021-06-11 虫洞创新平台(深圳)有限公司 语音识别方法及装置、设备、计算机可读存储介质
CN113360623A (zh) * 2021-06-25 2021-09-07 达闼机器人有限公司 一种文本匹配方法、电子设备及可读存储介质
CN113470644B (zh) * 2021-06-29 2023-09-26 读书郎教育科技有限公司 一种基于语音识别的智能语音学习方法及装置
CN114822527A (zh) * 2021-10-11 2022-07-29 北京中电慧声科技有限公司 一种语音转文本的纠错方法、装置及电子设备和存储介质
CN114120977A (zh) * 2021-11-23 2022-03-01 四川虹美智能科技有限公司 语音识别的生词自学习方法和装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060149558A1 (en) * 2001-07-17 2006-07-06 Jonathan Kahn Synchronized pattern recognition source data processed by manual or automatic means for creation of shared speaker-dependent speech user profile
CN102956231A (zh) * 2011-08-23 2013-03-06 上海交通大学 基于半自动校正的语音关键信息记录装置及方法
CN103902629A (zh) * 2012-12-28 2014-07-02 联想(北京)有限公司 利用语音提供操作帮助的电子设备和方法
CN105976818A (zh) * 2016-04-26 2016-09-28 Tcl集团股份有限公司 指令识别的处理方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9031839B2 (en) * 2010-12-01 2015-05-12 Cisco Technology, Inc. Conference transcription based on conference data
CN104679729B (zh) * 2015-02-13 2018-06-26 广州市讯飞樽鸿信息技术有限公司 录音留言有效性处理方法及系统
CN105206272A (zh) * 2015-09-06 2015-12-30 上海智臻智能网络科技股份有限公司 语音传输控制方法及系统
CN105206274A (zh) * 2015-10-30 2015-12-30 北京奇艺世纪科技有限公司 一种语音识别的后处理方法及装置和语音识别系统
CN107590121B (zh) * 2016-07-08 2020-09-11 科大讯飞股份有限公司 文本规整方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060149558A1 (en) * 2001-07-17 2006-07-06 Jonathan Kahn Synchronized pattern recognition source data processed by manual or automatic means for creation of shared speaker-dependent speech user profile
CN102956231A (zh) * 2011-08-23 2013-03-06 上海交通大学 基于半自动校正的语音关键信息记录装置及方法
CN103902629A (zh) * 2012-12-28 2014-07-02 联想(北京)有限公司 利用语音提供操作帮助的电子设备和方法
CN105976818A (zh) * 2016-04-26 2016-09-28 Tcl集团股份有限公司 指令识别的处理方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114666618B (zh) * 2022-03-15 2023-10-13 广州欢城文化传媒有限公司 音频审核方法、装置、设备及可读存储介质

Also Published As

Publication number Publication date
CN108847241A (zh) 2018-11-20
CN108847241B (zh) 2022-09-13

Similar Documents

Publication Publication Date Title
WO2019232991A1 (zh) 将会议语音识别为文本的方法、电子设备及存储介质
WO2018205389A1 (zh) 语音识别方法、系统、电子装置及介质
WO2019085779A1 (zh) 机器处理及文本纠错方法和装置、计算设备以及存储介质
US9805718B2 (en) Clarifying natural language input using targeted questions
EP3405912A1 (en) Analyzing textual data
WO2020186712A1 (zh) 一种语音识别方法、装置及终端
WO2021179701A1 (zh) 多语种语音识别方法、装置及电子设备
JP2020030408A (ja) オーディオにおける重要語句を認識するための方法、装置、機器及び媒体
CN111310441A (zh) 基于bert的语音识别后文本修正方法、装置、终端及介质
WO2014048172A1 (en) Method and system for correcting text
US11908477B2 (en) Automatic extraction of conversation highlights
WO2014117553A1 (en) Method and system of adding punctuation and establishing language model
TW201606750A (zh) 使用外國字文法的語音辨識
CN110852075B (zh) 自动添加标点符号的语音转写方法、装置及可读存储介质
US20220358955A1 (en) Method for detecting voice, method for training, and electronic devices
KR20210060897A (ko) 음성 처리 방법 및 장치
US20220068267A1 (en) Method and apparatus for recognizing speech, electronic device and storage medium
CN112100339A (zh) 用于智能语音机器人的用户意图识别方法、装置和电子设备
CN112036186A (zh) 语料标注方法、装置、计算机存储介质及电子设备
US10403275B1 (en) Speech control for complex commands
US20220277732A1 (en) Method and apparatus for training speech recognition model, electronic device and storage medium
US20220245340A1 (en) Electronic device for processing user's inquiry, and operation method of the electronic device
WO2023137903A1 (zh) 基于粗糙语义的回复语句确定方法、装置及电子设备
CN110929749B (zh) 文本识别方法、装置、介质及电子设备
CN112395414B (zh) 文本分类方法和分类模型的训练方法、装置、介质和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18921313

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18921313

Country of ref document: EP

Kind code of ref document: A1