CN113948065A - Method and system for screening error blocking words based on n-gram model - Google Patents

Method and system for screening error blocking words based on n-gram model Download PDF

Info

Publication number
CN113948065A
CN113948065A CN202111020788.9A CN202111020788A CN113948065A CN 113948065 A CN113948065 A CN 113948065A CN 202111020788 A CN202111020788 A CN 202111020788A CN 113948065 A CN113948065 A CN 113948065A
Authority
CN
China
Prior art keywords
words
error
text data
gram model
interception
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111020788.9A
Other languages
Chinese (zh)
Other versions
CN113948065B (en
Inventor
冉小龙
唐会军
刘拴林
梁堃
陈建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Nextdata Times Technology Co ltd
Original Assignee
Beijing Nextdata Times Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Nextdata Times Technology Co ltd filed Critical Beijing Nextdata Times Technology Co ltd
Priority to CN202111020788.9A priority Critical patent/CN113948065B/en
Publication of CN113948065A publication Critical patent/CN113948065A/en
Application granted granted Critical
Publication of CN113948065B publication Critical patent/CN113948065B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a system for screening error blocking words based on an n-gram model, and relates to the technical field of network security. The method comprises the following steps: acquiring audio translation text data intercepted based on interception words under a specific label; processing the text data through an n-gram model, and screening out data which are not stored in a specific label from the text data as backspacing information; and determining a sentence containing the error interception word according to the backspacing information. The method is suitable for intercepting forbidden words and sensitive words, particularly the forbidden words and sensitive words of the audio translation text data, wrong sentences and wrong intercepted words can be found quickly, and subsequently, the forbidden word library can be perfected and optimized according to the obtained wrong intercepted words, so that the interception accuracy rate of the corresponding intercepted words and the overall interception accuracy rate are improved.

Description

Method and system for screening error blocking words based on n-gram model
Technical Field
The invention relates to the technical field of network security, in particular to a method and a system for screening error blocking words based on an n-gram model.
Background
Content on the internet is increasing and often contains illegal and illegal information, so that the content needs to be audited and filtered to ensure a secure internet environment and business requirements.
Currently, the auditing mode is usually a mode of setting a forbidden word bank and a user-defined black/white word bank to intercept forbidden words and sensitive words. However, the existing interception method only intercepts words, and is difficult to mine the semantics of context, so that the interception accuracy is low, and particularly for data interception of a voice-to-text, the interception accuracy is further reduced due to the existence of homophones, words and dialects with similar pronunciations, and the like.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides the wrong interception word screening system based on the n-gram model, and the interception accuracy rate of the corresponding interception words and the overall interception accuracy rate can be improved by screening the wrong interception words.
The technical scheme for solving the technical problems is as follows:
an error blocking word screening method based on an n-gram model comprises the following steps:
acquiring audio translation text data intercepted based on interception words under a specific label;
processing the text data through an n-gram model, and screening out data which are not stored in the specific label from the text data as backspacing information;
and determining a sentence containing an error interception word according to the backspacing information.
Another technical solution of the present invention for solving the above technical problems is as follows:
an n-gram model-based error blocking word screening system, comprising:
the acquisition unit is used for acquiring audio translation text data intercepted based on intercepting words under a specific label;
the processing unit is used for processing the text data through an n-gram model and screening out data which are not stored in the specific label from the text data as backspacing information;
and the screening unit is used for determining sentences containing error blocking words according to the backspacing information.
The invention has the beneficial effects that: the method and the system for screening the error interception words are suitable for intercepting the forbidden words and the sensitive words, particularly the forbidden words and the sensitive words of the audio translation text data, the backspacing information is determined by using the n-gram model, the sentences containing the error interception words are determined according to the backspacing information, the wrong sentences and the wrong interception words can be quickly found, and the forbidden word library can be perfected and optimized according to the obtained error interception words, so that the interception accuracy rate of the corresponding interception words and the whole interception accuracy rate are improved.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a schematic flow chart diagram of a method for screening an error-blocking word according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a ppl scoring result provided by an embodiment of the method for screening an erroneous blocking word according to the present invention;
fig. 3 is a schematic structural framework diagram provided by an embodiment of the error-blocking word screening system of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth to illustrate, but are not to be construed to limit the scope of the invention.
As shown in fig. 1, a schematic flow chart provided by an embodiment of the method for screening the error-blocking word according to the present invention is implemented based on an n-gram model, and includes:
s1, acquiring audio translation text data intercepted based on the interception word under the specific label;
it should be noted that the specific tag type may be set according to actual service requirements, for example, the tags may be simply divided into 3 categories, which are an a-field sensitive tag, a B-field sensitive tag, and a normal tag, and the interception word of the tag of each category may be set according to actual requirements, for example, the interception word of the a-field sensitive tag may be: a1, A2 and A3, wherein A1, A2 and A3 are words to be intercepted in the A field respectively.
For the audio translation text data, interception errors may occur, for example, the reading and gambling are harmonious, and if the reading occurs in the audio and the reading is translated into gambling, the reading is assumed to be a blocking word under a certain label, and if the reading occurs in error in conversion, the translated text data is intercepted, thereby affecting the accuracy of interception.
Specifically, those skilled in the art may implement audio translation into text data through an acoustic model, and the specific acoustic model may be selected according to implementation requirements and is not described herein again.
S2, processing the text data through an n-gram model, and screening out data which are not stored in a specific label from the text data as backspacing information;
it should be noted that the n-gram model is a probability model for predicting that the current word is only related to the first n-1 words. The basic idea is to perform a sliding window operation of size n on the content in the text according to bytes, and form a byte fragment sequence with length n.
Each byte segment is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset threshold value to form a key gram list, namely a vector feature space of the text, wherein each gram in the list is a feature vector dimension.
The model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words and not to any other words, and that the probability of a complete sentence is the product of the probabilities of occurrence of the words. These probabilities can be obtained by counting the number of times n words occur simultaneously directly from the corpus.
For example, assuming that, for a label in the law and regulation aspect, a text for designing a legal label can be intercepted based on its specified intercepting word, then we can get "i know that today i am coming today once by criminals" by interpreting the intercepted text example based on ASR at step S1, and then process it to get "i know that today i am coming once by criminals" and then score this processed sentence by ppl using the 4-gram language model, and the result is shown in fig. 2.
In fig. 2, each row represents the probability of calculating the word, e.g., taking p (i | …) as an example, which calculates the probability of getting the word "i" to be 0.0452354, since this is a 4gram, only related to the first 3 words.
The first column [ xgram ] after the equal sign indicates the probability that xgram is used when the word is calculated, and if 1gram is shown here, it is proved that there is no corresponding sentence or phrase in the corpus of the model. The occurrence of the word is a pure probability spell, and when the training data of the n-gram language model contains more intercepted words with specific labels, the accuracy of the intercepted words with the specific labels is reduced when the training data falls back to 1 gram; aiming at the phenomenon, the invention provides the data screening scheme to optimize the accuracy rate of the label.
And S3, determining the sentence containing the error interception word according to the rollback information.
For example, the text interior trim part can be screened by the backspace information to obtain a sentence containing the error interception word.
The method and the system for screening the error blocking words are suitable for blocking the forbidden words and the sensitive words, especially the forbidden words and the sensitive words of the audio translation text data, the backspacing information is determined by using the n-gram model, the sentences containing the error blocking words are determined according to the backspacing information, the wrong sentences and the wrong blocking words can be quickly found, and the forbidden word library can be perfected and optimized according to the obtained error blocking words, so that the blocking accuracy rate of the corresponding blocking words and the blocking accuracy rate of the whole block are improved.
Optionally, in some possible embodiments, the processing is performed on the text data through an n-gram model, and data that is not stored in a specific tag is screened out from the text data as rollback information, which specifically includes:
preprocessing the text data;
performing ppl scoring on the preprocessed text data through an n-gram model;
according to the result of the ppl scoring, taking data corresponding to the 1-gram as backspacing information;
the preprocessing mode is the same as the processing mode of training data when the n-gram model is trained.
It should be understood that if the reality is 1gram, as shown in fig. 2, it indicates that the word of the criminal has no corresponding sentence or phrase in the field of the legal label, the occurrence of the word of the criminal is a pure probability problem, and therefore, the sentence with the interception word returned to 1gram can be screened out by using the corresponding interception word of each sentence, thereby optimizing the interception accuracy under the legal label.
It should be noted that, in order to enable the n-gram model to accurately recognize text data, the input data is usually preprocessed before being input into the n-gram model, for example, taking "i know that today, i come and come once by criminals" as an example, the vocabulary of the sentence needs to be split to obtain "i know that today, i come and come once by criminals", and therefore, when processing data, the input text data needs to be processed in the same preprocessing mode as that in training.
By preprocessing the text data, the processing efficiency and accuracy of the n-gram model can be improved.
Optionally, in some possible embodiments, determining a sentence containing an error blocking word according to the fallback information specifically includes:
and screening out sentences of which the interception words are returned to 1gram by using the interception words corresponding to each sentence in the text data.
Optionally, in some possible embodiments, the method further includes:
and marking the screened sentences containing the error interception words, and adding acoustic training.
Through labeling the screened sentences containing the error blocking words and performing acoustic training, the subsequent models can be translated more accurately when encountering the sentences.
Optionally, in some possible embodiments, labeling the selected sentences containing the error blocking words, and adding acoustic training, specifically including:
modifying the screened sentences containing the error blocking words to ensure that the sentences containing the error blocking words have the same content as the translated audio;
training an acoustic model through the labeled sentences containing the error blocking words.
It is to be understood that some or all of the various embodiments described above may be included in some embodiments.
As shown in fig. 3, a schematic structural framework diagram provided for an embodiment of the system for screening error-intercepting words of the present invention is implemented based on an n-gram model, and includes:
an obtaining unit 10, configured to obtain audio translation text data intercepted based on a blocking word under a specific label;
the processing unit 20 is used for processing the text data through an n-gram model and screening data which are not stored in the specific label from the text data as backspacing information;
and the screening unit 30 is used for determining the sentences containing the error blocking words according to the backspacing information.
The method and the system for screening the error blocking words are suitable for blocking the forbidden words and the sensitive words, especially the forbidden words and the sensitive words of the audio translation text data, the backspacing information is determined by using the n-gram model, the sentences containing the error blocking words are determined according to the backspacing information, the wrong sentences and the wrong blocking words can be quickly found, and the forbidden word library can be perfected and optimized according to the obtained error blocking words, so that the blocking accuracy rate of the corresponding blocking words and the blocking accuracy rate of the whole block are improved.
Optionally, in some possible embodiments, the processing unit 20 is specifically configured to perform preprocessing on the text data;
performing ppl scoring on the preprocessed text data through an n-gram model;
according to the result of the ppl scoring, taking data corresponding to the 1-gram as backspacing information;
the preprocessing mode is the same as the processing mode of training data when the n-gram model is trained.
Optionally, in some possible embodiments, the filtering unit 30 is specifically configured to filter out sentences with the interception word returned to 1gram using the interception word corresponding to each sentence in the text data.
Optionally, in some possible embodiments, the method further includes:
and the training unit is used for labeling the screened sentences containing the error interception words and adding acoustic training.
Optionally, in some possible embodiments, the training unit is specifically configured to modify the filtered sentences containing the error blocking words so that the sentences containing the error blocking words are the same as the translated audio content;
training an acoustic model through the labeled sentences containing the error blocking words.
It is to be understood that some or all of the various embodiments described above may be included in some embodiments.
It should be noted that the above embodiments are product embodiments corresponding to previous method embodiments, and for the description of the product embodiments, reference may be made to corresponding descriptions in the above method embodiments, and details are not repeated here.
The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described method embodiments are merely illustrative, and for example, the division of steps into only one logical functional division may be implemented in practice in another way, for example, multiple steps may be combined or integrated into another step, or some features may be omitted, or not implemented.
The above method, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for screening error interception words based on an n-gram model is characterized by comprising the following steps:
acquiring audio translation text data intercepted based on interception words under a specific label;
processing the text data through an n-gram model, and screening out data which are not stored in the specific label from the text data as backspacing information;
and determining a sentence containing an error interception word according to the backspacing information.
2. The method for screening the error intercepting words based on the n-gram model according to claim 1, wherein the text data is processed through the n-gram model, and data which is not stored in the specific tag is screened out from the text data as rollback information, and specifically the method comprises the following steps:
preprocessing the text data;
performing ppl scoring on the preprocessed text data through an n-gram model;
according to the result of the ppl scoring, taking data corresponding to the 1-gram as backspacing information;
and the preprocessing mode is the same as the processing mode of training data when the n-gram model is trained.
3. The method for screening the error blocking words based on the n-gram model according to claim 2, wherein determining the sentence containing the error blocking word according to the backspacing information specifically comprises:
and screening out sentences of which the interception words are returned to 1gram by using the interception words corresponding to each sentence in the text data.
4. The method for screening the n-gram model-based error intercepting words according to any one of claims 1 to 3, further comprising:
and marking the screened sentences containing the error interception words, and adding acoustic training.
5. The method for screening the error intercepting words based on the n-gram model according to claim 4, wherein the method for labeling the screened sentences containing the error intercepting words and adding acoustic training specifically comprises the following steps:
modifying the screened sentences containing the error blocking words to ensure that the sentences containing the error blocking words have the same content as the translated audio;
and training an acoustic model through the labeled sentences containing the error blocking words.
6. An error interception word screening system based on an n-gram model is characterized by comprising:
the acquisition unit is used for acquiring audio translation text data intercepted based on intercepting words under a specific label;
the processing unit is used for processing the text data through an n-gram model and screening out data which are not stored in the specific label from the text data as backspacing information;
and the screening unit is used for determining sentences containing error blocking words according to the backspacing information.
7. The n-gram model-based false intercepted word screening system according to claim 6, wherein the processing unit is specifically configured to pre-process the text data;
performing ppl scoring on the preprocessed text data through an n-gram model;
according to the result of the ppl scoring, taking data corresponding to the 1-gram as backspacing information;
and the preprocessing mode is the same as the processing mode of training data when the n-gram model is trained.
8. The n-gram model-based error intercepting word screening system of claim 7, wherein the screening unit is specifically configured to screen out sentences with intercepting words returned to 1gram using intercepting words corresponding to each sentence in the text data.
9. The n-gram model-based error-intercepting word screening system according to any one of claims 6 to 8, further comprising:
and the training unit is used for labeling the screened sentences containing the error interception words and adding acoustic training.
10. The n-gram model-based error intercepting word screening system of claim 9, wherein the training unit is specifically configured to modify the screened sentences containing error intercepting words so that the sentences containing error intercepting words are identical to the translated audio content;
and training an acoustic model through the labeled sentences containing the error blocking words.
CN202111020788.9A 2021-09-01 2021-09-01 Method and system for screening error blocking words based on n-gram model Active CN113948065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111020788.9A CN113948065B (en) 2021-09-01 2021-09-01 Method and system for screening error blocking words based on n-gram model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111020788.9A CN113948065B (en) 2021-09-01 2021-09-01 Method and system for screening error blocking words based on n-gram model

Publications (2)

Publication Number Publication Date
CN113948065A true CN113948065A (en) 2022-01-18
CN113948065B CN113948065B (en) 2022-07-08

Family

ID=79327642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111020788.9A Active CN113948065B (en) 2021-09-01 2021-09-01 Method and system for screening error blocking words based on n-gram model

Country Status (1)

Country Link
CN (1) CN113948065B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101655837A (en) * 2009-09-08 2010-02-24 北京邮电大学 Method for detecting and correcting error on text after voice recognition
CN107204184A (en) * 2017-05-10 2017-09-26 平安科技(深圳)有限公司 Audio recognition method and system
CN107705787A (en) * 2017-09-25 2018-02-16 北京捷通华声科技股份有限公司 A kind of audio recognition method and device
CN109151218A (en) * 2018-08-21 2019-01-04 平安科技(深圳)有限公司 Call voice quality detecting method, device, computer equipment and storage medium
CN110134952A (en) * 2019-04-29 2019-08-16 华南师范大学 A kind of Error Text rejection method for identifying, device and storage medium
CN110162767A (en) * 2018-02-12 2019-08-23 北京京东尚科信息技术有限公司 The method and apparatus of text error correction
CN110442870A (en) * 2019-08-02 2019-11-12 深圳市珍爱捷云信息技术有限公司 Text error correction method, device, computer equipment and storage medium
CN110600011A (en) * 2018-06-12 2019-12-20 中国移动通信有限公司研究院 Voice recognition method and device and computer readable storage medium
CN111312209A (en) * 2020-02-21 2020-06-19 北京声智科技有限公司 Text-to-speech conversion processing method and device and electronic equipment
CN111326144A (en) * 2020-02-28 2020-06-23 网易(杭州)网络有限公司 Voice data processing method, device, medium and computing equipment
CN111369996A (en) * 2020-02-24 2020-07-03 网经科技(苏州)有限公司 Method for correcting text error in speech recognition in specific field
CN112447172A (en) * 2019-08-12 2021-03-05 云号(北京)科技有限公司 Method and device for improving quality of voice recognition text
CN112489655A (en) * 2020-11-18 2021-03-12 元梦人文智能国际有限公司 Method, system and storage medium for correcting error of speech recognition text in specific field
CN112989806A (en) * 2021-04-07 2021-06-18 广州伟宏智能科技有限公司 Intelligent text error correction model training method

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101655837A (en) * 2009-09-08 2010-02-24 北京邮电大学 Method for detecting and correcting error on text after voice recognition
CN107204184A (en) * 2017-05-10 2017-09-26 平安科技(深圳)有限公司 Audio recognition method and system
CN107705787A (en) * 2017-09-25 2018-02-16 北京捷通华声科技股份有限公司 A kind of audio recognition method and device
CN110162767A (en) * 2018-02-12 2019-08-23 北京京东尚科信息技术有限公司 The method and apparatus of text error correction
CN110600011A (en) * 2018-06-12 2019-12-20 中国移动通信有限公司研究院 Voice recognition method and device and computer readable storage medium
CN109151218A (en) * 2018-08-21 2019-01-04 平安科技(深圳)有限公司 Call voice quality detecting method, device, computer equipment and storage medium
CN110134952A (en) * 2019-04-29 2019-08-16 华南师范大学 A kind of Error Text rejection method for identifying, device and storage medium
CN110442870A (en) * 2019-08-02 2019-11-12 深圳市珍爱捷云信息技术有限公司 Text error correction method, device, computer equipment and storage medium
CN112447172A (en) * 2019-08-12 2021-03-05 云号(北京)科技有限公司 Method and device for improving quality of voice recognition text
CN111312209A (en) * 2020-02-21 2020-06-19 北京声智科技有限公司 Text-to-speech conversion processing method and device and electronic equipment
CN111369996A (en) * 2020-02-24 2020-07-03 网经科技(苏州)有限公司 Method for correcting text error in speech recognition in specific field
CN111326144A (en) * 2020-02-28 2020-06-23 网易(杭州)网络有限公司 Voice data processing method, device, medium and computing equipment
CN112489655A (en) * 2020-11-18 2021-03-12 元梦人文智能国际有限公司 Method, system and storage medium for correcting error of speech recognition text in specific field
CN112989806A (en) * 2021-04-07 2021-06-18 广州伟宏智能科技有限公司 Intelligent text error correction model training method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴凡: "信息检索中的中文分词问题研究", 《情报杂志》 *
张俊祺: "面向领域的语音转换后文本纠错研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Also Published As

Publication number Publication date
CN113948065B (en) 2022-07-08

Similar Documents

Publication Publication Date Title
CN107247707B (en) Enterprise association relation information extraction method and device based on completion strategy
US8538743B2 (en) Disambiguating text that is to be converted to speech using configurable lexeme based rules
US7574349B2 (en) Statistical language-model based system for detection of missing attachments
US8650187B2 (en) Systems and methods for linked event detection
CN112287684B (en) Short text auditing method and device for fusion variant word recognition
US11386269B2 (en) Fault-tolerant information extraction
US20130158983A1 (en) System and Method for Identifying Phrases in Text
EP1627325B1 (en) Automatic segmentation of texts comprising chunks without separators
US8326809B2 (en) Systems and methods for defining and processing text segmentation rules
US8880391B2 (en) Natural language processing apparatus, natural language processing method, natural language processing program, and computer-readable recording medium storing natural language processing program
US10120843B2 (en) Generation of parsable data for deep parsing
WO2022256144A1 (en) Application-specific optical character recognition customization
CN111062208A (en) File auditing method, device, equipment and storage medium
CN113948065B (en) Method and system for screening error blocking words based on n-gram model
CN112699671A (en) Language marking method and device, computer equipment and storage medium
WO2008131509A1 (en) Systems and methods for improving translation systems
US12008305B2 (en) Learning device, extraction device, and learning method for tagging description portions in a document
Olinsky et al. Non-standard word and homograph resolution for asian language text analysis.
Yasin et al. Transformer-Based Neural Machine Translation for Post-OCR Error Correction in Cursive Text
JP5795302B2 (en) Morphological analyzer, method, and program
CN111950289A (en) Data processing method and device based on automobile maintenance record
WO2007041328A1 (en) Detecting segmentation errors in an annotated corpus
CN110232189B (en) Semantic analysis method, device, equipment and storage medium
Glocker et al. Hierarchical Multi-task Learning with Articulatory Attributes for Cross-Lingual Phoneme Recognition
Coats Noisy Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant