CN112151019A - Text processing method and device and computing equipment - Google Patents

Text processing method and device and computing equipment Download PDF

Info

Publication number
CN112151019A
CN112151019A CN201910561669.0A CN201910561669A CN112151019A CN 112151019 A CN112151019 A CN 112151019A CN 201910561669 A CN201910561669 A CN 201910561669A CN 112151019 A CN112151019 A CN 112151019A
Authority
CN
China
Prior art keywords
spoken
text
target
word
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910561669.0A
Other languages
Chinese (zh)
Inventor
周鑫
张雅婷
孙常龙
张琼
司罗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910561669.0A priority Critical patent/CN112151019A/en
Publication of CN112151019A publication Critical patent/CN112151019A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a text processing method and device and a mobile terminal of a computing device. Acquiring a target spoken language text; recognizing the spoken words in the target spoken text by using a spoken recognition model; the spoken word is eliminated from the target spoken text to obtain a canonical text. The technical scheme provided by the embodiment of the application realizes the purpose of eliminating the spoken language errors in the spoken language text.

Description

Text processing method and device and computing equipment
Technical Field
The embodiment of the application relates to the technical field of computer application, in particular to a text processing method and device and computing equipment.
Background
Spoken language is a spoken language used in informal situations (e.g., daily conversations, informal speech, informal writing, etc.) and is characterized by informal and flexible properties. The written language is a language used in formal situations (e.g., formal speech, formal writing of documents, etc.), and has characteristics such as normative and concise. Because the spoken language has no characteristics of conciseness, standardization and the like of written language, the spoken language is not conducive to propagation and communication in some occasions.
But sometimes inevitably receives spoken text. For example, when speech is converted into text by using speech recognition technology, the speech recognition technology strictly converts the input speech into corresponding text. If the input speech is spoken speech, the corresponding recognition result is also spoken text.
However, the spoken language text inevitably has some spoken errors compared to the written language text, so how to eliminate the spoken errors in the spoken language text to standardize the spoken language text is an urgent problem to be solved.
Disclosure of Invention
The embodiment of the application provides a text processing method and device and computing equipment.
In a first aspect, an embodiment of the present application provides a text processing method, including:
acquiring a target spoken language text;
recognizing the spoken words in the target spoken text by using a spoken recognition model;
the spoken word is eliminated from the target spoken text to obtain a canonical text.
In a second aspect, an embodiment of the present application provides a text processing apparatus, including:
the text acquisition module is used for acquiring a target spoken language text;
the spoken language error recognition module is used for recognizing spoken words in the target spoken language text by using a spoken language recognition model;
a spoken language error elimination module for eliminating the spoken words from the target spoken language text to obtain a canonical text.
In a third aspect, a computing device is provided in an embodiment of the present application, comprising a processing component and a storage component;
the storage component stores one or more computer instructions; the one or more computer instructions to be invoked for execution by the processing component;
the processing component is to:
acquiring a target spoken language text;
recognizing the spoken words in the target spoken text by using a spoken recognition model;
the spoken word is eliminated from the target spoken text to obtain a canonical text.
In the embodiment of the application, the spoken language training text based on the marked spoken language training words can train the spoken language recognition model in advance, so that the spoken language recognition model can be used for recognizing the spoken language words in the target spoken language text for the target spoken language text, and the spoken language words can be eliminated from the target spoken language text, so that the standard text, namely the written language text can be obtained, and the purpose of eliminating spoken language errors from the spoken language text is realized.
These and other aspects of the present application will be more readily apparent from the following description of the embodiments.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow diagram illustrating one embodiment of a text processing method provided herein;
FIG. 2 is a flow diagram illustrating yet another embodiment of a method of text processing provided herein;
FIG. 3 is a flow diagram illustrating yet another embodiment of a method of text processing provided herein;
FIG. 4 is a flow diagram illustrating yet another embodiment of a method of text processing provided herein;
FIG. 5 is a flow diagram illustrating yet another embodiment of a method of text processing provided herein;
FIG. 6 is a schematic structural diagram of an embodiment of a text processing apparatus provided in the present application;
FIG. 7 is a block diagram illustrating one embodiment of a computing device provided herein.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
In some of the flows described in the specification and claims of this application and in the above-described figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, the number of operations, e.g., 101, 102, etc., merely being used to distinguish between various operations, and the number itself does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.
The technical scheme of the embodiment of the application can be suitable for any requirement scene of converting the spoken language text into the written language text.
Language is a tool used by people to carry out communication, and generally has two different expressions, namely spoken language and written language, wherein the spoken language is spoken language and is used for spoken communication in an informal occasion, and the written language is used for a formal occasion, particularly the language used in a writing scene of a document. However, in practical applications, a spoken text is inevitably received, and for example, when a speech is converted into a character by using a speech recognition technology, the speech recognition technology strictly converts the input speech into a corresponding character. If the input speech is spoken speech, the corresponding recognition result is also a spoken text. Because the spoken text is spoken language, it is not suitable for continuous propagation, and it is necessary to eliminate the spoken error and convert it into a standard text in written language.
For example, in a court trial scenario, a bookkeeper needs to record a court trial in written language. In order to reduce the workload of the bookmarker, the voice recognition device also enters the court, and court trial voice can be automatically converted into court trial voice text by the voice recognition device, but spoken language can inevitably appear in the court trial voice text, and the court trial record of the bookmarker is required to be written language, so that the requirement of converting the court trial voice text into the written language text exists.
Currently, an effective and accurate way is available to eliminate spoken language errors in spoken language texts.
In order to realize conversion of a spoken language text, the inventor provides a technical scheme of the application through a series of researches, and in the embodiment of the application, for the target spoken language text, a spoken language recognition model is utilized to recognize spoken words in the target spoken language text; thus, the spoken words can be eliminated from the target spoken language text, and the conversion of the target spoken language text is realized to obtain the standard text in the written language. Because the spoken language error in the spoken language text is usually caused by some spoken language words, by adopting the technical scheme of the embodiment of the application, the spoken language words in the spoken language text can be identified and eliminated, the purpose of eliminating the spoken language error in the spoken language text is realized, and the spoken language text can be converted into a written language text.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a flowchart of an embodiment of a text processing method provided in an embodiment of the present application, where the method may include the following steps:
101: and acquiring a target spoken language text.
The target spoken text can be converted by a speech recognition technology; thus, optionally, the obtaining the target spoken text may be:
and acquiring a target spoken language text from the voice recognition result.
The target spoken text can be any one of the spoken texts to be processed in the speech recognition result. The voice recognition result can be divided into a plurality of texts, and the texts can be divided according to different roles in the voice recognition result, or at least one sentence sequentially selected according to punctuations of the voice recognition is used as a to-be-processed spoken text.
102: and recognizing the spoken words in the target spoken text by using a spoken recognition model.
The spoken language identification model can be obtained based on the training of a spoken language training text of a marked spoken language training word. The spoken language identification model may be implemented using a neural network model. The specific training mode of the spoken language identification model will be described in detail below.
Among them, the fact that the spoken text cannot be directly used as written language text is that there are spoken errors in the spoken text, which are mainly caused by spoken words appearing in the spoken text. Therefore, in the present embodiment, the spoken words in the spoken text can be recognized first by using the spoken language recognition model.
One or more spoken words may be included in the target spoken text, and the spoken types of different spoken words may be different. The spoken language type may include a mood word, a spoken statement word, a pause word, a repeated word, a corrected word, and the like.
103: removing the spoken words from the target spoken text to obtain a canonical text;
all the spoken words in the target spoken language text are eliminated, so that the standard text in the form of written language can be obtained, and the text conversion is realized.
In this embodiment, the spoken words in the target spoken language text can be recognized by using the spoken language recognition model obtained by pre-training, so that the spoken words are eliminated from the target spoken language text to normalize the target spoken language text and eliminate spoken language errors, and the spoken language text after the spoken language errors are eliminated can be used as a written language text.
As can be seen from the above, the common spoken language types may include a spoken word, a spoken statement word, a stop word, a repeated word, a corrected word, and the like, which are respectively described as follows:
the term "tone word" refers to a word appearing at the end of a sentence (a statement sentence, a quiz sentence, a question sentence, etc.) and expressing the emotion and emotion of a speaker, and common tone words include: hiccup, o, Yi (Chinese character) Zi. For example, the spoken text "the house and mouth are also taken outO"middle" is the word of qi.
Wherein, the spoken expression word means that the sentence is still smooth and the semantic is completely retained after the spoken expression word is deleted. Common spoken expressions are: this, then, should be said. For example,') "It should be said thatAccording to the statement at that time, the ' to say ' of the ' age gnawing of both men and women is the spoken language statement word.
The pause words refer to the fact that when a speaker wants to speak in a sentence, some pause words are added to reserve the right of continuing speaking in multi-person conversation. Common stop words are: this, kah, then. For example, "almost half a yearThis isThe loved parties have entered the "this" stop word in the "stage of talking about marriage.
Wherein repeated words refer to repeated words, usually occurring in the pattern "ABCABCABCBC" or "ABC { pause word } ABC', namely that the ABC is repeated, and pause words may be added between two repetitions; for example, "September of six yearsTen sizeThe ten I says the repeated word with him, namely the 'ten' in the thing.
Wherein, the appearance mode of the corrected word is "ABCXYZ" or "ABC { pause word } XYZ", the expression "XYZ" correcting "ABC", and "ABC" is the corrected word, wherein, pause words may appear between XYZ and ABC, and there may be at least partial word coincidence between XYZ and ABC. For example, "wishBoth partiesThe contradiction between the male and the female is not irreconcilable, and the two sides in the ' are the corrected words which are corrected by the ' male and the female sides '.
Wherein the elimination of the spoken word from the target spoken text may be based on a spoken type elimination of the spoken word.
The spoken language identification model may also identify spoken types of the spoken words.
As an optional way, the spoken language identification model may be obtained by pre-training in the following way:
obtaining a spoken language training text;
determining at least one spoken training word in the spoken training text and a spoken type of the at least one spoken training word;
and training a spoken language recognition model by using the spoken language training text and the spoken language type of the at least one spoken language training word.
Optionally, the spoken language training text may be subjected to word segmentation to obtain a word sequence; then, setting a label representing the spoken language type of each word as a training label for each word according to the spoken language type of each word; if a word does not belong to the spoken training word, a label indicating that the word does not belong to the spoken training word can be set as a training label;
therefore, word sequences obtained by the spoken language training texts can be used as model input, and training labels corresponding to different words are used as model output to train the spoken language recognition model. Alternatively, each word may be converted into a word vector input model.
Thus, identifying spoken words in the target spoken text using the spoken recognition model may include:
performing word segmentation on the target spoken language text;
identifying target labels of all words in the target spoken language text by using the spoken language identification model;
based on the target tags of the words, the spoken words and the spoken types thereof are determined.
Alternatively, the spoken language identification model may be obtained by pre-training as follows:
obtaining a spoken language training text;
determining at least one spoken training word in the spoken training text and a spoken type of the at least one spoken training word;
aiming at each single character in the spoken language training text, setting a training label of each single character according to the spoken language type of the spoken language training word formed by the single character and the starting character, the middle character, the ending character, the single character or the composition character which does not belong to any spoken language type in the spoken language training word formed by the single character;
and training a spoken language recognition model by using the spoken language training text and the training labels of the individual characters of the spoken language training text.
That is, the spoken training text may be labeled in units of words.
Wherein, the single character is the starting character, the middle character, the ending character or the single character in the oral training word formed by the single character, which is as follows: for example, the spoken language training word is "then" or "then", which is the beginning word, "then" is the middle word, and "then" is the ending word. Of course, if the spoken training word has only two characters, the two characters are respectively a start character and a middle character; if the spoken training word has only one character, the spoken training word is a single character.
If a certain character in the spoken training text does not form any spoken training word, the word is a component character which does not belong to any spoken type.
For convenience of label labeling, a letter B, E, I, S may be used to represent a start word, an end word, a middle word and a single word, respectively, if there are 6 spoken types, and the 6 spoken types may also be represented by different identification symbols, 24 labels may be set based on the 6 spoken types and the start word, the end word, the middle word and the single word, and a component word that does not belong to any spoken type may be represented by a label O, that is, 25 labels in total.
Therefore, the training label of each single character can be set based on the spoken type of the spoken training word formed by each single character and the starting character, the middle character, the ending character, the single character or the component characters which do not belong to any spoken type in the spoken training word formed by the single character.
For example, assuming that the spoken language type includes spoken language expression words, the spoken language expression word type is assumed to be represented by colloquial, and assuming that one spoken language training word belonging to the spoken language expression word type in the spoken language training text is "this side", the training label of the word "this" may be set to be represented by colloquial _ B, and the training label of the word "side" is colloquial _ I.
For the component characters which do not belong to any spoken language type in the spoken language training text, the training labels can be directly set as O.
In addition, in some embodiments, for each single word in the spoken training text, setting a training label of each single word according to the spoken type of the spoken training word formed by the single word and the start word, the middle word, the end word, the single word, or a component word not belonging to any spoken type in the spoken training word formed by the single word may include:
aiming at each single character in the spoken language training text, setting a training label of each single character according to the text position of the single character in the spoken language training text, the spoken language type of the spoken training word formed by the single character and the starting character, the middle character, the ending character, the single character or the composition character which does not belong to any spoken language type in the spoken language training word formed by the single character.
Namely, position information is added into the training label of each single character to indicate the text position in the spoken language training text where the single character is located, so that the accuracy of the spoken language recognition model is improved.
In some embodiments, the training of the spoken language recognition model using the spoken language training text and the training labels of the individual words of the spoken language training text may include:
converting each single character of the spoken language training text into a character vector;
and taking the word vector of each single word of the spoken language training text as model input, and taking the training label of each single word as model output to train the spoken language recognition model.
The word vectors of all the single words of the spoken training text form a vector sequence to be input into the spoken language recognition model.
Optionally, there may be a plurality of implementation manners for converting a Word into a Word vector, for example, the Word may be obtained by using a skip-gram (a Word vector conversion model), and certainly, other Word2Vec (Word to vector, Word to Word vector) models may also be used for implementation, which is the same as the prior art and will not be described herein again.
The spoken language identification model can be realized by adopting a neural network model. As an alternative, the spoken language identification model can be implemented by using a neural network architecture of bilstm ((Bi-Long Short-Term Memory, bidirectional Long Short-Term Memory network) + crf (conditional random field), that is, the spoken language identification model is composed of an input layer, a bidirectional Short-Term Memory (Long Short-Term Memory network) layer, a crf layer and an output layer, wherein the bidirectional lstm layer and the crf layer are intermediate layers.
The word vectors of the single words are used as the input of the input layer, then the output results of the input layer are modeled from the positive direction and the negative direction through the bidirectional lstm layer, the positive neuron output and the negative neuron output representing each word are spliced and input to the crf layer, the label relationship is modeled through the crf layer, the crf layer is connected with the output layer, the output result of the output layer is the label of each single word, and according to the principle, the spoken language recognition model can be obtained by training the spoken language training text.
In some embodiments, said identifying spoken words in said target spoken text using said spoken language identification model comprises:
identifying target labels of the individual characters in the target spoken language text by using the spoken language identification model;
and determining the spoken words in the target spoken language text based on the target labels of the single words.
Because the labels represent the spoken type of the single character and the starting character, the middle character, the ending character or the single character in the spoken word formed by the single character, the spoken word in the target spoken text can be obtained based on the target labels of different single characters.
In some embodiments, said identifying spoken words of said target spoken text using said spoken recognition model comprises:
converting each single character of the target spoken language text into a word vector;
inputting the word vector of each single word of the target spoken language text into the spoken language identification model to obtain a target label of each single word in the target spoken language text;
and determining at least one spoken word in the target spoken text based on the target labels of different single words.
Since the spoken words do not affect the correct meaning of the spoken text expression, i.e. they are non-essential morphemes of the spoken text when expressing the correct meaning.
Thus, as an alternative, the eliminating the spoken word from the target spoken text comprises:
deleting the spoken word from the target spoken text.
As another alternative, the eliminating the spoken word from the target spoken text may include:
respectively scoring a target spoken language text comprising the spoken words and a text obtained by deleting the spoken words from the target spoken language text by using a first language model; wherein the first language model is obtained based on a first standard written text training;
determining whether to delete the spoken word from the target spoken text based on a scoring result.
And scoring by using the first language model to judge the fluency of the target spoken language text and the first candidate text. Based on the scoring result, if the scoring result of the text obtained by deleting the spoken word is better than the scoring result of the target spoken text, the spoken word can be determined to be deleted from the target spoken text; and if the scoring result of the target spoken language text is better than the scoring result of the text obtained by deleting the spoken words, the spoken words in the target spoken language text are reserved.
The first language model may be an N-gram (N-gram) language model or a neural network model, such as LSTM, SRILM (SRI language model, SRI language model training tool), KenLM (kenn language model, a language model training tool), RNNLM (RNN language model, a language model training tool), or other unsupervised learning models.
The first language model may be obtained by training as follows:
segmenting the first standard written text;
and training the first language model by using each word obtained by segmenting the first standard written text.
The first standard written text refers to text expressed in written language.
In some embodiments, scoring the target spoken text including the spoken word and the first candidate text resulting from deleting the spoken word using the first language model, respectively, may include:
performing word segmentation on the target spoken language text;
based on each word obtained by word segmentation of the target spoken language text, scoring by using the first language model;
segmenting a text obtained by deleting the spoken words from the target spoken text;
and based on each word obtained by text word segmentation obtained by deleting the spoken word, scoring by using the first language model.
In some embodiments, the target spoken text may have a context relationship, for example, when the target spoken text is to-be-processed spoken text in the language recognition result, the target spoken text has a context relationship.
Accordingly, the method may further comprise:
acquiring at least one upper spoken language text of the target spoken language text;
taking the target spoken language text and the at least one upper spoken language text as texts to be processed;
the eliminating the spoken word from the target spoken text may include:
respectively scoring the text to be processed comprising the spoken words and the text obtained by deleting the spoken words from the text to be processed by utilizing a first language model;
determining whether to delete the spoken word from the target spoken text based on a scoring result.
Scoring the text to be processed including the spoken word and the text obtained by deleting the spoken word from the text to be processed by using the first language model respectively may include:
performing word segmentation on the text to be processed;
based on each word obtained by word segmentation of the text to be processed, scoring by using the first language model;
segmenting the text obtained by deleting the spoken words from the text to be processed;
and based on each word obtained by text word segmentation obtained by deleting the spoken word, scoring by using the first language model.
In addition, in practical application, the spoken words of the type of the Chinese word are used to support the context in some situations, such as "kayi" in the Chinese word, and in some cases, are the answer content of the question in the previous sentence, which cannot be deleted directly. And the spoken words of the non-linguistic word type can be deleted directly.
Thus, the identifying spoken words in the target spoken text using the spoken language identification model comprises:
recognizing the spoken words in the target spoken language text and the spoken types of the spoken words by using a spoken language recognition model; wherein the spoken language type comprises a mood word, a spoken language statement word, a pause word, a repeated word or a corrected word;
if the spoken word is of a non-verbal word type, deleting the spoken word from the target spoken text;
if the spoken words are of the type of the language words, respectively scoring a target spoken language text comprising the spoken words and a text in which the spoken words are deleted from the target spoken language text by using a first language model; wherein the first language model is obtained based on a first standard written text training;
determining whether to delete the spoken word from the target spoken text based on a scoring result.
For the specific implementation manner of deleting the spoken words of the tone word type and the training of the first language model, the details may be described above, and details are not repeated here.
Alternatively, for a spoken word of a linguistic word type may be the connective content of the spoken text of the target spoken text, and thus, if the spoken word is of the linguistic word type, scoring the target spoken text including the spoken word and the text in which the spoken word is deleted from the target spoken text, respectively, using the first language model may include:
if the spoken word is of a language-atmosphere word type, acquiring at least one upper-text spoken text of the target spoken text;
taking the target spoken language text and the at least one upper spoken language text as texts to be processed;
and respectively scoring the text to be processed comprising the spoken words and the text obtained by deleting the spoken words from the text to be processed by utilizing a first language model.
Besides, when the spoken text has spoken errors, for example, when the spoken text is a descriptive text in some specific fields, for example, when the text is obtained by speech recognition in a court trial scene, some specific technical terms may appear in the court trial scene, and many entity nouns such as names of people, place names, organization names, and the like may be involved, these technical terms and entity nouns may be regarded as professional vocabularies, and for the professional vocabularies, these are often some uncommon words, the speech recognition result may be wrong, thereby affecting the accuracy of the spoken text expression, and even if the spoken errors are eliminated, the accuracy of the text expression may be affected. Therefore, in order to ensure the accuracy of text expression at the same time, the text processing method shown in fig. 2 may include the following steps:
201: and acquiring a target spoken language text.
202: and recognizing the spoken words in the target spoken text by using a spoken recognition model.
203: eliminating the spoken word from the target spoken text.
The spoken language identification model is obtained based on the spoken language training text of the marked spoken language training words.
The operations of step 201 to step 203 are the same as the operations of step 101 to step 103 shown in fig. 1, and are not described herein again.
204: and searching for error professional words matched with the target professional words in the target spoken language text based on the professional word list.
205: and eliminating the wrong professional words in the target spoken language text by utilizing the target professional words. So that a normative text can be obtained that eliminates spoken words as well as erroneous professional words.
It should be noted that the removal of the spoken word and the removal of the error professional word from the target spoken text may be performed simultaneously or sequentially, and the sequential execution order is not limited to the order of steps in the present embodiment.
In addition, a sentence-breaking error may occur in the target spoken text, for example, a punctuation mark is used incorrectly or a punctuation mark is not used, and particularly when the target spoken text is obtained by speech recognition, the sentence-breaking error may affect fluency of text expression, and therefore, in order to ensure fluency of text expression, the text processing method shown in fig. 3 may include the following steps:
301: and acquiring a target spoken language text.
302: and recognizing the spoken words in the target spoken text by using a spoken recognition model.
303: eliminating the spoken word from the target spoken text.
The spoken language identification model is obtained based on the spoken language training text of the marked spoken language training words.
The operations of step 301 to step 303 are the same as the operations of step 101 to step 103 shown in fig. 1, and are not described again here.
304: and eliminating sentence-breaking errors from the target spoken text. Thereby, a standard text for eliminating errors of spoken words and sentence breaks can be obtained.
It should be noted that the elimination of the spoken word and the elimination of the sentence break error from the target spoken text may be performed simultaneously or sequentially, and the sequential execution order is not limited to the order of steps in the present embodiment.
In addition, in order to eliminate the spoken language error and ensure the accuracy and fluency of the text expression, the text processing method shown in fig. 4 may include the following steps:
401: and acquiring a target spoken language text.
402: and recognizing the spoken words in the target spoken text by using a spoken recognition model.
403: eliminating the spoken word from the target spoken text.
The spoken language identification model is obtained based on the spoken language training text of the marked spoken language training words.
The operations of step 401 to step 403 are the same as the operations of step 101 to step 103 shown in fig. 1, and are not described again here.
404: and searching for error professional words matched with the target professional words in the target spoken language text based on the professional word list.
405: and eliminating the wrong professional words in the target spoken language text by utilizing the target professional words.
406: and eliminating sentence-breaking errors from the target spoken text. Thereby, a standard text for eliminating spoken words, eliminating wrong professional words and eliminating sentence errors can be obtained.
It should be noted that the elimination of the spoken word, the elimination of the wrong specialized word, and the elimination of the sentence break error from the target spoken text may be performed simultaneously or sequentially, and the sequential execution order is not limited to the order of steps in the present embodiment.
The term list may include technical terms or entity nouns.
The professional words in the professional vocabulary can be pre-configured.
In addition, for entity nouns, the entity nouns can also be obtained by performing offline mining on written texts in the same field as the target spoken texts, for example, in a court trial scene, the written texts can adopt texts such as prosecution books, party information, evidence information and the like.
The written text can be obtained by mining through a regular expression; or segmenting the written text, and taking the word with higher travel frequency as the entity noun by counting the occurrence frequency of different words.
As an optional manner, the removing the wrong specialized word in the target spoken language text by using the target specialized word may include:
and replacing the error professional words in the target spoken language text by the target professional words.
As another alternative, in order to further improve the text expression accuracy, the eliminating the wrong specialized words in the target spoken text by using the target specialized words may include:
respectively scoring a target spoken language text comprising the wrong professional words and a text obtained by replacing the wrong professional words in the target spoken language text with the target professional words by using a second language model; wherein the second language model is obtained based on a second standard written text training;
and determining whether to replace the error professional word with the target professional word based on the scoring result.
If the scoring result of the target spoken language text is better than the scoring result of the text obtained by replacing the wrong professional words in the target spoken language text by the target professional words, the wrong professional words of the target spoken language text are not replaced; and if the scoring result of the text obtained by replacing the wrong professional words in the target spoken language text by the target professional words is better than the scoring result of the target spoken language text, replacing the wrong professional words by the target professional words.
Wherein the second standard written text may be written text containing professional words and expressed in written language, etc.
Further, the first standard text may be the same as the second standard text, and the first language model may be the same as the second language model. The second language model can be implemented by using an N-gram language model or a neural network model, such as LSTM, SRILM, KenLM or RNNLM, or other unsupervised learning models.
In some embodiments, the scoring, respectively by using the second language model, the target spoken language text including the wrong specialized word and the text obtained by replacing the wrong specialized word in the target spoken language text with the target specialized word includes:
acquiring at least one upper spoken language text of the target spoken language text;
taking the target spoken language text and the at least one upper spoken language text as texts to be processed;
respectively scoring the text to be processed comprising the error professional words and the text obtained by replacing the error professional words in the text to be processed by the target professional words by using a second language model
In some embodiments, the searching for the wrong specialized word in the target spoken text that matches the target specialized word based on the specialized word list may include:
searching in the target spoken language text by utilizing each professional word in the professional word list, and searching for a wrong professional word meeting similar conditions with any professional word; wherein the arbitrary professional word satisfying the similar condition with the error professional word is taken as the target professional word.
Wherein, the similarity condition may refer to that the similarity between word vectors is greater than a similarity threshold, and the like.
To improve the search accuracy, in some embodiments, the searching in the target spoken text by using each professional word in the professional word list, and the finding of the wrong professional word satisfying a similar condition to any professional word may include:
converting the target spoken language text into a pinyin sequence;
and searching in the target spoken language text by using the pinyin sequence of each professional word in the professional word list, and searching for a wrong professional word corresponding to a candidate pinyin sequence with the pinyin sequence of any professional word meeting similar conditions.
Wherein, the number of characters corresponding to the candidate pinyin sequence is the same as the number of characters corresponding to the professional word.
The candidate pinyin sequence satisfying the similarity condition with the pinyin sequence of any professional word may be a candidate pinyin sequence having a pinyin similarity with the pinyin sequence of any professional word greater than a pinyin similarity threshold.
The pinyin similarity can be obtained by calculating the Edit Distance (Edit Distance), or obtained based on the Jaro Edit Distance (an improved Edit Distance) or other improved Edit Distance calculations.
In some embodiments, each professional word in the professional word list is used, and a heuristic search method, such as a forward maximum matching algorithm, may be used to search in the target spoken text to reduce the search workload.
As an alternative, the eliminating of sentence break errors from the target spoken text may comprise:
for each punctuation mark to be processed in the target spoken text, respectively scoring the target spoken text comprising the punctuation mark and different texts obtained by respectively replacing the punctuation mark to be processed in the target spoken text by using different candidate punctuation marks by using a third language model; wherein the third language model is obtained by utilizing third standard written text training;
and selecting the optimal punctuation mark to replace the punctuation mark to be processed based on the scoring result.
The candidate punctuation marks corresponding to each punctuation mark to be processed in the target spoken language text can be any punctuation marks different from the punctuation marks to be processed.
Therefore, based on the scoring result, the optimal punctuation mark can be determined, the optimal punctuation mark may be any candidate punctuation mark or may be the punctuation mark to be processed, and if the optimal punctuation mark is the punctuation mark to be processed, punctuation mark replacement is not performed.
Wherein the third standard written text may be a text expressed in written language in which the punctuation mark is correctly set.
Of course, the first standard written text or the second standard written text may be the same as the third standard written text, so that the first language model or the second language model may be the same as the third language model, which is a general model, and punctuation mark substitution, spoken word deletion, or professional word substitution may be scored.
In some embodiments, the scoring, for each punctuation mark to be processed in the target spoken text, the target spoken text including the punctuation mark and the different texts obtained by respectively replacing the punctuation mark to be processed in the target spoken text with different candidate punctuation marks using a third language model includes:
acquiring at least one upper spoken language text of the target spoken language text;
taking the target spoken language text and the at least one upper spoken language text as texts to be processed;
and aiming at each punctuation mark to be processed in the target spoken language text, respectively scoring the text to be processed comprising the punctuation mark and different texts obtained by respectively replacing the punctuation mark to be processed in the text to be processed by different candidate punctuation marks by using a third language model.
As another alternative, the eliminating of sentence break errors from the target spoken text may include:
determining an insertion position of a punctuation mark required to be inserted in a target spoken language text;
for each inserting position, respectively scoring different texts obtained by respectively inserting different candidate punctuations into the target spoken language text by using a fourth language model; wherein the fourth language model is obtained by training with a fourth standard written text;
based on the scoring result, selecting an optimal candidate punctuation mark to be added to the punctuation insertion location.
The fourth speech model may be the same as the third speech model, and the fourth standard text is also the text expressed by the written language for which the punctuation mark is correctly set.
Optionally, determining a punctuation insertion position in the target spoken text where punctuation symbols need to be inserted may include:
in some embodiments, the scoring, for each insertion position, the different texts obtained by respectively inserting the different candidate punctuations into the target spoken text by using the fourth language model may include:
acquiring at least one upper spoken language text of the target spoken language text;
taking the target spoken language text and the at least one upper spoken language text as texts to be processed;
and for each insertion position, respectively scoring different texts obtained by respectively inserting different candidate punctuations into the text to be processed by utilizing a fourth language model.
The punctuation insertion positions of punctuation marks required to be inserted in the target spoken language text are determined, punctuation marks can be identified by carrying out punctuation recognition on the target spoken language text, and punctuation marks can be identified by utilizing a pre-trained punctuation recognition model. By segmenting the target spoken language text, the relevance between adjacent words can be calculated by using a sentence segmentation recognition model, so that whether punctuation marks need to be inserted between the two words or not can be determined based on the relevance between the adjacent words, and if so, the punctuation insertion position is determined between the two words.
As can be seen from the above description, the normalization of the target spoken text may include eliminating spoken words, eliminating incorrect professional words using professional words, and eliminating sentence break errors.
Thus, in some embodiments, said eliminating said spoken word from said target spoken text to obtain a canonical text may comprise:
the method comprises the steps of eliminating the spoken words from the target spoken language text, eliminating the wrong specialized words in the target spoken language text by using the target specialized words, and eliminating sentence break errors from the target spoken language text to obtain a standard text.
In some embodiments, the removing the spoken word from the target spoken language text, the removing the wrong specialized word from the target spoken language text with the target specialized word, and the removing the sentence break error from the target spoken language text may include:
taking each spoken word, each error professional word, each punctuation mark and each punctuation insertion position in the target spoken text as check points respectively;
for each check point, respectively scoring a target spoken language text comprising the check point and a candidate text for eliminating the check point by using a fifth language model;
determining whether to eliminate the check point from the target spoken text based on a scoring result.
The above description may be referred to as the check points, and the check points are respectively the mark mode and the elimination mode when the check points are spoken words, error professional words, punctuation marks or punctuation insertion positions, and will not be repeated herein.
In some embodiments, the scoring, for each checkpoint, a target spoken language text including the checkpoint and a text resulting from elimination of the checkpoint from the target spoken language text using a fifth language model may include:
acquiring at least one upper spoken language text of the target spoken language text;
taking the target spoken language text and the at least one upper spoken language text as texts to be processed;
and for each check point, respectively scoring the to-be-processed text comprising the check point and the text obtained by eliminating the check point from the to-be-processed text by utilizing a fifth language model.
In addition, the check point with sentence break error may correspond to a plurality of candidate elimination methods, such as a candidate punctuation mark inserted at a punctuation insertion position and a candidate punctuation mark corresponding to a punctuation mark to be processed. If there are more possible candidates to score using the fifth language model, in some embodiments, for increasing efficiency, the scoring, for each check point, the target spoken language text including the check point and the candidate text for eliminating the check point using the fifth language model may include:
aiming at each check point, sequencing and screening a plurality of candidate elimination modes corresponding to each check point to obtain a predetermined number of candidate elimination modes;
respectively scoring the target spoken language texts including the check points and texts obtained by respectively eliminating the check points from the target spoken language texts according to the preset number of candidate elimination modes by using a fifth language model;
and based on the scoring result, eliminating the check point from the target spoken language text in an optimal candidate elimination mode.
The candidate elimination manners corresponding to each check point are sorted and screened, and the candidate elimination manners with the preset number can be obtained by adopting a beam search strategy.
And eliminating spoken errors, sentence break errors and/or professional word errors of the target spoken language text, and standardizing the target spoken language text to obtain the standard text.
In certain embodiments, the method may further comprise:
and outputting the specification text.
In addition, the user can view the output specification text, modify the place where the conversion is inaccurate and the like.
Thus, in certain embodiments, the method may further comprise:
and receiving an updating operation aiming at the standard text, and updating the standard text.
Wherein outputting the canonical text may be sending the canonical text to a display device to output the canonical text in the display device.
And the updating operation aiming at the standard text is also detected by the display equipment and fed back to the server.
In certain embodiments, the method may further comprise:
and converting the standard text into target voice data.
That is, the canonical Text is converted into target Speech data by a TTS (Text To Speech) technique.
The target spoken language recognition text can be obtained by performing voice recognition on spoken language voice data, so that the obtained standard text is converted into target speech data, and the aim of correcting the spoken language voice into written language voice can be fulfilled.
Optionally, the method may further include:
and playing the target voice data.
In some scenes, such as a man-machine conversation scene, a telephone customer service scene and the like, collected voice input by a user can be subjected to voice recognition and converted into a text, then the text is converted into a standard text in a written language form by adopting the technical scheme of the application, and then the standard text is converted into voice data to be played, so that the purpose of outputting the written language voice is realized, and information transmission, communication and the like are facilitated.
In a practical application, the technical solution of the embodiment of the application can be applied to a court trial scene, and the target spoken language text is any one to-be-processed spoken language text in the court trial voice recognition result.
As can be seen from the schematic diagram shown in fig. 5, the target spoken text 501 may be obtained from the court trial speech recognition result. And the court trial voice recognition result is obtained by recognizing the second court trial voice by using a voice recognition technology.
In addition, in order to improve the text processing accuracy, besides the target spoken language text, at least one upper spoken language text of the target spoken language text of the court trial speech recognition result can be acquired. The court trial voice result can be divided into a plurality of spoken texts, and the spoken texts are sequentially processed according to the time sequence.
In addition, the target spoken text can be subjected to preprocessing operations, such as word segmentation, part-of-speech tagging, and the like, so as to facilitate subsequent processing.
Identifying spoken words 502 in the target spoken text using a spoken recognition model for the target spoken text;
the spoken language identification model can be obtained 503 by pre-training a spoken language training text labeled with spoken training words; the specific training mode of the spoken language identification model can be referred to above, and is not described herein again.
In addition, for the target spoken text, an error professional word 504 matching the target professional word in the target spoken text may be found based on the professional word list.
The professional word list stores a large number of pre-configured professional words related to the court trial, and the name of the entity related to the court trial and technical terms related to the court trial, such as legal terms, can be guaranteed.
If the professional word list is an entity noun, the professional word list can be obtained by mining the appellation book, the information of the party, the evidence information and the like related to the court trial.
The specific process of finding the wrong specialized word matching the target specialized word may be as described above.
In addition, a sentence break error 505 in the target spoken text can also be identified for the target spoken text.
Therefore, after determining the spoken words, the wrong professional words and the sentence break errors in the target spoken language text, the target spoken language text can be corrected to eliminate the spoken words, the professional words, the sentence break errors and the like in the target spoken language text, and the method comprises the following steps: removing spoken words from the target spoken text 506; eliminating the error professional words by using the target professional words 507; and a sentence break error 508 from the target spoken text.
The specific elimination manner of the spoken language errors, the professional word errors and the sentence break errors can be described in detail in the above, each check point can be scored by using a language model, and whether or not to eliminate or which manner to eliminate the check point is determined based on the scoring result.
The specification text 509 is obtained by specifying the target spoken text.
The standard text can be sent to a display device used by a bookkeeper so as to output the standard text in the display device, and meanwhile, a court trial voice recognition result can be output in the display device so as to be convenient for the bookkeeper to compare, and the bookkeeper can also perform updating operation on the standard text so as to artificially adjust and convert the obtained standard text.
Fig. 6 is a schematic structural diagram of an embodiment of a text processing apparatus according to an embodiment of the present application, where the apparatus may include:
a text obtaining module 601, configured to obtain a target spoken language text;
a spoken language error recognition module 602, configured to recognize spoken words in the target spoken language text by using a spoken language recognition model;
a spoken language error elimination module 603 configured to eliminate the spoken words from the target spoken language text to obtain a canonical text.
The spoken language identification model can be obtained based on the training of a spoken language training text of a marked spoken language training word.
In some embodiments, the apparatus may further comprise:
the model pre-training module is used for acquiring a spoken language training text; determining at least one spoken training word in the spoken training text and a spoken type of the at least one spoken training word; aiming at each single character in the spoken language training text, setting a training label of each single character according to the spoken language type of the spoken language training word formed by the single character and the starting character, the middle character, the ending character, the single character or the composition character which does not belong to any spoken language type in the spoken language training word formed by the single character; and training a spoken language recognition model by using the spoken language training text and the training labels of the individual characters of the spoken language training text.
In some embodiments, the setting, by the model pre-training module, for each single word in the spoken training text, a training label of each single word according to a spoken type of a spoken training word formed by the single word and a starting word, a middle word, an ending word, the single word, or a component word not belonging to any spoken type in the spoken training word formed by the single word includes: aiming at each single character in the spoken language training text, setting a training label of each single character according to the text position of the single character in the spoken language training text, the spoken language type of the spoken training word formed by the single character and the starting character, the middle character, the ending character, the single character or the composition character which does not belong to any spoken language type in the spoken language training word formed by the single character.
In some embodiments, the model pre-training module utilizes the spoken training text and training labels of individual words of the spoken training text, and training the spoken recognition model comprises: converting each single character of the spoken language training text into a character vector; and taking the word vector of each single word of the spoken language training text as model input, and taking the training label of each single word as model output to train the spoken language recognition model.
In some embodiments, the spoken language identification model comprises an input layer, a bidirectional long and short term memory network lstm layer, a conditional random field crf layer, and an output layer.
In some embodiments, the spoken language error recognition module is specifically configured to recognize a target tag of each single word in the target spoken language text by using the spoken language recognition model; and determining the spoken words in the target spoken language text based on the target labels of the single words.
In some embodiments, the spoken language error recognition module is specifically configured to convert each single word of the target spoken language text into a word vector; inputting the word vector of each single word of the target spoken language text into the spoken language identification model to obtain a target label of each single word in the target spoken language text; and determining the spoken words in the target spoken text based on the target labels of different single words.
In some embodiments, the spoken language error removal module is specifically configured to score a target spoken language text including the spoken word and a text obtained by deleting the spoken word from the target spoken language text by using a first language model respectively; wherein the first language model is obtained based on standard written text training; determining whether to delete the spoken word from the target spoken text based on a scoring result.
In some embodiments, the text acquisition module is further configured to acquire at least one upper spoken text of the target spoken text; taking the target spoken language text and the at least one upper spoken language text as texts to be processed;
the spoken language error elimination module is specifically used for respectively scoring a to-be-processed text comprising the spoken words and a text obtained by deleting the spoken words from the to-be-processed text by using a first language model; determining whether to delete the spoken word from the target spoken text based on a scoring result.
In some embodiments, the spoken language error recognition module is specifically configured to recognize spoken words in the target spoken language text and spoken types of the spoken words using a spoken language recognition model; wherein the spoken language type comprises a mood word, a spoken language statement word, a pause word, a repeated word or a corrected word;
the spoken language error elimination module is specifically configured to delete the spoken language words from the target spoken language text if the spoken language words are of a non-verbal-language word type; if the spoken words are of the type of the language words, respectively scoring a target spoken language text comprising the spoken words and a text obtained by deleting the spoken words from the target spoken language text by using a first language model; wherein the first language model is obtained based on a first standard written text training; determining whether to delete the spoken word from the target spoken text based on a scoring result.
In some embodiments, the apparatus may further comprise:
the professional word error identification module is used for searching for an error professional word matched with the target professional word in the target spoken language text based on a professional word list;
and the professional word error eliminating module is used for eliminating the error professional words in the target spoken language text by utilizing the target professional words.
In some embodiments, the specialized word error elimination module is specifically configured to replace the erroneous specialized words in the target spoken text with the target specialized words.
In some embodiments, the professional word error elimination module is specifically configured to score a target spoken language text including the erroneous professional words and a text obtained by replacing the erroneous professional words in the target spoken language text with the target professional words respectively by using a second language model; wherein the second language model is obtained based on a second standard written text training; and determining whether to replace the error professional word with the target professional word based on the scoring result.
In some embodiments, the professional word error identification module is specifically configured to search the target spoken language text by using each professional word in the professional word list, and search for an error professional word that satisfies similar conditions with any professional word; wherein any professional word satisfying similar conditions with the error professional word is taken as the target professional word.
In some embodiments, the specialized word error recognition module is specifically configured to convert the target spoken language text into a pinyin sequence;
and searching in the target spoken language text by using the pinyin sequence of each professional word in the professional word list, and searching for a wrong professional word corresponding to a candidate pinyin sequence with the pinyin sequence of any professional word meeting similar conditions.
In some embodiments, the apparatus may further comprise:
and the sentence break error elimination module is used for eliminating sentence break errors from the target spoken language text.
In some embodiments, the sentence break error elimination module is specifically configured to, for each punctuation mark to be processed in the target spoken text, respectively score the target spoken text including the punctuation mark to be processed by using a third language model and different texts obtained by respectively replacing the punctuation mark to be processed in the target spoken text with different candidate punctuation marks; wherein the third language model is obtained by utilizing third standard written text training; and selecting the optimal punctuation mark to replace the punctuation mark to be processed based on the scoring result.
In some embodiments, the sentence break error elimination module is specifically configured to determine a punctuation insertion position in the target spoken language text where punctuation symbols need to be inserted; respectively scoring different texts obtained by respectively inserting different candidate punctuations by utilizing a fourth language model aiming at each insertion position; wherein the fourth language model is obtained by training with a fourth standard written text; based on the scoring result, selecting an optimal candidate punctuation mark to be added to the punctuation insertion location.
In some embodiments, the spoken language error elimination module is specifically configured to eliminate the spoken words from the target spoken language text, eliminate the erroneous specialized words in the target spoken language text using the target specialized words, and eliminate sentence break errors from the target spoken language text.
In some embodiments, the spoken language error elimination module is specifically configured to take each spoken word, each erroneous professional word, each punctuation mark, and each punctuation insertion position in the target spoken language text as a check point, respectively; for each check point, respectively scoring a target spoken language text comprising the check point and a text obtained by eliminating the check point from the target spoken language text by using a fifth language model; determining whether to eliminate the check point from the target spoken text based on a scoring result.
In some embodiments, the spoken language error elimination module, for each checkpoint, using a fifth language model to score a target spoken language text that includes the checkpoint and a text that results from eliminating the checkpoint from the target spoken language text, respectively, includes: acquiring at least one upper spoken language text of the target spoken language text; taking the target spoken language text and the at least one upper spoken language text as texts to be processed; and for each check point, respectively scoring the to-be-processed text comprising the check point and the text obtained by eliminating the check point from the to-be-processed text by utilizing a fifth language model.
In some embodiments, the text obtaining module is specifically configured to obtain a target spoken text to be processed from a speech recognition result.
In some embodiments, the apparatus may further comprise:
the text output module is used for outputting the standard text;
and the text updating module is used for receiving the updating operation aiming at the standard text and updating the standard text.
In some embodiments, the apparatus may further comprise:
and the text conversion module is used for converting the standard text into target voice data.
The text processing apparatus shown in fig. 6 may execute the text processing method shown in the embodiment shown in fig. 1, and the implementation principle and the technical effect are not repeated. The specific manner in which each module and unit of the text processing apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
In one possible design, the text processing apparatus of the embodiment shown in fig. 6 may be implemented as a computing device, which may include a storage component 701 and a processing component 702 as shown in fig. 7;
the storage component 701 stores one or more computer instructions for the processing component 702 to invoke for execution.
The processing component 702 is configured to:
acquiring a target spoken language text;
recognizing the spoken words in the target spoken text by using a spoken recognition model;
the spoken word is eliminated from the target spoken text to obtain a canonical text.
The spoken language identification model can be obtained based on the training of a spoken language training text of a marked spoken language training word.
Among other things, the processing component 702 may include one or more processors to execute computer instructions to perform all or some of the steps of the methods described above. Of course, the processing elements may also be implemented as one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components configured to perform the above-described methods.
Storage component 701 is configured to store various types of data to support operations at the computing device. The memory components may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Of course, a computing device may also necessarily include other components, such as input/output interfaces, communication components, and so forth.
The input/output interface provides an interface between the processing components and peripheral interface modules, which may be output devices, input devices, etc.
The communication component is configured to facilitate wired or wireless communication between the computing device and other devices, and the like.
The computing device may be a physical device or an elastic computing host provided by a cloud computing platform, and the computing device may be a cloud server, and the processing component, the storage component, and the like may be basic service resources rented or purchased from the cloud computing platform.
In addition, an embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a computer, the text processing method of the embodiment shown in fig. 1 may be implemented.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (27)

1. A method of text processing, comprising:
acquiring a target spoken language text;
recognizing the spoken words in the target spoken text by using a spoken recognition model;
the spoken word is eliminated from the target spoken text to obtain a canonical text.
2. The method of claim 1, wherein the spoken language identification model is obtained based on training spoken training text labeling spoken training words.
3. The method of claim 2, wherein the spoken language identification model is obtained by training as follows:
obtaining a spoken language training text;
determining at least one spoken training word in the spoken training text and a spoken type of the at least one spoken training word;
aiming at each single character in the spoken language training text, setting a training label of each single character according to the spoken language type of the spoken language training word formed by the single character and the starting character, the middle character, the ending character, the single character or the composition character which does not belong to any spoken language type in the spoken language training word formed by the single character;
and training a spoken language recognition model by using the spoken language training text and the training labels of the individual characters of the spoken language training text.
4. The method of claim 3, wherein for each individual word in the spoken training text, setting a training label of each individual word according to the spoken type of the spoken training word formed by the individual word and the beginning word, the middle word, the ending word, the individual word or the constituent words not belonging to any spoken type in the spoken training word formed by the individual word comprises:
aiming at each single character in the spoken language training text, setting a training label of each single character according to the text position of the single character in the spoken language training text, the spoken language type of the spoken training word formed by the single character and the starting character, the middle character, the ending character, the single character or the composition character which does not belong to any spoken language type in the spoken language training word formed by the single character.
5. The method of claim 3, wherein training a spoken language identification model using the spoken training text and the training labels for each individual word of the spoken training text comprises:
converting each single character of the spoken language training text into a character vector;
and taking the word vector of each single word of the spoken language training text as model input, and taking the training label of each single word as model output to train the spoken language recognition model.
6. The method of claim 5, wherein the spoken language identification model comprises an input layer, an lstm layer, a crf layer, and an output layer.
7. The method of claim 3 or 4, wherein said identifying spoken words in said target spoken text using said spoken recognition model comprises:
identifying target labels of the individual characters in the target spoken language text by using the spoken language identification model;
and determining the spoken words in the target spoken language text based on the target labels of the single words.
8. The method of claim 5, wherein said identifying spoken words of said target spoken text using said spoken language identification model comprises:
converting each single character of the target spoken language text into a word vector;
inputting the word vector of each single word of the target spoken language text into the spoken language identification model to obtain a target label of each single word in the target spoken language text;
and determining the spoken words in the target spoken text based on the target labels of different single words.
9. The method of claim 1, wherein the eliminating the spoken word from the target spoken text comprises:
respectively scoring a target spoken language text comprising the spoken words and a text obtained by deleting the spoken words from the target spoken language text by using a first language model; wherein the first language model is obtained based on standard written text training;
determining whether to delete the spoken word from the target spoken text based on a scoring result.
10. The method of claim 1, further comprising:
acquiring at least one upper spoken language text of the target spoken language text;
taking the target spoken language text and the at least one upper spoken language text as texts to be processed;
the eliminating the spoken word from the target spoken text comprises:
respectively scoring the text to be processed comprising the spoken words and the text obtained by deleting the spoken words from the text to be processed by utilizing a first language model;
determining whether to delete the spoken word from the target spoken text based on a scoring result.
11. The method of claim 1, wherein said identifying spoken words in the target spoken text using a spoken language recognition model comprises:
recognizing the spoken words in the target spoken language text and the spoken types of the spoken words by using a spoken language recognition model; wherein the spoken language type comprises a mood word, a spoken language statement word, a pause word, a repeated word or a corrected word;
the eliminating the spoken word from the target spoken text comprises:
if the spoken word is of a non-verbal word type, deleting the spoken word from the target spoken text;
if the spoken words are of the type of the language words, respectively scoring a target spoken language text comprising the spoken words and a text obtained by deleting the spoken words from the target spoken language text by using a first language model; wherein the first language model is obtained based on a first standard written text training;
determining whether to delete the spoken word from the target spoken text based on a scoring result.
12. The method of claim 1, further comprising:
searching for error professional words matched with the target professional words in the target spoken language text based on a professional word list;
and eliminating the wrong professional words in the target spoken language text by utilizing the target professional words.
13. The method of claim 12, wherein said eliminating the wrong specialized word in the target spoken text using the target specialized word comprises:
and replacing the error professional words in the target spoken language text by the target professional words.
14. The method of claim 13, wherein replacing the wrong specialized word in the target spoken text with the target specialized word comprises:
respectively scoring a target spoken language text comprising the wrong professional words and a text obtained by replacing the wrong professional words in the target spoken language text with the target professional words by using a second language model; wherein the second language model is obtained based on a second standard written text training;
and determining whether to replace the error professional word with the target professional word based on the scoring result.
15. The method of claim 12, wherein the searching for the wrong specialized word in the target spoken text that matches the target specialized word based on the specialized word list comprises:
searching in the target spoken language text by utilizing each professional word in the professional word list, and searching for a wrong professional word meeting similar conditions with any professional word; wherein any professional word satisfying similar conditions with the error professional word is taken as the target professional word.
16. The method of claim 15, wherein searching in the target spoken text using each term in the list of terms comprises:
converting the target spoken language text into a pinyin sequence;
and searching in the target spoken language text by using the pinyin sequence of each professional word in the professional word list, and searching for a wrong professional word corresponding to a candidate pinyin sequence with the pinyin sequence of any professional word meeting similar conditions.
17. The method according to claim 1 or 12, comprising:
and eliminating sentence-breaking errors from the target spoken text.
18. The method of claim 17, wherein said eliminating sentence errors from said target spoken text comprises:
for each punctuation mark to be processed in the target spoken text, respectively scoring the target spoken text comprising the punctuation mark to be processed and different texts obtained by respectively replacing the punctuation mark to be processed in the target spoken text with different candidate punctuation marks by using a third language model; wherein the third language model is obtained by utilizing third standard written text training;
and selecting the optimal punctuation mark to replace the punctuation mark to be processed based on the scoring result.
19. The method of claim 17, wherein said eliminating sentence errors from said target spoken text comprises:
determining a punctuation insertion position in a target spoken language text, wherein punctuation symbols need to be inserted;
respectively scoring different texts obtained by respectively inserting different candidate punctuations by utilizing a fourth language model aiming at each insertion position; wherein the fourth language model is obtained by training with a fourth standard written text;
based on the scoring result, selecting an optimal candidate punctuation mark to be added to the punctuation insertion location.
20. The method of claim 1, further comprising:
searching for error professional words matched with the target professional words in the target spoken language text based on a professional word list;
the eliminating the spoken word from the target spoken text comprises:
the method further includes removing the spoken words from the target spoken text, removing the incorrect specialized words from the target spoken text using the target specialized words, and removing sentence break errors from the target spoken text.
21. The method of claim 20, wherein the eliminating the spoken words from the target spoken text, the eliminating the wrong specialized words from the target spoken text using the target specialized words, and the eliminating sentence break errors from the target spoken text comprises:
taking each spoken word, each error professional word, each punctuation mark and each punctuation insertion position in the target spoken text as check points respectively;
for each check point, respectively scoring a target spoken language text comprising the check point and a text obtained by eliminating the check point from the target spoken language text by using a fifth language model;
determining whether to eliminate the check point from the target spoken text based on a scoring result.
22. The method of claim 20, wherein the scoring, for each checkpoint, a target spoken language text including the checkpoint and a text resulting from elimination of the checkpoint from the target spoken language text using a fifth language model, respectively, comprises:
acquiring at least one upper spoken language text of the target spoken language text;
taking the target spoken language text and the at least one upper spoken language text as texts to be processed;
and for each check point, respectively scoring the to-be-processed text comprising the check point and the text obtained by eliminating the check point from the to-be-processed text by utilizing a fifth language model.
23. The method of claim 1, wherein obtaining the target spoken text comprises:
and acquiring a target spoken language text to be processed from the voice recognition result.
24. The method of claim 1, further comprising:
outputting the standard text;
and receiving an updating operation aiming at the standard text, and updating the standard text.
25. The method of claim 1, further comprising:
and converting the standard text into target voice data.
26. A text processing apparatus, comprising:
the text acquisition module is used for acquiring a target spoken language text;
the spoken language error recognition module is used for recognizing spoken words in the target spoken language text by using a spoken language recognition model;
a spoken language error elimination module for eliminating the spoken words from the target spoken language text to obtain a canonical text.
27. A computing device comprising a processing component and a storage component;
the storage component stores one or more computer instructions; the one or more computer instructions to be invoked for execution by the processing component;
the processing component is to:
acquiring a target spoken language text;
recognizing the spoken words in the target spoken text by using a spoken recognition model;
the spoken word is eliminated from the target spoken text to obtain a canonical text.
CN201910561669.0A 2019-06-26 2019-06-26 Text processing method and device and computing equipment Pending CN112151019A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910561669.0A CN112151019A (en) 2019-06-26 2019-06-26 Text processing method and device and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910561669.0A CN112151019A (en) 2019-06-26 2019-06-26 Text processing method and device and computing equipment

Publications (1)

Publication Number Publication Date
CN112151019A true CN112151019A (en) 2020-12-29

Family

ID=73869949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910561669.0A Pending CN112151019A (en) 2019-06-26 2019-06-26 Text processing method and device and computing equipment

Country Status (1)

Country Link
CN (1) CN112151019A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113314108A (en) * 2021-06-16 2021-08-27 深圳前海微众银行股份有限公司 Voice data processing method, device, equipment, storage medium and program product
CN113408274A (en) * 2021-07-13 2021-09-17 北京百度网讯科技有限公司 Method for training language model and label setting method
CN114169294A (en) * 2021-11-30 2022-03-11 中国电子科技集团公司第十五研究所 Office document automatic generation method and system based on countermeasure network

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080300855A1 (en) * 2007-05-31 2008-12-04 Alibaig Mohammad Munwar Method for realtime spoken natural language translation and apparatus therefor
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
CN105609107A (en) * 2015-12-23 2016-05-25 北京奇虎科技有限公司 Text processing method and device based on voice identification
CN105786793A (en) * 2015-12-23 2016-07-20 百度在线网络技术(北京)有限公司 Method and device for analyzing semanteme of spoken language text information
CN106354716A (en) * 2015-07-17 2017-01-25 华为技术有限公司 Method and device for converting text
CN108304375A (en) * 2017-11-13 2018-07-20 广州腾讯科技有限公司 A kind of information identifying method and its equipment, storage medium, terminal
CN108845979A (en) * 2018-05-25 2018-11-20 科大讯飞股份有限公司 A kind of speech transcription method, apparatus, equipment and readable storage medium storing program for executing
CN109108989A (en) * 2018-07-20 2019-01-01 吴怡 A kind of legal services special purpose robot of semantics recognition
CN109408833A (en) * 2018-10-30 2019-03-01 科大讯飞股份有限公司 A kind of interpretation method, device, equipment and readable storage medium storing program for executing

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080300855A1 (en) * 2007-05-31 2008-12-04 Alibaig Mohammad Munwar Method for realtime spoken natural language translation and apparatus therefor
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
CN106354716A (en) * 2015-07-17 2017-01-25 华为技术有限公司 Method and device for converting text
CN105609107A (en) * 2015-12-23 2016-05-25 北京奇虎科技有限公司 Text processing method and device based on voice identification
CN105786793A (en) * 2015-12-23 2016-07-20 百度在线网络技术(北京)有限公司 Method and device for analyzing semanteme of spoken language text information
CN108304375A (en) * 2017-11-13 2018-07-20 广州腾讯科技有限公司 A kind of information identifying method and its equipment, storage medium, terminal
CN108845979A (en) * 2018-05-25 2018-11-20 科大讯飞股份有限公司 A kind of speech transcription method, apparatus, equipment and readable storage medium storing program for executing
CN109108989A (en) * 2018-07-20 2019-01-01 吴怡 A kind of legal services special purpose robot of semantics recognition
CN109408833A (en) * 2018-10-30 2019-03-01 科大讯飞股份有限公司 A kind of interpretation method, device, equipment and readable storage medium storing program for executing

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113314108A (en) * 2021-06-16 2021-08-27 深圳前海微众银行股份有限公司 Voice data processing method, device, equipment, storage medium and program product
CN113314108B (en) * 2021-06-16 2024-02-13 深圳前海微众银行股份有限公司 Method, apparatus, device, storage medium and program product for processing voice data
CN113408274A (en) * 2021-07-13 2021-09-17 北京百度网讯科技有限公司 Method for training language model and label setting method
CN113408274B (en) * 2021-07-13 2022-06-24 北京百度网讯科技有限公司 Method for training language model and label setting method
CN114169294A (en) * 2021-11-30 2022-03-11 中国电子科技集团公司第十五研究所 Office document automatic generation method and system based on countermeasure network

Similar Documents

Publication Publication Date Title
CN108536654B (en) Method and device for displaying identification text
US10176804B2 (en) Analyzing textual data
CN109036464B (en) Pronunciation error detection method, apparatus, device and storage medium
CN110674629B (en) Punctuation mark labeling model, training method, training equipment and storage medium thereof
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
CN106534548B (en) Voice error correction method and device
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
CN107679032A (en) Voice changes error correction method and device
JP7266683B2 (en) Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction
CN112151019A (en) Text processing method and device and computing equipment
CN111881297A (en) Method and device for correcting voice recognition text
CN112016320A (en) English punctuation adding method, system and equipment based on data enhancement
CN113282701B (en) Composition material generation method and device, electronic equipment and readable storage medium
CN111883137A (en) Text processing method and device based on voice recognition
CN112818680A (en) Corpus processing method and device, electronic equipment and computer-readable storage medium
CN111401012B (en) Text error correction method, electronic device and computer readable storage medium
CN112382295A (en) Voice recognition method, device, equipment and readable storage medium
CN111160026B (en) Model training method and device, and text processing method and device
US11990131B2 (en) Method for processing a video file comprising audio content and visual content comprising text content
CN113076720A (en) Long text segmentation method and device, storage medium and electronic device
CN112069816A (en) Chinese punctuation adding method, system and equipment
CN109344388A (en) A kind of comment spam recognition methods, device and computer readable storage medium
CN111090720B (en) Hot word adding method and device
CN114528851A (en) Reply statement determination method and device, electronic equipment and storage medium
CN114297372A (en) Personalized note generation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination