CN113536743A - Text processing method and related device - Google Patents

Text processing method and related device Download PDF

Info

Publication number
CN113536743A
CN113536743A CN202110200840.2A CN202110200840A CN113536743A CN 113536743 A CN113536743 A CN 113536743A CN 202110200840 A CN202110200840 A CN 202110200840A CN 113536743 A CN113536743 A CN 113536743A
Authority
CN
China
Prior art keywords
text
modification
pair
content
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110200840.2A
Other languages
Chinese (zh)
Inventor
方俊
林炳怀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110200840.2A priority Critical patent/CN113536743A/en
Publication of CN113536743A publication Critical patent/CN113536743A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses a text processing method and a related device, which at least relate to natural language processing and machine learning in artificial intelligence, data parallel computing in cloud computing technology and the like. Wherein any one content modification corresponds to a modification pair. In order to determine the language expression error type and the error reason corresponding to the content modification, the complete contextual information of the content modification before and after the modification is perfected by introducing the first text and the second text according to the information provided by the modification pair, so that on the basis of the modification pair, a relatively complete information basis is provided for determining the language expression error type and the error reason of the modification pair by combining the first text and the second text, and accurate identification and specific error cause of the language expression error type are realized.

Description

Text processing method and related device
The application provides divisional application for Chinese patent application with application number of 202011231200.X, application date of 2020, 11 and 06, entitled "a text processing method and related device".
Technical Field
The present application relates to the field of data processing, and in particular, to a text processing method and a related apparatus.
Background
The language is an expression mode for human to communicate, has corresponding grammar and vocabulary, and is a speech meaning system formed by the vocabulary according to certain grammar. Generally, each nation has its own language, such as chinese, english, german, etc.
The user can make written text expression by language, however, whether using own native language or newly learned language, language expression errors may occur, such as problems of improper grammar use and irregular expression. The text provided by the user may be misidentified by language misidentification techniques, which may be used, for example, in the educational industry to assist teachers in appropriating student English compositions.
In the related technology, a large amount of linguistic data are mainly adopted to learn the language law, so that errors in the text are identified and modified. So that the user can only know that the text expression is wrong, but can hardly know the reason of the mistake.
Disclosure of Invention
In order to solve the technical problem, the application provides a text processing method and a related device, which realize accurate identification of text expression error types and specific error causes.
The embodiment of the application discloses the following technical scheme:
in one aspect, an embodiment of the present application provides a text processing method, where the method includes:
acquiring a first text to be identified;
performing text processing on the first text to obtain a second text;
determining at least one modification pair according to the first text and the second text, wherein one modification pair corresponds to one content modification in the text processing, and the modification pair comprises the content corresponding to the content modification in the first text and the content corresponding to the content modification in the second text;
and determining the language expression error type and the error reason corresponding to the content modification according to the first text, the second text and the modification pair.
On the other hand, an embodiment of the present application provides a text processing apparatus, which includes an obtaining unit and a determining unit:
the acquisition unit is used for acquiring a first text to be identified;
the determining unit is used for performing text processing on the first text to obtain a second text;
the determining unit is further configured to determine at least one modification pair according to the first text and the second text, where one modification pair corresponds to one content modification in the text processing, and the modification pair includes a content in the first text corresponding to the content modification and a content in the second text corresponding to the content modification;
the determining unit is further configured to determine a type of linguistic expression error and a cause of the error corresponding to the content modification according to the first text, the second text, and the modification pair.
In another aspect, an embodiment of the present application provides an apparatus for text processing, where the apparatus includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of the above aspect according to instructions in the program code.
In another aspect, the present application provides a computer-readable storage medium for storing a computer program for executing the method of the above aspect.
In another aspect, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of the above aspect.
According to the technical scheme, the first text to be recognized is subjected to text processing, and at least one part of the content of the first text is modified to modify the first text into the second text. And any content modification corresponds to a modification pair, and the modification pair comprises the content corresponding to the content modification in the first text and the content corresponding to the content modification in the second text. In order to determine the language expression error type and the error reason corresponding to the content modification, the complete contextual information of the content modification before and after the modification is perfected by introducing the first text and the second text according to the information provided by the modification pair, so that on the basis of the modification pair, a relatively complete information basis is provided for determining the language expression error type and the error reason of the modification pair by combining the first text and the second text, and accurate identification and specific error cause of the language expression error type are realized.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a text processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a text processing method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of another text processing method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of an alignment algorithm provided in an embodiment of the present application;
FIG. 5 is a schematic illustration showing a feedback error type provided by an embodiment of the present application;
fig. 6 is a schematic view of an application scenario of another text processing method according to an embodiment of the present application;
fig. 7 is a schematic view of an application scenario of another text processing method according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
In view of the fact that the error cause cannot be given in a language rule learning mode based on the corpus in the related art, the embodiment of the application provides a text processing method and a related device, so that the identification of the error cause of the text is realized, and the identification precision of the language expression error type is improved.
The text processing method provided by the embodiment of the application is realized based on Artificial Intelligence (AI), which is a theory, method, technology and application system for simulating, extending and expanding human Intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the directions of the above-mentioned natural language processing, machine learning/deep learning, and the like. For example, the method may relate to Text preprocessing (Text preprocessing) in Natural Language Processing (NLP), Semantic understanding (Semantic understating), or Deep Learning (Deep Learning) in Machine Learning (ML), including various types of Artificial Neural Networks (ANN).
The text processing method provided by the application can be applied to text processing equipment with data processing capacity, such as terminal equipment and servers. The terminal device may be specifically a smart phone, a desktop computer, a notebook computer, a tablet computer, a smart sound box, a smart watch, and the like, but is not limited thereto; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
The text processing device may have a capability of performing Natural Language Processing (NLP), which is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. In the embodiment of the present application, the text processing device may process the text by using a text preprocessing technique, a semantic understanding technique, or the like in natural language processing.
The text processing device may be equipped with machine learning capabilities. Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks.
The artificial intelligence model adopted in the text processing method provided by the embodiment of the application mainly relates to the application of a neural network, and the text is modified and analyzed through the neural network.
In addition, the text processing device provided by the embodiment of the application further has cloud computing capability. Cloud computing (cloud computing) refers to a delivery and use mode of an IT infrastructure, and refers to obtaining required resources in an on-demand and easily-extensible manner through a network; the generalized cloud computing refers to a delivery and use mode of a service, and refers to obtaining a required service in an on-demand and easily-extensible manner through a network. Such services may be IT and software, internet related, or other services. Cloud Computing is a product of development and fusion of traditional computers and Network Technologies, such as Grid Computing (Grid Computing), Distributed Computing (Distributed Computing), Parallel Computing (Parallel Computing), Utility Computing (Utility Computing), Network Storage (Network Storage Technologies), Virtualization (Virtualization), Load balancing (Load Balance), and the like.
With the development of diversification of internet, real-time data stream and connecting equipment and the promotion of demands of search service, social network, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Different from the prior parallel distributed computing, the generation of cloud computing can promote the revolutionary change of the whole internet mode and the enterprise management mode in concept.
In the embodiment of the application, the text processing device may process the text to be recognized by using a cloud computing technology, so as to determine the text error type and the specific error cause thereof according to the processed information.
In order to facilitate understanding of the technical solution of the present application, the following describes a text processing method provided in the embodiment of the present application with a terminal device as a text processing device in combination with an actual application scenario.
Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a text processing method provided in an embodiment of the present application. In the application scenario shown in fig. 1, a terminal device 101 is included, and is configured to identify and modify a text error, and parse a language expression error type and an error reason.
In practical applications, the user may input the first text S to be recognized at a text error modification interface provided by the terminal device 101. For example, in the scenario shown in FIG. 1, the user enters a first text S, whose composition includes AB C, where A, B and C are used to represent the words that make up the text, respectively. For example, the first text S may be: believe in you, where A stands for believe, B stands for in, and C stands for you.
Then, the first text S is subjected to text processing, and at least one content of the first text S is modified to modify the first text S into a second text T. The first text S may be a text of any language, such as english, french, russian, etc. Any content modification corresponds to a modification pair, and the modification pair comprises the content corresponding to the content modification in the first text S and the content corresponding to the content modification in the second text T.
In the scenario shown in fig. 1, if the first text S is english, modifying at least one wrong content in the first text S according to an english expression rule to obtain a second text T, which is believe in yourself, where the second text T includes AB D, a represents believe, B represents in, and D represents yourself. Based on this, C in the first text S is modified to D in the second text, and therefore, a modification pair E can be determined, which is composed of (C, D), i.e., (you, yourself).
It will be appreciated that the first text S has the complete context information of the text before modification and the second text T has the complete context information of the text after modification, which can be used as a basis for analyzing the content modification. Therefore, the language expression error type and the error reason corresponding to the content modification can be determined by combining the first text S and the second text T on the basis of the modification pair.
In the scene shown in fig. 1, according to the first text S, the second text T and the modification pair E, it is determined that C in the first text S is modified to D in the second text, which expresses the type of error and the cause of the error. Wherein, the language expression error type of C (you) is modified to D (yourself) is grammar error, and the error reason is pronoun use error.
Based on the above, on the basis of the modification pair, by combining the first text and the second text, a relatively complete information basis is provided for determining the language expression error type and the error reason of the modification pair, and accurate identification of the text expression error type and specific error cause are realized.
A text processing method provided in the embodiment of the present application is described below with reference to the drawings and using a terminal device as a text processing device.
Referring to fig. 2, fig. 2 is a schematic flowchart of a text processing method according to an embodiment of the present application. As shown in fig. 2, the text processing method includes the steps of:
s201: the method comprises the steps of obtaining a first text to be recognized.
In practical application, a user can input the first text S to be recognized in a text processing interface provided by the terminal device. The terminal device receives the first text S through a pre-deployed sequence-to-sequence syntax error correction model (denoted Seq-decoder) and performs the following steps.
The first text S refers to a text composed of a plurality of words having a specific meaning, and its existing forms include, but are not limited to: sentences, paragraphs, articles. Furthermore, the first text S may be a language of any language, such as chinese, english, japanese, etc., without any limitation.
S202: and performing text processing on the first text to obtain a second text.
As shown in fig. 3, a first text S (302) is input (301), and after receiving the first text S to be recognized, a sequence-to-sequence grammar error correction model (303) modifies a place where a language expression error exists in the first text S to obtain a second text T (304), which is used as an output of the sequence-to-sequence grammar error correction model and input as a sequence-to-sequence alignment model (denoted as Seq-align) (305) pre-deployed in a terminal device.
Sequence-to-Sequence (Seq 2Seq) models refer to deep learning models that convert one Sequence to another on demand. The sequence-to-sequence grammar error correction model is a sequence-to-sequence model for language expression error correction, is obtained based on a deep learning method, and has the function of modifying the first text S on the premise of not changing the semantic meaning of the first text S, so that the first text S is changed into a more standard and reasonable second text T from grammar or expression habit.
In practical application, the first text S with different languages can be used as an input from the sequence to the sequential grammar error correction model, and the place where the language expression error exists in the first text S is modified according to the language of the first text S and the language expression specification, so as to obtain the second text T. The language of the first text S may be english, russian, french, etc.
It should be noted that, in the modification process of the first text, the error correction model using the sequence-to-sequence syntax can be implemented, and the same function can also be implemented based on other types of models, which is not limited herein.
S203: determining at least one modification pair from the first text and the second text.
As shown in fig. 3, after the first text S is modified to obtain the second text T, the first text S and the second text T are used as input of a sequence to the sequence alignment model, and then contents in the first text S and the second text T are compared to determine at least one modification pair E (306).
A modification pair E corresponds to a content modification in the text processing, the modification pair comprising the content of the first text S corresponding to the content modification and the content of the second text T corresponding to the content modification. It should be noted that the content modification is determined based on a linguistic expression error existing in the first text S, and includes, but is not limited to, a word, a phrase composed of a plurality of words, or a sentence.
For example, the first text S is a This is less dependent sweep in store, and the corresponding second text T is a This is the less dependent sweep in store. Comparing the first text S and the second text T, it can be known that the comparison level in the first text S is used incorrectly, which should be the highest level, i.e. "less" is modified to "the least", and thus, a modification pair E ═ less (the least) can be determined.
The sequence-to-sequence alignment model is used for aligning the first text S before modification and the second text T after modification, and obtaining a modification pair E. In general, the modification changes to the content modification corresponding to E include Replace (Re), Insert (In), Delete (De), and Equal (equ, Eq). Based on this, the "less" in the first text S in the above example is replaced by "the least", and the modified pair may be denoted as E ═ E (less, the least, Re). In addition, if "the" is inserted into the first text S "In store", the modification pair E (In store, In the store) also exists.
In a possible implementation manner, the first text S and the second text T may be compared in content at a first comparison granularity to obtain a corresponding first comparison sequence. The content comparison refers to a comparison mode with the same or similar meaning based on the content expression in the first text S and the second text T. The first comparison granularity refers to the smallest unit for performing content comparison, such as: sentences, phrases, single words, etc. The first comparison sequence comprises a first content pair, the first content pair is used for identifying a pair of text character strings corresponding to the first text and the second text, and the corresponding change mode of the pair of text character strings.
For the above example, if the phrase is used as the first comparison granularity, and the content comparison is performed on the first text S and the second text T, a plurality of text character strings can be obtained, for example: e ═ E (in the store, in the store, Eq).
In a possible implementation manner, the first text and the second text may be subjected to content comparison by an alignment algorithm at a first comparison granularity, so as to obtain a corresponding first comparison sequence.
The alignment algorithm may be a Levenshtein (Levenshtein) algorithm, which is a method for calculating the degree of recognition between two sequences, and may provide a conversion from one sequence to another through a series of changes. In practical applications, other algorithms may be used, and are not limited herein.
It should be noted that the first content pair includes at least one pair of text strings, and the first content pair includes a pair of text strings with the same modification. In some cases, two pairs of text strings with different errors exist, and if the two pairs of text strings with the same modification mode are adjacent, when a first content pair is determined based on the content comparison mode, the two pairs of text strings are determined as the same first content pair, so that a problem that multiple errors in the same text are coupled together is caused, and the accuracy of the type and the cause of the language expression error corresponding to the subsequent content modification is influenced.
For example, the first text S is: nowadays, more and more the middle-shaped peer ear depletion for insomenia, and a second text T: nowadays, more and more middle-shaped peer ear depletion from insomina. Comparing the first text S and the second text T, it can be seen that "the" is deleted, "for" is replaced with "from," and "insomenia" is replaced with "insomnia. When determining the first content pair based on the content comparison method, since "for" and "insumenia" in the first text S are adjacent to each other and the modification is replaced, a first content pair may be determined, which includes E ═ E (the, De), E ═ E (from insumenia, Re). Although two pairs of text strings in E ═ E (for insomenia, from insominia, Re) are modified in the same manner, the reason why "for" is replaced with "from" is a collocation error with "buffer", and the reason why "insomenia" is replaced with "insominia" is a spelling error, so that it is not appropriate to analyze the types of linguistic expression errors and the causes of errors with respect to the two pairs of text strings as a whole.
In view of this, the first content pair in the first alignment sequence can be split to obtain a second alignment sequence at a second alignment granularity. The second alignment sequence comprises a second content pair, wherein the second content pair is used for identifying a pair of text character strings corresponding to the first text and the second text, and a change mode corresponding to the pair of text character strings. Wherein, the second comparison granularity refers to a minimum unit for splitting the first content pair, and the second comparison granularity is smaller than the first comparison granularity, such as: if the first comparison granularity is a sentence, the second comparison granularity can be a phrase or a single word; if the first comparison granularity is a phrase, the second comparison granularity may be a single word. The modifications include any of substitution, insertion, deletion, or the like. At least one modified pair can then be determined from the second aligned sequence.
For the content pair (for example, from in the example, Re) modified by substitution, the resolution with the single word as the second alignment granularity can obtain the second alignment sequence, such as: e ═ E, (insumenia, insumnia, Re).
On the basis of using the grammar error correction capability from the sequence to the sequence model, the content pair is determined by using the modification of the alignment algorithm on the model, and the further splitting is performed, so that the condition of multi-error coupling in the first comparison sequence is reduced, a foundation is laid for the subsequent accurate feedback of the language expression error type and the error reason, and the identification precision of the language expression error type and the error reason corresponding to the content modification is improved.
It is understood that the second content pair is determined based on the second modification of the granularity, which is not equivalent to the meaning of the present application that intends to find out that there is a linguistic expression error in the first text. Specifically, the modified mode refers to a mode of modifying text, and the focus is on a mode of performing an action of modifying, including replacing, deleting, inserting, and the like. The language expression refers to the expression rule of the language to which the text belongs, and includes a grammatical structure, a text expression meaning and the like. In this regard, no language is intended to be construed as indicating any non-exclusive embodiment of the claimed subject matter. If the language expression error type and the error reason corresponding to the content modification are determined only by depending on the change mode of the second content pair, the modification pair cannot be accurately analyzed from the language expression angle, so that the accuracy of the language expression error type and the error reason corresponding to the content modification is influenced.
Therefore, the present application provides a possible implementation manner, that is, according to the modification manner of the second content pair in the second alignment sequence, the modification manner identifier is determined to be a modified target content pair, and then adjacent target content pairs in the second alignment sequence that meet the preset rule are merged to obtain a third alignment sequence.
Wherein, the modification mode of the second alignment sequence comprises the following steps: any one of substitution (Re), insertion (In), deletion (De), or equivalence (Eq). The third alignment sequence comprises a modification pair obtained by combining adjacent target content pairs, and the modification mode of the modification pair obtained by combining is determined according to the combination mode. The preset rule is a rule which is satisfied by the corresponding change mode of the merging.
For example, the first text S is: in no case you short bright up, the second text T is: in no case she ould you give up. First, with a first comparison granularity (i.e., with sentences as the granularity comparison), a first content pair is determined: (she would you), then, at a second alignment granularity (i.e., granularity in terms of single words), determine a second pair of content (you, she, Re) and (she, you, Re). Since the two second content pairs conform to the rule of exchanging word positions with words, the two second content pairs can be merged to obtain the modified pair E (hou hould, hould you).
The changing mode of the second content pair is associated with the language expression error type of the content modification through the preset rule, and adjacent content pairs are integrated, so that changing modes with more dimensions are increased, the precision of judging the language expression error type and the error reason corresponding to the content modification is improved, the alignment algorithm is optimized, and the denoising effect is achieved.
For the above modified pairs determined by merging, it can be found that the modification mode of the modified pair in the third aligned sequence includes: swapping (Switch, Sw). Thus, the pair of modifications identified in the above example may be denoted as E ═ E (she would, she would you, Sw).
Therefore, the content pairs in the second alignment sequence can be combined based on the preset rule, the alignment algorithm is optimized, and the accuracy of determining the corresponding language expression error type and the error reason by subsequent modification is further improved.
In practical application, the sequence-to-sequence alignment model can output modified pairs with different change modes (Eq) and can be used as an input of an error detail processing model (denoted as Err-identity) (307) deployed in the terminal equipment in advance to analyze the language expression error type and the error reason of the error detail processing model.
S204: and determining the language expression error type and the error reason corresponding to the content modification according to the first text, the second text and the modification pair.
As shown in fig. 3, the error detail processing model takes the first text S, the second text T, and the modification pair E as input, analyzes and feeds back the language expression error type and the error cause corresponding to the content modification (308), and outputs (309). The error detail processing model is used for judging the error type of the modification pair output from the sequence to the sequence alignment model, analyzing the judgment result and returning an analysis result, wherein the analysis result comprises the language expression error type and the error reason.
It is understood that a language is a sentence with a specific meaning composed of different words, and the meaning of a single word and its position in the sentence play an important role. Therefore, before determining the language expression error type and the error reason corresponding to the content modification, part-of-speech tagging can be performed on the participles in the first text and the second text, and the language expression error type and the error reason corresponding to the content modification can be determined together by combining the part-of-speech tagged by the participles.
The part of speech refers to category attribution of a single word in a certain part of speech system. The part of speech of a word is determined by both a certain part of speech system and the grammatical properties of the word itself. In the present embodiment, the part-of-speech includes, but is not limited to, the meaning of the word, the category to which the word belongs (verb, noun, adjective, etc.), and the grammatical structure of the word (subject, predicate, object, etc.).
For example, for the first text S is: in the hash of the I like hash, the part-of-speech tag corresponding to the hash may include: the meaning of the fish is fish, the category of the fish in the first text S is noun, and the grammatical structure of the fish in the first text S is object.
Therefore, when the error detail module is used for determining the language expression error type and the error reason corresponding to the content modification, the part-of-speech information is introduced on the basis of introducing the first text and the second text, and the information basis for determining the language expression error type and the error reason corresponding to the content modification is further perfected, so that the accuracy of the language expression error type and the error reason corresponding to the content modification is improved on the basis of not constructing a large number of rules.
Based on the above, in a possible implementation, the linguistic expression error type includes an optimized expression or at least one type of syntax error. Or, the output of the error detail processing model comprises: modification level (Grade), modification Type (Type), and error cause (Reason). The modification level here refers to the optimal expression and syntax errors. Wherein, the optimized expression means that the first text S has no grammar error, but can be further optimized in terms of language expression. And a grammatical error means that the first text S does not comply with the language specification of the language to which the first text belongs. The modification type refers to different types of syntax errors, and the error reason is to analyze the reason generated by modification according to specific situations.
In practical application, whether the language expression error type corresponding to the content modification is optimized expression or different types of grammar errors can be distinguished by using the first text S, the second text T, the modification pair E and the part-of-speech tagging based on a preset rule.
For example, the first text S is: and (3) young dog run false same Jim' S, performing text processing on the first text S by using a sequence-to-sequence grammar error correction model to obtain a second text T: young dogs fan thar Jim's dog. Therefore, based on the preset rule, the first text S, the second text T, the modification pair E (, dog, In) and the part of speech tagging can be used to determine that the second text T better conforms to the english written expression specification, and therefore, the type of the linguistic expression error of E (, dog, In) is modified into the optimized expression.
When the error detail processing model determines the language expression error type and the error reason corresponding to the content modification, the data processing flow is as follows:
1. for an input modification pair E, a first text S and a second text T thereof, determining what Type of modification the modification pair E is, that is, determining a modification Type (Type) of the modification pair, wherein possible methods include parsing and the like. The grammar analysis is to analyze the grammar structure of the text according to the dependency relationship among the words.
2. On the premise of the first text S and the second text T, whether the modification level (Grade) of the modification to E is optimized expression or grammar error is judged.
3. And according to the first text S and the second text T, modifying the Grade (Grade) and the modification Type (Type), and generating a corresponding error Reason (Reason).
So far, after the first text S is input, the modified second text T, and the modification level, the modification Type, and the error Reason (Grade, Type, Reason) corresponding to each modification pair E can be obtained, as shown in fig. 3.
In the process of text processing of language expressions existing in the text, the alignment algorithm is utilized, the second content pairs are split and combined, each content modification is independent, part-of-speech information of the second content pairs included in each content modification is combined, the content modification is distinguished from grammar errors or optimized expressions, and analysis and feedback of error reasons corresponding to the content modification are realized.
After the language expression error type and the error reason corresponding to the content modification are determined, the content modification can be displayed on the basis of the first text S and the second text T, and the corresponding judgment result is displayed. And the judgment result comprises the language expression error type and the error reason corresponding to the content modification. Based on the method, the user can directly check the text processing result, and the use experience of the user is improved.
In the text processing method provided in the foregoing embodiment, for a first text to be recognized, text processing is performed on the first text, and at least one content of the first text is modified to modify the first text into a second text. And any content modification corresponds to a modification pair, and the modification pair comprises the content corresponding to the content modification in the first text and the content corresponding to the content modification in the second text. In order to determine the language expression error type and the error reason corresponding to the content modification, the complete contextual information of the content modification before and after the modification is perfected by introducing the first text and the second text according to the information provided by the modification pair, so that on the basis of the modification pair, a relatively complete information basis is provided for determining the language expression error type and the error reason of the modification pair by combining the first text and the second text, and accurate identification and specific error cause of the language expression error type are realized.
In order to better understand the text processing method in the embodiment of the present application, the following specifically describes the process of determining the modification pair for the first text S and the second text T by using the sequence-to-sequence alignment model in combination with fig. 4.
As shown in fig. 4, the sequence-to-sequence grammar error correction model obtains a first text S to be recognized, and the specific components thereof include: and the AC B D Eed F performs text processing on the first text S according to the language thereof to obtain a second text T, and the text T is as follows: AB C D E G and the first text S and the second text T as input of the sequence to the sequence alignment model (401).
In the process of determining a modification pair E by using a first text S and a second text T in a sequence-to-sequence alignment model, firstly, using a levenstein module (402), using a levenstein algorithm, and taking a sentence as a first comparison granularity, performing content comparison on the first text S and the second text T to obtain a corresponding first comparison sequence, wherein the first comparison sequence includes 6 first content pairs, each of which is: (A, A, Eq), (, B, In), (C, C, Eq), (B, De), (D, D, Eq), and (Eed F, EG, Re).
Then, using a splitting module (403), taking a single word as a second comparison granularity, splitting the 6 first content pairs to obtain a second comparison sequence, where the second comparison sequence includes 7 second content pairs, which are: (A, A, Eq), (, B, In), (C, C, Eq), (B, De), (D, D, Eq), (Eed, E, Re), and (F, G, Re).
Then, a merging module (404) is used for sequentially taking the second content pairs in the second comparison sequence as target content pairs and judging whether the target content pairs and the adjacent content pairs meet preset rules. For the second aligned sequence comprising 7 second content pairs, when the target content pair is (, B, In), the adjacent content pair (C, Eq) and the subsequent content pair (B, De) conform to the rule that the change is exchange, so that the 3 content pairs are merged and recorded as (CB, BC, Sw). Based on this, a third alignment sequence can be obtained, which includes 5 modification pairs, respectively: (A, A, Eq), (CB, BC, Sw), (D, D, Eq), (Eed, E, Re) and (F, G, Re).
Thus, using the output module (405), modified pairs of the third aligned sequence that have not been altered in an equal manner (Eq) are output, i.e. the output of the sequence to sequence alignment model comprises 3 modified pairs, respectively: (CB, BC, Sw), (Eed, E, Re) and (F, G, Re).
And taking the 3 output modification pairs as the input of the error detail processing model, and analyzing the language expression error type and the error reason corresponding to each modification pair one by one.
The sequence-to-sequence alignment model provided in the above embodiment performs content comparison on the first text and the second text before modification by using an alignment algorithm on the basis of the output of the sequence-to-sequence syntax error correction model, and further performs splitting and merging on the content pair obtained by the content comparison, thereby laying a foundation for subsequently identifying the language expression error type and the error reason corresponding to the content modification by using the error detail processing model, and improving the accuracy of text error type judgment.
The text processing method provided by the above embodiment is described below with reference to specific examples.
In the scenario shown in fig. 5, if the user inputs the first text S to be recognized, the first text S is: using a sequence to sequence grammar error correction model stage, taking the first text S as an input, modifying the first text S in the aspect of English expression according to English expression rules and outputting a second text T as follows: his sister is five layers yourer than him.
Then, using the sequence-to-sequence alignment model, taking the first text S and the second text T as input of the model, comparing the first text S and the second text T according to the process shown in fig. 4, and outputting a third comparison sequence including 2 modification pairs, which are: (five year, Sw) and (he, him, Re).
Then, the error type judgment is carried out on the 2 modification pairs by using an error detail processing module. Specifically, the first text S, the second text T, the 2 modification pairs and the part-of-speech tag are used as input, and the type and the cause of the error in the linguistic expression corresponding to each of the 2 modification pairs are determined. Wherein, the language expression error type corresponding to the modification pair (five year, five year you, Sw) is a language order error, and the error reason is: if the word order is wrong, please confirm whether the expression of the sentence has the problem of the word order caused by inversion, question or difference of expression habits. The type of the language expression error corresponding to the modification pair (he, him, Re) is a pronoun use error, and the error is caused by: the pronouns may be used incorrectly, please select the appropriate amount in conjunction with the first text and the second text. Here it is proposed to change he to him. Therefore, the content modification and the corresponding determination result in the 2 positions can be fed back to the user and displayed to the user for viewing, as shown in fig. 5.
The text processing method provided by the above embodiment modifies the first text by using the sequence-to-sequence syntax error correction model, and realizes the modification of the sentences with language expression errors with high accuracy and high recall rate compared with the conventional rule-based syntax error correction model. In addition, by means of the optimized alignment algorithm, the grammar analysis and the part of speech tagging based on the first text and the second text, the language expression error type corresponding to each content modification is judged, the error reason analysis corresponding to each content modification is generated, and therefore transition from knowing the error reason type to knowing the error reason type is achieved.
It should be noted that the text processing method provided by the embodiment of the present application can be widely applied to different scenes in education industry, enterprise office, and the like. For example, in the education industry, the text processing method can help teachers reduce the burden of correction homework, and can also help students independently locate language expression errors in homework without intervention of teachers, so that homework is optimized, and the homework quality is improved. For example, for a scenario that a teacher corrects a student's english practice in batches, the teacher can automatically correct the student's writing practice by using the text processing system provided in the embodiment of the present application in the process of checking the student's english writing practice, and give out the type of the linguistic expression error and the reason of the error corresponding to each content correction, so that the student can know the place where the linguistic expression error exists in the english writing practice and the corresponding correct expression and reason of the error, thereby reducing the burden of correcting the student's practice in batches by the teacher. In the scene that students learn foreign languages everyday, the students can also use the text processing system to check and modify own foreign language homework. Or in the scene of bilingual professional writing, the bilingual is helped, and the language level is improved.
In order to better understand the text processing method provided by the embodiment of the present application, an english composition scoring model is combined with the text processing method described in the present application to form a scoring feedback system, which is an example of a scene that helps students improve english writing ability, so that the text processing method provided by the embodiment of the present application is introduced. The text processing method relates to the sequence-to-sequence grammar error correction model, the sequence-to-sequence alignment model and the error detail processing model.
As shown in fig. 6, the student can input the written english composition into the entry of the english composition scoring feedback system, i.e., the left english part in the box shown in fig. 6. Wherein, the mode that the student input english composition can include: keyboard input, voice input, image recognition input, etc., and the actual application process may be determined according to a specific scenario, which is not limited herein. Generally, after waiting for several seconds, the scoring feedback system displays the corresponding score and error correction details of the english composition through the display area on the right side of the input entry. In the scenario shown in fig. 6, the score includes: content score, structure score, sentence score, and vocabulary score. The student can know the horizontal quality of the English composition approximately according to the scores displayed here.
In the application process, the output result corresponding to the english composition obtained by the text processing method can be used as the output of error correction details and as the input of an english composition model, and the english composition is scored to obtain the score of the english composition. Wherein, the score is influenced by the text processing result. For example, the text processing results include: the type of the error existing in a certain sentence in English is word error, and the error reason is word part of speech use error, thereby deducting certain vocabulary and sentence parts of the English according to a preset rule.
In addition, the error correction detail display comprises the places with errors in the English composition, the places are modified, and the corresponding language expression error types and the error reasons are given. As shown in FIG. 7, the first sentence "Do you knock what has been of animal I like most" is the English language input in the left window? "according to the right error correction presentation area, where there is a language expression error in the sentence, i.e.," most "is preceded by the article" the ", and should be inserted as" the ", i.e.," Do you know that the modified sentence is "Do you have a what of animal I like the most? "and gives the corresponding linguistic expression error type: the absence of articles or qualifiers "the", and the corresponding causes of errors: the article suggests that the sentence is optimized using suitable definite articles. It is here proposed to insert the.
The embodiment provides an application scenario of the text processing method, so that a user can independently find the defects in the aspect of the language method without the help of professionals in the process of writing and correcting the English, and the language expression level of the user is improved.
Aiming at the text processing method provided by the embodiment, the embodiment of the application also provides a text processing device.
Referring to fig. 8, fig. 8 is a text processing apparatus according to an embodiment of the present application. As shown in fig. 8, the text processing apparatus 800 includes an acquisition unit 801 and a determination unit 802:
the acquiring unit 801 is configured to acquire a first text to be recognized;
the determining unit 802 is configured to perform text processing on the first text to obtain a second text;
the determining unit 802 is further configured to determine at least one modification pair according to the first text and the second text, where one modification pair corresponds to one content modification in the text processing, and the modification pair includes content in the first text corresponding to the content modification and content in the second text corresponding to the content modification;
the determining unit 802 is further configured to determine a type of linguistic expression error and a cause of the error corresponding to the content modification according to the first text, the second text, and the modification pair.
In a possible implementation manner, the determining unit 802 is configured to:
performing part-of-speech tagging on the participles in the first text and the second text;
and determining the language expression error type and the error reason corresponding to the content modification according to the first text, the second text, the modification pair and the marked part of speech.
In one possible implementation, the linguistic expression error type includes an optimized expression or at least one type of grammatical error.
In a possible implementation manner, the determining unit 802 is configured to:
comparing the contents of the first text and the second text by a first comparison granularity to obtain a corresponding first comparison sequence; the first comparison sequence comprises a first content pair, and the first content pair is used for identifying a pair of text character strings corresponding to the first text and the second text and changing the corresponding change modes of the pair of text character strings;
splitting a first content pair in the first comparison sequence by a second comparison granularity to obtain a second comparison sequence; the second comparison sequence comprises a second content pair, the second content pair is used for identifying a pair of corresponding text character strings between the first text and the second text and a change mode corresponding to the pair of text character strings, and the second comparison granularity is smaller than the first comparison granularity;
determining the at least one modified pair based on the second aligned sequence.
In a possible implementation manner, the determining unit 802 is configured to:
determining the modification mode identification as a modified target content pair according to the modification mode of a second content pair in the second comparison sequence;
merging adjacent target content pairs which accord with a preset rule in the second comparison sequence to obtain a third comparison sequence; the third alignment sequence comprises the modification pair obtained by combining the adjacent target content pairs, and the change mode of the modification pair obtained by combining is determined according to the combination mode.
In a possible implementation manner, the modification manner of the second content pair in the second aligned sequence includes any one of substitution, insertion, deletion or the like; the modified pair in the third aligned sequence is altered in a manner that includes an exchange.
In a possible implementation manner, the determining unit 802 is configured to perform content comparison on the first text and the second text with a first comparison granularity through an alignment algorithm to obtain a corresponding first comparison sequence.
In one possible implementation, the apparatus further comprises a presentation unit;
the display unit is used for displaying the content modification on the basis of the first text and the second text and displaying a corresponding judgment result; the determination result includes the type of the linguistic expression error and the error reason.
The text processing apparatus provided in the foregoing embodiment performs text processing on a first text to be recognized, and modifies at least one content of the first text to modify the first text into a second text. And any content modification corresponds to a modification pair, and the modification pair comprises the content corresponding to the content modification in the first text and the content corresponding to the content modification in the second text. In order to determine the language expression error type and the error reason corresponding to the content modification, the complete contextual information of the content modification before and after the modification is perfected by introducing the first text and the second text according to the information provided by the modification pair, so that on the basis of the modification pair, a relatively complete information basis is provided for determining the language expression error type and the error reason of the modification pair by combining the first text and the second text, and accurate identification and specific error cause of the language expression error type are realized.
The embodiment of the present application further provides an apparatus for text processing, and the apparatus for text processing provided in the embodiment of the present application will be described below from the perspective of hardware implementation.
Referring to fig. 9, fig. 9 is a schematic diagram of a server 1400 provided by an embodiment of the present application, where the server 1400 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1422 (e.g., one or more processors) and a memory 1432, one or more storage media 1430 (e.g., one or more mass storage devices) for storing applications 1442 or data 1444. Memory 1432 and storage media 1430, among other things, may be transient or persistent storage. The program stored on storage medium 1430 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, a central processor 1422 may be disposed in communication with storage medium 1430 for executing a series of instruction operations on storage medium 1430 on server 1400.
The server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input-output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 9.
The CPU 1422 is configured to perform the following steps:
acquiring a first text to be identified;
performing text processing on the first text to obtain a second text;
determining at least one modification pair according to the first text and the second text, wherein one modification pair corresponds to one content modification in the text processing, and the modification pair comprises the content corresponding to the content modification in the first text and the content corresponding to the content modification in the second text;
and determining the language expression error type and the error reason corresponding to the content modification according to the first text, the second text and the modification pair.
Optionally, the CPU 1422 may further execute the method steps of any specific implementation manner of the text processing method in the embodiment of the present application.
For the text processing method described above, the embodiment of the present application further provides a terminal device for text processing, so that the text processing method described above is implemented and applied in practice.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application. For convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the specific technology are not disclosed. The terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA for short), and the like, taking the terminal device as the mobile phone as an example:
fig. 10 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 10, the mobile phone includes: a Radio Frequency (RF) circuit 1510, a memory 1520, an input unit 1530, a display unit 1540, a sensor 1550, an audio circuit 1560, a wireless fidelity (WiFi) module 1570, a processor 1580, and a power supply 1590. Those skilled in the art will appreciate that the handset configuration shown in fig. 10 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes each component of the mobile phone in detail with reference to fig. 10:
the RF circuit 1510 may be configured to receive and transmit signals during information transmission and reception or during a call, and in particular, receive downlink information of a base station and then process the received downlink information to the processor 1580; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1510 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 1510 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The memory 1520 may be used to store software programs and modules, and the processor 1580 implements various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1520. The memory 1520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1520 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The input unit 1530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1530 may include a touch panel 1531 and other input devices 1532. The touch panel 1531, also referred to as a touch screen, can collect touch operations of a user (e.g., operations of the user on or near the touch panel 1531 using any suitable object or accessory such as a finger or a stylus) and drive corresponding connection devices according to a preset program. Alternatively, the touch panel 1531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1580, and can receive and execute commands sent by the processor 1580. In addition, the touch panel 1531 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1530 may include other input devices 1532 in addition to the touch panel 1531. In particular, other input devices 1532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 1540 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The Display unit 1540 may include a Display panel 1541, and optionally, the Display panel 1541 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1531 may cover the display panel 1541, and when the touch panel 1531 detects a touch operation on or near the touch panel 1531, the touch operation is transmitted to the processor 1580 to determine the type of the touch event, and then the processor 1580 provides a corresponding visual output on the display panel 1541 according to the type of the touch event. Although in fig. 10, the touch panel 1531 and the display panel 1541 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1531 and the display panel 1541 may be integrated to implement the input and output functions of the mobile phone.
The handset can also include at least one sensor 1550, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1541 according to the brightness of ambient light and a proximity sensor that turns off the display panel 1541 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.
Audio circuitry 1560, speaker 1561, and microphone 1562 may provide an audio interface between a user and a cell phone. The audio circuit 1560 may transmit the electrical signal converted from the received audio data to the speaker 1561, and convert the electrical signal into an audio signal by the speaker 1561 and output the audio signal; on the other hand, the microphone 1562 converts collected sound signals into electrical signals, which are received by the audio circuit 1560 and converted into audio data, which are processed by the audio data output processor 1580 and then passed through the RF circuit 1510 for transmission to, for example, another cellular phone, or for output to the memory 1520 for further processing.
WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through a WiFi module 1570, and provides wireless broadband internet access for the user. Although fig. 10 shows WiFi module 1570, it is understood that it does not belong to the essential constitution of the handset and can be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 1580 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1520 and calling data stored in the memory 1520, thereby integrally monitoring the mobile phone. Optionally, the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, and the like, and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor may not be integrated into the processor 1580.
The handset also includes a power supply 1590 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 1580 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.
Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.
In an embodiment of the present application, the handset includes a memory 1520 that can store program code and transmit the program code to the processor.
The processor 1580 included in the mobile phone can execute the text processing method provided in the foregoing embodiments according to the instructions in the program code.
The embodiment of the present application further provides a computer-readable storage medium for storing a computer program, where the computer program is used to execute the text processing method provided by the foregoing embodiment.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the text processing method provided in the various alternative implementations of the above aspects.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as read-only memory (ROM), RAM, magnetic disk, or optical disk.
It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the models can be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (15)

1. A method of text processing, the method comprising:
acquiring a first text to be identified;
performing at least one content modification on the first text to obtain a second text;
determining at least one modification pair according to the first text and the second text, wherein one modification pair corresponds to one content modification in the text processing, and the modification pair comprises the content corresponding to the content modification in the first text and the content corresponding to the content modification in the second text;
and determining the language expression error type and the error reason corresponding to the content modification in the first text according to the first text, the second text and the modification pair, wherein the first text has contextual information before the content modification.
2. The method of claim 1, wherein the determining the type of linguistic expression error and the cause of the error corresponding to the content modification according to the first text, the second text, and the modification pair comprises:
performing part-of-speech tagging on the participles in the first text and the second text;
and determining the language expression error type and the error reason corresponding to the content modification according to the first text, the second text, the modification pair and the marked part of speech.
3. The method of claim 2, wherein the linguistic expression error type comprises an optimized expression or at least one type of grammatical error.
4. The method of claim 1, wherein determining at least one modification pair from between the first text and the second text comprises:
comparing the contents of the first text and the second text by a first comparison granularity to obtain a corresponding first comparison sequence; the first comparison sequence comprises a first content pair, and the first content pair is used for identifying a pair of text character strings corresponding to the first text and the second text and changing the corresponding change modes of the pair of text character strings;
splitting a first content pair in the first comparison sequence by a second comparison granularity to obtain a second comparison sequence; the second comparison sequence comprises a second content pair, the second content pair is used for identifying a pair of corresponding text character strings between the first text and the second text and a change mode corresponding to the pair of text character strings, and the second comparison granularity is smaller than the first comparison granularity;
determining the at least one modified pair based on the second aligned sequence.
5. The method of claim 4, wherein said determining said at least one modified pair from said second aligned sequence comprises:
determining the modification mode identification as a modified target content pair according to the modification mode of a second content pair in the second comparison sequence;
merging adjacent target content pairs which accord with a preset rule in the second comparison sequence to obtain a third comparison sequence; the third alignment sequence comprises the modification pair obtained by combining the adjacent target content pairs, and the change mode of the modification pair obtained by combining is determined according to the combination mode.
6. The method of claim 5, wherein the second pair of aligned sequences is modified by any one of substitution, insertion, deletion, or the like; the modified pair in the third aligned sequence is altered in a manner that includes an exchange.
7. The method of claim 4, wherein the comparing the contents of the first text and the second text with a first comparison granularity to obtain a corresponding first comparison sequence comprises:
and comparing the contents of the first text and the second text by an alignment algorithm at a first comparison granularity to obtain a corresponding first comparison sequence.
8. The method according to any one of claims 1-7, further comprising:
displaying the content modification on the basis of the first text and the second text, and displaying a corresponding judgment result; the determination result includes the type of the linguistic expression error and the error reason.
9. A text processing apparatus characterized by comprising an acquisition unit and a determination unit:
the acquisition unit is used for acquiring a first text to be identified;
the determining unit is used for modifying at least one content of the first text to obtain a second text;
the determining unit is further configured to determine at least one modification pair according to the first text and the second text, where one modification pair corresponds to one content modification in the text processing, and the modification pair includes a content in the first text corresponding to the content modification and a content in the second text corresponding to the content modification;
the determining unit is further configured to determine, according to the first text, the second text, and the modification pair, a type of linguistic expression error and a cause of the error corresponding to the content modification in the first text, where the first text has context information before the content modification.
10. The apparatus of claim 9, wherein the determining unit is configured to:
performing part-of-speech tagging on the participles in the first text and the second text;
and determining the language expression error type and the error reason corresponding to the content modification according to the first text, the second text, the modification pair and the marked part of speech.
11. The apparatus of claim 10, wherein the linguistic expression error type comprises an optimized expression or at least one type of grammatical error.
12. The apparatus of claim 9, wherein the determining unit is configured to:
comparing the contents of the first text and the second text by a first comparison granularity to obtain a corresponding first comparison sequence; the first comparison sequence comprises a first content pair, and the first content pair is used for identifying a pair of text character strings corresponding to the first text and the second text and changing the corresponding change modes of the pair of text character strings;
splitting a first content pair in the first comparison sequence by a second comparison granularity to obtain a second comparison sequence; the second comparison sequence comprises a second content pair, the second content word is used for identifying a pair of corresponding text character strings between the first text and the second text and a change mode corresponding to the pair of text character strings, and the second comparison granularity is smaller than the first comparison granularity;
determining the at least one modified pair based on the second aligned sequence.
13. The apparatus of claim 12, wherein the determining unit is configured to:
determining the modification mode identification as a modified target content pair according to the modification mode of a second content pair in the second comparison sequence;
merging adjacent target content pairs which accord with a preset rule in the second comparison sequence to obtain a third comparison sequence; the third alignment sequence comprises the modification pair obtained by combining the adjacent target content pairs, and the change mode of the modification pair obtained by combining is determined according to the combination mode.
14. An apparatus for text processing, the apparatus comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any of claims 1-8 according to instructions in the program code.
15. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any one of claims 1-8.
CN202110200840.2A 2020-11-06 2020-11-06 Text processing method and related device Pending CN113536743A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110200840.2A CN113536743A (en) 2020-11-06 2020-11-06 Text processing method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011231200.XA CN112036135B (en) 2020-11-06 2020-11-06 Text processing method and related device
CN202110200840.2A CN113536743A (en) 2020-11-06 2020-11-06 Text processing method and related device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202011231200.XA Division CN112036135B (en) 2020-11-06 2020-11-06 Text processing method and related device

Publications (1)

Publication Number Publication Date
CN113536743A true CN113536743A (en) 2021-10-22

Family

ID=73572791

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202110200840.2A Pending CN113536743A (en) 2020-11-06 2020-11-06 Text processing method and related device
CN202011231200.XA Active CN112036135B (en) 2020-11-06 2020-11-06 Text processing method and related device

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202011231200.XA Active CN112036135B (en) 2020-11-06 2020-11-06 Text processing method and related device

Country Status (1)

Country Link
CN (2) CN113536743A (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719424B (en) * 2023-08-09 2024-03-22 腾讯科技(深圳)有限公司 Determination method and related device for type identification model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100312545A1 (en) * 2009-06-05 2010-12-09 Google Inc. Detecting Writing Systems and Languages
CN106776549A (en) * 2016-12-06 2017-05-31 桂林电子科技大学 A kind of rule-based english composition syntax error correcting method
US20170220536A1 (en) * 2016-02-01 2017-08-03 Microsoft Technology Licensing, Llc Contextual menu with additional information to help user choice
CN108595410A (en) * 2018-03-19 2018-09-28 小船出海教育科技(北京)有限公司 The automatic of hand-written composition corrects method and device
WO2019105432A1 (en) * 2017-11-29 2019-06-06 腾讯科技(深圳)有限公司 Text recommendation method and apparatus, and electronic device
CN110718226A (en) * 2019-09-19 2020-01-21 厦门快商通科技股份有限公司 Speech recognition result processing method and device, electronic equipment and medium

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5384702A (en) * 1993-09-19 1995-01-24 Tou Julius T Method for self-correction of grammar in machine translation
JP5079019B2 (en) * 2008-01-08 2012-11-21 三菱電機株式会社 Information filtering system, information filtering method, and information filtering program
CN101520779A (en) * 2009-04-17 2009-09-02 哈尔滨工业大学 Automatic diagnosis and evaluation method for machine translation
US8560300B2 (en) * 2009-09-09 2013-10-15 International Business Machines Corporation Error correction using fact repositories
JP6605995B2 (en) * 2016-03-16 2019-11-13 株式会社東芝 Speech recognition error correction apparatus, method and program
CN108519974A (en) * 2018-03-31 2018-09-11 华南理工大学 English composition automatic detection of syntax error and analysis method
CN111767709A (en) * 2019-03-27 2020-10-13 武汉慧人信息科技有限公司 Logic method for carrying out error correction and syntactic analysis on English text
CN110309504B (en) * 2019-05-23 2023-10-31 平安科技(深圳)有限公司 Text processing method, device, equipment and storage medium based on word segmentation
CN111090989B (en) * 2019-07-17 2023-09-22 广东小天才科技有限公司 Prompting method based on character recognition and electronic equipment
CN110427330B (en) * 2019-08-13 2023-09-26 腾讯科技(深圳)有限公司 Code analysis method and related device
CN111859920A (en) * 2020-06-19 2020-10-30 北京国音红杉树教育科技有限公司 Method and system for identifying word spelling errors and electronic equipment
CN111737980B (en) * 2020-06-22 2023-05-16 桂林电子科技大学 Correction method for use errors of English text words

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100312545A1 (en) * 2009-06-05 2010-12-09 Google Inc. Detecting Writing Systems and Languages
US20170220536A1 (en) * 2016-02-01 2017-08-03 Microsoft Technology Licensing, Llc Contextual menu with additional information to help user choice
CN106776549A (en) * 2016-12-06 2017-05-31 桂林电子科技大学 A kind of rule-based english composition syntax error correcting method
WO2019105432A1 (en) * 2017-11-29 2019-06-06 腾讯科技(深圳)有限公司 Text recommendation method and apparatus, and electronic device
CN108595410A (en) * 2018-03-19 2018-09-28 小船出海教育科技(北京)有限公司 The automatic of hand-written composition corrects method and device
CN110718226A (en) * 2019-09-19 2020-01-21 厦门快商通科技股份有限公司 Speech recognition result processing method and device, electronic equipment and medium

Also Published As

Publication number Publication date
CN112036135B (en) 2021-03-02
CN112036135A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
US11416681B2 (en) Method and apparatus for determining a reply statement to a statement based on a sum of a probability of the reply statement being output in response to the statement and a second probability in which the statement is output in response to the statement and further based on a terminator
EP2947581B1 (en) Interactive searching method and apparatus
CN110334360B (en) Machine translation method and device, electronic device and storage medium
CN110334347A (en) Information processing method, relevant device and storage medium based on natural language recognition
CN110795528A (en) Data query method and device, electronic equipment and storage medium
CN111177371B (en) Classification method and related device
CN110795538B (en) Text scoring method and related equipment based on artificial intelligence
CN110717026B (en) Text information identification method, man-machine conversation method and related devices
CN111597804B (en) Method and related device for training entity recognition model
CN111368525A (en) Information searching method, device, equipment and storage medium
CN114328852A (en) Text processing method, related device and equipment
CN112214605A (en) Text classification method and related device
CN109543014B (en) Man-machine conversation method, device, terminal and server
CN108345612A (en) A kind of question processing method and device, a kind of device for issue handling
CN112749252A (en) Text matching method based on artificial intelligence and related device
CN113822072A (en) Keyword extraction method and device and electronic equipment
CN112232066A (en) Teaching outline generation method and device, storage medium and electronic equipment
CN112036135B (en) Text processing method and related device
CN113822038A (en) Abstract generation method and related device
CN112328783A (en) Abstract determining method and related device
CN116955610A (en) Text data processing method and device and storage medium
CN113505596B (en) Topic switching marking method and device and computer equipment
CN112307198B (en) Method and related device for determining abstract of single text
CN113821609A (en) Answer text acquisition method and device, computer equipment and storage medium
CN113703883A (en) Interaction method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40053616

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination