CN112307748A - Method and device for processing text - Google Patents

Method and device for processing text Download PDF

Info

Publication number
CN112307748A
CN112307748A CN202010134249.7A CN202010134249A CN112307748A CN 112307748 A CN112307748 A CN 112307748A CN 202010134249 A CN202010134249 A CN 202010134249A CN 112307748 A CN112307748 A CN 112307748A
Authority
CN
China
Prior art keywords
word
user
text
written
difference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010134249.7A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202010134249.7A priority Critical patent/CN112307748A/en
Publication of CN112307748A publication Critical patent/CN112307748A/en
Pending legal-status Critical Current

Links

Images

Abstract

The embodiment of the disclosure discloses a method and a device for processing texts. One embodiment of the method comprises: acquiring a text corresponding to the content written by the user as a user text; determining voice characteristics of voice corresponding to the user text, and performing voice recognition by using the voice characteristics to obtain a recognition text; selecting characters different from the corresponding characters in the recognition text from the user text as difference characters to obtain a difference character set; and determining a processing result of the content written by the user according to the difference word set, wherein the processing result is used for indicating suspected wrongly written words appearing in the content written by the user. The embodiment realizes convenient detection of wrongly written words in the content written by the user.

Description

Method and device for processing text
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to a method and a device for processing texts.
Background
With the rapid development and wide application of computer technology. In daily life, a user can directly use the functions of typing, voice, video and the like provided by the used electronic equipment to communicate with other users, and the chance that the user writes some characters is less and less. This also directly results in the user forgetting the correct writing of many words. Therefore, when a user completes some contents (such as a handwritten report, a handwritten manuscript and the like) by handwriting, wrongly written characters are easy to appear.
In addition, many students, especially primary and secondary school students, may not be skilled in writing Chinese characters themselves, and therefore, the users may also have wrongly written characters during writing.
For the various cases of the above examples, where wrongly written characters easily occur, it is usually necessary to carefully check the user himself or friends, parents, etc. of the user to find wrongly written characters, and missed wrongly written characters are also easily occurred. In some cases, if the content written by the user includes too many words, the examination can be time-consuming.
Disclosure of Invention
The embodiment of the disclosure provides a method and a device for processing texts.
In a first aspect, an embodiment of the present disclosure provides a method for processing text, the method including: acquiring a text corresponding to the content written by the user as a user text; determining voice characteristics of voice corresponding to the user text, and performing voice recognition by using the voice characteristics to obtain a recognition text; selecting characters different from the corresponding characters in the recognition text from the user text as difference characters to obtain a difference character set; and determining a processing result of the content written by the user according to the difference word set, wherein the processing result is used for indicating suspected wrongly written words appearing in the content written by the user.
In some embodiments, determining a result of processing the user-written content based on the set of difference words comprises: for the difference characters in the difference character set, extracting the words where the difference characters are from the user text to form a character set corresponding to the difference characters; and determining whether the difference word is a suspected wrongly-written word or not according to the word set corresponding to the difference word.
In some embodiments, determining whether the difference word is a suspected wrongly-written word according to the word set corresponding to the difference word includes: determining whether a preset word bank comprises words in a word set corresponding to the difference words; in response to determining that the thesaurus does not include a word in the set of words to which the difference word corresponds, determining the difference word as a suspected wrongly-written word.
In some embodiments, the above method further comprises: selecting characters and words belonging to a preset frequently-wrong character word library from a user text as candidate words to obtain a candidate character word set; for candidate words in the candidate word set, determining whether the candidate words are suspected wrongly written words; and updating the processing result in response to determining that the candidate word is a suspected wrongly written word.
In some embodiments, determining whether the candidate word is a suspected wrongly written word includes: and determining whether the candidate word is a suspected wrongly-written word or not according to the grammar rule corresponding to the candidate word.
In some embodiments, the above method further comprises: receiving user feedback information aiming at a processing result; and updating the word stock according to the user feedback information.
In some embodiments, the above method further comprises: and updating the wrongly-typed character set constructed aiming at the user according to the user feedback information.
In a second aspect, an embodiment of the present disclosure provides an apparatus for processing text, the apparatus including: an acquisition unit configured to acquire a text corresponding to content written by a user as a user text; the recognition unit is configured to determine a voice feature of voice corresponding to the user text, and perform voice recognition by using the voice feature to obtain a recognition text; the selecting unit is configured to select characters different from the corresponding characters in the recognition text from the user text as difference characters to obtain a difference character set; and the processing unit is configured to determine a processing result of the content written by the user according to the difference word set, wherein the processing result is used for indicating suspected wrongly written words appearing in the content written by the user.
In some embodiments, the processing unit is further configured to, for a difference word in the difference word set, extract a word in which the difference word is located from the user text to form a word set corresponding to the difference word; and determining whether the difference word is a suspected wrongly-written word or not according to the word set corresponding to the difference word.
In some embodiments, the processing unit is further configured to determine whether a word in a set of words corresponding to the difference word is included in a preset word library; in response to determining that the thesaurus does not include a word in the set of words to which the difference word corresponds, determining the difference word as a suspected wrongly-written word.
In some embodiments, the selecting unit is further configured to select, from the user text, a word and a word belonging to a preset frequently-wrong word library as candidate words, to obtain a candidate word set; the processing unit is further configured to determine, for a candidate word in the set of candidate words, whether the candidate word is a suspected wrongly-written word; and updating the processing result in response to determining that the candidate word is a suspected wrongly written word.
In some embodiments, the processing unit is further configured to determine whether the candidate word is a suspected wrongly written word according to a grammar rule corresponding to the candidate word.
In some embodiments, the above apparatus further comprises: a receiving unit configured to receive user feedback information for a processing result; and the updating unit is configured to update the word stock according to the user feedback information.
In some embodiments, the updating unit is further configured to update the set of mistyped words constructed for the user according to the user feedback information.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.
In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which computer program, when executed by a processor, implements the method as described in any of the implementations of the first aspect.
According to the method and the device for processing the text, the user text corresponding to the content written by the user is obtained, the recognition text is obtained through voice recognition according to the voice characteristics of the user text, and then the difference word set is obtained through comparing the corresponding words in the user text and the recognition text. Because the recognized text through the voice recognition can represent the text representation with more common and natural voice characteristics, the recognized text can be used as the contrast of the user text, a difference word set is obtained through the contrast, and the difference word can indicate that the word is not the text representation with common or natural voice characteristics, so that suspected wrongly written words in the content written by the user can be determined according to the obtained difference word set, and the method is favorable for realizing convenient detection of wrongly written words in the content written by the user.
Drawings
Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram for one embodiment of a method for processing text, according to the present disclosure;
FIG. 3 is a flow diagram of yet another embodiment of a method for processing text in accordance with the present disclosure;
FIG. 4 is a schematic diagram of one application scenario of a method for processing text in accordance with an embodiment of the present disclosure;
FIG. 5 is a flow diagram of yet another embodiment of a method for processing text in accordance with the present disclosure;
FIG. 6 is a schematic block diagram illustrating one embodiment of an apparatus for processing text according to the present disclosure;
FIG. 7 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary architecture 100 to which embodiments of a method for processing text or an apparatus for processing text of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The terminal devices 101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have client applications installed thereon. Such as browser-type applications, search-type applications, image processing-type applications, voice processing-type applications, and so forth.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server providing various services, such as a server providing back-end support for client applications installed on the terminal devices 101, 102, 103. The server 105 may analyze and process a text submitted by the client (e.g., a text corresponding to the content written by the user), and feed back a processing result (e.g., a suspected wrongly written word in the content written by the user) to the terminal devices 101, 102, and 103.
It should be noted that the text submitted by the client (e.g., text corresponding to the content written by the user) may also be directly stored locally in the server 105, and the server 105 may directly extract and process the locally stored data, in which case, the terminal devices 101, 102, and 103 and the network 104 may not be present.
It should be noted that the method for processing text provided by the embodiment of the present disclosure is generally performed by the server 105, and accordingly, the apparatus for processing text is generally disposed in the server 105.
It should be noted that the terminal devices 101, 102, and 103 may also directly process texts (such as texts corresponding to the content written by the user), in this case, the method for processing texts may also be executed by the terminal devices 101, 102, and 103, and accordingly, the device for processing texts may also be disposed in the terminal devices 101, 102, and 103. At this point, the exemplary system architecture 100 may not have the server 105 and the network 104.
The server 105 may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein. The number of (a) is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for processing text in accordance with the present disclosure is shown. The method for processing text comprises the following steps:
step 201, acquiring a text corresponding to the content written by the user as a user text.
In this embodiment, an execution subject (e.g., the server 105 shown in fig. 1) of the method for processing text may obtain text corresponding to the content written by the user from a local or other storage device (e.g., the terminal devices 101, 102, 103 shown in fig. 1), a database, a third-party data platform, and the like.
The content written by the user can be characters written by the user for various purposes. For example, the content written by the user includes, but is not limited to, jobs, reports, articles, lyrics, letters, and the like. It should be appreciated that the content written by the user may be different for different types of users. For example, for a student, the written content includes, but is not limited to, classroom work, out-of-class work, answers written on test papers, and the like. For adults, written content includes, but is not limited to, work reports, mail, and the like.
In this embodiment, the user text may be composed of words in the content written by the user in the writing order of the user. According to different application requirements and application scenes, the text corresponding to the content written by the user can be flexibly set and acquired.
Alternatively, an image for presenting the content written by the user may be acquired first, and then the text corresponding to the content written by the user may be obtained according to the acquired image. The mode of acquiring the images can be flexibly set. The image may be obtained, for example, by photographing the content written by the user. In this case, the text corresponding to the content written by the user can be obtained by extracting the characters in the image based on various existing image processing techniques (such as optical character recognition).
It should be understood that the execution main body may directly obtain the image for presenting the content written by the user, or may first obtain the image for presenting the content written by the user by using another storage device, and then send the obtained image to the execution main body.
Alternatively, the text corresponding to the content written by the user may be obtained according to a scanning result obtained by scanning the content written by the user with a scanning device (e.g., a scanning pen).
It should be understood that the execution body may be communicatively coupled to the scanning device. At this time, the execution body may directly receive the scan result. Of course, the scanning device may also be communicatively coupled to other storage devices. At this time, other storage devices may receive the scan result and then forward the scan result to the execution subject.
It should be noted that, the user in the present disclosure may refer to various types of users. For example, it may refer to children (e.g., primary and middle school students), the elderly, and so forth.
Step 202, determining the voice characteristics of the voice corresponding to the user text, and performing voice recognition by using the voice characteristics to obtain a recognition text.
In this embodiment, the speech features may be used to characterize the pronunciation characteristics of each word in the user's text. The speech features may take the form of various types of speech features that are available. For example, the phonetic feature may be pinyin. At this time, the phonetic feature of the speech corresponding to the user text can be used to indicate the pinyin of each word in the user text. As another example, the speech features may be phonemes. At this time, the voice feature of the voice corresponding to the user text may be used to indicate the phoneme of each word in the user text.
In this embodiment, the Speech feature of the Speech corresponding To the user Text may be determined based on various existing TTS (Text To Speech ) technologies. Meanwhile, the obtained speech features can be processed by utilizing various existing speech recognition technologies to obtain texts corresponding to the recognized speech features as recognition texts.
It should be understood that, in general, the resulting recognized text has a correspondence with the user text described above. In particular, the user text may be regarded as a word sequence consisting of a sequence of individual words in the user text, and likewise, the recognition text may be regarded as a sequence of individual words in the recognition text. Since the user text and the recognized text correspond to the same speech characteristics, respectively. Therefore, the words in the word sequence corresponding to the user text and the words in the word sequence corresponding to the recognition text are in one-to-one correspondence, and the two corresponding words have the same speech characteristics. Wherein, sequential may refer to reading order.
As an example, the user text is "his bundle", the speech feature corresponding to the user text is "ta de bao fu", and the recognized text derived from the speech feature is "his hold". At this time, "he" in the user text corresponds to "he" in the recognition text. "of" in the user text corresponds to "of" in the recognition text. The "package" in the user text corresponds to the "hug" in the recognition text. "beam" in the user text corresponds to "negative" in the recognition text.
It should be noted that the user text described above is only an example, and the length of the user text may be arbitrary. I.e. the number of words comprised by the user text may be arbitrary. The present disclosure is not limited thereto.
Step 203, selecting characters different from the corresponding characters in the recognition text from the user text as difference characters to obtain a difference character set.
In this embodiment, as explained in step 202, for a word in the user text, the word in the text corresponding to the word is identified as a word having the same phonetic feature as the word in order. Generally, words in the user text correspond one-to-one with words in the recognition text.
Therefore, the words in the user text can be compared with the words in the recognition text one by one to obtain a difference word set. In other words, for a word of the user text, it may be determined whether the word in the recognition text corresponding to the word is the same as the word. If the two are the same, it indicates that there is no difference between the two words. If the two are not the same, the word may be determined to be a difference word.
And step 204, determining a processing result of the content written by the user according to the difference word set.
In this embodiment, the processing result may be used to indicate a suspected wrongly written word appearing in the content written by the user. The suspected wrongly written words can be used to represent that the words have a certain probability of being wrongly written words.
Alternatively, it may be directly determined that each word in the difference word set is a suspected wrongly written word.
In this embodiment, after the processing result is obtained, the obtained processing result may be presented to the target user, so as to prompt the target user of a wrongly written word that may appear in the content written by the user. The target user may be a related user of the user corresponding to the user text. Therefore, the written content can be conveniently checked and corrected by the user according to the prompt information.
For example, when the user corresponding to the user text is a child, the target user may be a parent or a teacher of the child pair. At the moment, parents or teachers can check the contents (such as jobs and the like) written by children in a targeted manner according to the prompt information, and compared with a checking mode that the parents or teachers check wrongly written contents of the children manually, the checking efficiency can be effectively improved.
According to the method provided by the embodiment of the disclosure, the voice characteristics corresponding to the user text corresponding to the content written by the user are obtained by using processing technologies such as text-to-voice conversion and the like, then the recognition text corresponding to the voice characteristics is obtained by using a voice recognition technology, then the user text and the recognition text are compared to obtain the difference word set, and the wrongly written content by the user, namely the suspected wrongly written word, is determined according to the difference word set, so that the suspected wrongly written word can be checked by the user or a related user thereof in a targeted manner, and the problems of long time consumption and low efficiency in checking the wrongly written content by the user only by using a manual checking mode are effectively solved.
With further reference to FIG. 3, a flow 300 of yet another embodiment of a method for processing text is shown. The flow 300 of the method for processing text comprises the steps of:
step 301, acquiring a text corresponding to the content written by the user as a user text.
Step 302, determining the voice characteristics of the voice corresponding to the user text, and performing voice recognition by using the voice characteristics to obtain a recognition text.
Step 303, selecting a word different from the word corresponding to the recognition text from the user text as a difference word to obtain a difference word set.
The specific implementation process of steps 301, 302, and 303 may refer to the related description of steps 201, 202, and 203 in the corresponding embodiment of fig. 2, and will not be described herein again.
Step 304, for the difference word in the difference word set, extracting the word in which the difference word is located from the user text to form a word set corresponding to the difference word, and determining whether the difference word is a suspected wrongly-written or mispronounced word according to the word set corresponding to the difference word.
In this embodiment, for each difference word, the word in which the difference word is located may refer to the difference word composed of a preset number of words before and/or after the difference word in the designated reading order. The preset number can be set according to an actual application scene. Words may be composed from the difference word with words before it, with words after it, with words before it and with words after it are expected. Therefore, the word in which a difference word is located may be multiple, that is, the number of words included in the word set may be multiple. And the specific number of words included in the word set can be set according to the actual application requirements.
Optionally, the word set corresponding to the difference word may be composed of words in the user text, where the difference word is located, and the length of each word is not less than 2 and not more than 4.
The following examples are given as illustrations: the user text is "speech is the most natural way for human interaction", where the difference word is "yes". In this case, the term "d" in the length of not less than 2 and not more than 4 includes: then get, exchange, get, exchange naturally.
In this embodiment, it may be determined whether the difference word is a suspected wrongly-written word according to the word set corresponding to the difference word by using various existing language processing techniques.
Optionally, regarding a word in the word set corresponding to the difference word, the word may be regarded as a text, and the word is processed by using various existing word segmentation methods, so as to obtain a word segmentation result. Wherein the word segmentation result can be used to characterize whether the word is divided into a target number of words. Wherein the number of targets is equal to the number of words comprised by the word. And in response to determining that each word in the word set corresponding to the difference word is divided into the target number of words, determining the difference word as a suspected wrongly-written word.
Since a word is divided into the same number of words as it includes, it can be stated that the word is not a word that conforms to the rules of natural language. Therefore, if each word in which a difference word is located is not a word according with the natural language rule, it can be shown that the difference word is likely to be a wrongly written word by the user.
Optionally, it may be determined whether the difference word is a suspected wrongly-written word according to the word set corresponding to the difference word by the following steps: determining whether a preset word bank comprises words in a word set corresponding to the difference words; in response to determining that the thesaurus does not include a word in the set of words to which the difference word corresponds, determining the difference word as a suspected wrongly-written word.
Wherein the preset lexicon can be preset by a technician. For example, a thesaurus may consist of all words in an existing dictionary. For another example, the thesaurus may be some existing public thesaurus.
If a word exists in the thesaurus, the word can be indicated to belong to a word according with the natural language rule. If a word does not exist in the thesaurus, it can be indicated that the word does not belong to a word conforming to the rules of natural language, i.e. the word may be a wrong word. Based on this, if the word library does not include each word in the word set corresponding to the difference word, it is described that each word in which the difference word is located does not conform to the natural language rule, and therefore, the difference word is likely to be a wrongly-written word.
Since the word bank can usually cover almost all common words, the word bank is used for judging the words where the different words are located in the user text, so that the accuracy of the determined suspected wrongly-written words can be improved, and the situation of excessive misjudgment is avoided.
In some optional implementations of the embodiment, after obtaining the processing result of the content written by the user, the processing result may be presented to the user. Furthermore, user feedback information aiming at the processing result can be received, and then the preset word bank can be updated according to the user feedback information. The user feedback information may be used to indicate whether each suspected wrongly-written word indicated by the processing result is a true wrongly-written word.
Based on the method, the preset word bank is updated by utilizing the user feedback information, and the richness and comprehensiveness of the word bank can be expanded by repeatedly performing iterative updating, so that the condition of misjudgment can be reduced in the subsequent processing process of the content written by the user, and the accuracy of a processing result is improved.
In some optional implementations of this embodiment, after receiving the user feedback information, the set of wrongly-typed words constructed for the user may be further updated. The wrongly written words set constructed for the user can be used for recording wrongly written words appearing in the content written by the user.
By the method, a personalized wrongly-written character set can be constructed for each user, the wrongly-written character set is updated through continuous iteration, the wrongly-written content of the user can be further checked by utilizing the wrongly-written character set corresponding to the user when the written content of the user is checked later, so that the situation that the wrongly-written characters which are not detected exist is further reduced, and meanwhile, the accuracy of the processing result of the user text can be improved
With continued reference to fig. 4, fig. 4 is a schematic diagram 400 of an application scenario of the method for processing text according to the present embodiment. In the application scenario of fig. 4, the user writes a job 401 as "a game with a student playing a key", the job 401 is photographed by the camera 4021 of the mobile phone 402 to obtain a job image, and then the mobile phone 402 may transmit the photographed job image to the server 403.
Server 403 may first identify the job image based on OCR (Optical Character Recognition) techniques to obtain user text 404. Server 403 may then obtain pinyin 405 for user text 404 based on TTS techniques. Specifically, the pinyin 405 is "he tong xue wan le ti jian zi de you xi". The server 403 may then recognize the pinyin 405 using speech recognition techniques, resulting in recognized text 406. Specifically, the recognition text 406 is "play a game of kicking a shuttlecock with a student".
The server 403 may compare the characters in the user text 404 and the recognition text 406 one by one according to the reading order, and may determine that the "key" in the user text 404 is different from the corresponding "shuttlecock" in the recognition text 406. Thus, a "key" in the user text 404 may be determined as the difference word 407.
Server 403 may extract a set of word composition words 408 in user text 404 where difference words 407 are located. Wherein, the length of the word where the extracted difference character 407 is located is not less than 2 and not less than 4. Specifically, word set 408 includes the following 9 words: kick key, play kick key, key tour, kick key, play kick key.
Server 403 may then search lexicon 409 for whether each word in word set 408 is included, resulting in search result 410. The search result 410 is used to indicate that no word in the word set 408 is found in the word bank 409. Based on this, the correction result 411 of the job 401 written by the user can be obtained. The correction result 411 is used to indicate that the "key" in the job 401 written by the user may be a wrongly written word. Meanwhile, the server 403 may send the correction result 411 to the terminal device 402, so that the user checks that the "key" in the job 401 is indeed a wrongly written word and is correspondingly modified according to the correction result.
As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the process 300 of the method for processing a text in this embodiment highlights that after comparing the user text with the recognition text to obtain the difference word set, whether the word in the user text where the difference word in the difference word set exists in the word stock or not can be determined, and whether the difference word is a suspected wrongly-written word or not can be determined, which is helpful for improving the accuracy of the determined suspected wrongly-written word.
With further reference to FIG. 5, a flow 500 of yet another embodiment of a method for processing text is shown. The process 500 of the method for processing text comprises the following steps:
step 501, acquiring a text corresponding to the content written by the user as a user text.
Step 502, determining a voice feature of a voice corresponding to the user text, and performing voice recognition by using the voice feature to obtain a recognition text.
Step 503, selecting a word different from the word corresponding to the recognition text from the user text as a difference word, and obtaining a difference word set.
Step 504, for the difference characters in the difference character set, extracting the words where the difference characters are from the user text to form a character set corresponding to the difference characters; determining whether a preset word bank comprises words in a word set corresponding to the difference words; in response to determining that the thesaurus does not include a word in the set of words to which the difference word corresponds, determining the difference word as a suspected wrongly-written word.
The specific implementation process of the above steps 501, 502, 503 and 504 can refer to the related description of the steps 301, 302, 303 and 304 in the corresponding embodiment of fig. 3, and is not repeated here.
And 505, selecting characters and words belonging to a preset frequently-wrong character word library from the user text as candidate words to obtain a candidate character word set.
In this embodiment, the thesaurus of frequently-wrong words may be preset by a technician. For example, the thesaurus of frequently-wrong words may be derived from statistical analysis of the processing results for a large amount of user text. As another example, the thesaurus of frequently-wrong words may be provided by a third party. It should be understood that the thesaurus of frequently-wrong words includes single words as well as words.
For each word in the user text, if the word exists in the frequently-wrong word library, it can be stated that the word is easily a wrongly-written word.
Step 506, for the candidate word in the candidate word set, determining whether the candidate word is a suspected wrongly-written word, and updating the processing result in response to determining that the candidate word is a suspected wrongly-written word.
In this embodiment, since the candidate word is likely to be a wrongly written word, the determined candidate word can be further determined, and when the candidate word is determined to be a suspected wrongly written word, the processing result can be updated as a supplementary processing result.
It should be understood that if the candidate word determined as the suspected wrongly written word is also the difference word determined as the suspected wrongly written word, the candidate word is recorded only once.
In this embodiment, different methods for determining whether the candidate word is a suspected wrongly-written word may be selected according to different application scenarios.
Optionally, when the frequently-wrong word library is constructed, the error use mode of each word in the frequently-wrong word library can be recorded. At this time, whether the candidate word is a suspected wrongly-written word can be determined by judging whether the usage of the candidate word belongs to the corresponding wrongly-used word. If the candidate word belongs to the corresponding wrong use mode, the candidate word can be determined to be a suspected wrongly-written word. For example, whether the candidate word is used correctly can be determined by the sentence in the user text in which the candidate word is located.
It should be understood that, for a word, the word is a wrongly written word, one or more words in the word may be represented as wrongly written words, and each word in the word may also be represented as a wrongly written word. The specific meaning can be flexibly set according to the actual application requirement.
Optionally, for a candidate word in the candidate word set, determining whether the candidate word is a suspected wrongly-written word may include: and determining whether the candidate word is a suspected wrongly-written word or not according to the grammar rule corresponding to the candidate word.
Where a grammatical rule may refer to a rule that a word must comply with when used correctly. It should be understood that different words may have different grammar rules.
For example, "the", "get", and "the" are words that the user is likely to use by mistake when writing. Therefore, three words "of", "get", "ground" can be recorded in the set of frequently-wrong words. In this case, the grammar rule corresponding to "may be that the part of speech of the word preceding" is an adjective or a pronoun, and the part of speech of the word following "is a noun in reading order. The grammar rule corresponding to "get" may be that the part of speech of the word before "get" is a verb and the part of speech of the word after "get" is an adverb in reading order. The grammar rule corresponding to "ground" may be that the part of speech of the word before "ground" is an adverb and the part of speech of the word after "ground" is a verb in reading order.
Therefore, in this case, if the content written by the user includes the words in the three words, the words in the three words included in the text of the user are determined as candidate words. Whether to use correctly can then be determined according to the part of speech of the word before and the part of speech of the word after the candidate word in reading order. If the candidate word is incorrectly used, the candidate word can be determined to be a suspected wrongly-written word.
As another example, quantifier words are also prone to errors in use. Thus, individual quantifiers (e.g., individual, piece, grain, species, match, bar, etc.) may be recorded in the constantly-wrong subset. At this time, the grammar rule of each quantifier may be that the quantifier and the following words belong to a fixed collocation according to the reading order.
For example, for each quantifier, various fixed matches corresponding to the quantifier are constructed to obtain a fixed match set corresponding to the quantifier. At this time, when the quantifier is detected in the content written by the user, whether the collocation formed by the quantifier and the subsequent words belongs to the fixed collocation set corresponding to the quantifier is judged according to the reading sequence. If not, the quantifier can be determined as suspected wrongly written characters.
As can be seen from fig. 5, compared with the embodiments corresponding to fig. 2 and fig. 3, the flow 500 of the method for processing text in the present embodiment highlights the step of determining the suspected wrongly written words appearing in the content written by the user by using the speech processing technology, and at the same time, determining the words belonging to the frequently and wrongly written words appearing in the content written by the user to determine whether the words are correctly written by the user. Therefore, the scheme described in the embodiment checks the words in the content written by the user from multiple aspects, so that the checking strength is further improved, a more comprehensive processing result is obtained, and the condition that wrongly written words are missed is reduced.
With further reference to fig. 6, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for processing text, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.
As shown in fig. 6, the apparatus 600 for processing text provided by the present embodiment includes an obtaining unit 601, a recognition unit 602, a selecting unit 603, and a processing unit 604. Wherein, the obtaining unit 601 is configured to obtain a text corresponding to the content written by the user as the user text; the recognition unit 602 is configured to determine a speech feature of speech corresponding to the user text, and perform speech recognition using the speech feature to obtain a recognized text; the selecting unit 603 is configured to select a word different from the corresponding word in the recognition text from the user text as a difference word, resulting in a difference word set; the processing unit 604 is configured to determine a processing result of the content written by the user according to the difference word set, wherein the processing result is used for indicating suspected wrongly written words appearing in the content written by the user.
In the present embodiment, in the apparatus 600 for processing text: the specific processing of the obtaining unit 601, the identifying unit 602, the selecting unit 603, and the processing unit 604 and the technical effects thereof can refer to the related descriptions of step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2, which are not described herein again.
In some optional implementations of this embodiment, the processing unit 604 is further configured to, for a difference word in the difference word set, extract a word in which the difference word is located from the user text to form a word set corresponding to the difference word; and determining whether the difference word is a suspected wrongly-written word or not according to the word set corresponding to the difference word.
In some optional implementations of the present embodiment, the processing unit 604 is further configured to determine whether a preset lexicon includes words in a set of words corresponding to the difference word; in response to determining that the thesaurus does not include a word in the set of words to which the difference word corresponds, determining the difference word as a suspected wrongly-written word.
In some optional implementations of the present embodiment, the selecting unit 603 is further configured to select, from the user text, a word and a word belonging to a preset frequently-wrong word library as a candidate word, so as to obtain a candidate word set; the processing unit 604 is further configured to determine, for a candidate word in the candidate word set, whether the candidate word is a suspected wrongly-written word; and updating the processing result in response to determining that the candidate word is a suspected wrongly written word.
In some optional implementations of this embodiment, the processing unit 604 is further configured to determine whether the candidate word is a suspected wrongly-written word according to a grammar rule corresponding to the candidate word.
In some optional implementations of the present embodiment, the apparatus 600 for processing text further includes: the receiving unit (not shown in the figure) is configured to receive user feedback information for the processing result; the updating unit (not shown in the figure) is configured to update the thesaurus according to the user feedback information.
In some optional implementations of the embodiment, the updating unit is further configured to update the set of wrongly-typed words constructed for the user according to the user feedback information.
In the apparatus provided by the above embodiment of the present disclosure, the obtaining unit obtains a text corresponding to the content written by the user as a user text; the recognition unit determines the voice characteristics of voice corresponding to the user text and performs voice recognition by using the voice characteristics to obtain a recognition text; the selecting unit selects characters different from the corresponding characters in the identification text from the user text as difference characters to obtain a difference character set; the processing unit determines a processing result of the content written by the user according to the difference word set, wherein the processing result is used for indicating suspected wrongly written words appearing in the content written by the user, so that the suspected wrongly written words in the content written by the user can be detected, the user or a related user of the user can check the suspected wrongly written words in a targeted manner, and the problems of long time consumption and low efficiency of checking the wrongly written words in the content written by the user only by means of manual checking are effectively solved.
Referring now to FIG. 7, a block diagram of an electronic device (e.g., the server of FIG. 1) 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device/server shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure.
It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a text corresponding to the content written by the user as a user text; determining voice characteristics of voice corresponding to the user text, and performing voice recognition by using the voice characteristics to obtain a recognition text; selecting characters different from the corresponding characters in the recognition text from the user text as difference characters to obtain a difference character set; and determining a processing result of the content written by the user according to the difference word set, wherein the processing result is used for indicating suspected wrongly written words appearing in the content written by the user.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a recognition unit, a selection unit, and a processing unit. The names of the units do not form a limitation to the unit itself in some cases, and for example, the acquiring unit may also be described as a "unit that acquires text corresponding to content written by the user as the user text".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (10)

1. A method for processing text, comprising:
acquiring a text corresponding to the content written by the user as a user text;
determining voice characteristics of voice corresponding to the user text, and performing voice recognition by using the voice characteristics to obtain a recognition text;
selecting characters different from the characters corresponding to the identification texts from the user texts as difference characters to obtain a difference character set;
and determining a processing result of the content written by the user according to the difference word set, wherein the processing result is used for indicating suspected wrongly written words appearing in the content written by the user.
2. The method of claim 1, wherein said determining a processing result of said user-written content from said set of difference words comprises:
for the difference characters in the difference character set, extracting the words where the difference characters are from the user text to form a character set corresponding to the difference characters; and determining whether the difference word is a suspected wrongly-written word or not according to the word set corresponding to the difference word.
3. The method of claim 2, wherein the determining whether the difference word is a suspected wrongly-written word according to the word set corresponding to the difference word comprises:
determining whether a preset word bank comprises words in a word set corresponding to the difference words;
and in response to determining that the thesaurus does not include a word in the word set corresponding to the difference word, determining the difference word as a suspected wrongly-written word.
4. The method of claim 1, wherein the method further comprises:
selecting characters and words belonging to a preset frequently-wrong character word library from the user text as candidate words to obtain a candidate character word set;
for the candidate words in the candidate word set, determining whether the candidate words are suspected wrongly written words; and updating the processing result in response to determining that the candidate word is a suspected wrongly written word.
5. The method of claim 4, wherein the determining whether the candidate word is a suspected wrongly written word comprises:
and determining whether the candidate word is a suspected wrongly-written word or not according to the grammar rule corresponding to the candidate word.
6. The method of claim 3, wherein the method further comprises:
receiving user feedback information aiming at the processing result;
and updating the word stock according to the user feedback information.
7. The method according to one of claims 1-6, wherein the method further comprises:
and updating the wrongly-typed character set constructed aiming at the user according to the user feedback information.
8. An apparatus for processing text, comprising:
an acquisition unit configured to acquire a text corresponding to content written by a user as a user text;
the recognition unit is configured to determine a voice feature of a voice corresponding to the user text, and perform voice recognition by using the voice feature to obtain a recognition text;
a selecting unit configured to select a word different from a corresponding word in the recognition text from the user text as a difference word, resulting in a difference word set;
and the processing unit is configured to determine a processing result of the content written by the user according to the difference word set, wherein the processing result is used for indicating suspected wrongly written words appearing in the content written by the user.
9. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202010134249.7A 2020-03-02 2020-03-02 Method and device for processing text Pending CN112307748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010134249.7A CN112307748A (en) 2020-03-02 2020-03-02 Method and device for processing text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010134249.7A CN112307748A (en) 2020-03-02 2020-03-02 Method and device for processing text

Publications (1)

Publication Number Publication Date
CN112307748A true CN112307748A (en) 2021-02-02

Family

ID=74336350

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010134249.7A Pending CN112307748A (en) 2020-03-02 2020-03-02 Method and device for processing text

Country Status (1)

Country Link
CN (1) CN112307748A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420983A (en) * 2021-06-23 2021-09-21 科大讯飞股份有限公司 Writing evaluation method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541837A (en) * 2010-12-22 2012-07-04 张家港市赫图阿拉信息技术有限公司 Method for correcting inputted Chinese characters
CN108121455A (en) * 2016-11-29 2018-06-05 渡鸦科技(北京)有限责任公司 Identify method and device for correcting
CN109065031A (en) * 2018-08-02 2018-12-21 阿里巴巴集团控股有限公司 Voice annotation method, device and equipment
CN110534100A (en) * 2019-08-27 2019-12-03 北京海天瑞声科技股份有限公司 A kind of Chinese speech proofreading method and device based on speech recognition
CN110610180A (en) * 2019-09-16 2019-12-24 腾讯科技(深圳)有限公司 Method, device and equipment for generating recognition set of wrongly-recognized words and storage medium
CN110705217A (en) * 2019-09-09 2020-01-17 上海凯京信达科技集团有限公司 Wrongly-written character detection method and device, computer storage medium and electronic equipment
CN110807319A (en) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 Text content detection method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541837A (en) * 2010-12-22 2012-07-04 张家港市赫图阿拉信息技术有限公司 Method for correcting inputted Chinese characters
CN108121455A (en) * 2016-11-29 2018-06-05 渡鸦科技(北京)有限责任公司 Identify method and device for correcting
CN109065031A (en) * 2018-08-02 2018-12-21 阿里巴巴集团控股有限公司 Voice annotation method, device and equipment
CN110534100A (en) * 2019-08-27 2019-12-03 北京海天瑞声科技股份有限公司 A kind of Chinese speech proofreading method and device based on speech recognition
CN110705217A (en) * 2019-09-09 2020-01-17 上海凯京信达科技集团有限公司 Wrongly-written character detection method and device, computer storage medium and electronic equipment
CN110610180A (en) * 2019-09-16 2019-12-24 腾讯科技(深圳)有限公司 Method, device and equipment for generating recognition set of wrongly-recognized words and storage medium
CN110807319A (en) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 Text content detection method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420983A (en) * 2021-06-23 2021-09-21 科大讯飞股份有限公司 Writing evaluation method, device, equipment and storage medium
CN113420983B (en) * 2021-06-23 2024-04-12 科大讯飞股份有限公司 Writing evaluation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110489538B (en) Statement response method and device based on artificial intelligence and electronic equipment
CN107622054B (en) Text data error correction method and device
CN110969012B (en) Text error correction method and device, storage medium and electronic equipment
CN111324743A (en) Text relation extraction method and device, computer equipment and storage medium
CN110765996A (en) Text information processing method and device
CN109325091B (en) Method, device, equipment and medium for updating attribute information of interest points
US20160012751A1 (en) Comprehension assistance system, comprehension assistance server, comprehension assistance method, and computer-readable recording medium
CN102982021A (en) Method for disambiguating multiple readings in language conversion
US20220165172A1 (en) Method and system for interactive learning
CN110750624A (en) Information output method and device
CN111143556A (en) Software function point automatic counting method, device, medium and electronic equipment
CN111369980A (en) Voice detection method and device, electronic equipment and storage medium
CN111523532A (en) Method for correcting OCR character recognition error and terminal equipment
CN110232920B (en) Voice processing method and device
CN111079489B (en) Content identification method and electronic equipment
CN112307748A (en) Method and device for processing text
CN109670040B (en) Writing assistance method and device, storage medium and computer equipment
Alsunaidi et al. Abjad: Towards interactive learning approach to arabic reading based on speech recognition
KR102282307B1 (en) System for learning English and method thereof
KR20160106363A (en) Smart lecture system and method
CN112560431A (en) Method, apparatus, device, storage medium, and computer program product for generating test question tutoring information
CN112309385A (en) Voice recognition method, device, electronic equipment and medium
US11935425B2 (en) Electronic device, pronunciation learning method, server apparatus, pronunciation learning processing system, and storage medium
CN112560493B (en) Named entity error correction method, named entity error correction device, named entity error correction computer equipment and named entity error correction storage medium
Bannò et al. Towards automatic spoken grammatical error correction of l2 learners of english

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination