CN105404903B - Information processing method and device and electronic equipment - Google Patents

Information processing method and device and electronic equipment Download PDF

Info

Publication number
CN105404903B
CN105404903B CN201410468559.7A CN201410468559A CN105404903B CN 105404903 B CN105404903 B CN 105404903B CN 201410468559 A CN201410468559 A CN 201410468559A CN 105404903 B CN105404903 B CN 105404903B
Authority
CN
China
Prior art keywords
word
information
similar
words
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410468559.7A
Other languages
Chinese (zh)
Other versions
CN105404903A (en
Inventor
贾沛
孙林
薛苏葵
李众庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201410468559.7A priority Critical patent/CN105404903B/en
Publication of CN105404903A publication Critical patent/CN105404903A/en
Application granted granted Critical
Publication of CN105404903B publication Critical patent/CN105404903B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention discloses an information processing method, an information processing device and electronic equipment, wherein keywords are extracted from electronic text characters associated with information to be recognized of an electronic text, the acquired keywords, synonyms of the keywords and a training database taking the words associated with the extracted keywords as a core are used for correcting recognized data, the acquired electronic text content is a text after manual correction, the recognition accuracy is 100%, and the acquired electronic text content is associated with the information to be recognized of the electronic text, so that the words recognized through optical character recognition are corrected through the information processing method provided by the embodiment of the application, and the accuracy of the optical character recognition is improved.

Description

Information processing method and device and electronic equipment
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to an information processing method and apparatus, and an electronic device.
Background
Optical Character Recognition (OCR) is a process of scanning text data to obtain image files, analyzing the image files, and acquiring text and layout information, and is an important aspect in the field of automatic Recognition technology research and application.
At present, the accuracy of recognizing a printed text by optical character recognition is high, and the recognition rate of a handwritten text is low, so how to improve the accuracy of recognizing the handwritten text by OCR becomes a problem to be solved urgently.
Disclosure of Invention
The invention aims to provide an information processing method, an information processing device and electronic equipment, which are used for improving the accuracy of recognition of handwritten texts through OCR recognition.
In order to achieve the purpose, the invention provides the following technical scheme:
an information processing method is applied to electronic equipment and used for acquiring electronic text content associated with information to be identified as electronic text; extracting keywords in the acquired electronic text content; obtaining synonyms of the extracted keywords and words associated with the extracted keywords; the extracted keywords, synonyms of the keywords, and words associated with the extracted keywords constitute a lexicon; the method comprises the following steps:
identifying first information in the information to be identified to obtain a first word;
searching whether a second word similar to the first word exists in the word stock;
and when a second word similar to the first word exists in the word stock, replacing the first word with the second word.
In the above method, preferably, the identifying the first word of the first information in the information to be identified includes:
performing optical character recognition on first information in the information to be recognized to obtain a first word;
said searching for the presence of a second word in said thesaurus that is similar to said first word comprises:
and searching whether a second word with a font similar to that of the first word exists in the word stock.
In the above method, preferably, the information to be recognized is speech; said searching for the presence of a second word in said thesaurus that is similar to said first word comprises:
and searching whether a second word with the pronunciation similar to that of the first word exists in the word bank.
The method preferably further includes, after replacing the first word with the second word:
recording the number of times the first word is replaced by the second word.
In the method, preferably, when the number of times that the first word is replaced by the second word is greater than a first preset threshold, the first information in the information to be recognized is recognized as the second word in the process of recognizing the information to be recognized.
The above method, preferably, further comprises:
when at least two second words similar to the first words exist in the word stock, displaying the at least two second words;
selecting a second word according to a selection instruction triggered by a user;
the replacing the first word with the second word comprises:
and replacing the first word with a second word selected according to a selection instruction triggered by the user.
The above method, preferably, further comprises:
when the word library does not have a word similar to the first word, judging whether the first word is a word selected according to a selection instruction triggered by a user;
when the first word is a word selected according to a selection instruction triggered by a user, recording the number of times that the first word is triggered and selected by the user;
and when the number of times of triggering and selecting the first word by the user is greater than a second preset threshold value, adding the first word into the word stock.
An information processing apparatus applied to an electronic device, the electronic device having access to a thesaurus, the thesaurus comprising: extracting keywords from the obtained electronic text content, synonyms of the extracted keywords, and words associated with the extracted keywords; wherein, the obtained electronic text content is as follows: electronic text content associated with information to be identified as a text; the device comprises:
the identification module is used for identifying first information in the information to be identified to obtain a first word;
the searching module is used for searching whether a second word similar to the first word exists in the word bank;
and the replacing module is used for replacing the first word with a second word when the second word similar to the first word exists in the word stock.
The above apparatus, preferably, the identification module includes:
the first recognition unit is used for carrying out optical character recognition on first information in the information to be recognized to obtain a first word;
the searching module comprises:
and the first searching unit is used for searching whether a second word with a character pattern similar to that of the first word exists in the word stock.
In the above apparatus, preferably, the information to be recognized is speech;
the identification module comprises:
the second recognition unit is used for carrying out voice recognition on first information in the information to be recognized to obtain a first word;
the searching module comprises:
and the second searching unit is used for searching whether a second word with the pronunciation similar to that of the first word exists in the word bank.
The above apparatus, preferably, further comprises:
the first recording module is used for recording the times of replacing the first word by the second word after the replacing module replaces the first word by the second word.
In the above apparatus, preferably, the recognition module is further configured to, when the number of times that the first word is replaced by the second word is greater than a first preset threshold, recognize the first information in the information to be recognized as the second word in the process of recognizing the information to be recognized.
The above apparatus, preferably, further comprises:
the display module is used for displaying at least two second words similar to the first word when the at least two second words exist in the word stock;
the selection module is used for selecting a second word according to a selection instruction triggered by a user;
the replacement module is specifically configured to replace the first word with the second word selected according to the selection instruction triggered by the user.
The above apparatus, preferably, further comprises:
the judging module is used for judging whether the first word is a word selected according to a selection instruction triggered by a user when the word similar to the first word does not exist in the word bank;
the second recording module is used for recording the times of triggering and selecting the first word by the user when the first word is the word selected according to the selection instruction triggered by the user;
and the adding module is used for adding the first word into the word stock when the number of times of triggering and selecting the first word by the user is greater than a second preset threshold value.
An electronic device comprising the information processing apparatus as described above.
According to the scheme, the information processing method is applied to the electronic equipment, and the electronic text content associated with the information to be identified, which is to be identified as the electronic text, is obtained; extracting keywords in the acquired electronic text content; obtaining synonyms of the extracted keywords and words associated with the extracted keywords; the extracted keywords, synonyms of the keywords, and words associated with the extracted keywords constitute a lexicon; the method comprises the following steps: identifying first information in the information to be identified to obtain a first word; searching whether a second word similar to the first word exists in the word stock; and when a second word similar to the first word exists in the word stock, replacing the first word with the second word.
In the embodiment of the application, the keywords are extracted from the electronic text words associated with the information to be recognized as the electronic text, the obtained keywords, the synonyms of the keywords and the words associated with the extracted keywords are used as the core of the training database to correct the recognized data, the obtained electronic text content is the text after manual correction, the recognition accuracy is 100%, and the obtained electronic text content is associated with the information to be recognized as the electronic text, so that the words recognized through the optical characters are corrected through the information processing method provided by the embodiment of the application, and the accuracy of the optical character recognition is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
Fig. 1 is a flowchart of an implementation of an information processing method according to an embodiment of the present application;
fig. 2 is a flowchart of another implementation of an information processing method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an information processing apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an identification module according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a lookup module according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an identification module according to an embodiment of the present disclosure;
fig. 7 is another schematic structural diagram of a lookup module according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an information processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an information processing apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an information processing apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an information processing apparatus according to an embodiment of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be practiced otherwise than as specifically illustrated.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The information processing method and device provided by the embodiment of the application are applied to electronic equipment.
In the embodiment of the application, electronic text content associated with information to be identified as the electronic text is acquired in advance; extracting keywords in the acquired electronic text content; obtaining synonyms of the extracted keywords and words associated with the extracted keywords; the extracted keywords, synonyms of the keywords, and words associated with the extracted keywords constitute a thesaurus.
The information to be recognized can be characters to be subjected to optical character recognition, such as handwritten characters or print characters; the information to be recognized may also be speech information.
Different application scenarios, information to be identified and electronic text content associated with the information to be identified may also differ. For example, at a meeting or in a class of a school, a PPT electronic document is usually used, and at the meeting or in the class, besides the PPT electronic document, handwriting or communication of live personnel is usually performed, so that the information to be recognized can be handwritten characters or voice segments, and the electronic text content associated with the information to be recognized can be the electronic text content in the PPT electronic document.
Of course, in addition to PPT documents, commonly used electronic documents include: WORD documents, PDF documents, etc., and thus the electronic text content associated with the information to be identified may also be the electronic text content in the WORD documents, PDF documents.
In other words, the information to be recognized is information generated based on the electronic text content.
Wherein the words associated with the extracted keyword may include lower-level words of the extracted keyword or words having a high degree of correlation with the extracted keyword. If the extracted keyword is "traffic", then the words associated with "traffic" may include: buses, subways, taxis, motor vehicles, roads, lines, congestion, peaks, slowness, and the like.
The specific words which are the words with high relevance to the extracted keywords can be determined through statistics. For example, the probability that a certain word appears when the extracted keyword appears may be counted. Specifically, when the extracted keyword occurs and the probability of occurrence of a certain word is greater than a third preset threshold, it may be determined that the certain word is a word with a high degree of correlation with the extracted keyword.
Referring to fig. 1, fig. 1 is a flowchart of an implementation of an information processing method according to an embodiment of the present application, which may include:
step S11: identifying first information in the information to be identified to obtain a first word;
step S12: searching whether a second word similar to the first word exists in the word stock;
in the embodiment of the present application, the word stock is a word stock constructed according to electronic text content associated with information to be recognized as an electronic text.
Step S13: and when a second word similar to the first word exists in the word stock, replacing the first word with the second word.
And after a second word similar to the first word is found in the constructed word stock, replacing the first word with the second word, namely correcting the first word obtained by recognition into the second word.
If the second word is the same word as the first word, no substitution may be made.
According to the information processing method provided by the embodiment of the application, the keywords are extracted from the electronic text characters associated with the information to be recognized which is to be recognized as the electronic text, the obtained keywords, the synonyms of the keywords and the words associated with the extracted keywords are used as the core of the training database to correct the recognized data, the obtained electronic text content is the text which is manually corrected, the recognition accuracy is 100%, and the obtained electronic text content is associated with the information to be recognized which is to be recognized as the electronic text, so that the recognized characters are corrected through the information processing method provided by the embodiment of the application, and the recognition accuracy is improved.
In the foregoing embodiment, preferably, the recognizing the first information in the information to be recognized to obtain the first word may include:
and carrying out optical character recognition on first information in the information to be recognized to obtain a first word.
In this embodiment of the application, the information to be recognized may be handwritten characters or print characters, and for the handwritten characters and the print characters, the information to be recognized may be recognized through Optical Character Recognition (OCR). For handwritten characters, the handwriting recognition engine can also recognize the handwritten characters by detecting tracks in the handwriting process.
Correspondingly, the searching whether a second word similar to the first word exists in the word stock may include:
and searching whether a second word with a font similar to that of the first word exists in the word stock.
And when a second word with the character pattern similar to that of the first word exists in the word stock, indicating that the second word with the character pattern similar to that of the first word exists in the word stock.
In the embodiment of the present application, words having a font similar to that of the first word may be predetermined. Whether the font of a certain word (hereinafter referred to as a third word) is similar to the font of the first word or not can be judged by judging whether the structures of the characters at the corresponding positions in the first word and the third word are similar or not. Specifically, when the structures of the characters at the corresponding positions in the first word and the third word are similar, the font style of the first word is determined to be similar to that of the third word. Whether the glyphs of the two characters at the corresponding positions are similar can be judged by the following methods:
the four corner features of two characters can be compared; when the four corner features of the two characters are the same, determining the characters with the similar characters in the character shapes; otherwise, determining that the glyphs of the two words are different.
Alternatively, the first and second electrodes may be,
or comparing whether the first two strokes and the last two strokes of the two characters are the same during writing, and if the first two strokes and the last two strokes of the two characters are the same, determining that the characters of the two characters are similar; otherwise, determining that the glyphs of the two words are different.
And forming a character pattern similar word phrase corresponding to the first word by using all third words with character patterns similar to the character pattern of the first word. In the embodiment of the present application, when searching whether a second word with a font similar to that of the first word exists in the word bank, if the second word in the constructed word bank is one of word groups of the font similar word corresponding to the first word, it is determined that the second word is a word with a font similar to that of the first word, that is, the second word with a font similar to that of the first word exists in the word bank.
If a plurality of second words with the font similar to the first word exist in the word stock, the second word with the font highest in similarity with the first word in the word stock can be determined as the word with the font similar to the font of the first word.
In the above embodiment, preferably, the information to be recognized may also be speech. The identifying the first information in the information to be identified to obtain the first word may include:
and performing voice recognition on first information in the information to be recognized to obtain a first word.
Correspondingly, the searching whether a second word similar to the first word exists in the word stock may include:
and searching whether a second word with the pronunciation similar to that of the first word exists in the word bank.
And when a second word with the pronunciation similar to that of the first word exists in the word stock, indicating that the second word similar to the first word exists in the word stock.
In the embodiment of the present application, words with pronunciation similar to that of the first word may be predetermined. Whether the pronunciation of a word (hereinafter referred to as a third word) is similar to the pronunciation of the first word or not can be determined by determining whether the pronunciations of the characters at the corresponding positions in the first word and the third word are similar to each other. Specifically, when the pronunciations of the characters at the corresponding positions in the first word and the third word are similar, the pronunciations of the first word and the third word are determined to be similar. Whether the pronunciation of the two characters at the corresponding positions is similar can be judged by the following methods:
the pinyin of the two characters can be directly compared to determine whether the pinyin of the two characters is the same, and if the pinyin of the two characters is the same, the pronunciation of the two characters is determined to be similar.
Alternatively, the first and second electrodes may be,
whether the voice templates corresponding to the two characters are similar or not can be compared, and when the similarity of the voice templates corresponding to the two characters is larger than a fourth preset threshold value, the pronunciation of the two characters is similar.
And all the third words with the pronunciation similar to that of the first word form the pronunciation similar word group corresponding to the first word. In the embodiment of the application, when searching whether a second word with a pronunciation similar to that of the first word exists in the word stock, if the second word in the constructed word stock is one of the word groups with the pronunciation similar to that of the first word, the second word is determined to be a word with the pronunciation similar to that of the first word, that is, the second word with the pronunciation similar to that of the first word exists in the word stock.
If a plurality of second words with pronunciation similar to that of the first word exist in the word stock, the second word with the pronunciation similar to that of the first word in the word stock can be determined as the word with the pronunciation similar to that of the first word.
In the foregoing embodiment, preferably, after replacing the first word with the second word, the method may further include:
recording the number of times the first word is replaced by the second word.
In the embodiment of the application, the times of replacing the first word are accumulated every time the first word is replaced by the second word.
Further, when the number of times of replacement of the first word by the second word is greater than a first preset threshold, in the process of identifying the information to be identified, identifying the first information in the information to be identified as the second word.
That is, when the number of times that the first word is replaced by the second word is greater than a first preset threshold, the first information is directly recognized as the second word, and after the first word is not recognized any more, the second word is used for replacing the first word, so that the recognition efficiency and accuracy are improved.
In the foregoing embodiment, preferably, another implementation flowchart of the information processing method provided in this embodiment is shown in fig. 2, and may include:
step S21: identifying first information in the information to be identified to obtain a first word;
step S22: searching whether a second word similar to the first word exists in the word stock, and if the second word similar to the first word exists in the word stock, executing step S25; if the second word is at least two, performing step S23; (ii) a
Step S23: displaying the at least two second words;
step S24: selecting a second word according to a selection instruction triggered by a user;
step S25: replacing the first word with the second word;
when at least two second words similar to the first word exist in the word stock, the first word is replaced by the second word selected according to a selection instruction triggered by the user.
In the foregoing embodiment, preferably, when at least two second words similar to the first word exist in the thesaurus, after selecting a second word according to a selection instruction triggered by a user, the method may further include:
counting the number of times that each second word is triggered and selected by the user;
determining a second word which is triggered and selected by the user for the maximum times;
and when the number of times that the determined second word is triggered and selected is larger than a fifth preset threshold value, in the process of identifying the information to be identified, identifying the first information in the information to be identified as the determined second word.
That is to say, when the number of times that the second word is triggered and selected by the user is the largest and the number of times that the second word is triggered and selected by the user is greater than a fifth preset threshold, in the process of identifying the information to be identified, the first information in the information to be identified is identified as the second word that is triggered and selected by the user for the largest number of times and that is greater than the fifth preset threshold.
In the above embodiment, when there is no second word similar to the first word in the word stock, the first word is considered to be correctly recognized, and the first word may not be processed.
In the above embodiment, it is preferable that the method further includes:
when the word library does not have a word similar to the first word, judging whether the first word is a word selected according to a selection instruction triggered by a user;
when the first word is a word selected according to a selection instruction triggered by a user, recording the number of times that the first word is triggered and selected by the user;
and when the number of times of triggering and selecting the first word by the user is greater than a second preset threshold value, adding the first word into the word stock.
Corresponding to the method embodiment, an embodiment of the present application further provides an information processing apparatus, and a schematic structural diagram of the information processing apparatus provided in the embodiment of the present application is shown in fig. 3, and the information processing apparatus may include:
an identification module 31, a search module 32 and a replacement module 33; wherein the content of the first and second substances,
the identification module 31 is configured to identify first information in the information to be identified to obtain a first word;
the searching module 32 is configured to search the word bank for whether a second word similar to the first word exists;
in the embodiment of the present application, the word stock is a word stock constructed according to electronic text content associated with information to be recognized as an electronic text.
The replacing module 33 is configured to replace the first word with a second word similar to the first word when the second word exists in the thesaurus.
And after a second word similar to the first word is found in the constructed word stock, replacing the first word with the second word, namely correcting the first word obtained by recognition into the second word.
If the second word is the same word as the first word, no substitution may be made.
According to the information processing device provided by the embodiment of the application, the keywords are extracted from the electronic text characters associated with the information to be recognized which is to be recognized as the electronic text, the obtained keywords, the synonyms of the keywords and the words associated with the extracted keywords are used as the core of the training database to correct the recognized data, the obtained electronic text content is the text which is manually corrected, the recognition accuracy is 100%, and the obtained electronic text content is associated with the information to be recognized which is to be recognized as the electronic text, so that the recognized characters are corrected through the information processing method provided by the embodiment of the application, and the recognition accuracy is improved.
In the above embodiment, a schematic structural diagram of the identification module 31 is shown in fig. 4, and may include:
a first recognition unit 41, configured to perform optical character recognition on first information in the information to be recognized to obtain a first word;
in this embodiment of the application, the information to be recognized may be handwritten characters or print characters, and for the handwritten characters and the print characters, the information to be recognized may be recognized through Optical Character Recognition (OCR). For handwritten characters, the handwriting recognition engine can also recognize the handwritten characters by detecting tracks in the handwriting process.
Accordingly, a schematic structural diagram of the search module 32 is shown in fig. 5, and may include:
a first searching unit 51, configured to search the thesaurus for whether a second word with a font similar to that of the first word exists.
And when a second word with the character pattern similar to that of the first word exists in the word stock, indicating that the second word with the character pattern similar to that of the first word exists in the word stock.
If a plurality of second words with the font similar to the first word exist in the word stock, the second word with the font highest in similarity with the first word in the word stock can be determined as the word with the font similar to the font of the first word.
In the foregoing embodiment, preferably, the information to be recognized is speech.
Another schematic structural diagram of the identification module 31 is shown in fig. 6, and may include:
the second recognition unit 61 is configured to perform voice recognition on first information in the information to be recognized to obtain a first word;
correspondingly, another structural diagram of the search module 32 is shown in fig. 7, and may include:
a second searching unit 71, configured to search, in the thesaurus, whether a second word with a pronunciation similar to that of the first word exists.
And when a second word with the pronunciation similar to that of the first word exists in the word stock, indicating that the second word similar to the first word exists in the word stock.
If a plurality of second words with pronunciation similar to that of the first word exist in the word stock, the second word with the pronunciation similar to that of the first word in the word stock can be determined as the word with the pronunciation similar to that of the first word.
On the basis of the embodiment shown in fig. 3, another schematic structural diagram of the information processing apparatus provided in the embodiment of the present application is shown in fig. 8, and may further include:
a first recording module 81, configured to record, after the replacing module replaces the first word with the second word, the number of times that the first word is replaced by the second word.
In the embodiment of the application, the times of replacing the first word are accumulated every time the first word is replaced by the second word.
Further, the identification module 31 is further configured to identify, when the number of times that the first word is replaced by the second word is greater than a first preset threshold, the first information in the information to be identified as the second word in the process of identifying the information to be identified.
That is, when the number of times that the first word is replaced by the second word is greater than a first preset threshold, the first information is directly recognized as the second word, and after the first word is not recognized any more, the second word is used for replacing the first word, so that the recognition efficiency and accuracy are improved.
It should be noted that the recording module 81 can also be applied to the embodiments shown in fig. 3 to 4.
In the foregoing embodiment, preferably, on the basis of the embodiment shown in fig. 3, a schematic diagram of another structure of the information processing apparatus provided in the embodiment of the present application is shown in fig. 9, and may further include:
a display module 91 and a selection module 92; wherein the content of the first and second substances,
the display module 91 is configured to display at least two second words similar to the first word when the at least two second words exist in the thesaurus;
the selection module 92 is configured to select a second word according to a selection instruction triggered by a user;
the replacing module 33 is further configured to replace the first word with the second word selected according to the selection instruction triggered by the user.
It should be noted that the display module 91 and the selection module 92 may also be applied to the embodiment shown in any one of fig. 4 to 8.
In the foregoing embodiment, preferably, on the basis of the embodiment shown in fig. 9, a schematic diagram of another structure of the information processing apparatus provided in the embodiment of the present application is shown in fig. 10, and may further include:
a counting module 101, configured to count, when at least two second words similar to the first word exist in the word bank, the number of times that each second word is triggered and selected by the user after the selecting module 92 selects the second word according to a selection instruction triggered by the user;
the determining module 102 is configured to determine a second word that is selected by the user with the largest number of times;
the identifying module 31 may be further configured to identify, when the number of times that the determined second word is triggered to be selected is greater than a fifth preset threshold, in the process of identifying the information to be identified, the first information in the information to be identified as the second word determined by the determining module 102.
That is to say, when the number of times that the second word is triggered and selected by the user is the largest and the number of times that the second word is triggered and selected by the user is greater than a fifth preset threshold, in the process of identifying the information to be identified, the first information in the information to be identified is identified as the second word that is triggered and selected by the user for the largest number of times and that is greater than the fifth preset threshold.
In the foregoing embodiment, preferably, on the basis of the embodiment shown in fig. 3, a schematic diagram of another structure of the information processing apparatus provided in the embodiment of the present application is shown in fig. 11, and may further include:
the judging module 111 is configured to, when a word similar to the first word does not exist in the word bank, judge whether the first word is a word selected according to a selection instruction triggered by a user;
the second recording module 112 is configured to record, when the first word is a word selected according to a selection instruction triggered by a user, the number of times that the first word is triggered and selected by the user;
and the adding module 113 is configured to add the first word into the word stock when the number of times that the first word is triggered and selected by the user is greater than a second preset threshold.
An embodiment of the present application further provides an electronic device, which has the information processing apparatus described in any of the apparatus embodiments above.
The electronic device can be in various forms, such as a mobile phone, a palm computer, a tablet computer, a PC and the like.
A specific implementation of the embodiments of the present application is illustrated below.
Suppose a PPT electronic document is used in a meeting, the text in the PPT electronic document is: "Beijing governs traffic congestion 'urban disease': in recent years, Beijing has actively managed urban traffic congestion: the bus priority development strategy is adhered to, rail transit develops from 114 kilometers of 4 lines to 465 kilometers of 17 lines in ten years, the bus trip proportion is improved from 28% to 46%, and the rail transit is located at the first place of each major city; the traffic demand side management is implemented, and the over-fast growth of motor vehicles is restrained; promote scientific and technological innovation, improve traffic operating efficiency. "
Extracting keywords from the electronic text, wherein the extraction result is as follows:
governing, traffic, congestion, city, public transit, trip, management, operation, route, strategy, development, propulsion.
Expanding the keywords, wherein the expanded words include synonyms or words with high relevancy, and in this example, expanding the extracted keywords is as follows:
administering, treating, organizing, grooming, transporting, public transportation, subway, taxi, motor vehicle, road, street, road, line, congestion, peak, slow, city, urban, public transportation, trip, operation, line, planning, strategy, tactical, development, promotion, propulsion.
And forming a word bank by the expanded words.
Generally, the conference summary needs to be sorted after the conference, in order to improve the sorting efficiency, handwritten characters can be recognized through OCR recognition, voices of speakers are recognized through voice recognition, and in the recognition process, recognized characters can be corrected through the word bank so as to improve the recognition accuracy. For example,
for the OCR recognition results:
if the word obtained by OCR recognition is 'treatment', and the word 'treatment' also exists in the word stock, the 'treatment' is correct;
if the word identified by the OCR is 'metallurgical theory', and the word library does not have the word 'metallurgical theory' but has the word 'treatment' similar to the character shape of the 'metallurgical theory', the word 'metallurgical theory' is replaced by the word 'treatment';
similarly, if the word obtained through OCR recognition is 'subway', and the word 'subway' also exists in the word stock, the 'subway' is correct;
and if the word is 'altar rank' obtained through OCR recognition, the word library does not have the word 'altar rank' and has the word 'subway' similar to the 'altar rank' font, and the 'altar rank' is replaced by the 'subway'.
For the speech recognition result:
if the word obtained by the voice recognition is 'treatment', and the word 'treatment' also exists in the word stock, the 'treatment' is correct;
if the word obtained by speech recognition is 'intelligence', but the word 'intelligence' is not in the word stock, but the word 'treatment' similar to 'intelligence' pronunciation is present, the 'intelligence' is replaced by the 'treatment';
similarly, if the word obtained through voice recognition is "trip", and the word "trip" also exists in the thesaurus, it indicates that "trip" is correct;
if the word obtained by the voice recognition is 'rudiment', and the word 'trip' similar to the pronunciation of 'rudiment' is existed instead of 'rudiment' in the word stock, the 'rudiment' is replaced by 'trip';
further, if after OCR recognition, the 'metallurgical principle' is replaced by 'treatment' for more than 5 times, in the subsequent OCR recognition process, the first information to be recognized as the 'metallurgical principle' can be directly recognized as the 'treatment'; similarly, if the 'prototype' is replaced by the 'trip' for more than 5 times after the voice recognition, the voice segment to be recognized as the 'prototype' can be directly recognized as the 'trip' in the subsequent voice recognition process, so that the recognition accuracy is improved.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (15)

1. An information processing method is applied to electronic equipment and is characterized in that electronic text content associated with information to be identified which is to be identified as electronic text is acquired; extracting keywords in the acquired electronic text content; obtaining synonyms of the extracted keywords and words associated with the extracted keywords; the extracted keywords, synonyms of the keywords, and words associated with the extracted keywords constitute a lexicon; the electronic text content is a text which is corrected manually; the words associated with the extracted keywords include lower-order words of the extracted keywords or words having a high degree of correlation with the extracted keywords;
the method comprises the following steps:
identifying first information in the information to be identified to obtain a first word;
searching whether a second word similar to the first word exists in the word stock;
and when a second word similar to the first word exists in the word stock, replacing the first word with the second word.
2. The method according to claim 1, wherein the first word obtained by identifying the first information in the information to be identified comprises:
performing optical character recognition on first information in the information to be recognized to obtain a first word;
said searching for the presence of a second word in said thesaurus that is similar to said first word comprises:
and searching whether a second word with a font similar to that of the first word exists in the word stock.
3. The method according to claim 1, wherein the information to be recognized is speech; said searching for the presence of a second word in said thesaurus that is similar to said first word comprises:
and searching whether a second word with the pronunciation similar to that of the first word exists in the word bank.
4. The method of any of claims 1-3, wherein after replacing the first word with the second word, further comprising:
recording the number of times the first word is replaced by the second word.
5. The method according to claim 4, wherein when the number of times of replacement of the first word by the second word is greater than a first preset threshold, in the process of identifying the information to be identified, the first information in the information to be identified is identified as the second word.
6. The method of any one of claims 1-3, further comprising:
when at least two second words similar to the first words exist in the word stock, displaying the at least two second words;
selecting a second word according to a selection instruction triggered by a user;
the replacing the first word with the second word comprises:
and replacing the first word with a second word selected according to a selection instruction triggered by the user.
7. The method of any one of claims 1-3, further comprising:
when the word library does not have a word similar to the first word, judging whether the first word is a word selected according to a selection instruction triggered by a user;
when the first word is a word selected according to a selection instruction triggered by a user, recording the number of times that the first word is triggered and selected by the user;
and when the number of times of triggering and selecting the first word by the user is greater than a second preset threshold value, adding the first word into the word stock.
8. An information processing apparatus applied to an electronic device, wherein the electronic device has access to a thesaurus, and the thesaurus includes: extracting keywords from the obtained electronic text content, synonyms of the extracted keywords, and words associated with the extracted keywords; wherein, the obtained electronic text content is as follows: electronic text content associated with information to be identified as a text; the electronic text content is a text which is corrected manually; the words associated with the extracted keywords include lower-order words of the extracted keywords or words having a high degree of correlation with the extracted keywords; the device comprises:
the identification module is used for identifying first information in the information to be identified to obtain a first word;
the searching module is used for searching whether a second word similar to the first word exists in the word bank;
and the replacing module is used for replacing the first word with a second word when the second word similar to the first word exists in the word stock.
9. The apparatus of claim 8, wherein the identification module comprises:
the first recognition unit is used for carrying out optical character recognition on first information in the information to be recognized to obtain a first word;
the searching module comprises:
and the first searching unit is used for searching whether a second word with a character pattern similar to that of the first word exists in the word stock.
10. The apparatus of claim 8, wherein the information to be recognized is speech;
the identification module comprises:
the second recognition unit is used for carrying out voice recognition on first information in the information to be recognized to obtain a first word;
the searching module comprises:
and the second searching unit is used for searching whether a second word with the pronunciation similar to that of the first word exists in the word bank.
11. The apparatus of any one of claims 8-10, further comprising:
the first recording module is used for recording the times of replacing the first word by the second word after the replacing module replaces the first word by the second word.
12. The apparatus according to claim 11, wherein the recognition module is further configured to, when the number of times that the first word is replaced by the second word is greater than a first preset threshold, recognize, as the second word, the first information in the information to be recognized in the process of recognizing the information to be recognized.
13. The apparatus of any one of claims 8-10, further comprising:
the display module is used for displaying at least two second words similar to the first word when the at least two second words exist in the word stock;
the selection module is used for selecting a second word according to a selection instruction triggered by a user;
the replacing module is specifically configured to replace the first word with the second word selected according to the selection instruction triggered by the user.
14. The apparatus of any one of claims 8-10, further comprising:
the judging module is used for judging whether the first word is a word selected according to a selection instruction triggered by a user when the word similar to the first word does not exist in the word bank;
the second recording module is used for recording the times of triggering and selecting the first word by the user when the first word is the word selected according to the selection instruction triggered by the user;
and the adding module is used for adding the first word into the word stock when the number of times of triggering and selecting the first word by the user is greater than a second preset threshold value.
15. An electronic device characterized by comprising the information processing apparatus according to any one of claims 8 to 14.
CN201410468559.7A 2014-09-15 2014-09-15 Information processing method and device and electronic equipment Active CN105404903B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410468559.7A CN105404903B (en) 2014-09-15 2014-09-15 Information processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410468559.7A CN105404903B (en) 2014-09-15 2014-09-15 Information processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN105404903A CN105404903A (en) 2016-03-16
CN105404903B true CN105404903B (en) 2020-06-23

Family

ID=55470378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410468559.7A Active CN105404903B (en) 2014-09-15 2014-09-15 Information processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN105404903B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107909996B (en) * 2017-11-02 2020-11-10 威盛电子股份有限公司 Voice recognition method and electronic device
CN109062903B (en) * 2018-08-22 2019-12-10 北京百度网讯科技有限公司 Method and apparatus for correcting wrongly written words
CN109816047B (en) * 2019-02-19 2022-05-24 北京达佳互联信息技术有限公司 Method, device and equipment for providing label and readable storage medium
CN110136688B (en) * 2019-04-15 2023-09-29 平安科技(深圳)有限公司 Text-to-speech method based on speech synthesis and related equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916941A (en) * 2005-08-18 2007-02-21 北大方正集团有限公司 Post-processing approach of character recognition
CN1941077A (en) * 2005-09-27 2007-04-04 株式会社东芝 Apparatus and method speech recognition of character string in speech input
CN101183281A (en) * 2007-12-26 2008-05-21 腾讯科技(深圳)有限公司 Method for inputting word related to candidate word in input method and system
CN102110229A (en) * 2009-12-29 2011-06-29 欧姆龙株式会社 Word recognition method, and information processing device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7813929B2 (en) * 2007-03-30 2010-10-12 Nuance Communications, Inc. Automatic editing using probabilistic word substitution models
US9460708B2 (en) * 2008-09-19 2016-10-04 Microsoft Technology Licensing, Llc Automated data cleanup by substitution of words of the same pronunciation and different spelling in speech recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916941A (en) * 2005-08-18 2007-02-21 北大方正集团有限公司 Post-processing approach of character recognition
CN1941077A (en) * 2005-09-27 2007-04-04 株式会社东芝 Apparatus and method speech recognition of character string in speech input
CN101183281A (en) * 2007-12-26 2008-05-21 腾讯科技(深圳)有限公司 Method for inputting word related to candidate word in input method and system
CN102110229A (en) * 2009-12-29 2011-06-29 欧姆龙株式会社 Word recognition method, and information processing device

Also Published As

Publication number Publication date
CN105404903A (en) 2016-03-16

Similar Documents

Publication Publication Date Title
CN108629043B (en) Webpage target information extraction method, device and storage medium
CN107437038B (en) Webpage tampering detection method and device
US20090276378A1 (en) System and Method for Identifying Document Structure and Associated Metainformation and Facilitating Appropriate Processing
CN110795919A (en) Method, device, equipment and medium for extracting table in PDF document
CN103559310A (en) Method for extracting key word from article
CN105404903B (en) Information processing method and device and electronic equipment
CN107577663B (en) Key phrase extraction method and device
CN111814472B (en) Text recognition method, device, equipment and storage medium
CN110110325B (en) Repeated case searching method and device and computer readable storage medium
CN109299233A (en) Text data processing method, device, computer equipment and storage medium
CN111291177A (en) Information processing method and device and computer storage medium
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN112818117A (en) Label mapping method, system and computer readable storage medium
Zhu et al. DocBed: A multi-stage OCR solution for documents with complex layouts
US9053362B2 (en) System and method for capturing relevant information from a printed document
CN109472020B (en) Feature alignment Chinese word segmentation method
CN112783825A (en) Data archiving method, data archiving device, computer device and storage medium
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN112487293A (en) Method, device and medium for extracting safety accident case structured information
US8666987B2 (en) Apparatus and method for processing documents to extract expressions and descriptions
CN111160445A (en) Bid document similarity calculation method and device
KR101692244B1 (en) Method for spam classfication, recording medium and device for performing the method
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN103942188B (en) A kind of method and apparatus identifying language material language
CN112800771B (en) Article identification method, apparatus, computer readable storage medium and computer device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant