CN117009460A - Auxiliary information quick collection method for dictionary pen - Google Patents

Auxiliary information quick collection method for dictionary pen Download PDF

Info

Publication number
CN117009460A
CN117009460A CN202310884628.1A CN202310884628A CN117009460A CN 117009460 A CN117009460 A CN 117009460A CN 202310884628 A CN202310884628 A CN 202310884628A CN 117009460 A CN117009460 A CN 117009460A
Authority
CN
China
Prior art keywords
vocabulary
database
text recognition
recognition result
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310884628.1A
Other languages
Chinese (zh)
Inventor
王烈峰
詹晓沛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Readboy Education Technology Co Ltd
Original Assignee
Readboy Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Readboy Education Technology Co Ltd filed Critical Readboy Education Technology Co Ltd
Priority to CN202310884628.1A priority Critical patent/CN117009460A/en
Publication of CN117009460A publication Critical patent/CN117009460A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of dictionary pens, in particular to a method for quickly collecting auxiliary information of dictionary pens.

Description

Auxiliary information quick collection method for dictionary pen
Technical Field
The invention relates to the technical field of dictionary pens, in particular to a method for quickly collecting auxiliary information of dictionary pens.
Background
The dictionary pen is a portable electronic device, and is mainly used for assisting language learning and information inquiry, so that a convenient and quick language learning and information inquiry tool is provided for a user, and the user is helped to expand the vocabulary and improve the language understanding and application capability.
Chinese patent publication No.: CN105335356a discloses the following, the invention relates to a paper translation method and translation pen device for semantic recognition, the paper translation method for semantic recognition comprises the following steps: (1) Performing basic coding on English characters, establishing a character coding library, a rule library and a font library, and combining and arranging the character coding library, the rule library and the font library to form a coding preparation library; (2) Scanning and identifying paper English to be translated by utilizing OCR; (3) Coding the character string which is identified by utilizing a coding preparation library; (4) Carrying out semanteme processing on the coded character string to finish coding semanteme description; (5) Obtaining precisely recognized English words by utilizing OCR recognition word cognitive reasoning; (6) And connecting the English words accurately recognized by OCR with an electronic dictionary to realize automatic translation. Compared with the prior art, the method combines coding, semantic processing and reasoning with the traditional OCR, and reduces the false recognition rate caused by the traditional OCR text recognition.
However, the prior art has the following problems:
in the prior art, in actual situations, the corresponding definitions of the same English abbreviations in different technical fields are different, and the factors are not considered in the prior art, so that the meaning recognition of the dictionary pen to English abbreviations is inaccurate.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method for quickly collecting auxiliary information of a dictionary pen, comprising:
step S1, setting a plurality of auxiliary databases, wherein each auxiliary database is used for storing abbreviated vocabularies in different technical fields and paraphrasing vocabularies associated with the abbreviated vocabularies;
step S2, obtaining a text recognition result of the dictionary pen, judging the technical field of the text recognition result based on the text recognition result, wherein,
matching each vocabulary in the text recognition result with special vocabularies in a plurality of field databases, calculating the matching coincidence degree of the text recognition result and the databases in different fields based on the matching result, acquiring the field databases matched with the text recognition result based on the sorting of the matching coincidence degree, and determining the technical field corresponding to the field databases as the technical field to which the text recognition result belongs;
step S3, recognizing whether feature words appear in a text recognition result, determining an auxiliary database to be called based on the technical field of the text recognition result, and judging the paraphrase words of the feature words based on the content in the auxiliary database;
step S4, outputting the paraphrase vocabulary corresponding to the text recognition result, comprising,
identifying the paraphrase vocabulary of the non-characteristic vocabulary and outputting the paraphrase vocabulary;
and outputting the paraphrase vocabulary of the feature vocabulary identified in the step S3.
Further, in the step S2, each of the domain databases is pre-constructed, and the construction process of each domain database includes,
step S21, crawling text data of a public document database in a single technical field;
step S22, word segmentation processing is carried out on each text data to obtain a plurality of words, and a sample word database is constructed;
step S23, repeating the step S21 and the step S22 to obtain sample vocabulary databases in a plurality of technical fields, determining common vocabulary in each sample vocabulary database, wherein,
calculating the occurrence probability of the vocabulary in a sample vocabulary database, and determining the vocabulary as a public vocabulary under the preset vocabulary comparison condition;
the comparison condition of the preset vocabulary is that the vocabulary appears in each sample vocabulary database, and the occurrence probability is higher than a preset vocabulary probability threshold;
and step S24, screening out public words in the sample word database to obtain a field database.
Further, in the step S2, a matching coincidence degree between the text recognition result and the databases in different fields is determined, wherein,
calculating the matching coincidence degree of the text recognition result and the domain database according to the formula (1),
in the formula (1), N represents the number of words in the text recognition result, and Ne represents the number of words in the text recognition result that match the proprietary words in the domain database.
Further, in the step S2, each vocabulary in the text recognition result is matched with the private vocabulary in the plurality of domain databases, wherein,
and if the single vocabulary is the same as the special vocabulary in the domain database, judging that the vocabulary is matched with the special vocabulary.
Further, in the step S2, a domain database matched with the text recognition result is obtained based on the ranking result, wherein,
and arranging the text recognition results and the matching coincidence degrees of the databases in a descending order, and selecting the domain database corresponding to the maximum matching coincidence degree as the domain database matched with the text recognition results.
Further, in the step S3, whether a feature word appears in the text recognition result is recognized, wherein,
comparing the vocabulary in the text recognition result with the complete English vocabulary in the standard dictionary database, and judging the vocabulary as the characteristic vocabulary if the complete English vocabulary which is the same as the vocabulary does not exist in the standard dictionary database.
Further, the standard dictionary database stores a plurality of complete English vocabulary and paraphrase vocabulary associated with the English vocabulary.
Further, in the step S3, an auxiliary database to be called is determined based on the technical field to which the text recognition result belongs, wherein,
and calling an auxiliary database for storing the abbreviated vocabulary in the technical field and the paraphrasing vocabulary associated with the abbreviated vocabulary.
Further, in the step S3, a paraphrase of the feature vocabulary is determined based on the contents in the auxiliary database, wherein,
comparing the feature vocabulary with a plurality of abbreviated vocabularies in the called auxiliary database, and if the abbreviated vocabularies which are the same as the feature vocabulary exist in the auxiliary database, determining the paraphrasing vocabulary associated with the abbreviated vocabularies as the paraphrasing vocabulary of the feature vocabulary.
Further, in the step S4, identifying the paraphrase vocabulary of the non-feature vocabulary includes comparing the non-feature vocabulary with the complete english vocabulary in the standard dictionary database, and if the complete english vocabulary is the same as the non-feature vocabulary in the standard dictionary database, determining the paraphrase vocabulary associated with the complete english vocabulary as the paraphrase vocabulary of the non-feature vocabulary.
Compared with the prior art, the method and the device have the advantages that through the arrangement of the auxiliary databases for storing abbreviated vocabularies in different technical fields and paraphrasing vocabularies associated with the abbreviated vocabularies, each vocabulary in the text recognition result is matched with the special vocabularies in the databases in the plurality of fields, the matching coincidence degree is calculated based on the matching result, the field database matched with the text recognition result is obtained based on the sorting of the matching coincidence degree, the technical field corresponding to the field database is determined to be the technical field to which the text recognition result belongs, whether the characteristic vocabularies appear in the text recognition result is recognized, the auxiliary database to be called is determined based on the technical field to which the text recognition result belongs, the paraphrasing vocabularies corresponding to the text recognition result are judged based on the content in the auxiliary database, and further, the meaning recognition of the English abbreviated vocabularies in the different technical fields by the dictionary pen is enabled to be more accurate.
In particular, in the invention, a plurality of auxiliary databases are arranged, and each auxiliary database stores abbreviated vocabulary and paraphrase vocabulary associated with the abbreviated vocabulary, so that database support is provided, the paraphrase vocabulary of the abbreviated English vocabulary can be accurately output after the technical field of text recognition results is determined in advance, and further, the meaning recognition of the dictionary pen to the English abbreviated vocabulary in different technical fields is more accurate.
In particular, in the invention, the domain database matched with the text recognition result is obtained based on the sequence of the matching coincidence degree, the technical domain corresponding to the domain database is determined as the technical domain to which the text recognition result belongs, in the actual situation, the matching coincidence degree characterizes the matching degree of the text recognition result and the domain database, the higher the matching coincidence degree is, the domain database with the highest matching coincidence degree is the domain database matched with the text recognition result, so the domain database corresponding to the largest matching coincidence degree in the sequence of the matching coincidence degree is determined as the domain database matched with the text recognition result, the technical domain to which the text recognition result belongs is further reliably determined, the matching of the feature words in the text recognition result and the abbreviated words in the auxiliary database of the technical domain is ensured, and the accuracy of the feature word meaning of the dictionary pen for recognizing different technical domains is improved.
Particularly, in the invention, an auxiliary database for storing the abbreviated vocabulary of the technical field to which the text recognition result belongs and the paraphrasing vocabulary associated with the abbreviated vocabulary is called, the paraphrasing vocabulary of the feature vocabulary is judged based on the content in the auxiliary database, in the practical situation, the feature vocabulary, namely the English abbreviated vocabulary, is different from the complete English vocabulary in the standard dictionary database, so that the paraphrasing vocabulary of the feature vocabulary cannot be recognized by matching the feature vocabulary with the special vocabulary in the field database, the feature vocabulary is matched with the abbreviated vocabulary in the auxiliary database, but because the English abbreviated vocabulary has different paraphrasing in different technical fields, the technical field to which the text recognition result belongs is judged firstly, and then the feature vocabulary is matched with the abbreviated vocabulary of the technical field to which the text recognition result belongs and the paraphrasing vocabulary associated with the abbreviated vocabulary in the auxiliary database, so that the paraphrasing vocabulary of the feature vocabulary is accurately recognized, and the recognition accuracy of the feature vocabulary of the dictionary for the feature vocabulary in different technical fields is improved.
Drawings
FIG. 1 is a diagram showing steps of a method for quickly collecting auxiliary information of a dictionary pen according to an embodiment of the invention;
FIG. 2 is a schematic diagram of steps in a construction process of a domain database according to an embodiment of the invention;
fig. 3 is a flowchart of a determination of whether a feature word appears in a recognition result of a recognition text according to an embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will become more apparent, the invention will be further described with reference to the following examples; it should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.
It should be noted that, in the description of the present invention, terms such as "upper," "lower," "left," "right," "inner," "outer," and the like indicate directions or positional relationships based on the directions or positional relationships shown in the drawings, which are merely for convenience of description, and do not indicate or imply that the apparatus or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
Furthermore, it should be noted that, in the description of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those skilled in the art according to the specific circumstances.
Referring to fig. 1, which is a schematic diagram illustrating steps of a method for quickly collecting auxiliary information of a dictionary pen according to an embodiment of the present invention, the method for quickly collecting auxiliary information of a dictionary pen includes:
step S1, setting a plurality of auxiliary databases, wherein each auxiliary database is used for storing abbreviated vocabularies in different technical fields and paraphrasing vocabularies associated with the abbreviated vocabularies;
step S2, obtaining a text recognition result of the dictionary pen, judging the technical field of the text recognition result based on the text recognition result, wherein,
matching each vocabulary in the text recognition result with special vocabularies in a plurality of field databases, calculating the matching coincidence degree of the text recognition result and the databases in different fields based on the matching result, acquiring the field databases matched with the text recognition result based on the sorting of the matching coincidence degree, and determining the technical field corresponding to the field databases as the technical field to which the text recognition result belongs;
step S3, recognizing whether feature words appear in a text recognition result, determining an auxiliary database to be called based on the technical field of the text recognition result, and judging the paraphrase words of the feature words based on the content in the auxiliary database;
step S4, outputting the paraphrase vocabulary corresponding to the text recognition result, comprising,
identifying the paraphrase vocabulary of the non-characteristic vocabulary and outputting the paraphrase vocabulary;
and outputting the paraphrase vocabulary of the feature vocabulary identified in the step S3.
Specifically, the construction mode of the auxiliary database is not limited, in this embodiment, the auxiliary database may be obtained by pre-sorting english acronyms and corresponding definitions in different technical fields by a person skilled in the art, and the english acronyms and corresponding definitions in the technical fields obtained by sorting are stored after establishing an association relationship, so as to construct the auxiliary database.
Specifically, in the present invention, the text recognition result at least includes the recognized english character, but does not include the paraphrase corresponding to the english character, and in the dictionary pen recognition, the character in the image needs to be preferentially recognized, and further the corresponding meaning is output based on the character, which is not described herein.
Specifically, in the invention, a plurality of auxiliary databases are arranged, and abbreviation words and paraphrasing words associated with the abbreviation words are stored in each auxiliary database, so that database support is provided, the paraphrasing words of the abbreviation English words can be accurately output after the technical field to which the text recognition result belongs is determined in advance, and further, the meaning recognition of the English abbreviation words in different technical fields by the dictionary pen is more accurate.
Specifically, referring to fig. 2, in step S2, each of the domain databases is pre-constructed, and the construction process of each domain database includes,
step S21, crawling text data of a public document database in a single technical field;
step S22, word segmentation processing is carried out on each text data to obtain a plurality of words, and a sample word database is constructed;
step S23, repeating the step S21 and the step S22 to obtain sample vocabulary databases in a plurality of technical fields, determining common vocabulary in each sample vocabulary database, wherein,
calculating the occurrence probability of the vocabulary in a sample vocabulary database, and determining the vocabulary as a public vocabulary under the preset vocabulary comparison condition;
the comparison condition of the preset vocabulary is that the vocabulary appears in each sample vocabulary database, and the occurrence probability is higher than a preset vocabulary probability threshold;
and step S24, screening out public words in the sample word database to obtain a field database.
Specifically, in this embodiment, the predetermined vocabulary probability threshold is obtained by statistics in advance, a plurality of special vocabularies belonging to a single technical field are selected, occurrence probabilities of the special vocabularies in each sample vocabulary database are counted, an occurrence probability average value Δpl is calculated, the predetermined vocabulary probability threshold p0=Δpl×α is set, α represents an influence factor, and 1.3 < α < 1.9.
In this embodiment, word segmentation processing includes word segmentation of sentences after deleting punctuation, space, number and symbol in text data, and in this embodiment, word segmentation tools are not limited, and a person skilled in the art can select a corresponding word segmentation tool to segment words based on requirements, which is the prior art and will not be repeated.
Specifically, in this embodiment, those skilled in the art can rescreen proprietary words in the domain databases to make each of the domain databases more descriptive.
Specifically, in the step S2, the matching coincidence degree of the text recognition result and the databases in different fields is determined, wherein,
calculating the matching coincidence degree of the text recognition result and the domain database according to the formula (1),
in the formula (1), N represents the number of words in the text recognition result, and Ne represents the number of words in the text recognition result that match the proprietary words in the domain database.
Specifically, in the step S2, each vocabulary in the text recognition result is matched with the private vocabulary in the plurality of domain databases, wherein,
and if the single vocabulary is the same as the special vocabulary in the domain database, judging that the vocabulary is matched with the special vocabulary.
Specifically, in the step S2, a domain database to which the text recognition result matches is obtained based on the ranking result, wherein,
and arranging the text recognition results and the matching coincidence degrees of the databases in a descending order, and selecting the domain database corresponding to the maximum matching coincidence degree as the domain database matched with the text recognition results.
Specifically, in the method, the domain database matched with the text recognition result is obtained based on the sequence of the matching coincidence degree, the technical domain corresponding to the domain database is determined to be the technical domain to which the text recognition result belongs, in the practical situation, the matching coincidence degree characterizes the matching degree of the text recognition result and the domain database, the higher the matching coincidence degree is, the domain database with the highest matching coincidence degree is the domain database matched with the text recognition result, so that the domain database corresponding to the largest matching coincidence degree in the sequence of the matching coincidence degree is determined to be the domain database matched with the text recognition result, the technical domain to which the text recognition result belongs is further reliably determined, the matching of the feature words in the text recognition result and the abbreviated words in the auxiliary database of the technical domain is guaranteed, and the accuracy of the feature word meaning of the dictionary pen for recognizing different technical domains is improved.
Specifically, please continue to refer to fig. 3, which is a flowchart for determining whether feature words appear in the text recognition result according to an embodiment of the present invention, in the step S3, whether feature words appear in the text recognition result is determined, wherein,
comparing the vocabulary in the text recognition result with the complete English vocabulary in the standard dictionary database, and judging the vocabulary as the characteristic vocabulary if the complete English vocabulary which is the same as the vocabulary does not exist in the standard dictionary database.
Specifically, the standard dictionary database stores a plurality of complete English vocabulary and paraphrase vocabulary associated with the English vocabulary.
Specifically, the specific setting mode of the standard dictionary database is not limited, the standard dictionary database can be constructed through a database management system such as MySQL, SQLite and the like to store complete english vocabulary and paraphrase vocabulary associated with the english vocabulary, and only the function of setting the standard dictionary database can be completed, and the description is omitted.
Specifically, the invention does not limit the specific way of establishing the association between the english vocabulary and the corresponding paraphrase vocabulary in the standard dictionary database and between the abbreviated vocabulary and the corresponding paraphrase vocabulary in the auxiliary database, and in the technical field of computers, the way of establishing the association between the data is various, and in this embodiment, only any data construction way capable of realizing the above functions is selected, which is not described here again.
Specifically, in step S3, an auxiliary database to be called is determined based on the technical field to which the text recognition result belongs, wherein,
and calling an auxiliary database for storing the abbreviated vocabulary in the technical field and the paraphrasing vocabulary associated with the abbreviated vocabulary.
Specifically, in the step S3, the paraphrasing vocabulary of the feature vocabulary is determined based on the contents in the auxiliary database, wherein,
comparing the feature vocabulary with a plurality of abbreviated vocabularies in the called auxiliary database, and if the abbreviated vocabularies which are the same as the feature vocabulary exist in the auxiliary database, determining the paraphrasing vocabulary associated with the abbreviated vocabularies as the paraphrasing vocabulary of the feature vocabulary.
Specifically, in the invention, an auxiliary database for storing the abbreviated vocabulary of the technical field to which the text recognition result belongs and the paraphrasing vocabulary associated with the abbreviated vocabulary is called, the paraphrasing vocabulary of the feature vocabulary is judged based on the content in the auxiliary database, in the practical situation, the feature vocabulary, namely the English abbreviated vocabulary, is different from the complete English vocabulary in the standard dictionary database, so that the paraphrasing vocabulary of the feature vocabulary cannot be recognized by matching the feature vocabulary with the special vocabulary in the field database, the feature vocabulary is matched with the abbreviated vocabulary in the auxiliary database, but because the English abbreviated vocabulary has different paraphrasing in different technical fields, the technical field to which the text recognition result belongs is judged firstly, and then the feature vocabulary is matched with the abbreviated vocabulary of the technical field to which the text recognition result belongs and the paraphrasing vocabulary associated with the abbreviated vocabulary in the auxiliary database, so that the paraphrasing vocabulary of the feature vocabulary is accurately recognized, and the recognition accuracy of the feature vocabulary of the dictionary for the feature vocabulary in different technical fields is improved.
Specifically, in the step S4, identifying the paraphrase vocabulary of the non-feature vocabulary includes comparing the non-feature vocabulary with the complete english vocabulary in the standard dictionary database, and if the complete english vocabulary is the same as the non-feature vocabulary in the standard dictionary database, determining the paraphrase vocabulary associated with the complete english vocabulary as the paraphrase vocabulary of the non-feature vocabulary.
Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

Claims (10)

1. A method for quickly collecting auxiliary information of a dictionary pen is characterized by comprising the following steps:
step S1, setting a plurality of auxiliary databases and a plurality of domain databases, wherein each auxiliary database is used for storing abbreviated vocabularies in different technical domains and paraphrasing vocabularies associated with the abbreviated vocabularies, and each domain database is used for storing special vocabularies in different technical domains;
step S2, obtaining a text recognition result of the dictionary pen, judging the technical field of the text recognition result based on the text recognition result, wherein,
matching each vocabulary in the text recognition result with special vocabularies in a plurality of field databases, calculating the matching coincidence degree of the text recognition result and the databases in different fields based on the matching result, acquiring the field databases matched with the text recognition result based on the sorting of the matching coincidence degree, and determining the technical field corresponding to the field databases as the technical field to which the text recognition result belongs;
step S3, recognizing whether feature words appear in a text recognition result, determining an auxiliary database to be called based on the technical field of the text recognition result, and judging the paraphrase words of the feature words based on the content in the auxiliary database;
step S4, outputting the paraphrase vocabulary corresponding to the text recognition result, comprising,
identifying the paraphrase vocabulary of the non-characteristic vocabulary and outputting the paraphrase vocabulary;
and outputting the paraphrase vocabulary of the feature vocabulary identified in the step S3.
2. The method for quickly collecting auxiliary information of a dictionary pen according to claim 1, wherein in the step S2, each of the domain databases is constructed in advance, and the construction process of each domain database includes,
step S21, crawling text data of a public document database in a single technical field;
step S22, word segmentation processing is carried out on each text data to obtain a plurality of words, and a sample word database is constructed;
step S23, repeating the step S21 and the step S22 to obtain sample vocabulary databases in a plurality of technical fields, determining common vocabulary in each sample vocabulary database, wherein,
calculating the occurrence probability of the vocabulary in a sample vocabulary database, and determining the vocabulary as a public vocabulary under the preset vocabulary comparison condition;
the comparison condition of the preset vocabulary is that the vocabulary appears in each sample vocabulary database, and the occurrence probability is higher than a preset vocabulary probability threshold;
and step S24, screening out public words in the sample word database to obtain a field database.
3. The method for quickly collecting auxiliary information of dictionary pens according to claim 1, wherein in step S2, the matching coincidence degree of the text recognition result and the databases of different fields is determined, wherein,
calculating the matching coincidence degree of the text recognition result and the domain database according to the formula (1),
in the formula (1), N represents the number of words in the text recognition result, and Ne represents the number of words in the text recognition result that match the proprietary words in the domain database.
4. The method for quickly collecting auxiliary information of dictionary pens according to claim 3, wherein in step S2, each vocabulary in the text recognition result is matched with the exclusive vocabulary in the plurality of domain databases, wherein,
and if the single vocabulary is the same as the special vocabulary in the domain database, judging that the vocabulary is matched with the special vocabulary.
5. The method according to claim 4, wherein in the step S2, a domain database to which the text recognition result matches is obtained based on the sorting result, wherein,
and arranging the text recognition results and the matching coincidence degrees of the databases in a descending order, and selecting the domain database corresponding to the maximum matching coincidence degree as the domain database matched with the text recognition results.
6. The method for quickly collecting auxiliary information of a dictionary pen according to claim 1, wherein in the step S3, whether feature words appear in the text recognition result is recognized, wherein,
comparing the vocabulary in the text recognition result with the complete English vocabulary in the standard dictionary database, and judging the vocabulary as the characteristic vocabulary if the complete English vocabulary which is the same as the vocabulary does not exist in the standard dictionary database.
7. The method of claim 6, wherein the standard dictionary database stores a plurality of complete english words and paraphrasing words associated with the english words.
8. The method for quickly collecting auxiliary information of a dictionary pen according to claim 1, wherein in the step S3, an auxiliary database to be called is determined based on the technical field to which the text recognition result belongs, wherein,
and calling an auxiliary database for storing the abbreviated vocabulary in the technical field and the paraphrasing vocabulary associated with the abbreviated vocabulary.
9. The method according to claim 1, wherein in the step S3, the paraphrasing vocabulary of the feature vocabulary is determined based on the contents in the auxiliary database, wherein,
comparing the feature vocabulary with a plurality of abbreviated vocabularies in the called auxiliary database, and if the abbreviated vocabularies which are the same as the feature vocabulary exist in the auxiliary database, determining the paraphrasing vocabulary associated with the abbreviated vocabularies as the paraphrasing vocabulary of the feature vocabulary.
10. The method according to claim 1, wherein in step S4, identifying the paraphrase of the non-feature vocabulary includes comparing the non-feature vocabulary with the complete english vocabulary in the standard dictionary database, and if the complete english vocabulary is the same as the non-feature vocabulary in the standard dictionary database, determining the paraphrase associated with the complete english vocabulary as the paraphrase of the non-feature vocabulary.
CN202310884628.1A 2023-07-19 2023-07-19 Auxiliary information quick collection method for dictionary pen Pending CN117009460A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310884628.1A CN117009460A (en) 2023-07-19 2023-07-19 Auxiliary information quick collection method for dictionary pen

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310884628.1A CN117009460A (en) 2023-07-19 2023-07-19 Auxiliary information quick collection method for dictionary pen

Publications (1)

Publication Number Publication Date
CN117009460A true CN117009460A (en) 2023-11-07

Family

ID=88575560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310884628.1A Pending CN117009460A (en) 2023-07-19 2023-07-19 Auxiliary information quick collection method for dictionary pen

Country Status (1)

Country Link
CN (1) CN117009460A (en)

Similar Documents

Publication Publication Date Title
US6687697B2 (en) System and method for improved string matching under noisy channel conditions
RU2417435C2 (en) Method and system for validating unambiguously recognised words in ocr system
KR100292098B1 (en) Character recognition device and method
US10133965B2 (en) Method for text recognition and computer program product
JPH0736882A (en) Dictionary retrieving device
US8340425B2 (en) Optical character recognition with two-pass zoning
KR20100007722A (en) Method of character recongnition and translation based on camera image
WO2007086059A2 (en) Determining near duplicate &#39;noisy&#39; data objects
CN111859921A (en) Text error correction method and device, computer equipment and storage medium
CN111460793A (en) Error correction method, device, equipment and storage medium
Saluja et al. Error detection and corrections in Indic OCR using LSTMs
US8411958B2 (en) Apparatus and method for handwriting recognition
US5909509A (en) Statistical-based recognition of similar characters
CN109074355B (en) Method and medium for ideographic character analysis
KR101176963B1 (en) System for character recognition and post-processing in document image captured
JP2011008784A (en) System and method for automatically recommending japanese word by using roman alphabet conversion
JP2001175661A (en) Device and method for full-text retrieval
JP2000089786A (en) Method for correcting speech recognition result and apparatus therefor
CN117009460A (en) Auxiliary information quick collection method for dictionary pen
Lund Ensemble Methods for Historical Machine-Printed Document Recognition
JPH11232296A (en) Document filing system and document filing method
CN114970554A (en) Document checking method based on natural language processing
JP3975825B2 (en) Character recognition error correction method, apparatus and program
Lu et al. Word searching in document images using word portion matching
JP4802502B2 (en) Word recognition device and word recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination