CN117009460A

CN117009460A - Auxiliary information quick collection method for dictionary pen

Info

Publication number: CN117009460A
Application number: CN202310884628.1A
Authority: CN
Inventors: 王烈峰; 詹晓沛
Original assignee: Readboy Education Technology Co Ltd
Current assignee: Readboy Education Technology Co Ltd
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-11-07

Abstract

The invention relates to the technical field of dictionary pens, in particular to a method for quickly collecting auxiliary information of dictionary pens.

Description

Auxiliary information quick collection method for dictionary pen

Technical Field

Background

The dictionary pen is a portable electronic device, and is mainly used for assisting language learning and information inquiry, so that a convenient and quick language learning and information inquiry tool is provided for a user, and the user is helped to expand the vocabulary and improve the language understanding and application capability.

Chinese patent publication No.: CN105335356a discloses the following, the invention relates to a paper translation method and translation pen device for semantic recognition, the paper translation method for semantic recognition comprises the following steps: (1) Performing basic coding on English characters, establishing a character coding library, a rule library and a font library, and combining and arranging the character coding library, the rule library and the font library to form a coding preparation library; (2) Scanning and identifying paper English to be translated by utilizing OCR; (3) Coding the character string which is identified by utilizing a coding preparation library; (4) Carrying out semanteme processing on the coded character string to finish coding semanteme description; (5) Obtaining precisely recognized English words by utilizing OCR recognition word cognitive reasoning; (6) And connecting the English words accurately recognized by OCR with an electronic dictionary to realize automatic translation. Compared with the prior art, the method combines coding, semantic processing and reasoning with the traditional OCR, and reduces the false recognition rate caused by the traditional OCR text recognition.

However, the prior art has the following problems:

in the prior art, in actual situations, the corresponding definitions of the same English abbreviations in different technical fields are different, and the factors are not considered in the prior art, so that the meaning recognition of the dictionary pen to English abbreviations is inaccurate.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method for quickly collecting auxiliary information of a dictionary pen, comprising:

step S1, setting a plurality of auxiliary databases, wherein each auxiliary database is used for storing abbreviated vocabularies in different technical fields and paraphrasing vocabularies associated with the abbreviated vocabularies;

step S2, obtaining a text recognition result of the dictionary pen, judging the technical field of the text recognition result based on the text recognition result, wherein,

matching each vocabulary in the text recognition result with special vocabularies in a plurality of field databases, calculating the matching coincidence degree of the text recognition result and the databases in different fields based on the matching result, acquiring the field databases matched with the text recognition result based on the sorting of the matching coincidence degree, and determining the technical field corresponding to the field databases as the technical field to which the text recognition result belongs;

step S3, recognizing whether feature words appear in a text recognition result, determining an auxiliary database to be called based on the technical field of the text recognition result, and judging the paraphrase words of the feature words based on the content in the auxiliary database;

step S4, outputting the paraphrase vocabulary corresponding to the text recognition result, comprising,

identifying the paraphrase vocabulary of the non-characteristic vocabulary and outputting the paraphrase vocabulary;

and outputting the paraphrase vocabulary of the feature vocabulary identified in the step S3.

Further, in the step S2, each of the domain databases is pre-constructed, and the construction process of each domain database includes,

step S21, crawling text data of a public document database in a single technical field;

step S22, word segmentation processing is carried out on each text data to obtain a plurality of words, and a sample word database is constructed;

step S23, repeating the step S21 and the step S22 to obtain sample vocabulary databases in a plurality of technical fields, determining common vocabulary in each sample vocabulary database, wherein,

calculating the occurrence probability of the vocabulary in a sample vocabulary database, and determining the vocabulary as a public vocabulary under the preset vocabulary comparison condition;

the comparison condition of the preset vocabulary is that the vocabulary appears in each sample vocabulary database, and the occurrence probability is higher than a preset vocabulary probability threshold;

and step S24, screening out public words in the sample word database to obtain a field database.

Further, in the step S2, a matching coincidence degree between the text recognition result and the databases in different fields is determined, wherein,

calculating the matching coincidence degree of the text recognition result and the domain database according to the formula (1),

in the formula (1), N represents the number of words in the text recognition result, and Ne represents the number of words in the text recognition result that match the proprietary words in the domain database.

Further, in the step S2, each vocabulary in the text recognition result is matched with the private vocabulary in the plurality of domain databases, wherein,

and if the single vocabulary is the same as the special vocabulary in the domain database, judging that the vocabulary is matched with the special vocabulary.

Further, in the step S2, a domain database matched with the text recognition result is obtained based on the ranking result, wherein,

and arranging the text recognition results and the matching coincidence degrees of the databases in a descending order, and selecting the domain database corresponding to the maximum matching coincidence degree as the domain database matched with the text recognition results.

Further, in the step S3, whether a feature word appears in the text recognition result is recognized, wherein,

comparing the vocabulary in the text recognition result with the complete English vocabulary in the standard dictionary database, and judging the vocabulary as the characteristic vocabulary if the complete English vocabulary which is the same as the vocabulary does not exist in the standard dictionary database.

Further, the standard dictionary database stores a plurality of complete English vocabulary and paraphrase vocabulary associated with the English vocabulary.

Further, in the step S3, an auxiliary database to be called is determined based on the technical field to which the text recognition result belongs, wherein,

and calling an auxiliary database for storing the abbreviated vocabulary in the technical field and the paraphrasing vocabulary associated with the abbreviated vocabulary.

Further, in the step S3, a paraphrase of the feature vocabulary is determined based on the contents in the auxiliary database, wherein,

comparing the feature vocabulary with a plurality of abbreviated vocabularies in the called auxiliary database, and if the abbreviated vocabularies which are the same as the feature vocabulary exist in the auxiliary database, determining the paraphrasing vocabulary associated with the abbreviated vocabularies as the paraphrasing vocabulary of the feature vocabulary.

Further, in the step S4, identifying the paraphrase vocabulary of the non-feature vocabulary includes comparing the non-feature vocabulary with the complete english vocabulary in the standard dictionary database, and if the complete english vocabulary is the same as the non-feature vocabulary in the standard dictionary database, determining the paraphrase vocabulary associated with the complete english vocabulary as the paraphrase vocabulary of the non-feature vocabulary.

Compared with the prior art, the method and the device have the advantages that through the arrangement of the auxiliary databases for storing abbreviated vocabularies in different technical fields and paraphrasing vocabularies associated with the abbreviated vocabularies, each vocabulary in the text recognition result is matched with the special vocabularies in the databases in the plurality of fields, the matching coincidence degree is calculated based on the matching result, the field database matched with the text recognition result is obtained based on the sorting of the matching coincidence degree, the technical field corresponding to the field database is determined to be the technical field to which the text recognition result belongs, whether the characteristic vocabularies appear in the text recognition result is recognized, the auxiliary database to be called is determined based on the technical field to which the text recognition result belongs, the paraphrasing vocabularies corresponding to the text recognition result are judged based on the content in the auxiliary database, and further, the meaning recognition of the English abbreviated vocabularies in the different technical fields by the dictionary pen is enabled to be more accurate.

In particular, in the invention, a plurality of auxiliary databases are arranged, and each auxiliary database stores abbreviated vocabulary and paraphrase vocabulary associated with the abbreviated vocabulary, so that database support is provided, the paraphrase vocabulary of the abbreviated English vocabulary can be accurately output after the technical field of text recognition results is determined in advance, and further, the meaning recognition of the dictionary pen to the English abbreviated vocabulary in different technical fields is more accurate.

In particular, in the invention, the domain database matched with the text recognition result is obtained based on the sequence of the matching coincidence degree, the technical domain corresponding to the domain database is determined as the technical domain to which the text recognition result belongs, in the actual situation, the matching coincidence degree characterizes the matching degree of the text recognition result and the domain database, the higher the matching coincidence degree is, the domain database with the highest matching coincidence degree is the domain database matched with the text recognition result, so the domain database corresponding to the largest matching coincidence degree in the sequence of the matching coincidence degree is determined as the domain database matched with the text recognition result, the technical domain to which the text recognition result belongs is further reliably determined, the matching of the feature words in the text recognition result and the abbreviated words in the auxiliary database of the technical domain is ensured, and the accuracy of the feature word meaning of the dictionary pen for recognizing different technical domains is improved.

Particularly, in the invention, an auxiliary database for storing the abbreviated vocabulary of the technical field to which the text recognition result belongs and the paraphrasing vocabulary associated with the abbreviated vocabulary is called, the paraphrasing vocabulary of the feature vocabulary is judged based on the content in the auxiliary database, in the practical situation, the feature vocabulary, namely the English abbreviated vocabulary, is different from the complete English vocabulary in the standard dictionary database, so that the paraphrasing vocabulary of the feature vocabulary cannot be recognized by matching the feature vocabulary with the special vocabulary in the field database, the feature vocabulary is matched with the abbreviated vocabulary in the auxiliary database, but because the English abbreviated vocabulary has different paraphrasing in different technical fields, the technical field to which the text recognition result belongs is judged firstly, and then the feature vocabulary is matched with the abbreviated vocabulary of the technical field to which the text recognition result belongs and the paraphrasing vocabulary associated with the abbreviated vocabulary in the auxiliary database, so that the paraphrasing vocabulary of the feature vocabulary is accurately recognized, and the recognition accuracy of the feature vocabulary of the dictionary for the feature vocabulary in different technical fields is improved.

Drawings

FIG. 1 is a diagram showing steps of a method for quickly collecting auxiliary information of a dictionary pen according to an embodiment of the invention;

FIG. 2 is a schematic diagram of steps in a construction process of a domain database according to an embodiment of the invention;

fig. 3 is a flowchart of a determination of whether a feature word appears in a recognition result of a recognition text according to an embodiment of the present invention.

Detailed Description

In order that the objects and advantages of the invention will become more apparent, the invention will be further described with reference to the following examples; it should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.

It should be noted that, in the description of the present invention, terms such as "upper," "lower," "left," "right," "inner," "outer," and the like indicate directions or positional relationships based on the directions or positional relationships shown in the drawings, which are merely for convenience of description, and do not indicate or imply that the apparatus or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

Furthermore, it should be noted that, in the description of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those skilled in the art according to the specific circumstances.

Referring to fig. 1, which is a schematic diagram illustrating steps of a method for quickly collecting auxiliary information of a dictionary pen according to an embodiment of the present invention, the method for quickly collecting auxiliary information of a dictionary pen includes:

Specifically, the construction mode of the auxiliary database is not limited, in this embodiment, the auxiliary database may be obtained by pre-sorting english acronyms and corresponding definitions in different technical fields by a person skilled in the art, and the english acronyms and corresponding definitions in the technical fields obtained by sorting are stored after establishing an association relationship, so as to construct the auxiliary database.

Specifically, in the present invention, the text recognition result at least includes the recognized english character, but does not include the paraphrase corresponding to the english character, and in the dictionary pen recognition, the character in the image needs to be preferentially recognized, and further the corresponding meaning is output based on the character, which is not described herein.

Specifically, in the invention, a plurality of auxiliary databases are arranged, and abbreviation words and paraphrasing words associated with the abbreviation words are stored in each auxiliary database, so that database support is provided, the paraphrasing words of the abbreviation English words can be accurately output after the technical field to which the text recognition result belongs is determined in advance, and further, the meaning recognition of the English abbreviation words in different technical fields by the dictionary pen is more accurate.

Specifically, referring to fig. 2, in step S2, each of the domain databases is pre-constructed, and the construction process of each domain database includes,

Specifically, in this embodiment, the predetermined vocabulary probability threshold is obtained by statistics in advance, a plurality of special vocabularies belonging to a single technical field are selected, occurrence probabilities of the special vocabularies in each sample vocabulary database are counted, an occurrence probability average value Δpl is calculated, the predetermined vocabulary probability threshold p0=Δpl×α is set, α represents an influence factor, and 1.3 < α < 1.9.

In this embodiment, word segmentation processing includes word segmentation of sentences after deleting punctuation, space, number and symbol in text data, and in this embodiment, word segmentation tools are not limited, and a person skilled in the art can select a corresponding word segmentation tool to segment words based on requirements, which is the prior art and will not be repeated.

Specifically, in this embodiment, those skilled in the art can rescreen proprietary words in the domain databases to make each of the domain databases more descriptive.

Specifically, in the step S2, the matching coincidence degree of the text recognition result and the databases in different fields is determined, wherein,

Specifically, in the step S2, each vocabulary in the text recognition result is matched with the private vocabulary in the plurality of domain databases, wherein,

Specifically, in the step S2, a domain database to which the text recognition result matches is obtained based on the ranking result, wherein,

Specifically, in the method, the domain database matched with the text recognition result is obtained based on the sequence of the matching coincidence degree, the technical domain corresponding to the domain database is determined to be the technical domain to which the text recognition result belongs, in the practical situation, the matching coincidence degree characterizes the matching degree of the text recognition result and the domain database, the higher the matching coincidence degree is, the domain database with the highest matching coincidence degree is the domain database matched with the text recognition result, so that the domain database corresponding to the largest matching coincidence degree in the sequence of the matching coincidence degree is determined to be the domain database matched with the text recognition result, the technical domain to which the text recognition result belongs is further reliably determined, the matching of the feature words in the text recognition result and the abbreviated words in the auxiliary database of the technical domain is guaranteed, and the accuracy of the feature word meaning of the dictionary pen for recognizing different technical domains is improved.

Specifically, please continue to refer to fig. 3, which is a flowchart for determining whether feature words appear in the text recognition result according to an embodiment of the present invention, in the step S3, whether feature words appear in the text recognition result is determined, wherein,

Specifically, the standard dictionary database stores a plurality of complete English vocabulary and paraphrase vocabulary associated with the English vocabulary.

Specifically, the specific setting mode of the standard dictionary database is not limited, the standard dictionary database can be constructed through a database management system such as MySQL, SQLite and the like to store complete english vocabulary and paraphrase vocabulary associated with the english vocabulary, and only the function of setting the standard dictionary database can be completed, and the description is omitted.

Specifically, the invention does not limit the specific way of establishing the association between the english vocabulary and the corresponding paraphrase vocabulary in the standard dictionary database and between the abbreviated vocabulary and the corresponding paraphrase vocabulary in the auxiliary database, and in the technical field of computers, the way of establishing the association between the data is various, and in this embodiment, only any data construction way capable of realizing the above functions is selected, which is not described here again.

Specifically, in step S3, an auxiliary database to be called is determined based on the technical field to which the text recognition result belongs, wherein,

Specifically, in the step S3, the paraphrasing vocabulary of the feature vocabulary is determined based on the contents in the auxiliary database, wherein,

Specifically, in the invention, an auxiliary database for storing the abbreviated vocabulary of the technical field to which the text recognition result belongs and the paraphrasing vocabulary associated with the abbreviated vocabulary is called, the paraphrasing vocabulary of the feature vocabulary is judged based on the content in the auxiliary database, in the practical situation, the feature vocabulary, namely the English abbreviated vocabulary, is different from the complete English vocabulary in the standard dictionary database, so that the paraphrasing vocabulary of the feature vocabulary cannot be recognized by matching the feature vocabulary with the special vocabulary in the field database, the feature vocabulary is matched with the abbreviated vocabulary in the auxiliary database, but because the English abbreviated vocabulary has different paraphrasing in different technical fields, the technical field to which the text recognition result belongs is judged firstly, and then the feature vocabulary is matched with the abbreviated vocabulary of the technical field to which the text recognition result belongs and the paraphrasing vocabulary associated with the abbreviated vocabulary in the auxiliary database, so that the paraphrasing vocabulary of the feature vocabulary is accurately recognized, and the recognition accuracy of the feature vocabulary of the dictionary for the feature vocabulary in different technical fields is improved.

Specifically, in the step S4, identifying the paraphrase vocabulary of the non-feature vocabulary includes comparing the non-feature vocabulary with the complete english vocabulary in the standard dictionary database, and if the complete english vocabulary is the same as the non-feature vocabulary in the standard dictionary database, determining the paraphrase vocabulary associated with the complete english vocabulary as the paraphrase vocabulary of the non-feature vocabulary.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

Claims

1. A method for quickly collecting auxiliary information of a dictionary pen is characterized by comprising the following steps:

step S1, setting a plurality of auxiliary databases and a plurality of domain databases, wherein each auxiliary database is used for storing abbreviated vocabularies in different technical domains and paraphrasing vocabularies associated with the abbreviated vocabularies, and each domain database is used for storing special vocabularies in different technical domains;

2. The method for quickly collecting auxiliary information of a dictionary pen according to claim 1, wherein in the step S2, each of the domain databases is constructed in advance, and the construction process of each domain database includes,

3. The method for quickly collecting auxiliary information of dictionary pens according to claim 1, wherein in step S2, the matching coincidence degree of the text recognition result and the databases of different fields is determined, wherein,

4. The method for quickly collecting auxiliary information of dictionary pens according to claim 3, wherein in step S2, each vocabulary in the text recognition result is matched with the exclusive vocabulary in the plurality of domain databases, wherein,

5. The method according to claim 4, wherein in the step S2, a domain database to which the text recognition result matches is obtained based on the sorting result, wherein,

6. The method for quickly collecting auxiliary information of a dictionary pen according to claim 1, wherein in the step S3, whether feature words appear in the text recognition result is recognized, wherein,

7. The method of claim 6, wherein the standard dictionary database stores a plurality of complete english words and paraphrasing words associated with the english words.

8. The method for quickly collecting auxiliary information of a dictionary pen according to claim 1, wherein in the step S3, an auxiliary database to be called is determined based on the technical field to which the text recognition result belongs, wherein,

9. The method according to claim 1, wherein in the step S3, the paraphrasing vocabulary of the feature vocabulary is determined based on the contents in the auxiliary database, wherein,

10. The method according to claim 1, wherein in step S4, identifying the paraphrase of the non-feature vocabulary includes comparing the non-feature vocabulary with the complete english vocabulary in the standard dictionary database, and if the complete english vocabulary is the same as the non-feature vocabulary in the standard dictionary database, determining the paraphrase associated with the complete english vocabulary as the paraphrase of the non-feature vocabulary.