CN115858797A - Method and system for generating Chinese near-meaning words based on OCR technology - Google Patents

Method and system for generating Chinese near-meaning words based on OCR technology Download PDF

Info

Publication number
CN115858797A
CN115858797A CN202210125902.2A CN202210125902A CN115858797A CN 115858797 A CN115858797 A CN 115858797A CN 202210125902 A CN202210125902 A CN 202210125902A CN 115858797 A CN115858797 A CN 115858797A
Authority
CN
China
Prior art keywords
word
chinese
words
common
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210125902.2A
Other languages
Chinese (zh)
Inventor
贾敬伍
蒋宁
马超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongguancun Kejin Technology Co Ltd
Original Assignee
Beijing Zhongguancun Kejin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongguancun Kejin Technology Co Ltd filed Critical Beijing Zhongguancun Kejin Technology Co Ltd
Priority to CN202210125902.2A priority Critical patent/CN115858797A/en
Publication of CN115858797A publication Critical patent/CN115858797A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The method comprises the steps of firstly obtaining a Chinese request word needing to search for the similar meaning word, then combining the Chinese request word and a similar form word dictionary to build a plurality of candidate words with similar semantemes to the Chinese request word, and then selecting and outputting the similar meaning word of the Chinese request word from the candidate words. Based on the method and the system for generating the Chinese near-synonyms based on the OCR technology, provided by the application, the near-synonym acquisition cost can be greatly reduced, near-synonyms and expanded linguistic data can be obtained in a generating mode under the condition of lack of the linguistic data, and effective recognition can be performed on wrong words generated by a handwriting input method.

Description

Method and system for generating Chinese near-meaning words based on OCR technology
Technical Field
The application relates to the technical field of data processing, in particular to a method and a system for generating Chinese near-meaning words based on an OCR technology.
Background
In some natural language processing tasks, for example: intention recognition, entity recognition, text correction, or the like, which requires recalling their synonyms for the target word. At present, the way of realizing the task is to recall based on a near-meaning word dictionary or to recall by using word vectors on the basis of a large amount of training data.
However, the model construction of entity recognition requires a large amount of training data, and there is no mature technology at present, which is limited by the scope of the corpus, and there may be no returned result, or the returned result is not associated with the requested word, but some entity words may generate wrongly written or mispronounced words in the entity words due to the problem of input method or misspelling.
Disclosure of Invention
It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.
According to one aspect of the application, a method for generating Chinese near-meaning words based on an OCR technology is provided, which comprises the following steps:
acquiring a Chinese request word of which a similar meaning word needs to be searched;
combining the Chinese request word and the similar word dictionary to build a plurality of candidate words with similar meaning to the Chinese request word; the form-similar word dictionary is created based on a plurality of common words in a common Chinese dictionary in advance, and comprises the common words and corresponding form-similar words thereof;
and selecting and outputting the similar meaning words of the Chinese request words from the candidate words.
Optionally, before the combining the chinese request word and the word-shape dictionary to form a plurality of candidate words having similar meanings to the chinese request word, the method further includes:
reading a preset common Chinese dictionary, wherein the common Chinese dictionary comprises a plurality of common words;
converting each common character into a common character picture and storing the common character picture;
performing image-vectorization representation on each common word image by combining an OCR technology to obtain an image vector corresponding to the common word image, and establishing a mapping relation between each common word and the corresponding image vector;
and constructing and storing a form-near word dictionary based on the mapping relation between each common word and the corresponding graph vector.
Optionally, the constructing and storing a near-word dictionary based on a mapping relationship between each common word and a corresponding graph vector includes:
successively selecting a single common word as a target word, calculating first vector cosine similarity between the target word and other common words in the common Chinese dictionary based on a mapping relation between the common word and a corresponding graph vector, and comparing the first vector cosine similarity with a first preset threshold value to obtain a first comparison result;
selecting a shape near word of the target word according to the first comparison result, and recording a first vector cosine similarity between the target word and the shape near word;
and constructing and storing a shape-similar word dictionary based on the target words, the shape-similar words corresponding to the target words and the vector cosine similarity between the target words and the shape-similar words.
Optionally, the combining the chinese request word and the shape-similar word dictionary to form a plurality of candidate words having similar semantics to the chinese request word includes:
performing word segmentation processing on the Chinese request word to obtain a request word contained in the Chinese request word;
based on each request word, searching and recalling at least one target near-shaped word of the request word in the near-shaped word dictionary, and first vector cosine similarity between the request word and each target near-shaped word;
and arranging and combining the target-shaped near characters according to the length of the Chinese request word to construct at least one candidate word corresponding to the Chinese request word.
Optionally, the selecting and outputting a synonym of the chinese request word from the plurality of candidate words includes:
for each candidate word, calculating the product of the first vector cosine similarity of each corresponding request word, and storing the product as the second vector cosine similarity of the Chinese request word and the candidate word;
all the candidate words are arranged in a reverse order according to the second vector cosine similarity and are compared with a second preset threshold value to obtain a second comparison result;
and selecting the shape similar characters of the target characters according to the second comparison result, and outputting the shape similar characters as similar meaning words of the Chinese request words.
According to another aspect of the present application, there is provided a system for generating chinese synonyms based on OCR technology, including:
the request word acquisition module is configured to acquire a Chinese request word needing to find a similar meaning word;
a candidate word construction module configured to combine the Chinese request word and a word-near dictionary to construct a plurality of candidate words having similar meanings to the Chinese request word; the shape-similar word dictionary is created based on a plurality of common words in a common Chinese dictionary in advance, and comprises the common words and corresponding shape-similar words thereof;
a synonym output module configured to select and output a synonym of the Chinese request word from the plurality of candidate words.
Optionally, before the candidate word group building module, the candidate word group building module is further configured to:
reading a preset common Chinese dictionary, wherein the common Chinese dictionary comprises a plurality of common words;
converting each common character into a common character picture and storing the common character picture;
performing image-vectorization representation on each common word image by combining an OCR technology to obtain an image vector corresponding to the common word image, and establishing a mapping relation between each common word and the corresponding image vector;
and constructing and storing a form-near word dictionary based on the mapping relation between each common word and the corresponding graph vector.
Optionally, before the candidate word group building module, the candidate word group building module is further configured to:
successively selecting a single common word as a target word, calculating first vector cosine similarity between the target word and other common words in the common Chinese dictionary based on a mapping relation between the common word and a corresponding graph vector, and comparing the first vector cosine similarity with a first preset threshold value to obtain a first comparison result;
selecting a shape near word of the target word according to the first comparison result, and recording a first vector cosine similarity between the target word and the shape near word;
and constructing and storing a shape-similar word dictionary based on the target words, the shape-similar words corresponding to the target words and the vector cosine similarity between the target words and the shape-similar words.
Optionally, the candidate word group building module is further configured to:
performing word segmentation processing on the Chinese request word to obtain a request word contained in the Chinese request word;
based on each request word, searching and recalling at least one target near-shaped word of the request word in the near-shaped word dictionary, and first vector cosine similarity between the request word and each target near-shaped word;
and arranging and combining the target-shaped near characters according to the length of the Chinese request word to construct at least one candidate word corresponding to the Chinese request word.
Optionally, the synonym output module is further configured to:
for each candidate word, calculating the product of the first vector cosine similarity of each corresponding request word, and storing the product as the second vector cosine similarity of the Chinese request word and the candidate word;
arranging all the candidate words in a reverse order according to the second vector cosine similarity, and comparing the candidate words with a second preset threshold value to obtain a second comparison result;
and selecting the similar characters of the target characters according to the second comparison result, and outputting the similar characters as the similar meaning words of the Chinese request words.
According to another aspect of the present invention, there is also provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor, when executing the computer program, implements the method for generating chinese synonyms based on OCR technology as described in any one of the above.
According to another aspect of the present invention, there is also provided a computer-readable storage medium, preferably a non-volatile readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method for generating chinese synonyms based on OCR technology as recited in any of the above.
According to another aspect of the present invention, there is also provided a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform any of the above methods for generating chinese synonyms based on OCR technology.
The method comprises the steps of firstly obtaining a Chinese request word needing to search for a near-meaning word, then combining the Chinese request word and a shape-near word dictionary to build a plurality of candidate words with similar semantemes to the Chinese request word, and then selecting and outputting the near-meaning word of the Chinese request word from the candidate words. The method and the system for generating the Chinese near-synonyms based on the OCR technology can greatly reduce the near-synonym acquisition cost, can obtain the near-synonyms and the expanded linguistic data in a generating mode under the condition of lacking the linguistic data, and can effectively identify the wrong words generated by a handwriting input method.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, as illustrated in the accompanying drawings.
Drawings
Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flowchart illustrating a method for generating Chinese synonyms based on OCR technology according to an embodiment of the present disclosure;
FIG. 2 is a block diagram of a design scheme for generating Chinese synonyms based on OCR technology according to an embodiment of the present application;
FIG. 3 is a software architecture diagram for generating Chinese synonyms based on OCR technology according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a system for generating Chinese synonyms based on OCR technology according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a computing device architecture according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a computer-readable storage medium according to an embodiment of the application.
Detailed Description
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
At present, there are two common methods for generating Chinese form-similar words: the first one is based on the existing knowledge base, a near meaning word dictionary is constructed, the dictionary is utilized for recall, namely, the near meaning words of all words are collected firstly, the near meaning word dictionary is combed out, and then the near meaning words in the dictionary are searched out according to the request words; but the workload is large, and a large amount of manual editing work is needed; but also limited in the scope of data collection and may result in no return.
The other method is that on the basis of a large amount of text data, a word2vector model is used for pre-training, word vectors are used for recalling near-meaning words, namely, a large amount of Chinese text linguistic data are collected firstly, word segmentation is carried out by using a word segmentation device, then the word vectors are trained by using the word2vector model, then the vector cosine similarity between the request word and other words is calculated, namely, the confidence coefficient is obtained, the calculation result is subjected to reverse order arrangement according to the confidence coefficient, when the confidence coefficient exceeds a preset threshold value, the candidate word is selected as the candidate word, and the candidate word is used as the near-meaning word and is output; however, the scope of the corpus is limited, and there may be a case where there is no returned result or the returned result is not associated with the request word.
Fig. 1 is a flowchart illustrating a method for generating a chinese synonym based on an OCR technology according to an embodiment of the present application. As known from fig. 1, a method for generating a chinese synonym based on an OCR technology in an embodiment of the present application may include at least the following steps S101 to S103.
Step S101: acquiring a Chinese request word needing to find a similar meaning word;
step S102: combining the Chinese request word and the similar character dictionary to build a plurality of candidate words with similar semantemes to the Chinese request word; the form-similar word dictionary is created based on a plurality of common words in a common Chinese dictionary in advance, and comprises the common words and corresponding form-similar words thereof;
step S103: and selecting and outputting the similar meaning word of the Chinese request word from the candidate words.
The embodiment of the application provides a method for generating Chinese near-meaning words based on an OCR technology. Based on the method for generating the Chinese near-meaning words based on the OCR technology, the near-meaning words can be automatically formed, and a large amount of labor cost is saved; meanwhile, the method can generate near words, expand training corpora and improve the generalization capability of the model.
The method for generating the chinese synonyms based on the OCR technology mentioned in the above embodiments is described in detail below.
First, as described in step S101, a chinese request word for finding a similar meaning word is obtained.
The Chinese request word for finding the similar meaning word is mainly from natural language processing tasks, including intention recognition, entity recognition or text correction. Meanwhile, the input request word may be a word or a word, which is not limited in the embodiment of the present application.
After the Chinese request word is obtained, step S102 is executed to combine the Chinese request word and the form-word dictionary to construct a plurality of candidate words with semantics similar to the Chinese request word.
The near-shape word dictionary in the embodiment is created in advance based on a plurality of common words in a common Chinese dictionary, and comprises each common word and a near-shape word corresponding to the common word. The common chinese dictionary refers to a chinese dictionary composed of chinese characters and chinese words with a high usage rate, and may be specifically obtained from network data.
In an alternative embodiment of the present application, the construction process of the font-word dictionary is as follows steps S1 to S4.
S1, reading a preset common Chinese dictionary, wherein the common Chinese dictionary comprises a plurality of common words;
s2, converting each common character into a common character picture and storing the common character picture;
s3, performing image vectorization representation on each common word image by combining an OCR technology to obtain an image vector corresponding to the common word image, and establishing a mapping relation between each common word and the corresponding image vector; OCR (Optical character recognition), which is an all-english called Optical character recognition, refers to a process of scanning text data, analyzing an image file, and acquiring text and layout information.
And S4, constructing and storing a similar word dictionary based on the mapping relation between each common word and the corresponding graph vector.
In an alternative embodiment of the present application, the constructing the shape-near word dictionary by using the mapping relationship between the common word and the corresponding graph vector in step S4 specifically may include: successively selecting a single common word as a target word, calculating first vector cosine similarity between the target word and other common words in the common Chinese dictionary based on a mapping relation between the common word and a corresponding graph vector, and comparing the first vector cosine similarity with a first preset threshold value to obtain a first comparison result; then, selecting the shape near character of the target character according to the first comparison result, and recording the first vector cosine similarity between the target character and the shape near character; and then constructing and storing a shape-similar word dictionary based on each target word, the shape-similar word corresponding to the target word and the vector cosine similarity between the target word and the shape-similar word.
The above-mentioned obtaining of the first comparison result by comparing the first vector cosine similarity with the first preset threshold means that if the first vector cosine similarity between the target word and other common words exceeds the first preset threshold, the common word is determined to be the near word of the target word. The first preset threshold value can be set according to actual requirements.
And establishing a shape-similar word dictionary after all the target words and the corresponding shape-similar words are found out, and simultaneously storing the first vector pre-similarity between the target words and the corresponding shape-similar words into the shape-similar word dictionary.
After the completion word dictionary is constructed, performing word segmentation processing on the Chinese request word to obtain a request word contained in the Chinese request word; then based on each request word, searching and recalling at least one target near-form word of the request word in a near-form word dictionary, and first vector cosine similarity between the request word and each target near-form word; and then, arranging and combining the target form characters according to the length of the Chinese request word to construct at least one candidate word corresponding to the Chinese request word.
That is, the Chinese request words are separated from one another word by word, for each word, all corresponding form-near words are searched in the form-near word dictionary, then the form-near words are arranged and combined according to the arrangement sequence of each word in the Chinese request words to form a plurality of candidate words of the Chinese request words, and simultaneously the first vector cosine similarity of each word in the Chinese request words and the form-near words is recorded.
Finally, step S103 is executed to select and output a synonym of the chinese request word from the plurality of candidate words.
In an optional embodiment of the present application, for each candidate word, a product of first vector cosine similarities of corresponding request words is calculated, and a result of the product is stored as a second vector cosine similarity of the chinese request word and the candidate word; arranging all candidate words in a reverse order according to the second vector cosine similarity (namely, sorting the similarity from high to low), and comparing the candidate words with a second preset threshold to obtain a second comparison result; and selecting the shape similar characters of the target characters according to the second comparison result, and outputting the shape similar characters as similar meaning words of the Chinese request words.
Specifically, for each candidate word, sequentially multiplying the first vector cosine similarity corresponding to each word in the candidate word, and taking the product result as the second vector cosine similarity corresponding to the candidate word; and comparing all the obtained second cosine similarity with a second preset threshold, selecting candidate words corresponding to the second cosine similarity higher than the second preset threshold as synonyms of the Chinese request words, and storing and outputting the synonyms. The number of the synonyms of the Chinese request word can be multiple.
In general, the design process is shown in fig. 2, and includes:
(1) Reading a common word: reading a common Chinese dictionary;
(2) And (3) converting characters into images: converting the characters into a picture format and storing the characters;
(3) Vectorization represents: reading a character picture, representing the character picture by combining an OCR technology and a graph vector to obtain a mapping relation between characters and the graph vector, and storing the mapping relation;
(4) Matching the shape and the approximate character: loading characters, map vectors and mapping relations thereof, selecting single characters as target characters one by one, calculating vector cosine similarity between the target characters and other characters, namely confidence degrees, selecting characters exceeding a threshold value as candidate characters, and recording the confidence degrees, thereby constructing a character dictionary of similar characters and storing the character dictionary.
(5) Dividing characters and loading dictionaries: according to the character level, dividing the request word to obtain a request word, and loading a similar word dictionary;
(6) Recall of similar characters: utilizing the request words to respectively recall the similar characters and the confidence coefficients thereof;
(7) Candidate word generation: arranging and combining the similar characters of all the request characters according to the length of the original request character to construct a candidate character, and taking the product (or average value) of the confidence degrees of the corresponding request characters as the confidence degree of the candidate character;
(8) And (3) confidence degree sequencing: arranging all candidate words in a reverse order according to the confidence degree, and taking the candidate words exceeding a threshold value as approximate words;
(9) And returning a result: and returning the shape word and the similar word of the request word, namely the similar meaning word.
By integrating the above steps, as shown in fig. 3, the method can be classified into five categories, i.e., character vectorization representation, font-word dictionary generation, candidate word recall, candidate word sorting, and result return, which are specifically as follows:
(1) And (3) character vectorization representation: converting common Chinese into pictures, and combining an OCR technology to carry out vectorization representation;
(2) Form-close dictionary generation: for each common Chinese character, gradually calculating the vector cosine similarity (confidence) between the character and other characters, and screening the similar characters by combining a threshold value;
(3) Candidate recall: generating a near word, namely a candidate word (containing confidence) according to the request word recall near word;
(4) Sorting candidate words: carrying out reverse order arrangement according to the confidence coefficient of the candidate words, and taking the candidate words exceeding a threshold value as form-near words;
(5) And returning a result: and outputting the result, namely the similar meaning words.
In the embodiment of the application, the model construction of the entity recognition needs a large amount of training data, and some entity words generate wrongly written characters/words due to the problems of input methods or misspelling, so that the shape similar characters or the shape similar words can be added into the training data by using the method, and the generalization capability of the entity recognition model is improved.
Based on the same inventive concept, an embodiment of the present application further provides a system for generating a chinese synonym based on an OCR technology, and as shown in fig. 4, the system for generating a chinese synonym based on an OCR technology, provided by the embodiment of the present application, may include:
a request word obtaining module 410 configured to obtain a Chinese request word for which a near-synonym word needs to be found;
a candidate word construction module 420 configured to combine the chinese request word and the approximating word dictionary to construct a plurality of candidate words semantically similar to the chinese request word; the Chinese character shape-similar dictionary is created in advance based on a plurality of common characters in a common Chinese dictionary, and comprises all the common characters and corresponding shape-similar characters;
a synonym output module 430 configured to select and output a synonym of the Chinese request word from the plurality of candidate words.
In an optional embodiment of the present application, before the candidate word building module 420, it may further be configured to:
reading a preset common Chinese dictionary, wherein the common Chinese dictionary comprises a plurality of common words;
converting each common character into a common character picture and storing the common character picture;
performing image vectorization representation on each common word image by combining an OCR technology to obtain an image vector corresponding to the common word image, and establishing a mapping relation between each common word and the corresponding image vector;
and constructing and storing a form-near word dictionary based on the mapping relation between each common word and the corresponding graph vector.
In an optional embodiment of the present application, before the candidate word building module 420, it may further be configured to:
selecting a single common word as a target word one by one, calculating first vector cosine similarity between the target word and other common words in a common Chinese dictionary based on a mapping relation between the common word and a corresponding graph vector, and comparing the first vector cosine similarity with a first preset threshold value to obtain a first comparison result;
selecting a shape-near word of the target word according to the first comparison result, and recording a first vector cosine similarity between the target word and the shape-near word;
and constructing and storing a shape-similar word dictionary based on each target word, the shape-similar word corresponding to the target word and the vector cosine similarity between the target word and the shape-similar word.
In an optional embodiment of the present application, the candidate word building module 420 may be further configured to:
performing word segmentation processing on the Chinese request word to obtain a request word contained in the Chinese request word;
based on each request word, searching and recalling at least one target near-form word of the request word in a near-form word dictionary, and first vector cosine similarity between the request word and each target near-form word;
and arranging and combining the target similar characters according to the length of the Chinese request word to construct at least one candidate word corresponding to the Chinese request word.
In an optional embodiment of the present application, the synonym output module 430 may be further configured to:
for each candidate word, calculating the product of the cosine similarity of the first vector of each corresponding request word, and storing the product as the cosine similarity of the second vector of the Chinese request word and the candidate word;
arranging all candidate words in a reverse order according to the second vector cosine similarity, and comparing the candidate words with a second preset threshold to obtain a second comparison result;
and selecting the shape similar characters of the target characters according to the second comparison result, and outputting the shape similar characters as similar meaning words of the Chinese request words.
The embodiment of the present application further provides a computing device, which includes a memory, a processor, and a computer program stored in the memory and capable of being executed by the processor, where the processor executes the computer program to implement the method for generating chinese synonyms based on OCR technology as described in any one of the above.
Embodiments of the present application further provide a computer-readable storage medium, preferably a non-volatile readable storage medium, in which a computer program is stored, and when executed by a processor, the computer program implements the method for generating the chinese synonym based on the OCR technology as described in any one of the above.
The method comprises the steps of firstly obtaining a Chinese request word needing to search for a near-meaning word, then combining the Chinese request word and a shape-near word dictionary to build a plurality of candidate words with similar semantemes to the Chinese request word, and then selecting and outputting the near-meaning word of the Chinese request word from the candidate words. Based on the method and the system for generating the Chinese near-synonyms based on the OCR technology, the acquisition cost of the near-synonyms can be greatly reduced, the near-synonyms and the expanded linguistic data can be obtained in a generating mode under the condition of lack of the linguistic data, and the Chinese near-synonyms can be effectively identified aiming at the error words generated by a handwriting input method.
A computing device is also provided in embodiments of the present application, and with reference to fig. 5, the computing device comprises a memory 520, a processor 510, and a computer program stored in the memory 520 and executable by the processor 510, the computer program being stored in a space 530 for program code in the memory 520, the computer program, when executed by the processor 510, implementing the method steps 531 for performing any one of the methods according to embodiments of the present application.
The embodiment of the application also provides a computer readable storage medium. Referring to fig. 6, the computer readable storage medium comprises a storage unit for program code provided with a program 531' for performing the method steps according to an embodiment of the application, the program being executed by a processor.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for generating Chinese near-meaning words based on an OCR technology comprises the following steps:
acquiring a Chinese request word needing to find a similar meaning word;
combining the Chinese request word and the form-similar word dictionary to establish a plurality of candidate words with similar meaning to the Chinese request word; the shape-similar word dictionary is created based on a plurality of common words in a common Chinese dictionary in advance, and comprises the common words and corresponding shape-similar words thereof;
and selecting and outputting the similar meaning words of the Chinese request words from the candidate words.
2. The method of claim 1, wherein prior to said combining the chinese requested word with the quadword dictionary to create a plurality of candidate words having a similar meaning to the chinese requested word, further comprising:
reading a preset common Chinese dictionary, wherein the common Chinese dictionary comprises a plurality of common words;
converting each common character into a common character picture and storing the common character picture;
performing image-vectorization representation on each common word image by combining an OCR technology to obtain an image vector corresponding to the common word image, and establishing a mapping relation between each common word and the corresponding image vector;
and constructing and storing a form-near word dictionary based on the mapping relation between each common word and the corresponding graph vector.
3. The method of claim 2, wherein constructing and storing a near-word dictionary based on the mapping relationship between each common word and the corresponding graph vector comprises:
successively selecting a single common word as a target word, calculating first vector cosine similarity between the target word and other common words in the common Chinese dictionary based on a mapping relation between the common word and a corresponding graph vector, and comparing the first vector cosine similarity with a first preset threshold value to obtain a first comparison result;
selecting a shape near word of the target word according to the first comparison result, and recording a first vector cosine similarity between the target word and the shape near word;
and constructing and storing a shape-similar word dictionary based on the target words, the shape-similar words corresponding to the target words and the vector cosine similarity between the target words and the shape-similar words.
4. The method of claim 3, wherein the combining the Chinese request word and the quadword dictionary to create a plurality of candidate words that are semantically similar to the Chinese request word comprises:
performing word segmentation processing on the Chinese request word to obtain a request word contained in the Chinese request word;
based on each request word, searching and recalling at least one target near-shaped word of the request word in the near-shaped word dictionary, and first vector cosine similarity between the request word and each target near-shaped word;
and arranging and combining the target-shaped near characters according to the length of the Chinese request word to construct at least one candidate word corresponding to the Chinese request word.
5. The method of claim 4, wherein the selecting and outputting the synonym of the Chinese request word from the plurality of candidate words comprises:
for each candidate word, calculating the product of the first vector cosine similarity of each corresponding request word, and storing the product as the second vector cosine similarity of the Chinese request word and the candidate word;
arranging all the candidate words in a reverse order according to the second vector cosine similarity, and comparing the candidate words with a second preset threshold value to obtain a second comparison result;
and selecting the shape similar characters of the target characters according to the second comparison result, and outputting the shape similar characters as similar meaning words of the Chinese request words.
6. A system for generating Chinese near-meaning words based on OCR technology comprises:
the request word acquisition module is configured to acquire a Chinese request word needing to find a similar meaning word;
a candidate word construction module configured to combine the Chinese request word and a word-near dictionary to construct a plurality of candidate words having similar meanings to the Chinese request word; the shape-similar word dictionary is created based on a plurality of common words in a common Chinese dictionary in advance, and comprises the common words and corresponding shape-similar words thereof;
a synonym output module configured to select and output a synonym of the Chinese request word from the plurality of candidate words.
7. The system of claim 6, wherein the candidate word group building module is further configured to, prior to:
reading a preset common Chinese dictionary, wherein the common Chinese dictionary comprises a plurality of common words;
converting each common character into a common character picture and storing the common character picture;
performing image-vectorization representation on each common word image by combining an OCR technology to obtain an image vector corresponding to the common word image, and establishing a mapping relation between each common word and the corresponding image vector;
and constructing and storing a form-near word dictionary based on the mapping relation between each common word and the corresponding graph vector.
8. A computing device comprising a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein the processor when executing the computer program implements the method for generating chinese synonyms based on OCR technology as recited in any one of claims 1-5.
9. A computer-readable storage medium, preferably a non-transitory readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method for generating chinese synonyms based on OCR technology as claimed in any one of the claims 1-5.
10. A computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method for generating chinese synonyms based on OCR techniques as claimed in any one of claims 1-5.
CN202210125902.2A 2022-02-10 2022-02-10 Method and system for generating Chinese near-meaning words based on OCR technology Pending CN115858797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210125902.2A CN115858797A (en) 2022-02-10 2022-02-10 Method and system for generating Chinese near-meaning words based on OCR technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210125902.2A CN115858797A (en) 2022-02-10 2022-02-10 Method and system for generating Chinese near-meaning words based on OCR technology

Publications (1)

Publication Number Publication Date
CN115858797A true CN115858797A (en) 2023-03-28

Family

ID=85659954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210125902.2A Pending CN115858797A (en) 2022-02-10 2022-02-10 Method and system for generating Chinese near-meaning words based on OCR technology

Country Status (1)

Country Link
CN (1) CN115858797A (en)

Similar Documents

Publication Publication Date Title
JP3689455B2 (en) Information processing method and apparatus
US8577882B2 (en) Method and system for searching multilingual documents
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
KR100903961B1 (en) Indexing And Searching Method For High-Demensional Data Using Signature File And The System Thereof
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
RU2712101C2 (en) Prediction of probability of occurrence of line using sequence of vectors
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN113159013A (en) Paragraph identification method and device based on machine learning, computer equipment and medium
CN116881425A (en) Universal document question-answering implementation method, system, device and storage medium
CN106933824A (en) The method and apparatus that the collection of document similar to destination document is determined in multiple documents
US20230065965A1 (en) Text processing method and apparatus
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN112836019A (en) Public health and public health named entity identification and entity linking method and device, electronic equipment and storage medium
CN113569018A (en) Question and answer pair mining method and device
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
US20220358158A1 (en) Methods for searching images and for indexing images, and electronic device
CN112434134B (en) Search model training method, device, terminal equipment and storage medium
CN115858797A (en) Method and system for generating Chinese near-meaning words based on OCR technology
CN111310442B (en) Method for mining shape-word error correction corpus, error correction method, device and storage medium
CN110941730A (en) Retrieval method and device based on human face feature data migration
CN112650870A (en) Method for training picture ordering model, and method and device for picture ordering
CN117094394B (en) Astronomical multi-mode knowledge graph construction method and system based on paper PDF
WO2021115115A1 (en) Zero-shot dynamic embeddings for photo search
CN112347284B (en) Combined trademark image retrieval method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination