CN112115933A

CN112115933A - Character recognition method, device and storage medium

Info

Publication number: CN112115933A
Application number: CN202010864604.6A
Authority: CN
Inventors: 刘滨; 旷黎明; 林大
Original assignee: Shanghai Weiyi Intelligent Manufacturing Technology Co ltd; Changzhou Weiyizhi Technology Co Ltd
Current assignee: Shanghai Weiyi Intelligent Manufacturing Technology Co ltd; Changzhou Weiyizhi Technology Co Ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2020-12-22

Abstract

The invention discloses a character recognition method, which comprises the following steps: acquiring target characters and establishing at least one target character library; constructing a target data structure of each target character library based on each target character library and a preset data processing structure; acquiring characters to be processed; according to the calling relation between the micro-service and the target data structure, the character to Be processed is identified, and an identification result is obtained, and the purpose is to ensure that the collection of big data in the field of industrial Internet of things is standardized and legal, the memory space is saved, and the QPS is efficient.

Description

Character recognition method, device and storage medium

Technical Field

The invention relates to the technical field of character processing of industrial Internet, in particular to a method and a device for recognizing a character and a storage medium.

Background

Industrial internet is a result of the convergence of global industrial systems with advanced computing, analytics, sensing technologies and internet connectivity. The equipment, production lines, factories, suppliers, products and customers can be tightly connected and fused through an open and global industrial-level network platform, various element resources in industrial economy are efficiently shared, and the manufacturing industry is helped to prolong the industrial chain. And there may be illegal characters in various element resources, and the illegal characters may be characters that need to be recognized in the test data, so as to avoid problems in the test data or recognize problems in the test process in time.

At present, a commonly used character recognition algorithm is to pack characters into a traditional packing form, such as a jar packet form, because doing so would cause each service that needs to perform illegal character filtering to load a thesaurus of illegal characters, for example, 10 services integrate the jar packet, if the capacity of the illegal thesaurus is 1G, there is a waste of 9G memory, which shows that the illegal character filtering method in the prior art would cause the memory to be occupied and reduce the filtering efficiency.

Disclosure of Invention

The invention aims to provide a character recognition method, a character recognition device and a storage medium, and aims to ensure that the collection of big data in the field of industrial Internet of things is standardized and legal, the memory space is saved, and the efficient QPS is realized; the method and the device avoid the situation that each micro service in the micro service architecture in the existing industrial scene needs to load a mass word stock, thereby saving a large amount of memory space and improving the usability of the service.

In order to achieve the above object, there is provided a character recognition method including:

acquiring target characters and establishing at least one target character library;

constructing a target data structure of each target character library based on each target character library and a preset data processing structure;

acquiring characters to be processed;

and identifying the character to be processed according to the calling relationship between the micro service and the target data structure, and acquiring an identification result.

In one implementation, the step of obtaining the target character and building at least one target character library includes:

obtaining an illegal character, wherein the illegal character is a preset character;

determining the illegal character as a target character;

forming the target characters into a target character library;

and loading the data corresponding to the target word bank into a memory for data processing.

In one implementation, the step of constructing a target data structure of each target character library based on each target character library and a preset data processing structure includes:

determining a preset data processing structure as a Be _ Tree data structure;

and constructing each target character library into a Tree-shaped data structure according to the Be _ Tree data structure.

In one implementation manner, the step of identifying the character to be processed according to the call relationship between the microservice and the target data structure and obtaining the identification result includes:

calling the tree-shaped data structure based on the data processing memory;

filtering the character to be processed to obtain a character filtering result;

judging whether the character filtering result contains the same character as the character to be processed;

if so, confirming that the character to be processed contains an illegal character.

In one implementation manner, the step of filtering the character to be processed to obtain a character filtering result includes:

when the characters to be processed are a plurality of characters, acquiring a first character in the characters to be processed;

filtering based on the first character, and obtaining a character filtering result;

if the character filtering result contains the character which is the same as the first character, acquiring a second character in the characters to be processed;

based on the position of the first character in the tree data structure, filtering the second character, and acquiring a filtering result;

wherein the first character is in an order prior to the second character.

In one implementation, the step of constructing each target character library into a tree data structure includes:

and constructing each target character library into a tree data structure by adopting hashmap.

In one implementation, the method further comprises:

acquiring a row position of a first character in the tree data structure;

judging whether the word is the last word of the row position of the tree-shaped data structure;

if not, setting the position of the character in the tree structure as a first flag bit;

otherwise, setting the position of the character as a second flag bit in the tree structure.

An implementation, the method further comprising:

if the bit is the second zone bit, ending the line search;

otherwise, the search for the line is continued based on the second character.

In addition, the invention also discloses a character recognition device, which comprises a processor and a memory connected with the processor through a communication bus; wherein the content of the first and second substances,

the memory is used for storing a character recognition program;

the processor is configured to execute the character recognition program to implement any of the character recognition steps.

And a storage device, such as a computer storage device, having one or more programs stored thereon, the one or more programs being executable by one or more processors to cause the one or more processors to perform any of the character recognition steps.

The character recognition method provided by the embodiment of the invention has the following beneficial effects:

(1) by adopting the pre-loading of the word stock and the Be-Tree algorithm, the problem of how to search for quick response in a massive word stock after information is input in an industrial scene is solved.

(2) The service which needs illegal character filtering is called by adopting the independent building service in the form of a feign interface. The method and the device avoid the situation that each micro service in the micro service architecture needs to load a mass lexicon in an industrial scene, thereby saving a large amount of memory space and improving the usability of the service.

(3) By adopting a user-defined annotation mode, the method is convenient to integrate into the service which needs illegal word filtering.

(4) The invention aims to ensure that the collection of big data in the field of industrial Internet of things is standardized and legal, saves memory space and has high efficient QPS.

Drawings

Fig. 1 is a flow chart of a character recognition method according to an embodiment of the present invention.

Fig. 2 is a specific embodiment of the character recognition method according to the embodiment of the present invention.

Fig. 3 is another embodiment of the character recognition method according to the present invention.

Fig. 4 is a diagram illustrating another embodiment of a character recognition method according to the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Please refer to fig. 1-4. It should be noted that the drawings provided in the present embodiment are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The invention provides a character recognition method as shown in fig. 1, which comprises the following steps:

s101, acquiring target characters and establishing at least one target character library.

It should be noted that, in an industrial scenario, the word stock of illegal characters is massive. Therefore, the illegal characters can be collected according to past experience, specifically, the illegal characters can be classified and grouped according to actual requirements, and the classified characters are used as target characters to obtain a target character library.

In an implementation manner of the present invention, the step of obtaining the target character and building at least one target character library includes:

s1011, obtaining an illegal character, wherein the illegal character is a preset character.

It will be appreciated that the user may have characters collected in advance as illegal characters, for example, in an illegal character database.

And S1012, determining the illegal character as a target character.

It should be noted that the target character is a basis for character recognition or filtering, and since the target character is an illegal character, when other characters are obtained and are the same as the target character, the target character is regarded as the same illegal character, and therefore, the preset illegal character is used as the target character.

And S1013, forming the target characters into a target character library.

And then, the target characters form a target character library to form a whole, for example, one target character library corresponds to illegal data of a testing process or corresponds to a testing tool of a product, so that the testing can be smoothly carried out in the industrial Internet of things without replacing the illegal characters midway, and the illegal characters can be artificially incorporated into the target character library when the illegal characters are updated to form an updated target character library.

And S1014, loading the data corresponding to the target word bank into a memory for data processing.

In the embodiment of the invention, in order to further improve the running efficiency and the running speed of the target character library in service, the target character library is loaded into the memory, so that the reading efficiency can be improved, and the response time can be shortened. Therefore, the input item does not need to be searched in the database for whether the input item is illegal characters or not each time, the searching speed is improved, and the searching efficiency and the searching speed of the illegal characters are improved.

S102, constructing a target data structure of each target character library based on each target character library and a preset data processing structure.

In an implementation manner of the present invention, the step of constructing the target data structure of each target character library based on each target character library and a preset data processing structure includes: determining a preset data processing structure as a Be _ Tree data structure; and constructing each target character library into a Tree-shaped data structure according to the Be _ Tree data structure.

In the embodiment of the invention, the data is processed through a preset data processing structure, so that the subsequent data identification or filtering process is facilitated.

It should be noted that a data structure refers to a collection of data elements that have one or more specific relationships with each other. Typically, a carefully selected data structure can lead to greater operational or storage efficiency.

Therefore, the data element relation existing between the preset data processing structure and each character in the target character library is utilized, and the character is convenient to find.

S103, acquiring characters to be processed.

It is understood that the character to be processed is a character generated during the operation of the industrial internet system, so that it is confirmed whether an illegal character is generated by the process after the processing result of the character to be processed is generated.

And S104, identifying the character to be processed according to the calling relation between the micro service and the target data structure, and acquiring an identification result.

calling the tree-shaped data structure based on the data processing memory; filtering the character to be processed to obtain a character filtering result; judging whether the character filtering result contains the same character as the character to be processed; if so, confirming that the character to be processed contains an illegal character.

In the embodiment of the invention, in an industrial scene, the function of filtering illegal characters is built into an independent service under a micro-service architecture, so that the memory space is saved, and the target character library is called through the micro-service, so that the target character library is started when needed, and the memory cannot be continuously occupied.

Therefore, the problems that in the prior art, every service needing illegal character filtering needs to load a word bank of illegal characters, for example, 10 services integrate the jar, and if the capacity of the illegal word bank is 1G, 9G of memory is wasted, and the operation efficiency is low are solved.

By applying the embodiment of the invention, the illegal character filtering service can be clustered, thereby coping with high QPS and improving the availability of the system. The thesaurus of illegal characters in the memory uses a data structure of Be _ Tree, and if the thesaurus uses a whole character string in java, the efficiency of whether an input item needs to Be searched is very low. The word stock is constructed into a tree-shaped data structure, so that the matching range of retrieval is greatly reduced when whether one word is an illegal word is judged.

In an implementation manner shown in fig. 3, the step of filtering the character to be processed to obtain a character filtering result includes:

filtering based on the first character, and obtaining a character filtering result; if the character filtering result contains the character which is the same as the first character, acquiring a second character in the characters to be processed; based on the position of the first character in the tree data structure, filtering the second character, and acquiring a filtering result; wherein the first character is in an order prior to the second character.

As shown in fig. 3, for example, to determine whether "teacher" is an illegal word, it can be confirmed that the tree that needs to be retrieved is the tree of fig. 3 according to the first character, and then the "old" character is obtained by further retrieval and recognition, and is determined in the first row, because of the relevance between the characters, if there are two adjacent characters of the teacher, the teacher is not the next character, so that only the first character of "old", that is, the first character, is retrieved first, and the second character can be determined at any time. Therefore, the data size of the retrieval can be reduced, and the retrieval efficiency can be improved.

It should be noted that, in an implementation manner, when the number of characters is 3, 4, or 5 or even more, the first character is not the first character in the sequence, but any other character, for example, when the number of characters is 3, it may be defined that the second character is the first character, and then the third character is the second character. In another implementation manner, the first character and the second character are mentioned in the embodiment of the present invention, and may actually be defined in sequence according to the number of characters, for example, when there are 3 characters, the first character, the second character, and the third character may be set in sequence, and when there are 4 characters, the first character, the second character, the third character, and the fourth character may be set in sequence, which may be implemented according to the embodiment shown in fig. 3, and details of the embodiment of the present invention are not described herein.

and constructing each target character library into a tree data structure by adopting hashmap. Acquiring the row position of a first character in the tree data structure; judging whether the word is the last word of the row position of the tree-shaped data structure; if not, setting the flag bit as a first flag bit; otherwise, set to the second flag bit.

In the embodiment of the present invention, a HashMap is used to construct the tree structure, for example:

{ one ═ old ═ isEnd ═ 0, teacher ═ isEnd ═ 1} }, isEnd ═ 0}

And judging whether the character is the last character in the word. If the word indicates that the sensitive word is finished, setting the flag bit isEnd to 1, otherwise, setting the flag bit isEnd to 0.

Get ("old") is set hashMap by retrieving "teacher" if found in hashMap, indicating that there is an illegal word starting with "old".

In one implementation, the method further comprises:

if the bit is the second zone bit, ending the line search;

otherwise, the search for the line is continued based on the second character.

When hashMap is hash map, get, and since chessend is 1, it indicates that the search is finished.

To sum up, in fig. 4, firstly, the illegal word server forms step 1, that is, the illegal word library is loaded into the memory, and then the illegal character library is encapsulated into a data structure of B _ Tree, which is a key step in the embodiment of the present invention, and forms a basis for character recognition. At the user side, step 2, submitting information to the micro service through the user; then the micro service calls the word stock to verify the submitted information through the step 3, namely, submits the character to be processed, searches the illegal word in the B _ Tree at the illegal word service end (namely, the end which inherits the service of the illegal word stock), then returns the verification result to the micro service in the step 5, and the micro service executes the step 6: and judging whether the user request is submitted or rejected, returning a submission result to the user side, and ending the whole process.

the memory is used for storing a character recognition program;

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method of character recognition, the method comprising:

acquiring characters to be processed;

2. The method of claim 1, wherein the step of obtaining the target characters and building at least one target character library comprises:

determining the illegal character as a target character;

forming the target characters into a target character library;

3. The character recognition method of claim 2, wherein the step of constructing the target data structure of each target character library based on each target character library and the preset data processing structure comprises:

determining a preset data processing structure as a Be _ Tree data structure;

4. The character recognition method according to claim 3, wherein the step of recognizing the character to be processed according to the calling relationship between the micro service and the target data structure and obtaining the recognition result comprises:

calling the tree-shaped data structure based on the data processing memory;

filtering the character to be processed to obtain a character filtering result;

5. The character recognition method according to claim 4, wherein the step of filtering the character to be processed to obtain a character filtering result comprises:

wherein the first character is in an order prior to the second character.

6. The character recognition method of claim 5, wherein the step of constructing each target character library into a tree-like data structure comprises:

7. The character recognition method of claim 6, further comprising:

acquiring a row position of a first character in the tree data structure;

judging whether the character is the last character of the row position of the tree data structure;

8. The character recognition method of claim 6, further comprising:

if the flag bit is the second flag bit, ending the line search;

otherwise, the search for the line continues to be performed based on the second character.

9. A character recognition apparatus, comprising a processor, and a memory connected to the processor via a communication bus; wherein the content of the first and second substances,

the memory is used for storing a character recognition program;

the processor for executing the character recognition program to implement the character recognition steps of any one of claims 1 to 8.

10. A storage device, being a computer storage device, having one or more programs stored thereon, the one or more programs being executable by one or more processors to cause the one or more processors to perform the character recognition steps of any of claims 1 to 8.