CN111914513A - RDP window title character recognition method and device - Google Patents

RDP window title character recognition method and device Download PDF

Info

Publication number
CN111914513A
CN111914513A CN201910379750.7A CN201910379750A CN111914513A CN 111914513 A CN111914513 A CN 111914513A CN 201910379750 A CN201910379750 A CN 201910379750A CN 111914513 A CN111914513 A CN 111914513A
Authority
CN
China
Prior art keywords
character
rdp
characteristic information
conversion table
font
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910379750.7A
Other languages
Chinese (zh)
Inventor
周春楠
赵之阳
郭波
赵贵阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yiyang Safety Technology Co ltd
Original Assignee
Yiyang Safety Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yiyang Safety Technology Co ltd filed Critical Yiyang Safety Technology Co ltd
Priority to CN201910379750.7A priority Critical patent/CN111914513A/en
Publication of CN111914513A publication Critical patent/CN111914513A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/08Protocols specially adapted for terminal emulation, e.g. Telnet

Abstract

A method of RDP window title literal recognition, the method comprising: defining a character conversion table, wherein the character conversion table comprises font characteristic information and corresponding character coding information; intercepting and analyzing RDP protocol data containing RDP window titles, and extracting font feature information data in the RDP window titles; inputting the font characteristic information data in the RDP window header into the character conversion table, and obtaining corresponding character coding information data by table lookup; and storing the character coding information data of the RDP window title into a database. The invention also discloses a device for identifying the title characters of the RDP window. The method and the device can improve the accuracy of the RDP window title character recognition and can also shorten the time of the RDP window title character recognition.

Description

RDP window title character recognition method and device
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for identifying title characters of an RDP window.
Background
The Remote Desktop Protocol (RDP) is a multi-channel Protocol, and allows a client to connect to a server providing terminal services. RDP attempts to provide only those services that are necessary to achieve efficient operation and small size. RDP supports virtual channels to carry data traffic between client and server sides, and RDP-based remote desktops allow users to use all application, file, and network resources on remote computers without executing local programs. With the rapid development of the internet, the RDP is widely applied, and the network security risks are increased. Therefore, the contents transmitted by the RDP need to be audited so as to discover security information such as illegal access and abnormal operation in time and guarantee network security.
RDP Auditing one of the important tasks is to audit the window header of RDP transfers. The common method for transmitting characters between computers is as follows: the method comprises the steps that a sending end transmits character codes such as Unicode and ASCII, and a receiving end displays characters after rendering by word stock software or hardware; if the receiving end word stock is missing, the content cannot be correctly displayed. However, the RDP is different from the method for transmitting the window header text, and the transmission method is as follows: the character shape code (one of dot matrix codes) is directly transmitted, in order to output the Chinese characters on a display or a printer, the Chinese characters are designed into a dot matrix diagram according to graphic symbols, and then the corresponding dot matrix codes are obtained, so that the problem that the remote receiving end cannot correctly display because no character library exists can be avoided, and the problems are solved. On the other hand, characters are transmitted by using a character form code expression character form mode, although a receiving end can correctly show the character form, an auditing party cannot automatically recognize character semantics, and then the auditing is carried out.
The prior art generally adopts OCR technology to recognize the word semantics of a window title transmitted by RDP, but has the following problems:
first, the recognition speed of OCR technology is too slow, and it takes about a fraction of a second to recognize 10 words.
Secondly, the OCR technology requires that the resolution of the characters to be recognized is large, and the number of the characters generally exceeds 16 × 16 pixels, but most of the window title characters transmitted by the RDP have only 10 × 10 pixels or 12 × 12 pixels, so that the recognition rate and the accuracy rate of the characters recognized by using the OCR technology are not high.
There is a need for a fast and reliable method for identifying the literal semantics of a window header for RDP transmissions.
Disclosure of Invention
The invention discloses a method for identifying RDP window title characters, which comprises the following steps:
defining a character conversion table, wherein the character conversion table comprises font characteristic information and corresponding character coding information;
intercepting and analyzing RDP protocol data containing RDP window titles, and extracting font feature information data in the RDP window titles;
inputting the font characteristic information data in the RDP window header into the character conversion table, and obtaining corresponding character coding information data by table lookup;
and storing the character coding information data of the RDP window title into a database.
Specifically, the method for defining the text conversion table includes:
representing the glyph feature information by binary number;
converting the font characteristic information into unique unified query codes one by one, wherein the unified query codes are binary numbers with the digit number being a constant C; the conversion rule for converting the unified query code specifically includes: setting the maximum bit value of the binary number representing the character pattern characteristic information as C; if the digit number of the binary number representing the character pattern characteristic information to be converted is equal to C, the unified query code is equal to the binary number representing the character pattern characteristic information; otherwise, the unified query code is equal to the binary number representing the character pattern characteristic information and is converted into the binary number with the digit number equal to C through a complementary digit algorithm;
the font characteristic information and the uniform inquiry code correspond to the character coding information corresponding to the font characteristic information and the uniform inquiry code one by one to generate a character conversion table;
and storing the character conversion table in a binary tree data structure, storing the uniform query code of the character conversion table into nodes of the binary tree, and storing the character coding information in the character conversion table into leaf nodes of the binary tree.
Specifically, the method for inputting the font characteristic information data in the RDP window header into the text conversion table and obtaining the corresponding character encoding information data by looking up the table includes:
representing the font characteristic information data in the RDP window header by binary number;
converting the binary number representing the font characteristic information data in the RDP window header into the unified query code data by using the complementary bit algorithm according to the conversion rule for converting the unified query code;
inputting the unified query code data into the text conversion table, wherein the text conversion table is stored in a binary tree data structure; and sequentially matching each digit of the unified query code data with nodes starting from the root of the binary tree one by one to form a path, searching leaf nodes, and obtaining the character coding information data of the RDP window title.
Specifically, the method for defining a text conversion table further includes:
representing the glyph feature information by binary number;
generating a character conversion record by using the character pattern characteristic information and the corresponding character coding information;
forming a plurality of character conversion records into one character conversion table, wherein the digits of binary numbers representing character pattern characteristic information contained in each character conversion record are the same; the binary numbers representing the character pattern characteristic information contained in the character conversion records have different digits and can form a plurality of character conversion tables; a plurality of character conversion tables form a character conversion library;
and storing each word conversion table in a binary tree data structure, storing the character representation characteristic information of the word conversion table into nodes of the binary tree, and storing the character coding information in the word conversion table into leaf nodes of the binary tree.
Specifically, the method for inputting the font characteristic information data in the RDP window header into the text conversion table and obtaining the character encoding information data of the corresponding RDP window header by looking up the table further includes:
representing the font characteristic information data in the RDP window header by binary number;
matching the corresponding character conversion table in the character conversion library according to the digit of the binary number representing the character pattern characteristic information in the RDP window header;
inputting the binary number representing the font characteristic information in the RDP window header into the matched text conversion table, wherein the text conversion table is stored in a data structure of a binary tree; and sequentially matching each digit of the binary number representing the font characteristic information in the RDP window title with the nodes starting from the root of the binary tree one by one to form a path, searching leaf nodes, and obtaining the character coding information data of the RDP window title.
The invention also discloses a device for identifying the title characters of the RDP window, which comprises the following components:
the conversion table definition unit is used for defining a character conversion table, wherein the character conversion table comprises character pattern characteristic information and character coding information;
the data extraction unit is used for intercepting and analyzing RDP protocol data containing the RDP window title and extracting font characteristic information data of the RDP window title;
and the character recognition unit is used for inputting the font characteristic information data of the RDP window title acquired from the data extraction unit into the character conversion table acquired from the conversion table definition unit, searching the character conversion table to acquire corresponding character coding information data and storing the corresponding character coding information data into a database.
Specifically, the conversion table defining unit includes:
the font characteristic information digitization module is used for representing the font characteristic information by binary numbers;
the uniform query code generation module is used for converting the font characteristic information acquired from the font characteristic information digitization module into unique uniform query codes one by one, and the uniform query codes are binary numbers with the digit number being a constant C; the conversion rule for converting the unified query code specifically includes: setting the maximum digit of the binary number representing the character pattern characteristic information as C; if the digit number of the binary number representing the character pattern characteristic information to be converted is equal to C, the unified query code is equal to the binary number representing the character pattern characteristic information; otherwise, the unified query code is equal to the binary number representing the character pattern characteristic information and is converted into the binary number with the digit number equal to C through a complementary digit algorithm;
the character conversion table generating module is used for correspondingly generating the character conversion table by the font characteristic information obtained from the font characteristic information datamation module and the unified query code obtained from the unified query code generating module and the corresponding character coding information;
and the character conversion table storage module is used for storing the character conversion table acquired from the character conversion table generation module in a binary tree data structure, the unified query code of the character conversion table is stored as a node of the binary tree, and the character coding information in the character conversion table is stored as a leaf node of the binary tree.
Specifically, the character recognition unit includes:
a font characteristic information data binarization module for representing font characteristic information data in the RDP window header acquired from the data extraction unit by binary number;
the unified query code data conversion module is used for converting the binary number representing the font characteristic information data in the RDP window title, which is obtained from the font characteristic information data binarization module, into the unified query code data according to the conversion rule for converting the unified query code and by using the complementary bit algorithm;
the query module I is used for inputting the unified query code data obtained from the unified query code data conversion module into the character conversion table obtained from the character conversion table storage module, and the character conversion table is stored in a data structure of a binary tree; sequentially matching each digit of the unified query code data with nodes starting from the root of the binary tree one by one to form a path, searching leaf nodes, and obtaining character coding information data of the RDP window title; and saved to the database.
Specifically, the conversion table defining unit further includes:
the font characteristic information digitization module is used for representing the font characteristic information by binary numbers;
the character conversion library generating module is used for corresponding the font characteristic information acquired from the font characteristic information datamation module with the corresponding character coding information to generate a character conversion record; forming a plurality of character conversion records into one character conversion table, wherein the digits of binary numbers representing character pattern characteristic information contained in each character conversion record are the same; the binary numbers representing the character pattern characteristic information contained in the character conversion records have different digits and can form a plurality of character conversion tables; a plurality of character conversion tables form a character conversion library;
and the character conversion library storage module is used for storing each character conversion table acquired from the character conversion library generation module in a binary tree data structure, the character expression characteristic information of the character conversion table is stored as a node of the binary tree, and the character coding information in the character conversion table is stored as a leaf node of the binary tree.
Specifically, the character recognition unit further includes:
a font characteristic information data binarization module for representing font characteristic information data in the RDP window header acquired from the data extraction unit by binary number;
a matching module for matching the corresponding character conversion table in the character conversion library obtained from the character conversion library storage module according to the digit number of the binary number representing the character form characteristic information in the RDP window title obtained from the character form characteristic information data binarization module; sending the characteristic attribute information of the character conversion table to a second query module;
the query module II is used for inputting the binary number representing the font feature information in the RDP window title, which is acquired from the font feature information data binarization module, into the matched text conversion table acquired from the text conversion table storage module, the text conversion table is stored in a binary tree data structure, each digit number of the binary number representing the font feature information in the RDP window title is sequentially matched with a node starting from the root of the binary tree one by one to form a path, leaf nodes are searched, and the character coding information data of the RDP window title is acquired; and saved to the database.
Compared with the prior art, the invention has the beneficial effects that: the received character font can be quickly and correctly identified as the character with corresponding semantic meaning, and then the character font is provided for auditing and utilization. The corresponding software program is small and practical.
Firstly, the character recognition speed is high. Characters are identified by inquiring a character conversion table, and a single character with time complexity is O (1), wherein the time complexity is irrelevant to the number of the characters contained in the table. The speed of OCR varies according to OCR algorithm, because only O (N ^2) can be reached as fast to match the characters contained in the library and the characteristic length of each character.
Secondly, the accuracy is high. Characters are identified in a one-to-one searching character conversion table mode, and messy codes and wrong characters can not occur.
Thirdly, the memory occupation is small. The space complexity word of the word conversion table is O (N). The spatial complexity is independent of the number of words contained in the table and is related to the number N of binary digits representing the glyph's characteristic information.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for identifying title text of an RDP window according to an embodiment of the present application;
FIG. 2 is a diagram illustrating font property information;
FIG. 3 is a flowchart illustrating another RDP window title text recognition method according to the second embodiment of the present application;
FIG. 4 is a diagram illustrating alternative glyph characterization information;
FIG. 5 is a diagram of a text translation table stored in a binary tree structure;
fig. 6 is a flowchart illustrating a method for recognizing title characters of an RDP window according to a third embodiment of the present application;
fig. 7 is a schematic structural diagram of an apparatus for RDP window title text recognition according to a fourth embodiment of the present application;
FIG. 8 is a block diagram illustrating an apparatus for text recognition of an RDP window title according to a fifth embodiment of the present application;
fig. 9 is a schematic structural diagram of another RDP window title character recognition apparatus according to a sixth embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for recognizing title text of an RDP window according to an embodiment of the present application, where the method includes:
step S101: defining a character conversion table, wherein the character conversion table comprises character pattern characteristic information and corresponding character coding information.
The character encoding information refers to binary character encoding used in computer character systems, that is, binary numbers are used for corresponding to characters in a character set. Common character codes include: ASCII encoding, EBCDIC encoding, GB2312 encoding, Unicode encoding, UTF-8 encoding, and the like. The character coding information is an exchange code for communicating input and output with a system platform, and is convenient for text storage in a computer and transmission through a communication network. The use of different character encoding formats for the same alphabetic character may correspond to different character encodings, such as the Unicode encoding "u 4e2 d" for the Kanji "and" 0xE 40 xB 80 xAD "for UTF-8.
The character pattern characteristic information is that in order to output the Chinese characters on a display or a printer, the Chinese characters are designed into a dot matrix diagram according to graphic symbols, the 1 represents the pixel points with characters, the 0 represents the blank points, and the character pattern characteristic can be represented by binary numbers. For example, the "middle" of the chinese character may be designed as the font shown in fig. 2 in a 10 × 10 pixel lattice, and the font characteristic information of the "middle" character may be represented as "0000110000000011000001111111100100110010010011001001001100100111111110000011000000001100000000110000".
The method for defining the word conversion table comprises the following steps:
selecting a character code to represent code information of the character;
step two, collecting font feature information of the title common words of the RDP window;
and step three, the font characteristic information corresponds to the corresponding character coding information one by one to generate a character conversion table.
For example, referring to table 1 below, a record in the character conversion table is generated by selecting the Unicode code to represent character code information, and corresponding the binary number "0000110000000011000001111111100100110010010011001001001100100111111110000011000000001100000000110000" representing the font characteristic information of "medium" in fig. 2 to the code "u 4e2 d" of Unicode code representation "medium".
Figure BDA0002052973670000081
Figure BDA0002052973670000091
Table 1 text conversion table schematic
It should be noted that, in view of the RDP window title text, the following features are provided: 1) the text size is relatively fixed, 2) the resolution is low, typically 10 x 10 pixels or 12 x 12 pixels, 3) the number of commonly used text for the title is limited. The size of the text conversion table is controllable. Furthermore, the word conversion table can be expanded at any time, and if the situation that the corresponding character coding information cannot be inquired by the character pattern feature information which is not recorded in the existing word conversion table occurs, the situation can be overcome by adding a method for recording the word conversion table.
Step S102: and intercepting and analyzing RDP protocol data containing the RDP window title, and extracting font feature information data in the RDP window title.
Comprises the following steps:
step one, intercepting an RDP protocol packet from a network;
step two, analyzing the RDP communication protocol, and storing the same RDP instruction in a message;
step three, analyzing the RDP message and separating an RDP window title;
and step four, decoding the RDP window header, and extracting font characteristic information data contained in the RDP window header.
Step S103: inputting the character pattern characteristic information data in the RDP window header into the character conversion table, and obtaining character coding information data corresponding to the character pattern of each window header through table lookup.
Step S104: and storing the character coding information data of the RDP window title into a database.
Compared with the prior art, the method for identifying the title characters of the RDP window identifies the title characters of the RDP window by using a method for inquiring the character conversion table. The speed of character recognition is improved, and the accuracy of character recognition is ensured.
Further, when the title text of the RDP window is recognized, the character encoding information data of the RDP window title is stored in the database as a record of an audit log, wherein the record indicates that a window is opened. In conjunction with other contents of the RDP audit log, for example: the user account, the RDP link time, the closing time, the time when the operation occurs, the address of the client, etc. may indicate that the user opens an application program and performs the operation therein at a specific time and in a specific scene. The auditing party can design the business rules to audit according to the business scene and the business requirements. For example: the auditor can predefine keywords to be audited, such as 'registry editor' and define alarm rules, such as 'executing short message alarm'; when the character coding information data of the RDP window title is stored in the database, the comparison is carried out with the keyword 'registry editor', and if the comparison result is in line, a short message alarm is triggered.
Therefore, the invention improves the speed of character recognition and ensures the accuracy of character recognition. And further, the efficiency and the accuracy of RDP audit are improved.
Referring to fig. 3, fig. 3 is a schematic flow chart of another RDP window title text recognition method according to a second embodiment of the present application, where the method includes:
step S201: representing the glyph feature information by binary number, comprising:
step one, collecting font characteristic information of the title common words of the RDP window.
In step two, since the default RDP window title text resolutions of different server-side operating systems are different but relatively fixed, and the common resolution is 10 × 10 pixels or 12 × 12 pixels, the glyph feature information can be represented by binary numbers of 100 bits or 144 bits, respectively. For example, the "middle" of the chinese character can be designed into the glyph of fig. 2 in a 10 × 10 pixel lattice, and the glyph feature information of the "middle" word can be represented as "0000110000000011000001111111100100110010010011001001001100100111111110000011000000001100000000110000" by a 100-bit binary number. The Chinese character "middle" can be designed into the font of fig. 4 in a 12-by-12 pixel lattice, and then the font characteristic information of the "middle" character can be represented as "000001100000000001100000000001100000011111111110010001100010010001100010010001100010011111111110000001100000000001100000000001100000000001100000" by a binary number of 144 bits.
Step S202: converting the font characteristic information into unique unified query codes one by one, wherein the unified query codes are binary numbers with the digit number being a constant C; the conversion rule for converting the unified query code specifically includes: setting the maximum bit value of the binary number representing the character pattern characteristic information as C; if the digit number of the binary number representing the character pattern characteristic information to be converted is equal to C, the unified query code is equal to the binary number representing the character pattern characteristic information; otherwise, the unified query code is equal to the binary number representing the character pattern characteristic information, and the binary number with the bit number equal to C is converted into the binary number with the bit number equal to C through a complementary bit algorithm.
The purpose of this step is to make all the binary numbers representing the character pattern characteristic information have the same digit number, and the characters with different resolutions can be inquired by using one character conversion table. The bit-filling algorithm comprises head bit filling, tail bit filling, row bit filling and the like.
Still referring to fig. 2 and 4 for example, in this example, C is 144, and if the head padding algorithm is adopted, the unified query code of the chinese character "middle" in fig. 2 is "000000000000000000000000000000000000000000000000110000000011000001111111100100110010010011001001001100100111111110000011000000001100000000110000" with 144 bits from 100 bits. The 144-bit uniform query code of the Chinese character "middle" in FIG. 4 is "000001100000000001100000000001100000011111111110010001100010010001100010010001100010011111111110000001100000000001100000000001100000000001100000".
Step S203: and correspondingly associating the font characteristic information and the uniform inquiry code with the corresponding character coding information one by one to generate a character conversion table.
The method for generating the character conversion table comprises the following steps:
step one, selecting a character code to represent character code information;
and step two, the character pattern characteristic information and the uniform inquiry code correspond to the character coding information corresponding to the uniform inquiry code one by one to generate a character conversion table.
For example, referring to table 2 below, selecting Unicode code to represent character encoding information, and corresponding the binary number representing the font character feature information of "medium" in fig. 2 and 4, the Unicode to the code "u 4e2 d" of Unicode representation "medium", two records in the character conversion table are generated.
Figure BDA0002052973670000111
Figure BDA0002052973670000121
Table 2 another text conversion table schematic
Step S204: and storing the character conversion table in a binary tree data structure, storing the uniform query code of the character conversion table into nodes of the binary tree, and storing the character coding information in the character conversion table into leaf nodes of the binary tree.
Please refer to fig. 5, which is a diagram illustrating a text conversion table stored in a binary tree structure. If the Unicode code is selected to represent the character coding information, the leaves of the binary tree are Unicode codes; the nodes that make up the path from the root to the leaf of the binary tree are the corresponding uniform query codes. It should be noted that the degree of the binary tree is equal to the number of bits of the unified query code, and fig. 5 simplifies the degree of the actual structure for convenience of expression.
Step S205: and intercepting and analyzing RDP protocol data containing the RDP window title, and extracting font feature information data in the RDP window title.
Please refer to step S102 in this embodiment of the present application for a method of capturing and analyzing RDP protocol data including an RDP window header and extracting font feature information data in the RDP window header.
Step S206: and representing the font feature information data in the RDP window header by binary number.
Step S207: and converting the binary number representing the font characteristic information data in the RDP window header into the unified query code data by using the complementary bit algorithm according to the conversion rule for converting the unified query code.
It should be noted that the same bit-filling algorithm as that used in the second step S202 of the embodiment of the present application must be used in this step.
Step S208: inputting the unified query code data into the text conversion table, wherein the text conversion table is stored in a binary tree data structure; and sequentially matching each digit of the unified query code data with nodes starting from the root of the binary tree one by one to form a path, searching leaf nodes, and obtaining the character coding information data of the RDP window title.
Step S209: and storing the character coding information data of the RDP window title into a database.
Compared with the prior art, the method has the advantages that the character conversion table is stored in a binary tree data structure, and the speed of inquiring the character conversion table is further improved.
Referring to fig. 6, fig. 6 is a schematic flow chart illustrating a method for recognizing title characters of an RDP window according to a third embodiment of the present application, where the method includes:
step S301: and representing the glyph characteristic information by binary number.
Please refer to step S201 in this embodiment of the present application for a method of representing the glyph feature information by binary number.
Step S302: and corresponding the character pattern characteristic information with the corresponding character coding information to generate a character conversion record.
Comprises the following steps:
step one, selecting a character code to represent character code information;
and step two, the character pattern characteristic information and the uniform inquiry code are in one-to-one correspondence with the corresponding character coding information to generate a character conversion record.
For example, referring to table 1, a Unicode code is selected to represent character code information, and a binary number "0000110000000011000001111111100100110010010011001001001100100111111110000011000000001100000000110000" representing font characteristic information "medium" in fig. 2 is associated with a code "u 4e2 d" in Unicode representation "medium", thereby generating a character conversion record.
Step S303: forming a plurality of character conversion records into one character conversion table, wherein the digits of binary numbers representing character pattern characteristic information contained in each character conversion record are the same; the binary numbers representing the character pattern characteristic information contained in the character conversion records have different digits and can form a plurality of character conversion tables; and forming a character conversion library by a plurality of character conversion tables.
Because the default resolution of the RDP window title words is different for different server-side operating systems, the number of bits of the binary numbers used to represent the font feature information of the RDP window title words is different. For example, the glyph feature information with a resolution of 10 × 10 pixels may be represented using a 100-bit binary number, and the glyph feature information with a resolution of 12 × 12 pixels may be represented using a 144-bit binary number. Dividing the character conversion records into tables according to the digits of the binary numbers, dividing the character conversion records with the same digits of the binary numbers into one character conversion table, and obtaining a plurality of character conversion tables which form a character conversion library. In this example, the word conversion library is composed of two word conversion tables of 100-bit binary number and 144-bit binary number.
Step S304: and storing each word conversion table in a binary tree data structure, storing the character representation characteristic information of the word conversion table into nodes of the binary tree, and storing the character coding information in the word conversion table into leaf nodes of the binary tree.
Please refer to fig. 5, which is a diagram illustrating a text conversion table stored in a binary tree structure. If the Unicode code is selected to represent the character coding information, the leaves of the binary tree are Unicode codes; the nodes that make up the path from the root to the leaf of the binary tree are the corresponding glyph characterization information. It should be noted that the degree of the binary tree is equal to the number of bits of the unified query code, and fig. 5 simplifies the degree of the actual structure for convenience of expression.
Step S305: and intercepting and analyzing RDP protocol data containing the RDP window title, and extracting font feature information data in the RDP window title.
Please refer to step S102 in this embodiment of the present application for a method of capturing and analyzing RDP protocol data including an RDP window header and extracting font feature information data in the RDP window header.
Step S306: and representing the font feature information data in the RDP window header by binary number.
Step S307: and matching the corresponding character conversion table in the character conversion library according to the digit of the binary number representing the character pattern characteristic information in the RDP window header.
Step S308: inputting the binary number representing the font characteristic information in the RDP window header into the matched text conversion table, wherein the text conversion table is stored in a data structure of a binary tree; and sequentially matching each digit of the binary number representing the font characteristic information in the RDP window title with the nodes starting from the root of the binary tree one by one to form a path, searching leaf nodes, and obtaining the character coding information data of the RDP window title.
Step S309: and storing the character coding information data of the RDP window title into a database.
Compared with the prior art, the method has the advantages that the character conversion table is stored in a binary tree data structure, and the speed of inquiring the character conversion table is further improved.
The fourth embodiment of the invention discloses a device for identifying title characters of an RDP window, which has a structure schematic as shown in FIG. 7:
a conversion table defining unit M1 for defining a character conversion table containing font style characteristic information and character encoding information;
the data extraction unit M2 is used for intercepting and analyzing RDP protocol data containing an RDP window title and extracting font characteristic information data of the RDP window title;
the character recognition unit M3 is used for inputting the font character information data of the RDP window header obtained from the data extraction unit M2 into the character conversion table obtained from the conversion table definition unit M1, and obtaining the corresponding character code information data by searching the character conversion table, and storing the corresponding character code information data in the database DB.
The fifth embodiment of the present invention further discloses a device for identifying a title text of an RDP window, which has a schematic structure as shown in fig. 8:
the conversion table defining unit M1 further includes:
a glyph feature information digitizing module M11 for representing the glyph feature information as a binary number.
A uniform query code generation module M12, configured to convert the font characteristic information obtained from the font characteristic information datamation module M11 into unique uniform query codes one by one, where the uniform query codes are binary numbers with a constant number of digits C; the conversion rule for converting the unified query code specifically includes: setting the maximum digit of the binary number representing the character pattern characteristic information as C; if the digit number of the binary number representing the character pattern characteristic information to be converted is equal to C, the unified query code is equal to the binary number representing the character pattern characteristic information; otherwise, the unified query code is equal to the binary number representing the character pattern characteristic information, and the binary number with the bit number equal to C is converted into the binary number with the bit number equal to C through a complementary bit algorithm.
A character conversion table generating module M13, configured to generate a character conversion table by associating the font characteristic information obtained from the font characteristic information datamation module M11 and the unified query code obtained from the unified query code generating module M12 with the corresponding character code information in a one-to-one manner.
A literal translation table storage module M14, configured to store the literal translation table obtained from the literal translation table generation module M13 in a binary tree data structure, store the uniform query code of the literal translation table as a node of the binary tree, and store the character encoding information in the literal translation table as a leaf node of the binary tree.
The character recognition unit M3 further includes:
a font characteristic information data binarization module M31 for representing font characteristic information data in the RDP window header retrieved from the data extraction unit M2 as binary numbers;
a unified query code data conversion module M32, configured to convert the binary number representing the font characteristic information data in the RDP window header, obtained from the font characteristic information data binarization module M31, into the unified query code data according to a conversion rule for converting the unified query code and using the complementary bit algorithm;
a query module-M33, configured to input the unified query code data obtained from the unified query code conversion module M32 into the text conversion table obtained from the text conversion table storage module M14, where the text conversion table is stored in a binary tree data structure; and sequentially matching each digit of the unified query code data with nodes starting from the root of the binary tree one by one to form a path, finding leaf nodes, obtaining character coding information data of the RDP window title, and storing the character coding information data in a database DB.
The sixth embodiment of the present invention further discloses another RDP window title word recognition device, which has a structure schematic as shown in fig. 9:
the conversion table defining unit M1 further includes:
a glyph feature information digitization module M11, configured to represent the glyph feature information as a binary number;
a character conversion library generation module M15, configured to generate a character conversion record by associating the font characteristic information obtained from the font characteristic information digitization module M11 with the corresponding character encoding information; forming a plurality of character conversion records into one character conversion table, wherein the digits of binary numbers representing character pattern characteristic information contained in each character conversion record are the same; the binary numbers representing the character pattern characteristic information contained in the character conversion records have different digits and can form a plurality of character conversion tables; a plurality of character conversion tables form a character conversion library;
a text conversion library storage module M16, configured to store each text conversion table obtained from the text conversion library generation module M15 in a binary tree data structure, respectively, wherein the information indicating the font features of the text conversion table is stored as nodes of the binary tree, and the character encoding information in the text conversion table is stored as leaf nodes of the binary tree.
The character recognition unit M3 further includes:
a font characteristic information data binarization module M31 for representing font characteristic information data in the RDP window header retrieved from the data extraction unit M2 as binary numbers;
a matching module M34, configured to match the corresponding text conversion table in the text conversion library storage module M16 according to the number of digits of the binary number representing the font feature information in the RDP window header, which is obtained from the font feature information data binarization module M31; and sending the characteristic attribute information of the character conversion table to a second query module M35;
a second query module M35, configured to input the binary number representing the font feature information in the RDP window title, obtained from the font feature information data binarization module M31, into the matched text conversion table obtained from the text conversion table storage module M16, where the text conversion table is stored in a binary tree data structure, and sequentially matches each bit number of the binary number representing the font feature information in the RDP window title with a node from a root of the binary tree one by one to form a path, and finds a leaf node, so as to obtain character encoding information data of the RDP window title; and saved in the database DB.
It is clear to those skilled in the art that, for convenience and brevity of description, the foregoing method steps may be referred to for the specific corresponding working processes of the above-described systems, units and units, and are not described herein again.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for RDP window title text recognition, the method comprising:
defining a character conversion table, wherein the character conversion table comprises font characteristic information and corresponding character coding information;
intercepting and analyzing RDP protocol data containing RDP window titles, and extracting font feature information data in the RDP window titles;
inputting the font characteristic information data in the RDP window header into the character conversion table, and obtaining corresponding character coding information data by table lookup;
and storing the character coding information data of the RDP window title into a database.
2. The method of claim 1, wherein the method of defining a text translation table comprises:
representing the glyph feature information by binary number;
converting the font characteristic information into unique unified query codes one by one, wherein the unified query codes are binary numbers with the digit number being a constant C; the conversion rule for converting the unified query code specifically includes: setting the maximum bit value of the binary number representing the character pattern characteristic information as C; if the digit number of the binary number representing the character pattern characteristic information to be converted is equal to C, the unified query code is equal to the binary number representing the character pattern characteristic information; otherwise, the unified query code is equal to the binary number representing the character pattern characteristic information and is converted into the binary number with the digit number equal to C through a complementary digit algorithm;
the font characteristic information and the uniform inquiry code correspond to the character coding information corresponding to the font characteristic information and the uniform inquiry code one by one to generate a character conversion table;
and storing the character conversion table in a binary tree data structure, storing the uniform query code of the character conversion table into nodes of the binary tree, and storing the character coding information in the character conversion table into leaf nodes of the binary tree.
3. The method of claim 2, wherein the step of inputting the font datum in the RDP window header into the text conversion table to obtain the corresponding character encoding datum by looking up the table comprises:
representing the font characteristic information data in the RDP window header by binary number;
converting the binary number representing the font characteristic information data in the RDP window header into the unified query code data by using the complementary bit algorithm according to the conversion rule for converting the unified query code;
inputting the unified query code data into the text conversion table, wherein the text conversion table is stored in a binary tree data structure; and sequentially matching each digit of the unified query code data with nodes starting from the root of the binary tree one by one to form a path, searching leaf nodes, and obtaining the character coding information data of the RDP window title.
4. The method of claim 1, wherein the method of defining a text translation table further comprises:
representing the glyph feature information by binary number;
generating a character conversion record by using the character pattern characteristic information and the corresponding character coding information;
forming a plurality of character conversion records into one character conversion table, wherein the digits of binary numbers representing character pattern characteristic information contained in each character conversion record are the same; the binary numbers representing the character pattern characteristic information contained in the character conversion records have different digits and can form a plurality of character conversion tables; a plurality of character conversion tables form a character conversion library;
and storing each word conversion table in a binary tree data structure, storing the character representation characteristic information of the word conversion table into nodes of the binary tree, and storing the character coding information in the word conversion table into leaf nodes of the binary tree.
5. The method of claim 4, wherein the step of inputting the font character information data in the RDP window header into the text conversion table to obtain the character encoding information data of the corresponding RDP window header by looking up the table further comprises:
representing the font characteristic information data in the RDP window header by binary number;
matching the corresponding character conversion table in the character conversion library according to the digit of the binary number representing the character pattern characteristic information in the RDP window header;
inputting the binary number representing the font characteristic information in the RDP window header into the matched text conversion table, wherein the text conversion table is stored in a data structure of a binary tree; and sequentially matching each digit of the binary number representing the font characteristic information in the RDP window title with the nodes starting from the root of the binary tree one by one to form a path, searching leaf nodes, and obtaining the character coding information data of the RDP window title.
6. An apparatus for RDP window title text recognition, comprising:
the conversion table definition unit is used for defining a character conversion table, wherein the character conversion table comprises character pattern characteristic information and character coding information;
the data extraction unit is used for intercepting and analyzing RDP protocol data containing the RDP window title and extracting font characteristic information data of the RDP window title;
and the character recognition unit is used for inputting the font characteristic information data of the RDP window title acquired from the data extraction unit into the character conversion table acquired from the conversion table definition unit, searching the character conversion table to acquire corresponding character coding information data and storing the corresponding character coding information data into a database.
7. The apparatus according to claim 6, wherein the conversion table defining unit comprises:
the font characteristic information digitization module is used for representing the font characteristic information by binary numbers;
the uniform query code generation module is used for converting the font characteristic information acquired from the font characteristic information digitization module into unique uniform query codes one by one, and the uniform query codes are binary numbers with the digit number being a constant C; the conversion rule for converting the unified query code specifically includes: setting the maximum digit of the binary number representing the character pattern characteristic information as C; if the digit number of the binary number representing the character pattern characteristic information to be converted is equal to C, the unified query code is equal to the binary number representing the character pattern characteristic information; otherwise, the unified query code is equal to the binary number representing the character pattern characteristic information and is converted into the binary number with the digit number equal to C through a complementary digit algorithm;
the character conversion table generating module is used for correspondingly generating the character conversion table by the font characteristic information obtained from the font characteristic information datamation module and the unified query code obtained from the unified query code generating module and the corresponding character coding information;
and the character conversion table storage module is used for storing the character conversion table acquired from the character conversion table generation module in a binary tree data structure, the unified query code of the character conversion table is stored as a node of the binary tree, and the character coding information in the character conversion table is stored as a leaf node of the binary tree.
8. The apparatus of claim 6, wherein the text recognition unit comprises:
a font characteristic information data binarization module for representing font characteristic information data in the RDP window header acquired from the data extraction unit by binary number;
the unified query code data conversion module is used for converting the binary number representing the font characteristic information data in the RDP window title, which is obtained from the font characteristic information data binarization module, into the unified query code data according to the conversion rule for converting the unified query code and by using the complementary bit algorithm;
the query module I is used for inputting the unified query code data obtained from the unified query code data conversion module into the character conversion table obtained from the character conversion table storage module, and the character conversion table is stored in a data structure of a binary tree; sequentially matching each digit of the unified query code data with nodes starting from the root of the binary tree one by one to form a path, searching leaf nodes, and obtaining character coding information data of the RDP window title; and saved to the database.
9. The apparatus according to claim 6, wherein said conversion table defining unit further comprises:
the font characteristic information digitization module is used for representing the font characteristic information by binary numbers;
the character conversion library generating module is used for corresponding the font characteristic information acquired from the font characteristic information datamation module with the corresponding character coding information to generate a character conversion record; forming a plurality of character conversion records into one character conversion table, wherein the digits of binary numbers representing character pattern characteristic information contained in each character conversion record are the same; the binary numbers representing the character pattern characteristic information contained in the character conversion records have different digits and can form a plurality of character conversion tables; a plurality of character conversion tables form a character conversion library;
and the character conversion library storage module is used for storing each character conversion table acquired from the character conversion library generation module in a binary tree data structure, the character expression characteristic information of the character conversion table is stored as a node of the binary tree, and the character coding information in the character conversion table is stored as a leaf node of the binary tree.
10. The apparatus of claim 6, wherein the text recognition unit further comprises:
a font characteristic information data binarization module for representing font characteristic information data in the RDP window header acquired from the data extraction unit by binary number;
a matching module for matching the corresponding character conversion table in the character conversion library obtained from the character conversion library storage module according to the digit number of the binary number representing the character form characteristic information in the RDP window title obtained from the character form characteristic information data binarization module; sending the characteristic attribute information of the character conversion table to a second query module;
the query module II is used for inputting the binary number representing the font feature information in the RDP window title, which is acquired from the font feature information data binarization module, into the matched text conversion table acquired from the text conversion table storage module, the text conversion table is stored in a binary tree data structure, each digit number of the binary number representing the font feature information in the RDP window title is sequentially matched with a node starting from the root of the binary tree one by one to form a path, leaf nodes are searched, and the character coding information data of the RDP window title is acquired; and saved to the database.
CN201910379750.7A 2019-05-08 2019-05-08 RDP window title character recognition method and device Pending CN111914513A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910379750.7A CN111914513A (en) 2019-05-08 2019-05-08 RDP window title character recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910379750.7A CN111914513A (en) 2019-05-08 2019-05-08 RDP window title character recognition method and device

Publications (1)

Publication Number Publication Date
CN111914513A true CN111914513A (en) 2020-11-10

Family

ID=73242021

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910379750.7A Pending CN111914513A (en) 2019-05-08 2019-05-08 RDP window title character recognition method and device

Country Status (1)

Country Link
CN (1) CN111914513A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022116524A1 (en) * 2020-12-04 2022-06-09 北京搜狗科技发展有限公司 Picture recognition method and apparatus, electronic device, and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080154978A1 (en) * 2006-12-22 2008-06-26 Lemar Eric M Systems and methods of directory entry encodings
CN101246550A (en) * 2008-03-11 2008-08-20 深圳华为通信技术有限公司 Image character recognition method and device
CN101963954A (en) * 2009-07-24 2011-02-02 康佳集团股份有限公司 Method and device for displaying words
CN102662926A (en) * 2012-03-29 2012-09-12 常州华文文字技术有限公司 Storage and access methods for word stock
CN106599940A (en) * 2016-11-25 2017-04-26 东软集团股份有限公司 Picture character identification method and apparatus thereof
CN107403108A (en) * 2017-08-07 2017-11-28 上海上讯信息技术股份有限公司 A kind of method and system of data processing
CN109409370A (en) * 2017-08-18 2019-03-01 深圳市傲冠软件股份有限公司 A kind of remote desktop character identifying method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080154978A1 (en) * 2006-12-22 2008-06-26 Lemar Eric M Systems and methods of directory entry encodings
CN101246550A (en) * 2008-03-11 2008-08-20 深圳华为通信技术有限公司 Image character recognition method and device
CN101963954A (en) * 2009-07-24 2011-02-02 康佳集团股份有限公司 Method and device for displaying words
CN102662926A (en) * 2012-03-29 2012-09-12 常州华文文字技术有限公司 Storage and access methods for word stock
CN106599940A (en) * 2016-11-25 2017-04-26 东软集团股份有限公司 Picture character identification method and apparatus thereof
CN107403108A (en) * 2017-08-07 2017-11-28 上海上讯信息技术股份有限公司 A kind of method and system of data processing
CN109409370A (en) * 2017-08-18 2019-03-01 深圳市傲冠软件股份有限公司 A kind of remote desktop character identifying method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022116524A1 (en) * 2020-12-04 2022-06-09 北京搜狗科技发展有限公司 Picture recognition method and apparatus, electronic device, and medium

Similar Documents

Publication Publication Date Title
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
US8542235B2 (en) System and method for displaying complex scripts with a cloud computing architecture
US8712977B2 (en) Computer product, information retrieval method, and information retrieval apparatus
KR970003322B1 (en) Method for interchance code conversion of multi-byte character string characters
US8583743B1 (en) System and method for message gateway consolidation
CN104199812B (en) Data system and method supporting multiple languages
Wang et al. A coverless plain text steganography based on character features
CN110096635B (en) Query visual display method and device for Chinese and western medicine information
JP5788047B2 (en) Encoder for encoding text into matrix code symbols and decoder for decoding matrix code symbols
CN111046135A (en) Unstructured text processing method and device, computer equipment and storage medium
CN111666575B (en) Text carrier-free information hiding method based on word element coding
CN115116082B (en) One-key gear system based on OCR (optical character recognition) algorithm
CN111914513A (en) RDP window title character recognition method and device
CN110516125B (en) Method, device and equipment for identifying abnormal character string and readable storage medium
US8463759B2 (en) Method and system for compressing data
JP4821287B2 (en) Structured document encoding method, encoding apparatus, encoding program, decoding apparatus, and encoded structured document data structure
CN107832341B (en) AGNSS user duplicate removal statistical method
CN114065269B (en) Method for generating and analyzing bindless heterogeneous token and storage medium
CN116303888A (en) Rarely used word processing method and device, storage medium and electronic equipment
CN113806782A (en) Ciphertext judgment method, system and equipment based on transfer matrix
CN110335586B (en) Information conversion method and system
CN112966282B (en) Text carrier-free steganography method and device for component histogram
US20220327278A1 (en) Hypercube encoding of text for natural language processing
CN113434657B (en) E-commerce customer service response method and corresponding device, equipment and medium thereof
CN110706309B (en) Method and device for generating fishbone map

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination