CN115424284A - Text similarity recognition method, device, equipment and storage medium - Google Patents

Text similarity recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN115424284A
CN115424284A CN202210858888.7A CN202210858888A CN115424284A CN 115424284 A CN115424284 A CN 115424284A CN 202210858888 A CN202210858888 A CN 202210858888A CN 115424284 A CN115424284 A CN 115424284A
Authority
CN
China
Prior art keywords
image
log
text
data
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210858888.7A
Other languages
Chinese (zh)
Inventor
李波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210858888.7A priority Critical patent/CN115424284A/en
Publication of CN115424284A publication Critical patent/CN115424284A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • G06V30/1475Inclination or skew detection or correction of characters or of image to be recognised
    • G06V30/1478Inclination or skew detection or correction of characters or of image to be recognised of characters or characters lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/18162Extraction of features or characteristics of the image related to a structural representation of the pattern
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • G06V30/19093Proximity measures, i.e. similarity or distance measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references

Abstract

The invention relates to the technical field of image processing, and discloses a text similarity identification method, a text similarity identification device, text similarity identification equipment and a storage medium. The method comprises the following steps: determining text representation data corresponding to the obtained real-time log data; converting the real-time log data into an encoding character string according to the text representation data and a preset encoding specification; determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification; and acquiring historical log images corresponding to the historical log data, respectively preprocessing the target log image and the historical log image, and calculating the similarity between the target log image and the historical log image according to the acquired image characteristics and the identification character characteristic data. According to the scheme, the text data is subjected to imaging processing to obtain the image data in the picture format, and the similarity between the images is calculated to judge whether the starting process is normal or not, so that the technical problem that the similarity identification efficiency of the log data is low in the prior art is solved.

Description

Text similarity recognition method, device, equipment and storage medium
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to a text similarity recognition method, apparatus, device, and storage medium.
Background
In the version updating process of the application system, a program code packet needs to be replaced, the application system needs to be restarted, and then a starting log of the application system is obtained at the first time; and judging whether the application system is normally started or not according to the log. In the conventional log judgment method, a keyword list for normally starting an application system, an abnormal error-reporting keyword list and the like need to be set. Then setting a personalized black and white list library according to the characteristics of the application system; the rule is recorded in the blacklist library, and a certain keyword must not appear in the starting process of a certain system or a certain process must be started before another process, otherwise, the starting is regarded as failed.
For a normal grammar keyword recognition rule and an additional black and white list rule related to a specific resource environment, manpower is required to be invested for maintenance in daily life, the recognition efficiency of a program is greatly reduced, the false alarm rate and the missing report rate are improved, and the conditions of sudden messy codes and abnormal foreign languages of the system cannot be dealt with. Meanwhile, the industry has methods for separating and extracting semantics of log texts, further identifying semantics and judging starting, because different choices of languages, common words, word segmentation granularity, interception step length, vector conversion, comparison algorithm, sample library and the like are involved, the combination with industry attributes is tight, and the accuracy and the universality are difficult to continuously improve. Therefore, how to improve the efficiency of identifying the similarity of the log text data becomes a technical problem to be solved by those skilled in the art.
Disclosure of Invention
The invention mainly aims to calculate the similarity among pictures after the startup log in a text format, the last log and the historical log are subjected to the graphing, and judge whether the startup process is normal or not according to the similarity, so that the technical problem of low similarity identification efficiency of log text data in the prior art is solved.
The invention provides a text similarity recognition method in a first aspect, which comprises the following steps: acquiring real-time log text data to be processed, and determining multi-level text representation data corresponding to the real-time log text data; converting the real-time log text data into an encoding character string according to the text representation data and a preset encoding specification; determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification; acquiring historical log images corresponding to historical log text data, and respectively preprocessing the target log images and the historical log images to obtain image characteristics and identification character characteristic data of each image; and calculating the image similarity between the target log image and the historical log image according to the image characteristics and the identification character characteristic data.
Optionally, in a first implementation manner of the first aspect of the present invention, the determining multi-level text representation data corresponding to the real-time log text data includes: acquiring real-time log text data to be processed, and extracting features of the real-time log text data based on a preset text encoder to obtain sentence level features and word level features of the real-time log text data; labeling each word in the real-time log text data according to the word level characteristics; and extracting level information corresponding to the real-time log text data based on a preset regular expression, and determining multi-level text representation data corresponding to the real-time log text data according to the level information.
Optionally, in a second implementation manner of the first aspect of the present invention, the converting the real-time journal text data into an encoding character string according to the text characterization data and a preset encoding specification includes: converting the text representation data based on a preset Unicode character set to obtain initial code points; determining the number of bytes of the initial code point; and converting the real-time log text data into an encoding character string according to the number of bytes and a preset encoding specification.
Optionally, in a third implementation manner of the first aspect of the present invention, the determining a picture specification according to the encoded character string and a preset rule, and generating a target log image based on the picture specification includes: converting the encoded string into a plurality of RGB color values; determining the picture specification according to a preset rule and the RGB color value; determining a character sequence in the real-time log text data, and arranging the RGB color values according to the character sequence and the picture specification to obtain image parameters; and generating a target log image corresponding to the real-time log text data according to the image parameters.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the obtaining a history log image corresponding to history log text data, and respectively preprocessing the target log image and the history log image to obtain image features and identification character feature data of each image includes: acquiring images to be compared, wherein the images to be compared are historical log images and target log images, and the historical log images are corresponding historical log images converted from historical log text data; respectively carrying out rotation correction detection on the images to be compared to obtain angle-corrected images to be compared; performing feature extraction on the image to be compared after the angle correction to obtain a feature extraction image corresponding to the image to be compared after the angle correction; and carrying out target detection on the image to be compared after the angle correction according to the characteristic extraction diagram to obtain identification position data corresponding to the image to be compared, and carrying out character characteristic extraction on the image to be compared after the angle correction to obtain identification character characteristic data corresponding to the image to be compared.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the calculating an image similarity between the target log image and the historical log image according to the image feature and the identification character feature data includes: according to the identification position data, the image to be compared is cut to obtain a target identification image corresponding to the image to be compared; extracting the features of the target identification image to obtain a feature vector of the target identification image; and calculating the similarity between the historical log image and the target log image according to the feature vector and the identification character feature data to obtain a similarity comparison result between the historical log image and the target log image.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the calculating, according to the feature vector and the feature data of the identification word, a similarity between the history log image and the target log image to obtain a comparison result of the similarities between the history log image and the target log image includes: calculating the characteristic distance between the target identification images according to the characteristic vector; judging whether the characteristic distance is larger than a preset threshold value or not; if yes, determining the similarity between the images to be compared according to the identification character characteristic data, and obtaining the similarity according to the value of the similarity.
The second aspect of the present invention provides a text similarity recognition apparatus, including: the determining module is used for acquiring real-time log text data to be processed and determining multi-level text representation data corresponding to the real-time log text data; the conversion module is used for converting the real-time log text data into an encoding character string according to the text representation data and a preset encoding specification; the generating module is used for determining the picture specification according to the coding character string and a preset rule and generating a target log image based on the picture specification; the preprocessing module is used for acquiring historical log images corresponding to historical log text data, and respectively preprocessing the target log images and the historical log images to obtain image characteristics and identification character characteristic data of each image; and the calculating module is used for calculating the image similarity between the target log image and the historical log image according to the image characteristics and the identification character characteristic data.
Optionally, in a first implementation manner of the second aspect of the present invention, the determining module is specifically configured to: acquiring real-time log text data to be processed, and extracting features of the real-time log text data based on a preset text encoder to obtain sentence level features and word level features of the real-time log text data; labeling each word in the real-time log text data according to the word level characteristics; and extracting level information corresponding to the real-time log text data based on a preset regular expression, and determining multi-level text representation data corresponding to the real-time log text data according to the level information.
Optionally, in a second implementation manner of the second aspect of the present invention, the conversion module includes: the first conversion unit is used for carrying out conversion processing on the text representation data based on a preset Unicode character set to obtain initial code points; a determining unit, configured to determine the number of bytes of the initial code point; and the second conversion unit is used for converting the real-time log text data into an encoding character string according to the byte number and a preset encoding specification.
Optionally, in a third implementation manner of the second aspect of the present invention, the generating module is specifically configured to: converting the encoded string into a plurality of RGB color values; determining the picture specification according to a preset rule and the RGB color value; determining a character sequence in the real-time log text data, and arranging the RGB color values according to the character sequence and the picture specification to obtain image parameters; and generating a target log image corresponding to the real-time log text data according to the image parameters.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the preprocessing module is specifically configured to: acquiring images to be compared, wherein the images to be compared are historical log images and target log images, and the historical log images are corresponding historical log images converted from historical log text data; respectively carrying out rotation correction detection on the images to be compared to obtain angle-corrected images to be compared; performing feature extraction on the image to be compared after the angle correction to obtain a feature extraction image corresponding to the image to be compared after the angle correction; and carrying out target detection on the image to be compared after the angle correction according to the characteristic extraction diagram to obtain identification position data corresponding to the image to be compared, and carrying out character characteristic extraction on the image to be compared after the angle correction to obtain identification character characteristic data corresponding to the image to be compared.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the calculation module is specifically configured to: according to the identification position data, the image to be compared is cut to obtain a target identification image corresponding to the image to be compared; extracting the features of the target identification image to obtain a feature vector of the target identification image; and calculating the similarity between the historical log image and the target log image according to the feature vector and the identification character feature data to obtain a similarity comparison result between the historical log image and the target log image.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the calculating module is further specifically configured to: calculating the characteristic distance between the target identification images according to the characteristic vector; judging whether the characteristic distance is larger than a preset threshold value or not; if yes, determining the similarity between the images to be compared according to the identification character feature data, and obtaining the similarity according to the value of the similarity.
A third aspect of the present invention provides a text similarity recognition apparatus, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;
the at least one processor invokes the instructions in the memory to cause the text similarity recognition device to perform the steps of the text similarity recognition method described above.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the text similarity recognition method described above.
In the technical scheme provided by the invention, the text representation data corresponding to the acquired real-time log data is determined; converting the real-time log data into an encoding character string according to the text representation data and a preset encoding specification; determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification; and acquiring historical log images corresponding to the historical log data, respectively preprocessing the target log image and the historical log image, and calculating the similarity between the target log image and the historical log image according to the acquired image characteristics and the identification character characteristic data. According to the scheme, the text data is subjected to imaging processing to obtain the image data in the picture format, and the similarity between the images is calculated to judge whether the starting process is normal or not, so that the technical problem that the similarity identification efficiency of the log data is low in the prior art is solved.
Drawings
FIG. 1 is a diagram of a text similarity recognition method according to a first embodiment of the present invention;
FIG. 2 is a diagram of a text similarity recognition method according to a second embodiment of the present invention;
FIG. 3 is a diagram of a text similarity recognition method according to a third embodiment of the present invention;
FIG. 4 is a diagram of a fourth embodiment of a text similarity recognition method according to the present invention;
FIG. 5 is a diagram of a fifth embodiment of a text similarity recognition method according to the present invention;
FIG. 6 is a diagram of a text similarity recognition apparatus according to a first embodiment of the present invention;
FIG. 7 is a diagram of a text similarity recognition apparatus according to a second embodiment of the present invention;
fig. 8 is a schematic diagram of an embodiment of a text similarity recognition apparatus provided in the present invention.
Detailed Description
The embodiment of the invention provides a text similarity recognition method, a text similarity recognition device, text similarity recognition equipment and a storage medium, wherein in the technical scheme of the invention, text characterization data corresponding to acquired real-time log data are determined; converting the real-time log data into an encoding character string according to the text representation data and a preset encoding specification; determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification; and acquiring historical log images corresponding to the historical log data, respectively preprocessing the target log image and the historical log image, and calculating the similarity between the target log image and the historical log image according to the acquired image characteristics and the identification character characteristic data. According to the scheme, the text data is subjected to imaging processing to obtain the image data in the picture format, and the similarity between the images is calculated to judge whether the starting process is normal or not, so that the technical problem that the similarity identification efficiency of the log data is low in the prior art is solved.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a specific flow of the embodiment of the present invention is described below, and referring to fig. 1, a first embodiment of the text similarity recognition method in the embodiment of the present invention includes:
101. acquiring real-time log text data to be processed, and determining multi-level text representation data corresponding to the real-time log text data;
in this embodiment, real-time log text data to be processed is acquired, and multi-level text representation data corresponding to the real-time log text data is determined. Specifically, the server obtains real-time log text data to be processed, where the real-time log text data to be processed is text content of the accessory. The real-time log text data to be processed is text content in an attachment needing to be uploaded in a webpage.
Before acquiring the real-time log text data to be processed, the server receives an analysis signal, where the analysis signal is used to instruct the server to analyze the uploaded attachment, analyze the text content of the attachment through a FILE API of JavaScript, and use the analysis result as the real-time log text data to be processed. After receiving the analysis signal, the server also receives an encryption signal, wherein the encryption signal is used for instructing the server to encrypt the real-time log text data to be processed, which is analyzed from the front end, into a picture and then to transmit the picture to the background.
In the embodiment, comprehensive text semantic representation plays a vital role in a task of converting text into images, and the text semantic representation is represented from multiple levels, including sentence level (content-level), aspect level (aspect-level) and word level (word-level), and sentence level features, aspect level features and word level features of text sentences are correspondingly extracted.
102. Converting the real-time log text data into an encoding character string according to the text representation data and a preset encoding specification;
in this embodiment, the real-time log text data is converted into an encoding character string according to the text representation data and the preset encoding specification.
Specifically, the server converts real-time log text data to be processed into an initial encoding character string according to a preset encoding specification. Specifically, the server converts real-time log text data to be processed into initial code points according to a preset Unicode character set, wherein the initial code points are hexadecimal; the server determines the byte number of the initial code point; and the server converts the initial code points into an initial code character string according to the number of bytes and a preset coding specification, wherein the initial code character string is a binary system.
In this embodiment, the unicode is an industry standard in the field of computer science, and includes a character set, a coding scheme, and the like. The range of encoding is an integer between 0-65535. The method is generated for solving the limitation of the traditional character coding scheme, and sets a uniform and unique binary code for each character in each language so as to meet the requirements of text conversion and processing in cross-language and cross-platform. Because the computer can only process numbers, if the real-time log text data to be processed is to be processed, the real-time log text data to be processed must be converted into numbers first to be processed. Each character in the real-time log text data to be processed corresponds to one coding character.
The current characters are arranged in 17 groups, 0x0000 to 0x10FFFF, each group is called a Plane (Plane), and each Plane has 65536 code bits, 1114112 in total. However, only a few planes are currently used. UTF-8, UTF-16, and UTF-32 are all encoding schemes that convert numbers to program data.
103. Determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification;
in this embodiment, the picture specification is determined according to the encoding character string and the preset rule, and the target log image is generated based on the picture specification. Specifically, the server converts the initial encoding character string into a plurality of RGB color values according to a first preset rule. Specifically, the server converts the initial coding character string into an RGB numeric string through JavaScript, wherein the RGB numeric string is decimal; the server determines three continuous values in the RGB digital string as a pixel value of an RGB color value according to a first preset rule; the server determines a plurality of RGB color values from the pixel values.
Further, a picture specification is determined according to a second preset rule and the plurality of RGB color values, the target log image is generated based on the picture specification, the picture specification is used for indicating the number of lines and columns of the RGB color values, and the second preset rule specifies the number of the RGB color values included in the same line. Specifically, the server determines the picture specification according to a second preset rule, wherein the second preset rule specifies the number of RGB color values included in the same line; the server determines the sequence of characters in the target text to obtain a first sequence; the server arranges the RGB color values in sequence according to a first sequence and a picture specification to obtain picture parameters, wherein the picture parameters comprise the line number and the column number of the RGB color values; and the server generates a target log image according to the picture parameters.
104. Acquiring historical log images corresponding to historical log text data, and respectively preprocessing a target log image and the historical log images to obtain image characteristics and identification character characteristic data of each image;
in this embodiment, a history log image corresponding to history log text data is obtained, and the target log image and the history log image are preprocessed respectively to obtain image features and identification character feature data of each image. And preprocessing comprises carrying out target detection and character feature extraction on the picture.
Specifically, the target detection refers to identifying a target identifier and a position of the target identifier in the picture, the target identifier may be preset according to a service scene, and specifically may refer to an object that needs to be subjected to similarity comparison. For example, when performing the doorhead similarity audit, the target detection may specifically be to identify a doorhead and a position of the doorhead in the picture. The door head refers to a plaque and related facilities arranged at the door of an enterprise, an institution and an individual industrial and commercial company, and is a decoration form outside a shop. For another example, when people similarity audit is performed, the target detection specifically may be to identify people and positions of people in the picture. For example, common target detection frameworks include region pro-common + CNN (convolutional neural network) extraction classification target detection framework and End-to-End (End-to-End) target detection framework.
The character feature extraction is to identify character feature information in the picture. For example, when performing the doorhead similarity audit, the text feature extraction may specifically be to identify doorhead text in the picture. For example, a commonly used text feature extraction method may be to perform text feature extraction by using Optical Character Recognition (OCR). The identification character feature data refers to character information extracted from the pictures to be compared by utilizing character feature extraction. The identification position data refers to the position of a target identification identified from the picture to be compared through target detection. For example, the identification position data may specifically refer to coordinate information of the target identification in the picture to be compared.
105. And calculating the image similarity between the target log image and the historical log image according to the image characteristics and the identification character characteristic data.
In this embodiment, the image similarity between the target log image and the history log image is calculated according to the image feature and the identification character feature data. Specifically, the similarity is used to represent whether the pictures to be compared are in the same scene. For example, the similarity may specifically be that the images to be compared represent the same scene. For another example, the similarity may specifically represent different scenes of the images to be compared.
In this embodiment, the server may calculate a characteristic distance between the target identification pictures according to the characteristic vector, determine whether the similarity can be obtained by comparing the characteristic distance with a preset first distance threshold and a preset second distance threshold, directly obtain the similarity when the characteristic distance is smaller than the preset first distance threshold, obtain the similarity of the pictures to be compared when the characteristic distance is greater than the preset second distance threshold according to the identification character characteristic data, push the pictures to be compared to the manual terminal when the characteristic distance is greater than the preset first distance threshold and smaller than the preset second distance threshold, and compare the pictures to be compared by a worker of the manual terminal.
In the embodiment of the invention, the text representation data corresponding to the acquired real-time log data is determined; converting the real-time log data into an encoding character string according to the text representation data and a preset encoding specification; determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification; and acquiring historical log images corresponding to the historical log data, respectively preprocessing the target log image and the historical log image, and calculating the similarity between the target log image and the historical log image according to the acquired image characteristics and the identification character characteristic data. According to the scheme, the text data is subjected to imaging processing to obtain the image data in the picture format, and the similarity between the images is calculated to judge whether the starting process is normal or not, so that the technical problem that the similarity identification efficiency of the log data is low in the prior art is solved.
Referring to fig. 2, a second embodiment of the text similarity recognition method according to the embodiment of the present invention includes:
201. acquiring real-time log text data to be processed, and performing feature extraction on the real-time log text data based on a preset text encoder to obtain sentence-level features and word-level features of the real-time log text data;
in this embodiment, to-be-processed real-time log text data is obtained, and feature extraction is performed on the real-time log text data based on a preset text encoder to obtain sentence level features and word level features of the real-time log text data. Specifically, comprehensive text semantic representation plays a vital role in a task of converting text into images, and the text semantic representation is represented from multiple levels, including sentence level (sense-level), aspect level (aspect-level) and word level (word-level), and sentence level features, aspect level features and word level features of a text sentence are correspondingly extracted.
In a specific application scenario, the extracted original sentence-level features may be directly used to participate in a subsequent text-to-image processing flow, or, optionally, a conditional enhancement (CA) method may be further used to enhance the extracted sentence-level features (so that the sentence-level characterization accuracy is higher), and the enhanced sentence-level features participate in the subsequent text-to-image processing flow.
202. Labeling each word in the real-time log text data according to the word level characteristics;
in this embodiment, each word in the real-time log text data is labeled according to the word-level features. For the aspect level features, the aspect level information of the text sentence can be extracted according to the syntax structure of the text sentence, and the aspect level features corresponding to the aspect level information are further extracted, so that the aspect level feature extraction of the text sentence is realized.
Specifically, tools such as NLTK (natural language toolkit) and the like can be used for performing part-of-speech tagging on each word in a text sentence, and then regular expressions are used for extracting the aspect information contained therein.
203. Extracting level information corresponding to the real-time log text data based on a preset regular expression, and determining multi-level text representation data corresponding to the real-time log text data according to the level information;
in this embodiment, the level information corresponding to the real-time log text data is extracted based on the preset regular expression, and the multi-level text representation data corresponding to the real-time log text data is determined according to the level information. Specifically, comprehensive text semantic representation plays a crucial role in a task of converting text into images, and the text semantic representation is represented from multiple levels, including sentence level (sensor-level), aspect level (aspect-level) and word level (word-level), and sentence level features, aspect level features and word level features of a text sentence are correspondingly extracted.
204. Converting the text representation data based on a preset unicode character set to obtain initial code points;
in this embodiment, the text representation data is converted based on a preset unicode character set to obtain an initial code point. Specifically, the server converts the target text into initial code points according to a preset unicode character set, wherein the initial code points are hexadecimal; the server determines the byte number of the initial code point; and the server converts the initial code points into an initial code character string according to the number of bytes and a preset code specification, wherein the initial code character string is a binary system.
205. Determining the byte number of the initial code point;
in this embodiment, the number of bytes of the initial code point is determined. Specifically, the unicode is an industry standard in the field of computer science, and includes character sets, encoding schemes, and the like. The range of encoding is an integer between 0-65535. The method is generated for solving the limitation of the traditional character coding scheme, and sets a uniform and unique binary code for each character in each language so as to meet the requirements of text conversion and processing in cross-language and cross-platform. Because computers can only process numbers, if the target text is to be processed, the target text must first be converted to numbers for processing. Wherein each character in the target text corresponds to an encoding character.
206. Converting the real-time log text data into an encoding character string according to the number of bytes and a preset encoding specification;
in this embodiment, the real-time log text data is converted into an encoded string according to the number of bytes and a preset encoding specification. The current character is divided into 17 groups of characters, 0x0000 to 0x10FFFF, each group is called a Plane (Plane), and each Plane has 65536 code bits and 1114112 total code bits. However, only a few planes are currently used. UTF-8, UTF-16, and UTF-32 are all encoding schemes that convert numbers to program data.
Among them, UTF-8 is characterized by using codes of different lengths for different ranges of characters. For characters between 0x00-0x7F, the UTF-8 encoding is identical to the ASCII encoding. The maximum length of the UTF-8 encoding is 4 bytes. A 4-byte template has 21 x's, i.e., can accommodate 21-bit binary digits. The maximum code bit 0x10FFFF of (a) is also only 21 bits. For example, UTF-8 is taken as an example, the initial code point of the Chinese word is 0x6C49;0x6C49 is between 0x0800-0xFFFF, so a 3 byte template is used: 1110xxxx 10xxxxxx, a byte number of 3; convert the initial codepoint 0x6C49 to a binary target codepoint: 0110 11000100 1001, replacing x in the 3 byte template with this bitstream in order from back to front, yields: 11100110 1011000110001001.
207. Determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification;
208. acquiring historical log images corresponding to historical log text data, and respectively preprocessing a target log image and the historical log images to obtain image characteristics and identification character characteristic data of each image;
209. and calculating the image similarity between the target log image and the historical log image according to the image characteristics and the identification character characteristic data.
Steps 207-209 in this embodiment are similar to steps 103-105 in the first embodiment, and are not repeated here.
In the embodiment of the invention, the text representation data corresponding to the obtained real-time log data is determined; converting the real-time log data into an encoding character string according to the text representation data and a preset encoding specification; determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification; and acquiring historical log images corresponding to the historical log data, respectively preprocessing the target log image and the historical log image, and calculating the similarity between the target log image and the historical log image according to the obtained image characteristics and the identification character characteristic data. According to the scheme, the text data is subjected to imaging processing to obtain the image data in the picture format, and the similarity between the images is calculated to judge whether the starting process is normal or not, so that the technical problem that the similarity identification efficiency of the log data is low in the prior art is solved.
Referring to fig. 3, a third embodiment of the text similarity recognition method according to the embodiment of the present invention includes:
301. acquiring real-time log text data to be processed, and determining multi-level text representation data corresponding to the real-time log text data;
302. converting the real-time log text data into an encoding character string according to the text representation data and a preset encoding specification;
303. converting the encoded string into a plurality of RGB color values;
in this embodiment, the encoded character string is converted into a plurality of RGB color values. Specifically, the initial encoding string is converted into a plurality of RGB color values according to a first preset rule. Specifically, the server converts the initial coding character string into an RGB numeric string through JavaScript, wherein the RGB numeric string is decimal; the server determines three continuous values in the RGB digital string as a pixel value of an RGB color value according to a first preset rule; the server determines a plurality of RGB color values from the pixel values. Wherein the value range of the pixel value is 0-255.
For example, if the initial encoding string is 11111111 11111010 11111010 and the corresponding RGB digital string is 255 250, the three values are determined as the pixel value of one RGB color value, the corresponding RGB color value is R255G250B250, and the corresponding hexadecimal color is # FFFAFA. For another example, if the initial encoding string is 11111000 1111100011111111 and the corresponding RGB digital string is 248 248 248 255, the three values are determined as the pixel value of one RGB color value, the corresponding RGB color value is R248G248B255, and the corresponding hexadecimal color value is # F8FF, which is not described herein again.
304. Determining the picture specification according to a preset rule and RGB color values;
in this embodiment, the picture specification is determined according to the preset rule and the RGB color value. Specifically, a picture specification is determined according to a second preset rule and a plurality of RGB color values, and a target picture is generated based on the picture specification, wherein the picture specification is used for indicating the number of lines and columns of the RGB color values, and the second preset rule specifies the number of the RGB color values included in the same line. Specifically, the server determines the picture specification according to a second preset rule, wherein the second preset rule specifies the number of the RGB color values included in the same line.
305. Determining a character sequence in the real-time log text data, and arranging a plurality of RGB color values according to the character sequence and the picture specification to obtain an image parameter;
in this embodiment, a text sequence in the real-time log text data is determined, and the plurality of RGB color values are arranged according to the text sequence and the picture specification to obtain the image parameter.
Specifically, the server determines a character sequence in a target text to obtain a first sequence; the server arranges the RGB color values in sequence according to a first sequence and a picture specification to obtain picture parameters, wherein the picture parameters comprise the line number and the column number of the RGB color values; and the server generates a target picture according to the picture parameters. The picture parameters comprise the number of RGB color values in each line and column of the picture.
306. Generating a target log image corresponding to the real-time log text data according to the image parameters;
in this embodiment, a target log image corresponding to the real-time log text data is generated according to the image parameter. Specifically, according to the original sequence (first sequence) of the characters in the target text, the characters are replaced by the RGB color values corresponding to the characters, and the target picture is generated according to the picture parameters after replacement.
The server may set a conversion rule according to a preset rule, that is, set a parameter threshold of the picture, and adjust the size of the generated target picture, for example, how many RGB color values are included in the same row may be specified, so that a regular target picture is generated, which is not described herein again.
307. Acquiring historical log images corresponding to historical log text data, and respectively preprocessing a target log image and the historical log images to obtain image characteristics and identification character characteristic data of each image;
308. and calculating the image similarity between the target log image and the historical log image according to the image characteristics and the identification character characteristic data.
Steps 301 to 302 and 307 to 308 in this embodiment are similar to steps 101 to 102 and 104 to 105 in the first embodiment, and are not repeated here.
In the embodiment of the invention, the text representation data corresponding to the acquired real-time log data is determined; converting the real-time log data into an encoding character string according to the text representation data and a preset encoding specification; determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification; and acquiring historical log images corresponding to the historical log data, respectively preprocessing the target log image and the historical log image, and calculating the similarity between the target log image and the historical log image according to the obtained image characteristics and the identification character characteristic data. According to the scheme, the text data is subjected to imaging processing to obtain the image data in the picture format, and the similarity between the images is calculated to judge whether the starting process is normal or not, so that the technical problem that the similarity identification efficiency of the log data is low in the prior art is solved.
Referring to fig. 4, a fourth embodiment of the text similarity recognition method according to the embodiment of the present invention includes:
401. acquiring real-time log text data to be processed, and determining multi-level text representation data corresponding to the real-time log text data;
402. converting the real-time log text data into an encoding character string according to the text representation data and a preset encoding specification;
403. determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification;
404. acquiring images to be compared, wherein the images to be compared are historical log images and target log images, and the historical log images are corresponding historical log images converted from historical log text data;
in this embodiment, images to be compared are obtained, where the images to be compared are a history log image and a target log image, and the history log image is a history log image corresponding to a history log text data conversion. Specifically, history log text data, which is the text content of the attachment, is acquired. The historical log text data is text content in an attachment needing to be uploaded in a webpage.
Before obtaining the history log text data, the server receives an analysis signal, where the analysis signal is used to instruct the server to analyze the uploaded attachment, analyze the text content of the attachment through a FILE API of JavaScript, and use the analysis result as the history log text data. After receiving the analysis signal, the server also receives an encryption signal, wherein the encryption signal is used for instructing the server to encrypt the historical log text data analyzed from the front end into pictures and then transmit the pictures to the background.
405. Respectively carrying out rotation correction detection on the images to be compared to obtain angle-corrected images to be compared;
in this embodiment, rotation correction detection is performed on images to be compared respectively to obtain angle-corrected images to be compared. Specifically, the rotation correction detection means detecting whether the picture to be compared is at a preset normal angle, and when the picture to be compared is not at the preset normal angle, performing angle correction on the picture to be compared. For example, the rotation correction detection may detect that the to-be-compared picture is rotated by 90 degrees, 180 degrees, or 270 degrees, and when the to-be-compared picture is detected to be rotated by 90 degrees, 180 degrees, or 270 degrees, the to-be-compared picture needs to be angle-corrected to the preset normal angle. The feature extraction graph is a feature graph obtained by extracting features of the angle-corrected picture to be compared and is used for representing the picture features of the angle-corrected picture to be compared.
406. Performing feature extraction on the image to be compared after the angle correction to obtain a feature extraction image corresponding to the image to be compared after the angle correction;
in this embodiment, feature extraction is performed on the angle-corrected image to be compared, so as to obtain a feature extraction map corresponding to the angle-corrected image to be compared. Specifically, the server may use a mobilenetv2 network as a feature extractor to perform feature extraction, so as to obtain a feature extraction graph corresponding to the angle-corrected image to be compared. After the feature extraction image is obtained, the server performs target detection through a preset target detection network to obtain identification position data corresponding to the image to be compared.
407. Carrying out target detection on the image to be compared after angle correction according to the feature extraction diagram to obtain identification position data corresponding to the image to be compared, and carrying out character feature extraction on the image to be compared after angle correction to obtain identification character feature data corresponding to the image to be compared;
in this embodiment, target detection is performed on the image to be compared after the angle correction according to the feature extraction diagram to obtain identification position data corresponding to the image to be compared, and character feature extraction is performed on the image to be compared after the angle correction to obtain identification character feature data corresponding to the image to be compared. The preset target detection network may specifically be a network composed of a plurality of convolution layers and an average pooling layer, each convolution layer may perform feature extraction on the feature extraction map, output feature maps of different sizes of receptive fields, and predict target positions and categories on the feature maps of different sizes of receptive fields, so as to obtain identification position data corresponding to the to-be-compared picture. When character feature extraction is performed on the angle-corrected picture to be compared, the server inputs the angle-corrected picture to be compared into a ResNet network for convolution, extracts a feature map of the angle-corrected picture to be compared, and inputs the feature map into a preset text line detection module and a text line recognition bidirectional LSTM (Long Short-Term Memory) network to obtain identification character feature data corresponding to the picture to be compared.
408. And calculating the image similarity between the target log image and the historical log image according to the image characteristics and the identification character characteristic data.
Steps 401 to 403 and 408 in this embodiment are similar to steps 101 to 103 and 105 in the first embodiment, and are not described again here.
In the embodiment of the invention, the text representation data corresponding to the obtained real-time log data is determined; converting the real-time log data into an encoding character string according to the text representation data and a preset encoding specification; determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification; and acquiring historical log images corresponding to the historical log data, respectively preprocessing the target log image and the historical log image, and calculating the similarity between the target log image and the historical log image according to the acquired image characteristics and the identification character characteristic data. According to the scheme, the text data is subjected to imaging processing to obtain the image data in the picture format, and the similarity between the images is calculated to judge whether the starting process is normal or not, so that the technical problem that the similarity recognition efficiency of the log data is low in the prior art is solved.
Referring to fig. 5, a fifth embodiment of the text similarity recognition method according to the embodiment of the present invention includes:
501. acquiring real-time log text data to be processed, and determining multi-level text representation data corresponding to the real-time log text data;
502. converting the real-time log text data into an encoding character string according to the text representation data and a preset encoding specification;
503. determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification;
504. acquiring historical log images corresponding to historical log text data, and respectively preprocessing a target log image and the historical log images to obtain image characteristics and identification character characteristic data of each image;
505. according to the identification position data, cutting the image to be compared to obtain a target identification image corresponding to the image to be compared;
in this embodiment, the image to be compared is cut according to the identifier position data, so as to obtain a target identifier image corresponding to the image to be compared. Specifically, after the identification position data is obtained, the server marks a target identification picture in the picture to be compared according to the identification position data, cuts the picture to be compared, and cuts out an accurate target identification picture from the picture to be compared. For example, when the doorhead similarity is checked, after the doorhead position information is obtained, the server cuts the picture to be compared according to the doorhead position information, and cuts the doorhead picture from the picture to be compared. By the method, the accurate target identification picture can be cut out from the picture to be compared, so that the similarity comparison can be realized according to the accurate target identification picture, the interference of other picture features irrelevant to the target identification picture in the picture to be compared on the similarity comparison is reduced, and the accurate similarity comparison result is favorably obtained.
506. Extracting the features of the target identification image to obtain a feature vector of the target identification image;
in this embodiment, feature extraction is performed on the target identification image to obtain a feature vector of the target identification image. The method comprises the steps of extracting features of a target identification picture to obtain a feature vector, and classifying the target identification picture according to the feature vector. In the process of classifying the target identification picture, the trained classification model firstly performs feature extraction on the target identification picture for multiple times through a multilayer network to obtain a feature vector of the target identification picture, and then classifies the target identification picture according to the feature vector. The feature vector of the target identification picture refers to a vector for representing picture features of the target identification picture.
507. Calculating the characteristic distance between the target identification images according to the characteristic vectors;
in this embodiment, the feature distance between the target identification images is calculated according to the feature vector. Specifically, the server calculates the feature distance between the target identification pictures according to the feature vector, and judges whether a similarity comparison result can be obtained by comparing the feature distance with a preset first distance threshold and a preset second distance threshold, when the feature distance is smaller than the preset first distance threshold, the similarity comparison result can be directly obtained, and when the feature distance is larger than the preset second distance threshold, the server further obtains the similarity comparison result of the pictures to be compared according to the target identification character feature information.
508. Judging whether the characteristic distance is larger than a preset threshold value or not;
in this embodiment, it is determined whether the characteristic distance is greater than a preset threshold. The server calculates the characteristic distance between the target identification pictures according to the characteristic vector, and judges whether the similarity can be obtained or not by comparing the characteristic distance with a preset first distance threshold and a preset second distance threshold.
509. If yes, determining the similarity between the images to be compared according to the identification character characteristic data, and obtaining the similarity according to the value of the similarity.
In this embodiment, if the feature distance is greater than the preset threshold, the similarity between the images to be compared is determined according to the identification character feature data, and a similarity comparison result is obtained according to the value of the similarity.
Steps 501 to 504 in this embodiment are similar to steps 101 to 105 in the first embodiment, and are not described here again.
In the embodiment of the invention, the text representation data corresponding to the acquired real-time log data is determined; converting the real-time log data into an encoding character string according to the text representation data and a preset encoding specification; determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification; and acquiring historical log images corresponding to the historical log data, respectively preprocessing the target log image and the historical log image, and calculating the similarity between the target log image and the historical log image according to the obtained image characteristics and the identification character characteristic data. According to the scheme, the text data is subjected to imaging processing to obtain the image data in the picture format, and the similarity between the images is calculated to judge whether the starting process is normal or not, so that the technical problem that the similarity identification efficiency of the log data is low in the prior art is solved.
In the above description of the text similarity recognition method in the embodiment of the present invention, the text similarity recognition apparatus in the embodiment of the present invention is described below with reference to fig. 6, where a first embodiment of the text similarity recognition apparatus in the embodiment of the present invention includes:
the determining module 601 is configured to acquire real-time log text data to be processed, and determine multi-level text representation data corresponding to the real-time log text data;
a conversion module 602, configured to convert the real-time log text data into an encoding character string according to the text representation data and a preset encoding specification;
a generating module 603, configured to determine a picture specification according to the encoding character string and a preset rule, and generate a target log image based on the picture specification;
the preprocessing module 604 is configured to obtain a history log image corresponding to history log text data, and respectively preprocess the target log image and the history log image to obtain image features and identification character feature data of each image;
a calculating module 605, configured to calculate an image similarity between the target log image and the historical log image according to the image feature and the identification character feature data.
In the embodiment of the invention, the text representation data corresponding to the acquired real-time log data is determined; converting the real-time log data into an encoding character string according to the text representation data and a preset encoding specification; determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification; and acquiring historical log images corresponding to the historical log data, respectively preprocessing the target log image and the historical log image, and calculating the similarity between the target log image and the historical log image according to the acquired image characteristics and the identification character characteristic data. According to the scheme, the text data is subjected to imaging processing to obtain the image data in the picture format, and the similarity between the images is calculated to judge whether the starting process is normal or not, so that the technical problem that the similarity identification efficiency of the log data is low in the prior art is solved.
Referring to fig. 7, a text similarity recognition apparatus according to a second embodiment of the present invention specifically includes:
the determining module 601 is configured to obtain real-time log text data to be processed, and determine multi-level text representation data corresponding to the real-time log text data;
a conversion module 602, configured to convert the real-time log text data into an encoded character string according to the text representation data and a preset encoding specification;
a generating module 603, configured to determine a picture specification according to the encoding character string and a preset rule, and generate a target log image based on the picture specification;
the preprocessing module 604 is configured to obtain a history log image corresponding to history log text data, and respectively preprocess the target log image and the history log image to obtain image features and identification character feature data of each image;
a calculating module 605, configured to calculate an image similarity between the target log image and the historical log image according to the image feature and the identification character feature data.
In this embodiment, the determining module 601 is specifically configured to:
acquiring real-time log text data to be processed, and extracting features of the real-time log text data based on a preset text encoder to obtain sentence level features and word level features of the real-time log text data;
labeling each word in the real-time log text data according to the word level characteristics;
and extracting level information corresponding to the real-time log text data based on a preset regular expression, and determining multi-level text representation data corresponding to the real-time log text data according to the level information.
In this embodiment, the conversion module 602 includes:
a first conversion unit 6021, configured to perform conversion processing on the text characterization data based on a preset unicode character set to obtain an initial code point;
a determining unit 6022, configured to determine the number of bytes of the initial code point;
a second conversion unit 6023, configured to convert the real-time log text data into an encoded character string according to the number of bytes and a preset encoding specification.
In this embodiment, the generating module 603 is specifically configured to:
converting the encoded string into a plurality of RGB color values;
determining the picture specification according to a preset rule and the RGB color value;
determining a character sequence in the real-time log text data, and arranging the RGB color values according to the character sequence and the picture specification to obtain image parameters;
and generating a target log image corresponding to the real-time log text data according to the image parameters.
In this embodiment, the preprocessing module 604 is specifically configured to:
acquiring images to be compared, wherein the images to be compared are historical log images and target log images, and the historical log images are corresponding historical log images converted from historical log text data;
respectively carrying out rotation correction detection on the images to be compared to obtain angle-corrected images to be compared;
performing feature extraction on the image to be compared after the angle correction to obtain a feature extraction image corresponding to the image to be compared after the angle correction;
and carrying out target detection on the image to be compared after the angle correction according to the feature extraction diagram to obtain identification position data corresponding to the image to be compared, and carrying out character feature extraction on the image to be compared after the angle correction to obtain identification character feature data corresponding to the image to be compared.
In this embodiment, the calculating module 605 is specifically configured to:
according to the identification position data, cutting the image to be compared to obtain a target identification image corresponding to the image to be compared;
extracting the features of the target identification image to obtain a feature vector of the target identification image;
and calculating the similarity between the historical log image and the target log image according to the feature vector and the identification character feature data to obtain a similarity comparison result of the historical log image and the target log image.
In this embodiment, the calculating module 605 is further specifically configured to:
calculating the characteristic distance between the target identification images according to the characteristic vector;
judging whether the characteristic distance is larger than a preset threshold value or not;
if yes, determining the similarity between the images to be compared according to the identification character characteristic data, and obtaining the similarity according to the value of the similarity.
In the embodiment of the invention, the text representation data corresponding to the acquired real-time log data is determined; converting the real-time log data into an encoding character string according to the text representation data and a preset encoding specification; determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification; and acquiring historical log images corresponding to the historical log data, respectively preprocessing the target log image and the historical log image, and calculating the similarity between the target log image and the historical log image according to the acquired image characteristics and the identification character characteristic data. According to the scheme, the text data is subjected to imaging processing to obtain the image data in the picture format, and the similarity between the images is calculated to judge whether the starting process is normal or not, so that the technical problem that the similarity identification efficiency of the log data is low in the prior art is solved.
The text similarity recognition device in the embodiment of the present invention is described in detail in terms of the modular functional entity in fig. 6 and 7, and the text similarity recognition apparatus in the embodiment of the present invention is described in detail in terms of hardware processing.
Fig. 8 is a schematic structural diagram of a text similarity recognition apparatus according to an embodiment of the present invention, where the text similarity recognition apparatus 800 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 810 (e.g., one or more processors) and a memory 820, and one or more storage media 830 (e.g., one or more mass storage devices) storing an application 833 or data 832. Memory 820 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the text similarity recognition apparatus 800. Still further, the processor 810 may be configured to communicate with the storage medium 830, and execute a series of instruction operations in the storage medium 830 on the text similarity recognition device 800, so as to implement the steps of the text similarity recognition method provided by the above-mentioned method embodiments.
Text similarity recognition device 800 may also include one or more power supplies 840, one or more wired or wireless network interfaces 850, one or more input-output interfaces 860, and/or one or more operating systems 831, such as Windows Server, mac OS X, unix, linux, freeBSD, etc. Those skilled in the art will appreciate that the text similarity recognition device configuration shown in fig. 8 does not constitute a limitation of the text similarity recognition device provided herein, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and may also be a volatile computer-readable storage medium, having stored therein instructions, which, when executed on a computer, cause the computer to execute the steps of the text similarity identification method.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A text similarity recognition method is characterized by comprising the following steps:
acquiring real-time log text data to be processed, and determining multi-level text representation data corresponding to the real-time log text data;
converting the real-time log text data into an encoding character string according to the text representation data and a preset encoding specification;
determining a picture specification according to the coding character string and a preset rule, and generating a target log image based on the picture specification;
acquiring historical log images corresponding to historical log text data, and respectively preprocessing the target log images and the historical log images to obtain image characteristics and identification character characteristic data of each image;
and calculating the image similarity between the target log image and the historical log image according to the image characteristics and the identification character characteristic data.
2. The method for recognizing text similarity according to claim 1, wherein the determining the multi-level text representation data corresponding to the real-time log text data comprises:
acquiring real-time log text data to be processed, and extracting features of the real-time log text data based on a preset text encoder to obtain sentence level features and word level features of the real-time log text data;
labeling each word in the real-time log text data according to the word level characteristics;
and extracting level information corresponding to the real-time log text data based on a preset regular expression, and determining multi-level text representation data corresponding to the real-time log text data according to the level information.
3. The text similarity recognition method according to claim 1, wherein the converting the real-time log text data into an encoding string according to the text representation data and a preset encoding specification comprises:
converting the text representation data based on a preset Unicode character set to obtain initial code points;
determining the number of bytes of the initial code point;
and converting the real-time log text data into an encoding character string according to the byte number and a preset encoding specification.
4. The method for recognizing text similarity according to claim 1, wherein the determining a picture specification according to the code string and a preset rule, and generating a target log image based on the picture specification comprises:
converting the encoded string into a plurality of RGB color values;
determining the picture specification according to a preset rule and the RGB color value;
determining a character sequence in the real-time log text data, and arranging the RGB color values according to the character sequence and the picture specification to obtain image parameters;
and generating a target log image corresponding to the real-time log text data according to the image parameters.
5. The text similarity recognition method according to claim 1, wherein the obtaining of the historical log images corresponding to the historical log text data, and the preprocessing of the target log image and the historical log image to obtain image features and identification character feature data of each image, respectively, includes:
acquiring images to be compared, wherein the images to be compared are historical log images and target log images, and the historical log images are corresponding historical log images converted from historical log text data;
respectively carrying out rotation correction detection on the images to be compared to obtain angle-corrected images to be compared;
performing feature extraction on the image to be compared after the angle correction to obtain a feature extraction image corresponding to the image to be compared after the angle correction;
and carrying out target detection on the image to be compared after the angle correction according to the characteristic extraction diagram to obtain identification position data corresponding to the image to be compared, and carrying out character characteristic extraction on the image to be compared after the angle correction to obtain identification character characteristic data corresponding to the image to be compared.
6. The text similarity recognition method according to claim 5, wherein the calculating of the image similarity between the target log image and the historical log image according to the image feature and the identification character feature data comprises:
according to the identification position data, the image to be compared is cut to obtain a target identification image corresponding to the image to be compared;
extracting the features of the target identification image to obtain a feature vector of the target identification image;
and calculating the similarity between the historical log image and the target log image according to the feature vector and the identification character feature data to obtain a similarity comparison result between the historical log image and the target log image.
7. The text similarity recognition method according to claim 6, wherein the calculating a similarity between the history log image and the target log image according to the feature vector and the identification character feature data to obtain a comparison result of the similarity between the history log image and the target log image comprises:
calculating the characteristic distance between the target identification images according to the characteristic vector;
judging whether the characteristic distance is larger than a preset threshold value or not;
if yes, determining the similarity between the images to be compared according to the identification character characteristic data, and obtaining the similarity according to the value of the similarity.
8. A text similarity recognition apparatus, characterized in that the text similarity recognition apparatus comprises:
the determining module is used for acquiring real-time log text data to be processed and determining multi-level text representation data corresponding to the real-time log text data;
the conversion module is used for converting the real-time log text data into an encoding character string according to the text representation data and a preset encoding specification;
the generating module is used for determining the picture specification according to the coding character string and a preset rule and generating a target log image based on the picture specification;
the preprocessing module is used for acquiring historical log images corresponding to historical log text data, and respectively preprocessing the target log images and the historical log images to obtain image characteristics and identification character characteristic data of each image;
and the calculating module is used for calculating the image similarity between the target log image and the historical log image according to the image characteristics and the identification character characteristic data.
9. A text similarity recognition apparatus characterized by comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;
the at least one processor invokes the instructions in the memory to cause the text similarity recognition device to perform the steps of the text similarity recognition method according to any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text similarity recognition method according to any one of claims 1 to 7.
CN202210858888.7A 2022-07-21 2022-07-21 Text similarity recognition method, device, equipment and storage medium Pending CN115424284A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210858888.7A CN115424284A (en) 2022-07-21 2022-07-21 Text similarity recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210858888.7A CN115424284A (en) 2022-07-21 2022-07-21 Text similarity recognition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115424284A true CN115424284A (en) 2022-12-02

Family

ID=84197155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210858888.7A Pending CN115424284A (en) 2022-07-21 2022-07-21 Text similarity recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115424284A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116797405A (en) * 2023-06-29 2023-09-22 华腾建信科技有限公司 Engineering data processing method and system based on data intercommunication of participating parties
CN116882383A (en) * 2023-07-26 2023-10-13 中信联合云科技有限责任公司 Digital intelligent proofreading system based on text analysis

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116797405A (en) * 2023-06-29 2023-09-22 华腾建信科技有限公司 Engineering data processing method and system based on data intercommunication of participating parties
CN116797405B (en) * 2023-06-29 2023-12-19 华腾建信科技有限公司 Engineering data processing method and system based on data intercommunication of participating parties
CN116882383A (en) * 2023-07-26 2023-10-13 中信联合云科技有限责任公司 Digital intelligent proofreading system based on text analysis

Similar Documents

Publication Publication Date Title
CN115424284A (en) Text similarity recognition method, device, equipment and storage medium
CN110795919B (en) Form extraction method, device, equipment and medium in PDF document
JP2005242579A (en) Document processor, document processing method and document processing program
CN113836928B (en) Text entity generation method, device, equipment and storage medium
CN112151014A (en) Method, device and equipment for evaluating voice recognition result and storage medium
CN115438650B (en) Contract text error correction method, system, equipment and medium fusing multi-source characteristics
CN110633475A (en) Natural language understanding method, device and system based on computer scene and storage medium
CN114821613A (en) Extraction method and system of table information in PDF
CN113626561A (en) Component model identification method, device, medium and equipment
CN113705167A (en) Character checking method, device, equipment and storage medium
CN116484052B (en) Educational resource sharing system based on big data
CN111144107B (en) Messy code identification method based on slicing algorithm
CN110727743A (en) Data identification method and device, computer equipment and storage medium
JP4885112B2 (en) Document processing apparatus, document processing method, and document processing program
CN112084105A (en) Log file monitoring and early warning method, device, equipment and storage medium
US8990238B2 (en) System and method for keyword spotting using multiple character encoding schemes
KR102599980B1 (en) Data processing method for decoding text data and data processing apparatus thereof
CN111859896B (en) Formula document detection method and device, computer readable medium and electronic equipment
CN113743052A (en) Multi-mode-fused resume layout analysis method and device
CN114693955A (en) Method and device for comparing image similarity and electronic equipment
CN111402012B (en) E-commerce defective product identification method based on transfer learning
CN117235727B (en) WebShell identification method and system based on large language model
CN113722496B (en) Triple extraction method and device, readable storage medium and electronic equipment
CN113449510B (en) Text recognition method, device, equipment and storage medium
CN109408795B (en) Text recognition method, text recognition equipment, computer readable storage medium and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination