WO2021068683A1 - 正则表达式生成方法、装置、服务器及计算机可读存储介质 - Google Patents

正则表达式生成方法、装置、服务器及计算机可读存储介质 Download PDF

Info

Publication number
WO2021068683A1
WO2021068683A1 PCT/CN2020/112341 CN2020112341W WO2021068683A1 WO 2021068683 A1 WO2021068683 A1 WO 2021068683A1 CN 2020112341 W CN2020112341 W CN 2020112341W WO 2021068683 A1 WO2021068683 A1 WO 2021068683A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
regular expression
information
text information
key information
Prior art date
Application number
PCT/CN2020/112341
Other languages
English (en)
French (fr)
Inventor
唐志辉
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021068683A1 publication Critical patent/WO2021068683A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Definitions

  • This application relates to the field of machine learning technology, and in particular to a method, device, server, and computer-readable storage medium for generating regular expressions.
  • Regular expression is a tool used to describe these rules, that is, the code for recording text rules.
  • Regular expression is a kind of logical formula that operates on strings, including ordinary characters (such as the letters between a and z) and special characters (called “metacharacters”), which is to use some pre-defined specific characters and these The combination of specific characters forms a "rule string", and this "rule string” is used to express a kind of filtering logic for the string.
  • Regular expression is a text pattern. The pattern describes one or more strings to be matched when searching for text. It is usually used to retrieve and replace text that meets a certain pattern (rule).
  • the regular expression generator can generate the corresponding regular expression code according to the string that the user wants to match.
  • Existing regular expression generators generally provide common regular expressions for users to choose from. For example, if the user selects the "mobile phone number” button, the tool will generate a regular expression corresponding to the "mobile phone number".
  • these regular expression generators have fewer regular expressions to choose from, and cannot meet specific scenarios, and cannot meet user requirements, which affects user experience.
  • This application proposes a regular expression generation method, which includes the steps:
  • the corresponding regular expression is automatically recognized through machine learning.
  • a regular expression generating device which includes:
  • Receiving module used to receive text information input by the user
  • Extraction module used to filter the text information to extract key information
  • Classification module perform text classification on the extracted key information according to a predetermined category system
  • Recognition module used to automatically recognize the corresponding regular expression through machine learning for the text information after the text classification processing.
  • a server includes a memory and a processor.
  • the memory stores a regular expression generation program that can run on the processor.
  • the regular expression generation program is executed by the processor, the following steps are implemented:
  • the corresponding regular expression is automatically recognized through machine learning.
  • the present application also provides a computer-readable storage medium that stores a regular expression generation program, and the regular expression generation program can be executed by at least one processor, so that the at least one processor Perform the following steps:
  • the corresponding regular expression is automatically recognized through machine learning.
  • the regular expression generation method, device, server, and computer-readable storage medium proposed in this application can automatically generate corresponding regular expression codes according to the text information input by the user, and not only provide a small amount of Common regular expressions are for users to choose, but text information can be automatically classified and recognized according to user needs, and corresponding regular expressions can be generated, which can also meet the needs of various specific scenarios.
  • the regular expression generation method is more intelligent, convenient, fast, and efficient, and allows non-developers to generate regular expressions by themselves, maintain and build various text parsing tools by themselves.
  • FIG. 1 is a schematic diagram of an optional hardware architecture of the server of this application.
  • FIG. 2 is a schematic diagram of modules of the first embodiment of the regular expression generating device of the present application.
  • FIG. 3 is a schematic diagram of modules of a second embodiment of the regular expression generation device of the present application.
  • FIG. 5 is a schematic flowchart of a second embodiment of the regular expression generation method of the present application.
  • FIG. 1 is a schematic diagram of an optional hardware architecture of the server 2 of the present application.
  • the server 2 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 that can communicate with each other through a system bus. It should be pointed out that FIG. 1 only shows the server 2 with components 11-13, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
  • the server 2 may be a computing device such as a rack server, a blade server, a tower server, or a cabinet server.
  • the server 2 may be an independent server or a server cluster composed of multiple servers.
  • the memory 11 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the storage 11 may be an internal storage unit of the server 2, for example, a hard disk or a memory of the server 2.
  • the memory 11 may also be an external storage device of the server 2, for example, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital) equipped on the server 2. Digital, SD) card, flash card (Flash Card), etc.
  • the memory 11 may also include both the internal storage unit of the server 2 and its external storage device.
  • the memory 11 is generally used to store the operating system and various application software installed in the server 2, for example, the code of the regular expression generation program 20.
  • the memory 11 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 12 is generally used to control the overall operation of the server 2.
  • the processor 12 is used to run the program code or process data stored in the memory 11, for example, to run the code of the regular expression generation program 20.
  • the network interface 13 may include a wireless network interface or a wired network interface, and the network interface 13 is usually used to establish a communication connection between the server 2 and other electronic devices.
  • this application proposes a regular expression generating device 200.
  • FIG. 2 is a schematic diagram of the modules of the first embodiment of the regular expression generating apparatus 200 of the present application.
  • the regular expression generating device 200 includes a series of computer program instructions stored on the memory 11, and when the computer program instructions are executed by the processor 12, the regular expressions of the various embodiments of the present application can be implemented. Generate operation.
  • the regular expression generating apparatus 200 may be divided into one or more modules based on specific operations implemented by various parts of the computer program instructions. For example, in FIG. 2, the regular expression generating device 200 can be divided into a receiving module 201, an extraction module 202, a classification module 203, and an identification module 204. among them:
  • the receiving module 201 is used to receive text information input by the user.
  • the corresponding regular expression code can be automatically generated by the regular expression generator.
  • the user needs to input required text information into the regular expression generator tool, and after the tool receives the text information input by the user, it performs subsequent processing on it.
  • the extraction module 202 is used to filter the input text information to extract key information.
  • extracting key information mainly includes two stages: text segmentation and stop word removal.
  • Chinese text information such as a Chinese sentence
  • words are continuous, and the smallest unit granularity of data analysis is preferably words, so word segmentation is needed to prepare for the next step.
  • Specific to Chinese word segmentation unlike English, which has natural space intervals, it is necessary to design a complex word segmentation algorithm.
  • Traditional algorithms mainly include forward/reverse/two-way maximum matching based on string matching, syntactic and semantic analysis disambiguation based on understanding, and mutual information/conditional random field (CRF) methods based on statistics.
  • CCF mutual information/conditional random field
  • stop words are some high-frequency pronouns, conjunctions, prepositions and other words in the text information that are meaningless to text classification (do not contribute to text features). Some words, such as adjectives, can also be removed in some text information. Usually maintain a stop vocabulary list, and delete the words that appear in the stop vocabulary list in the process of extracting key information.
  • the classification module 203 is used for text classification of the extracted key information.
  • text classification refers to a process in which a computer uses an algorithm to automatically classify the input text according to a predetermined category system. For example, whether the input text information is a Chinese character or a number, and if it is a number, what is the number of digits in the number; if it is an 18-digit number, it is generally an ID number.
  • the recognition module 204 is configured to automatically recognize a regular expression corresponding to the classified text information through machine learning.
  • the regular expression generator can automatically identify corresponding regular expressions for the text information after the text classification processing through a large number of sample training and machine learning.
  • training set is used to estimate the model
  • validation set is used to determine the network structure or parameters that control the complexity of the model
  • test set is used to test the performance of the optimal model.
  • sample data set can also be divided into only two categories, namely the training set and the test set.
  • K-fold cross-validation method the sample data set D is first divided into k mutually exclusive subsets of similar size. Each subset maintains the consistency of the data distribution as much as possible, that is, obtained from D through hierarchical sampling.
  • the text information and corresponding regular expressions used in the past can be used as the sample data set, and the sample data set can be divided into 10 parts, 9 of which are selected in turn as the training set, and the other 1 as the test set. .
  • Nine pieces of data in the training set are used to train the machine learning model of the regular expression generator, and then one piece of data in the test set is used to verify the test result.
  • the machine learning model that has undergone the above training and testing can automatically output corresponding regular expressions directly according to the input text information after the text classification processing.
  • Users for example, non-developers
  • the regular expression generating device provided in this embodiment can automatically generate the corresponding regular expression code according to the text information input by the user. Instead of just providing a small number of commonly used regular expressions for users to choose, it can automatically classify and recognize text information according to user needs, and generate corresponding regular expressions, which can also meet the needs of various specific scenarios.
  • the regular expression generation device is more intelligent, convenient, fast, and efficient, and allows non-developers to generate regular expressions by themselves, maintain and build various text parsing tools by themselves.
  • the classification module 203 specifically includes a feature extraction sub-module 300, a text classification sub-module 302, and a post-processing sub-module 304. among them:
  • the feature extraction sub-module 300 is used for feature extraction and text representation of the extracted key information.
  • the core of text classification is how to extract key features that can reflect text characteristics from text information, and capture the mapping between features and categories, so feature extraction is very important.
  • the purpose of text representation is to convert the preprocessed text into a computer-understandable way, which is the most important part of determining the quality of text classification.
  • Bag Of Words BOW
  • VSM Vector Space Model
  • the bag of words model is the basis of the vector space model. Therefore, the vector space model reduces the dimensionality through feature item selection.
  • the weight calculation increases the density.
  • the feature extraction of the text representation method of the vector space model corresponds to the selection of feature items and the calculation of feature weights.
  • the basic idea of feature selection is to independently rank the original feature items (terms) according to a certain evaluation index, select some of the feature items with the highest scores, and filter out the remaining feature items.
  • Commonly used evaluations include document frequency, mutual information, information gain, ⁇ 2 statistics, etc.
  • the feature weight is mainly the classic term frequency-inverse document frequency (TF-IDF) method and its extension method.
  • TF-IDF classic term frequency-inverse document frequency
  • the main idea is that the importance of a word is proportional to the word frequency in the category, and it is proportional to all categories. The number of occurrences is inversely proportional.
  • the converted feature code can carry more text features, the better Help the classification algorithm predict the corresponding category.
  • one-hot or TF-IDF methods are used to convert each feature code into a fixed-length feature code as the input of the classification algorithm, that is, text representation.
  • the long string of numbers in Chinese text represents text content such as mobile phone number, license plate number, user name ID, etc., or convert it into a normalized feature code, such as whether there is a long string of Boolean value feature HAS_DIGITAL, according to length The normalized DIGIAL_LEN_10 etc.
  • One-hot encoding also known as one-hot encoding, one-bit effective encoding
  • the method is to use N-bit status registers to encode N states, each state has its own independent register bit, and at any time, only One valid.
  • One-hot encoding belongs to the bag-of-words model in feature extraction. The first advantage is that it solves the problem that the classifier does not handle discrete data, and the other is that it also functions to expand features to a certain extent.
  • TF-IDF is the most commonly used text representation in information retrieval (IR).
  • the idea of the algorithm is also very simple, that is, to count the word frequency (TF) of each word, and then attach a weight parameter to it, that is, the inverse document word rate (IDF).
  • term frequency (TF) number of occurrences of a word/total number of times
  • inverse document word rate (IDF) log (total number of documents in the corpus/(number of documents containing the word + 1)).
  • TF easy to understand is to calculate word frequency
  • IDF measures the commonness of words. In order to calculate IDF, it is necessary to prepare a corpus in advance to simulate the use environment of the language.
  • TF-IDF term frequency (TF) * inverse document term rate (IDF).
  • the text classification sub-module 302 is used to classify data after text representation using a text classification algorithm.
  • the features are put into the text classification algorithm learning model, and then the classification result is obtained according to the prediction of the test data set.
  • classification algorithms include: decision tree, Rocchio algorithm, naive Bayes, neural network, support vector machine, linear least square fitting, nearest neighbor algorithm (kNN), genetic algorithm, maximum entropy, etc.
  • a decision tree algorithm may be used, and a fixed-length feature code obtained after the feature extraction and text representation is used as an input, so as to classify the text information.
  • the classification decision tree model is a tree structure that describes the classification of instances. It is a predictive model and represents a mapping relationship between object attributes and object values.
  • the decision tree is composed of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes. Internal nodes represent a feature or attribute, and leaf nodes represent a class. When classifying, starting from the root node, a certain feature of the instance is tested, and the instance is assigned to its sub-nodes according to the test results; at this time, each sub-node corresponds to a value of the feature. This recursively moves down until the leaf node is reached, and finally the instance is assigned to the class of the leaf node. In layman's terms, it is an if-then process. Common decision trees are ID3, C4.5, CART, etc.
  • the naive Bayes algorithm can also be used to classify the data after the text representation.
  • the naive Bayes algorithm is a classification method based on Bayes' theorem and the assumption of the independence of characteristic conditions. Its formula is probability P("belongs to a certain class"
  • "has a certain characteristic") P("has a certain characteristic"
  • the word segmentation result is mayun/nx @/n pingan.com/nx .cn/nx
  • the probability P "Email address"
  • the most likely (highest probability value) classification result of the text information can be obtained.
  • the categories may include Chinese characters, Email addresses, URLs, domestic phone numbers, Tencent QQ numbers, Chinese postal codes, 18-digit ID numbers, mobile phone numbers, fixed phone numbers, IP addresses, (year -Month-Day) format date, positive integer, negative integer, integer, non-negative integer, non-positive integer, positive floating point number, negative floating point number, etc.
  • the regular expression corresponding to the category can be matched subsequently.
  • the post-processing sub-module 304 is used to perform post-processing on the classified text information according to preset rules.
  • keyword rules are the most commonly used post-processing method, which is characterized by the ability to directly introduce domain knowledge into the classification system. Keyword rules can not only realize that one or more keywords correspond to a category, but also realize one-to-many and many-to-many rule mapping when the upper-level algorithm gives a probability output. In addition, the strength and priority of different keyword rules can be set according to the actual situation, so that the prediction result can be adjusted more flexibly.
  • this application also proposes a regular expression generation method.
  • FIG. 4 is a schematic flowchart of the first embodiment of the regular expression generation method of the present application.
  • the execution order of the steps in the flowchart shown in FIG. 4 can be changed, and some steps can be omitted.
  • the method includes the following steps:
  • Step S400 Receive text information input by the user.
  • the corresponding regular expression code can be automatically generated by the regular expression generator.
  • the user needs to input required text information into the regular expression generator tool, and after the tool receives the text information input by the user, it performs subsequent processing on it.
  • Step S402 Filter the input text information to extract key information.
  • extracting key information mainly includes two stages: text segmentation and stop word removal.
  • Chinese text information such as a Chinese sentence
  • words are continuous, and the smallest unit granularity of data analysis is preferably words, so word segmentation is needed to prepare for the next step.
  • Specific to Chinese word segmentation unlike English, which has natural space intervals, it is necessary to design a complex word segmentation algorithm.
  • Traditional algorithms mainly include forward/reverse/two-way maximum matching based on string matching, syntactic and semantic analysis disambiguation based on understanding, and mutual information/CRF method based on statistics.
  • the WordEmbedding+Bi-LSTM+CRF method has gradually become the mainstream.
  • stop words are some high-frequency pronouns, conjunctions, prepositions and other words in the text information that are meaningless to text classification (do not contribute to text features). Some words, such as adjectives, can also be removed in some text information. Usually maintain a stop vocabulary list, and delete the words that appear in the stop vocabulary list in the process of extracting key information.
  • Step S404 Perform text classification on the extracted key information.
  • text classification refers to a process in which a computer uses an algorithm to automatically classify the input text according to a predetermined category system. For example, whether the input text information is a Chinese character or a number, and if it is a number, what is the number of digits in the number; if it is an 18-digit number, it is generally an ID number.
  • step S406 the regular expression corresponding to the classified text information is automatically recognized through machine learning.
  • the regular expression generator can automatically identify corresponding regular expressions for the text information after the text classification processing through a large number of sample training and machine learning.
  • training set is used to estimate the model
  • validation set is used to determine the network structure or parameters that control the complexity of the model
  • test set is used to test the performance of the optimal model.
  • sample data set can also be divided into only two categories, namely the training set and the test set.
  • K-fold cross-validation method the sample data set D is first divided into k mutually exclusive subsets of similar size. Each subset maintains the consistency of the data distribution as much as possible, that is, obtained from D through stratified sampling.
  • the text information and corresponding regular expressions used in the past can be used as the sample data set, and the sample data set can be divided into 10 parts, 9 of which are selected in turn as the training set, and the other 1 as the test set. .
  • Nine pieces of data in the training set are used to train the machine learning model of the regular expression generator, and then one piece of data in the test set is used to verify the test result.
  • the machine learning model that has undergone the above training and testing can automatically output corresponding regular expressions directly according to the input text information after the text classification processing.
  • Users for example, non-developers
  • the regular expression generation method provided in this embodiment can automatically generate the corresponding regular expression code according to the text information input by the user. Instead of just providing a small number of commonly used regular expressions for users to choose, it can automatically classify and recognize text information according to user needs, and generate corresponding regular expressions, which can also meet the needs of various specific scenarios.
  • the regular expression generation method is more intelligent, convenient, fast, and efficient, and allows non-developers to generate regular expressions by themselves, maintain and build various text parsing tools by themselves.
  • step S404 specifically includes:
  • Step S500 Perform feature extraction and text representation on the extracted key information.
  • the core of text classification is how to extract key features that can reflect text characteristics from text information, and capture the mapping between features and categories, so feature extraction is very important.
  • the purpose of text representation is to convert the preprocessed text into a computer-understandable way, which is the most important part of determining the quality of text classification.
  • the bag-of-words model and/or vector space model are commonly used.
  • the bag-of-words model is the basis of the vector space model. Therefore, the vector space model reduces the dimensionality through feature item selection and increases the density through feature weight calculation.
  • the feature extraction of the text representation method of the vector space model corresponds to the selection of feature items and the calculation of feature weights.
  • the basic idea of feature selection is to independently rank the original feature items (terms) according to a certain evaluation index, select some of the feature items with the highest scores, and filter out the remaining feature items.
  • Commonly used evaluations include document frequency, mutual information, information gain, ⁇ 2 statistics, etc.
  • the feature weight is mainly the classic TF-IDF method and its extension method. The main idea is that the importance of a word is proportional to the word frequency in the category and inversely proportional to the number of occurrences of all categories.
  • the converted feature code can carry more text features, the better Help the classification algorithm predict the corresponding category.
  • one-hot or TF-IDF methods are used to convert each feature code into a fixed-length feature code as the input of the classification algorithm, that is, text representation.
  • the long string of numbers in Chinese text represents text content such as mobile phone number, license plate number, user name ID, etc., or convert it into a normalized feature code, such as whether there is a long string of Boolean value feature HAS_DIGITAL, according to length The normalized DIGIAL_LEN_10 etc.
  • One-hot encoding also known as one-hot encoding, one-bit effective encoding
  • the method is to use N-bit status registers to encode N states, each state has its own independent register bit, and at any time, only One valid.
  • One-hot encoding belongs to the bag-of-words model in feature extraction. The first advantage is that it solves the problem that the classifier is not good at processing discrete data, and the other is that it also plays a role in expanding features to a certain extent.
  • TF-IDF is the most commonly used text representation in information retrieval.
  • the idea of the algorithm is also very simple, that is, to count the word frequency (TF) of each word, and then attach a weight parameter to it, that is, the inverse document word rate (IDF).
  • term frequency (TF) number of occurrences of a word/total number of times
  • inverse document word rate (IDF) log (total number of documents in the corpus/(number of documents containing the word + 1)).
  • TF easy to understand is to calculate word frequency
  • IDF measures the commonness of words. In order to calculate IDF, it is necessary to prepare a corpus in advance to simulate the use environment of the language.
  • TF-IDF term frequency (TF) * inverse document term rate (IDF).
  • Step S502 Use a text classification algorithm to classify the data after the text representation.
  • the features are put into the text classification algorithm learning model, and then the classification result is obtained according to the prediction of the test data set.
  • classification algorithms include: decision tree, Rocchio algorithm, naive Bayes, neural network, support vector machine, linear least square fitting, nearest neighbor algorithm (kNN), genetic algorithm, maximum entropy, etc.
  • a decision tree algorithm may be used, and a fixed-length feature code obtained after the feature extraction and text representation is used as an input, so as to classify the text information.
  • the classification decision tree model is a tree structure that describes the classification of instances. It is a predictive model and represents a mapping relationship between object attributes and object values.
  • the decision tree is composed of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes. Internal nodes represent a feature or attribute, and leaf nodes represent a class. When classifying, starting from the root node, a certain feature of the instance is tested, and the instance is assigned to its sub-nodes according to the test results; at this time, each sub-node corresponds to a value of the feature. This recursively moves down until the leaf node is reached, and finally the instance is assigned to the class of the leaf node. In layman's terms, it is an if-then process. Common decision trees are ID3, C4.5, CART, etc.
  • the naive Bayes algorithm can also be used to classify the data after the text representation.
  • the naive Bayes algorithm is a classification method based on Bayes' theorem and the assumption of the independence of characteristic conditions. Its formula is probability P("belongs to a certain class"
  • "has a certain characteristic") P("has a certain characteristic"
  • the word segmentation result is mayun/nx @/n pingan.com/nx .cn/nx
  • the probability P "Email address"
  • the most likely (highest probability value) classification result of the text information can be obtained.
  • the categories may include Chinese characters, Email addresses, URLs, domestic phone numbers, Tencent QQ numbers, Chinese postal codes, 18-digit ID numbers, mobile phone numbers, fixed phone numbers, IP addresses, (year -Month-Day) format date, positive integer, negative integer, integer, non-negative integer, non-positive integer, positive floating point number, negative floating point number, etc.
  • the regular expression corresponding to the category can be matched later.
  • Step S504 Perform post-processing on the classified text information according to preset rules.
  • keyword rules are the most commonly used post-processing method, which is characterized by the ability to directly introduce domain knowledge into the classification system. Keyword rules can not only realize that one or more keywords correspond to a category, but also realize one-to-many and many-to-many rule mapping when the upper-level algorithm gives a probability output. In addition, the strength and priority of different keyword rules can be set according to the actual situation, so that the prediction result can be adjusted more flexibly.
  • the regular expression generation method provided in this embodiment can automatically generate the corresponding regular expression code according to the text information input by the user. Instead of just providing a small number of commonly used regular expressions for users to choose, it can automatically classify and recognize text information according to user needs, and generate corresponding regular expressions, which can also meet the needs of various specific scenarios.
  • the regular expression generation method is more intelligent, convenient, fast, and efficient, and allows non-developers to generate regular expressions by themselves, maintain and build various text parsing tools by themselves.
  • the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores A regular expression generating program, the regular expression generating program may be executed by at least one processor, so that the at least one processor executes the steps of the above-mentioned regular expression generating method.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种正则表达式生成方法、装置、服务器及计算机可读存储介质,该方法包括:接收用户输入的文本信息(S400);对所述文本信息进行过滤,以提取关键信息(S402);对提取出的所述关键信息按照预定的类目体系进行文本分类(S404);针对经过所述文本分类处理后的文本信息,通过机器学习自动识别出相应的正则表达式(S406)。该方法能够根据用户输入的文本信息自动生成相应的正则表达式代码,并满足各种特定的场景需要,使用方便快捷、高效。

Description

正则表达式生成方法、装置、服务器及计算机可读存储介质
本申请要求于2019年10月11日提交中国专利局、申请号为201910967226.1,发明名称为“正则表达式生成方法、服务器及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及机器学习技术领域,尤其涉及一种正则表达式生成方法、装置、服务器及计算机可读存储介质。
背景技术
在编写处理字符串的程序或网页时,经常有查找符合某些复杂规则的字符串的需要。正则表达式就是用于描述这些规则的工具,也就是记录文本规则的代码。正则表达式是对字符串,包括普通字符(例如a到z之间的字母)和特殊字符(称为“元字符”)操作的一种逻辑公式,就是用事先定义好的一些特定字符及这些特定字符的组合,组成一个“规则字符串”,这个“规则字符串”用来表达对字符串的一种过滤逻辑。正则表达式是一种文本模式,模式描述在搜索文本时要匹配的一个或多个字符串,通常被用来检索、替换那些符合某个模式(规则)的文本。
正则表达式生成器可以根据用户想要匹配的字符串生成对应的正则表达式代码。现有的正则表达式生成器一般都是提供常用正则表达式供用户选择。例如,用户选择了“手机号”按钮,工具就生成与“手机号”相应的正则表达式。但是,发明人意识到这些正则表达式生成器中供选择的正则表达式较少,且不能满足特定的场景,无法达到用户的使用要求,影响了用户体验。
技术解决方案
本申请提出一种正则表达式生成方法,该方法包括步骤:
接收用户输入的文本信息;
对所述文本信息进行过滤,以提取关键信息;
对提取出的所述关键信息按照预定的类目体系进行文本分类;及
针对经过所述文本分类处理后的文本信息,通过机器学习自动识别出相应的正则表达式。
一种正则表达式生成装置,该装置包括:
接收模块:用于接收用户输入的文本信息;
提取模块:用于对所述文本信息进行过滤,以提取关键信息;
分类模块:对提取出的所述关键信息按照预定的类目体系进行文本分类;及
识别模块:用于针对经过所述文本分类处理后的文本信息,通过机器学习自动识别出相应的正则表达式。
一种服务器,包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的正则表达式生成程序,所述正则表达式生成程序被所述处理器执行时实现如下步骤:
接收用户输入的文本信息;
对所述文本信息进行过滤,以提取关键信息;
对提取出的所述关键信息按照预定的类目体系进行文本分类;及
针对经过所述文本分类处理后的文本信息,通过机器学习自动识别出相应的正则表达式。
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有正则表达式生成程序,所述正则表达式生成程序可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:
接收用户输入的文本信息;
对所述文本信息进行过滤,以提取关键信息;
对提取出的所述关键信息按照预定的类目体系进行文本分类;及
针对经过所述文本分类处理后的文本信息,通过机器学习自动识别出相应的正则表达式。
有益效果
相较于现有技术,本申请所提出的正则表达式生成方法、装置、服务器及计算机可读存储介质,可以根据用户输入的文本信息,自动生成相应的正则表达式代码,不是仅仅提供少量的常用正则表达式供用户选择,而是能根据用户需求对文本信息进行自动分类和识别,生成对应的正则表达式,还可以满足各种特定的场景需要。该正则表达式生成方法更加智能,使用方便快捷、高效,可以让非开发人员也能自己生成正则表达式,自己维护构建各种文本解析工具。
附图说明
图1是本申请服务器一可选的硬件架构的示意图;
图2是本申请正则表达式生成装置第一实施例的模块示意图;
图3是本申请正则表达式生成装置第二实施例的模块示意图;
图4是本申请正则表达式生成方法第一实施例的流程示意图;
图5是本申请正则表达式生成方法第二实施例的流程示意图;
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
参阅图1所示,是本申请服务器2一可选的硬件架构的示意图。
本实施例中,所述服务器2可包括,但不仅限于,可通过系统总线相互通信连接存储器11、处理器12、网络接口13。需要指出的是,图1仅示出了具有组件11-13的服务器2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
其中,所述服务器2可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器等计算设备,该服务器2可以是独立的服务器,也可以是多个服务器所组成的服务器集群。
所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器11可以是所述服务器2的内部存储单元,例如该服务器2的硬盘或内存。在另一些实施例中,所述存储器11也可以是所述服务器2的外部存储设备,例如该服务器2上配备的插接式硬盘,智能存储卡(Smart Media Card, SMC),安全数字(Secure Digital, SD)卡,闪存卡(Flash Card)等。当然,所述存储器11还可以既包括所述服务器2的内部存储单元也包括其外部存储设备。本实施例中,所述存储器11通常用于存储安装于所述服务器2的操作系统和各类应用软件,例如正则表达式生成程序20的代码等。此外,所述存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述服务器2的总体操作。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行所述的正则表达式生成程序20的代码等。
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述服务器2与其他电子设备之间建立通信连接。
至此,己经详细介绍了本申请相关设备的硬件结构和功能。下面,将基于上述介绍提出本申请的各个实施例。
首先,本申请提出一种正则表达式生成装置200。
参阅图2所示,是本申请正则表达式生成装置200第一实施例的模块示意图。
本实施例中,所述正则表达式生成装置200包括一系列的存储于存储器11上的计算机程序指令,当该计算机程序指令被处理器12执行时,可以实现本申请各实施例的正则表达式生成操作。在一些实施例中,基于该计算机程序指令各部分所实现的特定的操作,正则表达式生成装置200可以被划分为一个或多个模块。例如,在图2中,所述正则表达式生成装置200可以被分割成接收模块201、提取模块202、分类模块203、识别模块204。其中:
所述接收模块201,用于接收用户输入的文本信息。
具体地,针对用户需要搜索或者替换的一些文本信息,可以通过正则表达式生成器自动生成对应的正则表达式代码。首先,用户需要在所述正则表达式生成器工具中输入所需的文本信息,工具接收用户输入的文本信息后,对其进行后续处理。
所述提取模块202,用于对输入的文本信息进行过滤,以提取关键信息。
具体地,对于所接收到的用户输入的文本信息,为了准确转化对应的正则表达式,以查找到用户真正想要的文本信息,首先要过滤掉所输入的文本信息中的一些无意义的字符,仅保留关键信息。
对于中文文本信息的处理,提取关键信息主要包括文本分词和去停用词两个阶段。对于中文文本信息,例如一条中文的句子,词与词之间是连续的,而数据分析的最小单位粒度最好是词语,所以需要进行分词工作,这样就给下一步的工作做准备。具体到中文分词,不同于英文有天然的空格间隔,需要设计复杂的分词算法。传统算法主要有基于字符串匹配的正向/逆向/双向最大匹配、基于理解的句法和语义分析消歧、基于统计的互信息/条件随机场(Conditional Random Field,CRF)方法。另外,随着深度学习的应用,WordEmbedding+Bi-LSTM+CRF方法逐渐成为主流。而停用词是文本信息中一些高频的代词连词介词等对文本分类无意义(对文本特征没有任何贡献作用)的词。在一些文本信息也能针对性的去掉一些词,例如形容词。通常维护一个停用词表,提取关键信息过程中删除停用词表中出现的词。
所述分类模块203,用于对提取出的关键信息进行文本分类。
具体地,文本分类指的是计算机通过算法对输入的文本按照预定的类目体系进行自动化归类的过程。例如,所输入文本信息的是汉字还是数字,以及如果是数字,数字的位数是多少,如果是18位的数字,一般情况下是身份证号。
所述识别模块204,用于通过机器学习自动识别出分类后的文本信息相应的正则表达式。
具体地,所述正则表达式生成器通过大量的样本训练、机器学习,可以针对经过所述文本分类处理后的文本信息自动识别出相应的正则表达式。
在机器学习和模式识别等领域中,一般需要将样本分成独立的三部分训练集(train set),验证集(validation set)和测试集(test set)。其中训练集用来估计模型,验证集用来确定网络结构或者控制模型复杂程度的参数,而测试集则检验最终选择最优的模型的性能如何。另外,一般情况下,还可以只将样本数据集分成两类,即训练集和测试集,采用K折交叉验证法,先将样本数据集D划分为k个大小相似的互斥子集。每个子集都尽可能保持数据分布的一致性,即从D中通过分层采样得到。然后,每次用k-1个子集的并集作为训练集,余下的子集作为测试集。这样就可以获得k组训练/测试集,从而可以进行k次训练和测试,最终返回的是k个测试结果的均值。
在本实施例中,可以将以往使用过的文本信息和对应的正则表达式作为样本数据集,将样本数据集均分为10份,轮流选择其中9份作为训练集,另外1份作为测试集。采用训练集的9份数据对所述正则表达式生成器的机器学习模型进行训练,然后用测试集的1份数据验证测试结果。
经过上述训练和测试的机器学习模型,可以在后续直接根据输入的经过所述文本分类处理后的文本信息自动输出相应的正则表达式。用户(例如非开发人员)可以将所述正则表达式生成器生成的正则表达式放到Excel中,根据配置在Excel中的规则解析文本信息,从而自己维护构建各种文本解析工具。
本实施例提供的正则表达式生成装置,可以根据用户输入的文本信息,自动生成相应的正则表达式代码。不是仅仅提供少量的常用正则表达式供用户选择,而是能根据用户需求对文本信息进行自动分类和识别,生成对应的正则表达式,还可以满足各种特定的场景需要。该正则表达式生成装置更加智能,使用方便快捷、高效,可以让非开发人员也能自己生成正则表达式,自己维护构建各种文本解析工具。
参阅图3所示,是本申请正则表达式生成装置200第二实施例的模块示意图。本实施例中,所述分类模块203具体包括特征提取子模块300、文本分类子模块302和后处理子模块304。其中:
所述特征提取子模块300,用于对所提取出的关键信息进行特征提取和文本表示。
具体地,文本分类的核心都是如何从文本信息中抽取出能够体现文本特点的关键特征,抓取特征到类别之间的映射,所以特征提取很重要。文本表示的目的是把文本预处理后的转换成计算机可理解的方式,是决定文本分类质量最重要的部分。传统做法常用词袋模型(Bag Of Words,BOW)和/或向量空间模型(Vector Space Model,VSM),词袋模型是向量空间模型的基础,因此向量空间模型通过特征项选择降低维度,通过特征权重计算增加稠密性。向量空间模型的文本表示方法的特征提取对应特征项的选择和特征权重计算两部分。特征选择的基本思路是根据某个评价指标独立的对原始特征项(词项)进行评分排序,从中选择得分最高的一些特征项,过滤掉其余的特征项。常用的评价有文档频率、互信息、信息增益、χ²统计量等。特征权重主要是经典的词频-逆文档词率(term frequency–inverse document frequency,TF-IDF)方法及其扩展方法,主要思路是一个词的重要度与在类别内的词频成正比,与所有类别出现的次数成反比。
在本实施例中,先从所述关键信息中抽取出能够体现文本特点的关键特征,将对应文本转化为预定格式的特征编码,转化后的特征编码能够携带越多的文本特征,就越能帮助分类算法预测出对应的类别。在提取了特征值之后,再采用One-hot或TF-IDF等方法将每个特征编码转化为固定长度的特征编码作为分类算法的输入,也就是进行文本表示。通常情况下中文文本中长串的数字代表手机号、车牌号、用户名ID等文本内容,或者将其转换为归一化的特征编码,例如是否出现长串数字的布尔值特征HAS_DIGITAL、按长度归一的DIGIAL_LEN_10等。
One-hot编码,又称独热编码、一位有效编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。One-hot编码在特征提取上属于词袋模型,优点一是解决了分类器不好处理离散数据的问题,二是在一定程度上也起到了扩充特征的作用。
TF-IDF是信息检索(IR)中最常用的一种文本表示法。算法的思想也很简单,就是统计每个词出现的词频(TF),然后再为其附上一个权值参数,即逆文档词率(IDF)。其中,词频(TF)=某个词的出现次数/总次数,逆文档词率(IDF)=log(语料库的文档总数/(包含该词的文档数+1))。TF很容易理解就是计算词频,IDF衡量词的常见程度。为了计算IDF需要事先准备一个语料库用来模拟语言的使用环境,如果一个词越是常见,那么该公式中分母就越大,逆文档频率就越小越接近于0。TF-IDF的计算公式如下:TF-IDF=词频(TF)*逆文档词率(IDF)。根据公式很容易看出,TF-IDF的值与该词出现的频率成正比,与该词在整个语料库中出现的频率成反比,因此可以很好的实现提取关键词的目的。该方法的优点是简单快速,结果比较符合实际。
所述文本分类子模块302,用于对文本表示后的数据采用文本分类算法进行分类。
具体地,将文本表示为广义特征数据结构以后,将特征放入文本分类算法学习模型,然后根据测试数据集的预测,得到分类结果。常用的分类算法包括:决策树、Rocchio算法、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、最近邻算法(kNN)、遗传算法、最大熵等。在本实施例中,可以采用决策树算法,将所述特征提取和文本表示后得到的固定长度的特征编码作为输入,从而对所述文本信息进行分类。
分类决策树模型是一种描述对实例进行分类的树形结构,它是一个预测模型,代表的是对象属性与对象值之间的一种映射关系。决策树由结点和有向边组成。结点有两种类型:内部节点和叶节点,内部节点表示一个特征或属性,叶节点表示一个类。分类的时候,从根节点开始,对实例的某一个特征进行测试,根据测试结果,将实例分配到其子结点;此时,每一个子结点对应着该特征的一个取值。如此递归向下移动,直至达到叶结点,最后将实例分配到叶结点的类中。通俗点说就是一个if-then的过程。常见的决策树有ID3、C4.5、CART等。
另外,还可以采用朴素贝叶斯算法对文本表示后的数据进行分类。
朴素贝叶斯算法是基于贝叶斯定理与特征条件独立假设的分类方法,其公式为概率P(“属于某类”|“具有某特征”)=P(“具有某特征”|“属于某类”)P(“属于某类”)/P(“具有某特征”)。例如,针对输入的文本信息mayun@pingan.com.cn,得到分词结果为mayun/nx @/n pingan.com/nx .cn/nx,可以采用朴素贝叶斯算法计算出该文本信息是Email地址的概率P(“Email地址”|“mayun”,“@”,“pingan.com”,“.cn”)。根据所计算出的所述文本信息属于各种类别的概率值,可以得出所述文本信息最有可能(概率值最大)的分类结果。在本实施例中,所述类别可以包括中文字符、Email地址、网址URL、国内电话号码、腾讯QQ号、中国邮政编码、18位身份证号、手机号、固定电话号码、IP地址、(年-月-日)格式日期、正整数、负整数、整数、非负整数、非正整数、正浮点数、负浮点数等。根据该步骤的分类结果,后续可以匹配与该类别对应的正则表达式。
所述后处理子模块304,用于对分类后的文本信息按预设规则进行后处理。
具体地,关键词规则是最常用的后处理方法,其特点在于能够直接地将领域知识引入到分类系统当中。关键词规则不仅可以实现一个或多个关键词对应一个类别,更可以在上层算法给出概率输出的情况下实现一对多和多对多的规则映射。并且,还可以根据实际情况对不同的关键词规则设定作用强度和优先级,从而更加灵活地调整得到预测结果。
此外,本申请还提出一种正则表达式生成方法。
参阅图4所示,是本申请正则表达式生成方法第一实施例的流程示意图。在本实施例中,根据不同的需求,图4所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。
该方法包括以下步骤:
步骤S400,接收用户输入的文本信息。
具体地,针对用户需要搜索或者替换的一些文本信息,可以通过正则表达式生成器自动生成对应的正则表达式代码。首先,用户需要在所述正则表达式生成器工具中输入所需的文本信息,工具接收用户输入的文本信息后,对其进行后续处理。
步骤S402,对输入的文本信息进行过滤,以提取关键信息。
具体地,对于所接收到的用户输入的文本信息,为了准确转化对应的正则表达式,以查找到用户真正想要的文本信息,首先要过滤掉所输入的文本信息中的一些无意义的字符,仅保留关键信息。
对于中文文本信息的处理,提取关键信息主要包括文本分词和去停用词两个阶段。对于中文文本信息,例如一条中文的句子,词与词之间是连续的,而数据分析的最小单位粒度最好是词语,所以需要进行分词工作,这样就给下一步的工作做准备。具体到中文分词,不同于英文有天然的空格间隔,需要设计复杂的分词算法。传统算法主要有基于字符串匹配的正向/逆向/双向最大匹配、基于理解的句法和语义分析消歧、基于统计的互信息/CRF方法。另外,随着深度学习的应用,WordEmbedding+Bi-LSTM+CRF方法逐渐成为主流。而停用词是文本信息中一些高频的代词连词介词等对文本分类无意义(对文本特征没有任何贡献作用)的词。在一些文本信息也能针对性的去掉一些词,例如形容词。通常维护一个停用词表,提取关键信息过程中删除停用词表中出现的词。
步骤S404,对提取出的关键信息进行文本分类。
具体地,文本分类指的是计算机通过算法对输入的文本按照预定的类目体系进行自动化归类的过程。例如,所输入文本信息的是汉字还是数字,以及如果是数字,数字的位数是多少,如果是18位的数字,一般情况下是身份证号。
步骤S406,通过机器学习自动识别出分类后的文本信息相应的正则表达式。
具体地,所述正则表达式生成器通过大量的样本训练、机器学习,可以针对经过所述文本分类处理后的文本信息自动识别出相应的正则表达式。
在机器学习和模式识别等领域中,一般需要将样本分成独立的三部分训练集(train set),验证集(validation set)和测试集(test set)。其中训练集用来估计模型,验证集用来确定网络结构或者控制模型复杂程度的参数,而测试集则检验最终选择最优的模型的性能如何。另外,一般情况下,还可以只将样本数据集分成两类,即训练集和测试集,采用K折交叉验证法,先将样本数据集D划分为k个大小相似的互斥子集。每个子集都尽可能保持数据分布的一致性,即从D中通过分层采样得到。然后,每次用k-1个子集的并集作为训练集,余下的子集作为测试集。这样就可以获得k组训练/测试集,从而可以进行k次训练和测试,最终返回的是k个测试结果的均值。
在本实施例中,可以将以往使用过的文本信息和对应的正则表达式作为样本数据集,将样本数据集均分为10份,轮流选择其中9份作为训练集,另外1份作为测试集。采用训练集的9份数据对所述正则表达式生成器的机器学习模型进行训练,然后用测试集的1份数据验证测试结果。
经过上述训练和测试的机器学习模型,可以在后续直接根据输入的经过所述文本分类处理后的文本信息自动输出相应的正则表达式。用户(例如非开发人员)可以将所述正则表达式生成器生成的正则表达式放到Excel中,根据配置在Excel中的规则解析文本信息,从而自己维护构建各种文本解析工具。
本实施例提供的正则表达式生成方法,可以根据用户输入的文本信息,自动生成相应的正则表达式代码。不是仅仅提供少量的常用正则表达式供用户选择,而是能根据用户需求对文本信息进行自动分类和识别,生成对应的正则表达式,还可以满足各种特定的场景需要。该正则表达式生成方法更加智能,使用方便快捷、高效,可以让非开发人员也能自己生成正则表达式,自己维护构建各种文本解析工具。
如图5所示,是本申请正则表达式生成方法的第二实施例的流程示意图。在本实施例中,所述步骤S404具体包括:
步骤S500,对所提取出的关键信息进行特征提取和文本表示。
具体地,文本分类的核心都是如何从文本信息中抽取出能够体现文本特点的关键特征,抓取特征到类别之间的映射,所以特征提取很重要。文本表示的目的是把文本预处理后的转换成计算机可理解的方式,是决定文本分类质量最重要的部分。传统做法常用词袋模型和/或向量空间模型,词袋模型是向量空间模型的基础,因此向量空间模型通过特征项选择降低维度,通过特征权重计算增加稠密性。向量空间模型的文本表示方法的特征提取对应特征项的选择和特征权重计算两部分。特征选择的基本思路是根据某个评价指标独立的对原始特征项(词项)进行评分排序,从中选择得分最高的一些特征项,过滤掉其余的特征项。常用的评价有文档频率、互信息、信息增益、χ²统计量等。特征权重主要是经典的TF-IDF方法及其扩展方法,主要思路是一个词的重要度与在类别内的词频成正比,与所有类别出现的次数成反比。
在本实施例中,先从所述关键信息中抽取出能够体现文本特点的关键特征,将对应文本转化为一定格式的特征编码,转化后的特征编码能够携带越多的文本特征,就越能帮助分类算法预测出对应的类别。在提取了特征值之后,再采用One-hot或TF-IDF等方法将每个特征编码转化为固定长度的特征编码作为分类算法的输入,也就是进行文本表示。通常情况下中文文本中长串的数字代表手机号、车牌号、用户名ID等文本内容,或者将其转换为归一化的特征编码,例如是否出现长串数字的布尔值特征HAS_DIGITAL、按长度归一的DIGIAL_LEN_10等。
One-hot编码,又称独热编码、一位有效编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。One-hot编码在特征提取上属于词袋模型,优点一是解决了分类器不好处理离散数据的问题,二是在一定程度上也起到了扩充特征的作用。
TF-IDF是信息检索中最常用的一种文本表示法。算法的思想也很简单,就是统计每个词出现的词频(TF),然后再为其附上一个权值参数,即逆文档词率(IDF)。其中,词频(TF)=某个词的出现次数/总次数,逆文档词率(IDF)=log(语料库的文档总数/(包含该词的文档数+1))。TF很容易理解就是计算词频,IDF衡量词的常见程度。为了计算IDF需要事先准备一个语料库用来模拟语言的使用环境,如果一个词越是常见,那么该公式中分母就越大,逆文档频率就越小越接近于0。TF-IDF的计算公式如下:TF-IDF=词频(TF)*逆文档词率(IDF)。根据公式很容易看出,TF-IDF的值与该词出现的频率成正比,与该词在整个语料库中出现的频率成反比,因此可以很好的实现提取关键词的目的。该方法的优点是简单快速,结果比较符合实际。
步骤S502,对文本表示后的数据采用文本分类算法进行分类。
具体地,将文本表示为广义特征数据结构以后,将特征放入文本分类算法学习模型,然后根据测试数据集的预测,得到分类结果。常用的分类算法包括:决策树、Rocchio算法、朴素贝叶斯、神经网络、支持向量机、线性最小平方拟合、最近邻算法(kNN)、遗传算法、最大熵等。在本实施例中,可以采用决策树算法,将所述特征提取和文本表示后得到的固定长度的特征编码作为输入,从而对所述文本信息进行分类。
分类决策树模型是一种描述对实例进行分类的树形结构,它是一个预测模型,代表的是对象属性与对象值之间的一种映射关系。决策树由结点和有向边组成。结点有两种类型:内部节点和叶节点,内部节点表示一个特征或属性,叶节点表示一个类。分类的时候,从根节点开始,对实例的某一个特征进行测试,根据测试结果,将实例分配到其子结点;此时,每一个子结点对应着该特征的一个取值。如此递归向下移动,直至达到叶结点,最后将实例分配到叶结点的类中。通俗点说就是一个if-then的过程。常见的决策树有ID3、C4.5、CART等。
另外,还可以采用朴素贝叶斯算法对文本表示后的数据进行分类。
朴素贝叶斯算法是基于贝叶斯定理与特征条件独立假设的分类方法,其公式为概率P(“属于某类”|“具有某特征”)=P(“具有某特征”|“属于某类”)P(“属于某类”)/P(“具有某特征”)。例如,针对输入的文本信息mayun@pingan.com.cn,得到分词结果为mayun/nx @/n pingan.com/nx .cn/nx,可以采用朴素贝叶斯算法计算出该文本信息是Email地址的概率P(“Email地址”|“mayun”,“@”,“pingan.com”,“.cn”)。根据所计算出的所述文本信息属于各种类别的概率值,可以得出所述文本信息最有可能(概率值最大)的分类结果。
在本实施例中,所述类别可以包括中文字符、Email地址、网址URL、国内电话号码、腾讯QQ号、中国邮政编码、18位身份证号、手机号、固定电话号码、IP地址、(年-月-日)格式日期、正整数、负整数、整数、非负整数、非正整数、正浮点数、负浮点数等。根据该步骤的分类结果,后续可以匹配该类别对应的正则表达式。
步骤S504,对分类后的文本信息按预设规则进行后处理。
具体地,关键词规则是最常用的后处理方法,其特点在于能够直接地将领域知识引入到分类系统当中。关键词规则不仅可以实现一个或多个关键词对应一个类别,更可以在上层算法给出概率输出的情况下实现一对多和多对多的规则映射。并且,还可以根据实际情况对不同的关键词规则设定作用强度和优先级,从而更加灵活地调整得到预测结果。
本实施例提供的正则表达式生成方法,可以根据用户输入的文本信息,自动生成相应的正则表达式代码。不是仅仅提供少量的常用正则表达式供用户选择,而是能根据用户需求对文本信息进行自动分类和识别,生成对应的正则表达式,还可以满足各种特定的场景需要。该正则表达式生成方法更加智能,使用方便快捷、高效,可以让非开发人员也能自己生成正则表达式,自己维护构建各种文本解析工具。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质存储有正则表达式生成程序,所述正则表达式生成程序可被至少一个处理器执行,以使所述至少一个处理器执行如上述的正则表达式生成方法的步骤。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种正则表达式生成方法,其中,所述方法包括步骤:
    接收用户输入的文本信息;
    对所述文本信息进行过滤,以提取关键信息;
    对提取出的所述关键信息按照预定的类目体系进行文本分类;及
    针对经过所述文本分类处理后的文本信息,通过机器学习自动识别出相应的正则表达式。
  2. 如权利要求1所述的正则表达式生成方法,其中,所述对提取出的所述关键信息按照预定的类目体系进行文本分类的步骤包括:
    对所提取出的所述关键信息进行特征提取和文本表示;
    对文本表示后的数据采用文本分类算法进行分类;
    对分类后的文本信息按预设规则进行后处理。
  3. 如权利要求1或2所述的正则表达式生成方法,其中,所述提取关键信息包括文本分词和去停用词。
  4. 如权利要求1或2所述的正则表达式生成方法,其中,所述通过机器学习自动识别出相应的正则表达式的步骤包括:
    将以往使用过的文本信息和对应的正则表达式作为样本数据集,对正则表达式生成器的机器学习模型进行训练和测试;
    采用经过所述训练和测试的机器学习模型,根据输入的经过所述文本分类处理后的文本信息,自动输出相应的正则表达式。
  5. 如权利要求2所述的正则表达式生成方法,其中,所述对所提取出的所述关键信息进行特征提取和文本表示的步骤包括:
    从所述关键信息中抽取出能够体现文本特点的关键特征,将对应文本转化为预定格式的特征编码,再采用独热编码One-hot或词频-逆文档词率TF-IDF方法将每个特征编码转化为固定长度的特征编码。
  6. 如权利要求5所述的正则表达式生成方法,其中,在所述对文本表示后的数据采用文本分类算法进行分类的步骤中,采用分类决策树算法,将所述特征提取和文本表示后得到的固定长度的特征编码作为输入,从而对所述文本信息进行分类。
  7. 如权利要求2所述的正则表达式生成方法,其中,在所述对文本表示后的数据采用文本分类算法进行分类的步骤中,采用朴素贝叶斯算法对文本表示后的数据进行分类,计算所述文本信息属于各种类别的概率值,得出概率值最大的分类结果。
  8. 如权利要求2所述的正则表达式生成方法,其中,所述后处理步骤中的预设规则为关键词规则,用于将所述文本信息中的关键词和所述分类的类别之间建立规则映射。
  9. 一种正则表达式生成装置,其中,所述正则表达式生成装置包括:
    接收模块:用于接收用户输入的文本信息;
    提取模块:用于对所述文本信息进行过滤,以提取关键信息;
    分类模块:用于对提取出的所述关键信息按照预定的类目体系进行文本分类;及
    识别模块:用于针对经过所述文本分类处理后的文本信息,通过机器学习自动识别出相应的正则表达式。
  10. 一种服务器,其中,所述服务器包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的正则表达式生成程序,所述正则表达式生成程序被所述处理器执行时实现如下步骤:
    接收用户输入的文本信息;
    对所述文本信息进行过滤,以提取关键信息;
    对提取出的所述关键信息按照预定的类目体系进行文本分类;及
    针对经过所述文本分类处理后的文本信息,通过机器学习自动识别出相应的正则表达式。
  11. 如权利要求10所述的服务器,其中,所述对提取出的所述关键信息按照预定的类目体系进行文本分类的步骤包括:
    对所提取出的所述关键信息进行特征提取和文本表示;
    对文本表示后的数据采用文本分类算法进行分类;
    对分类后的文本信息按预设规则进行后处理。
  12. 如权利要求10或11所述的服务器,其中,所述提取关键信息包括文本分词和去停用词。
  13. 如权利要求10或11所述的服务器,其中,所述通过机器学习自动识别出相应的正则表达式的步骤包括:
    将以往使用过的文本信息和对应的正则表达式作为样本数据集,对正则表达式生成器的机器学习模型进行训练和测试;
    采用经过所述训练和测试的机器学习模型,根据输入的经过所述文本分类处理后的文本信息,自动输出相应的正则表达式。
  14. 如权利要求11所述的服务器,其中,所述对所提取出的所述关键信息进行特征提取和文本表示的步骤包括:
    从所述关键信息中抽取出能够体现文本特点的关键特征,将对应文本转化为预定格式的特征编码,再采用独热编码One-hot或词频-逆文档词率TF-IDF方法将每个特征编码转化为固定长度的特征编码。
  15. 如权利要求14所述的服务器,其中,在所述对文本表示后的数据采用文本分类算法进行分类的步骤中,采用分类决策树算法,将所述特征提取和文本表示后得到的固定长度的特征编码作为输入,从而对所述文本信息进行分类。
  16. 如权利要求11所述的服务器,其中,在所述对文本表示后的数据采用文本分类算法进行分类的步骤中,采用朴素贝叶斯算法对文本表示后的数据进行分类,计算所述文本信息属于各种类别的概率值,得出概率值最大的分类结果。
  17. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有正则表达式生成程序,所述正则表达式生成程序可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:接收用户输入的文本信息;
    对所述文本信息进行过滤,以提取关键信息;
    对提取出的所述关键信息按照预定的类目体系进行文本分类;及
    针对经过所述文本分类处理后的文本信息,通过机器学习自动识别出相应的正则表达式。
  18. 如权利要求17所述的计算机可读存储介质,其中,所述对提取出的所述关键信息按照预定的类目体系进行文本分类的步骤包括:
    对所提取出的所述关键信息进行特征提取和文本表示;
    对文本表示后的数据采用文本分类算法进行分类;
    对分类后的文本信息按预设规则进行后处理。
  19. 如权利要求17或18所述的计算机可读存储介质,其中,所述提取关键信息包括文本分词和去停用词。
  20. 如权利要求17或18所述的计算机可读存储介质,其中,所述通过机器学习自动识别出相应的正则表达式的步骤包括:
    将以往使用过的文本信息和对应的正则表达式作为样本数据集,对正则表达式生成器的机器学习模型进行训练和测试;
    采用经过所述训练和测试的机器学习模型,根据输入的经过所述文本分类处理后的文本信息,自动输出相应的正则表达式。
PCT/CN2020/112341 2019-10-11 2020-08-30 正则表达式生成方法、装置、服务器及计算机可读存储介质 WO2021068683A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910967226.1 2019-10-11
CN201910967226.1A CN110909160A (zh) 2019-10-11 2019-10-11 正则表达式生成方法、服务器及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2021068683A1 true WO2021068683A1 (zh) 2021-04-15

Family

ID=69815507

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/112341 WO2021068683A1 (zh) 2019-10-11 2020-08-30 正则表达式生成方法、装置、服务器及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN110909160A (zh)
WO (1) WO2021068683A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114139883A (zh) * 2021-11-10 2022-03-04 云南电网有限责任公司信息中心 一种电力企业物资域评价指标的计算方法
CN114299930A (zh) * 2021-12-21 2022-04-08 广州虎牙科技有限公司 端到端语音识别模型处理方法、语音识别方法及相关装置
WO2023165122A1 (zh) * 2022-03-04 2023-09-07 康键信息技术(深圳)有限公司 问诊模板的匹配方法、装置、设备及存储介质
CN117114910A (zh) * 2023-09-22 2023-11-24 浙江河马管家网络科技有限公司 一种基于机器学习的票务自动入账系统及方法

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909160A (zh) * 2019-10-11 2020-03-24 平安科技(深圳)有限公司 正则表达式生成方法、服务器及计算机可读存储介质
CN111881795B (zh) * 2020-07-20 2022-06-21 上海东普信息科技有限公司 运单号识别方法及装置
CN112131378B (zh) * 2020-08-20 2024-09-03 彭涛 用于识别民生问题类别的方法、装置及电子设备
CN111814192B (zh) * 2020-08-28 2021-04-27 支付宝(杭州)信息技术有限公司 训练样本生成方法及装置、敏感信息检测方法及装置
CN112329469B (zh) * 2020-11-05 2023-12-19 新华智云科技有限公司 一种行政地域实体识别方法及系统
CN112966264A (zh) * 2021-02-28 2021-06-15 新华三信息安全技术有限公司 Xss攻击检测方法、装置、设备及机器可读存储介质
CN113515957B (zh) * 2021-04-21 2023-09-19 南通大学 一种基于bart模型的正则表达式描述生成方法
CN113779935A (zh) * 2021-09-10 2021-12-10 北京金堤科技有限公司 文本信息的获取方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160217121A1 (en) * 2015-01-22 2016-07-28 Alibaba Group Holding Limited Generating regular expression
CN109783819A (zh) * 2019-01-18 2019-05-21 广东小天才科技有限公司 一种正则表达式的生成方法及系统
CN109800339A (zh) * 2018-12-13 2019-05-24 平安普惠企业管理有限公司 正则表达式生成方法、装置、计算机设备及存储介质
CN110147445A (zh) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 基于文本分类的意图识别方法、装置、设备及存储介质
CN110909160A (zh) * 2019-10-11 2020-03-24 平安科技(深圳)有限公司 正则表达式生成方法、服务器及计算机可读存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224818B (zh) * 2015-11-10 2018-09-25 北京科技大学 一种作业程序自动评分方法及系统
CN108717461B (zh) * 2018-05-25 2021-03-26 平安科技(深圳)有限公司 海量数据结构化方法、装置、计算机设备及存储介质
CN109388700A (zh) * 2018-10-26 2019-02-26 广东小天才科技有限公司 一种意图识别方法及系统
CN109271492A (zh) * 2018-11-16 2019-01-25 广东小天才科技有限公司 一种语料正则表达式的自动生成方法及系统
CN109740159B (zh) * 2018-12-29 2022-04-26 北京泰迪熊移动科技有限公司 用于命名实体识别的处理方法及装置
CN109918676B (zh) * 2019-03-18 2023-06-27 广东小天才科技有限公司 一种检测意图正则表达式的方法及装置、终端设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160217121A1 (en) * 2015-01-22 2016-07-28 Alibaba Group Holding Limited Generating regular expression
CN109800339A (zh) * 2018-12-13 2019-05-24 平安普惠企业管理有限公司 正则表达式生成方法、装置、计算机设备及存储介质
CN109783819A (zh) * 2019-01-18 2019-05-21 广东小天才科技有限公司 一种正则表达式的生成方法及系统
CN110147445A (zh) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 基于文本分类的意图识别方法、装置、设备及存储介质
CN110909160A (zh) * 2019-10-11 2020-03-24 平安科技(深圳)有限公司 正则表达式生成方法、服务器及计算机可读存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114139883A (zh) * 2021-11-10 2022-03-04 云南电网有限责任公司信息中心 一种电力企业物资域评价指标的计算方法
CN114139883B (zh) * 2021-11-10 2024-03-29 云南电网有限责任公司信息中心 一种电力企业物资域评价指标的计算方法
CN114299930A (zh) * 2021-12-21 2022-04-08 广州虎牙科技有限公司 端到端语音识别模型处理方法、语音识别方法及相关装置
WO2023165122A1 (zh) * 2022-03-04 2023-09-07 康键信息技术(深圳)有限公司 问诊模板的匹配方法、装置、设备及存储介质
CN117114910A (zh) * 2023-09-22 2023-11-24 浙江河马管家网络科技有限公司 一种基于机器学习的票务自动入账系统及方法

Also Published As

Publication number Publication date
CN110909160A (zh) 2020-03-24

Similar Documents

Publication Publication Date Title
WO2021068683A1 (zh) 正则表达式生成方法、装置、服务器及计算机可读存储介质
CN107609121B (zh) 基于LDA和word2vec算法的新闻文本分类方法
RU2628431C1 (ru) Подбор параметров текстового классификатора на основе семантических признаков
RU2628436C1 (ru) Классификация текстов на естественном языке на основе семантических признаков
US9355171B2 (en) Clustering of near-duplicate documents
CN104199965B (zh) 一种语义信息检索方法
Viegas et al. Cluhtm-semantic hierarchical topic modeling based on cluwords
KR20180011254A (ko) 웹페이지 트레이닝 방법 및 기기, 그리고 검색 의도 식별 방법 및 기기
CN108647322B (zh) 基于词网识别大量Web文本信息相似度的方法
CN113962293B (zh) 一种基于LightGBM分类与表示学习的姓名消歧方法和系统
WO2023065642A1 (zh) 语料筛选方法、意图识别模型优化方法、设备及存储介质
US11886515B2 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
Lisena et al. TOMODAPI: A topic modeling API to train, use and compare topic models
CN115422371A (zh) 一种基于软件测试知识图谱的检索方法
CN114462392A (zh) 一种基于主题关联度与关键词联想的短文本特征扩展方法
CN115422372A (zh) 一种基于软件测试的知识图谱构建方法和系统
WO2015084757A1 (en) Systems and methods for processing data stored in a database
CN115248839A (zh) 一种基于知识体系的长文本检索方法以及装置
TW202022635A (zh) 自適應性調整關連搜尋詞的系統及其方法
WO2023246849A1 (zh) 回馈数据图谱生成方法及冰箱
CN116049376B (zh) 一种信创知识检索回复的方法、装置和系统
CN117235199A (zh) 一种基于文档树的信息智能匹配检索的方法
CN117057349A (zh) 新闻文本关键词抽取方法、装置、计算机设备和存储介质
CN116955534A (zh) 投诉工单智能处理方法、装置、设备及存储介质
WO2021227951A1 (zh) 前端页面元素的命名

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20874233

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20874233

Country of ref document: EP

Kind code of ref document: A1