CN111639491B - Time data extraction method and device and electronic equipment - Google Patents

Time data extraction method and device and electronic equipment Download PDF

Info

Publication number
CN111639491B
CN111639491B CN202010418390.XA CN202010418390A CN111639491B CN 111639491 B CN111639491 B CN 111639491B CN 202010418390 A CN202010418390 A CN 202010418390A CN 111639491 B CN111639491 B CN 111639491B
Authority
CN
China
Prior art keywords
description
time
text
processed
time data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010418390.XA
Other languages
Chinese (zh)
Other versions
CN111639491A (en
Inventor
李盼盼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fusionskye Beijing Software Co ltd
Original Assignee
Fusionskye Beijing Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fusionskye Beijing Software Co ltd filed Critical Fusionskye Beijing Software Co ltd
Priority to CN202010418390.XA priority Critical patent/CN111639491B/en
Publication of CN111639491A publication Critical patent/CN111639491A/en
Application granted granted Critical
Publication of CN111639491B publication Critical patent/CN111639491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a time data extraction method, a time data extraction device and electronic equipment, and relates to the technical field of data processing, wherein the method comprises the steps of obtaining a text to be processed; decomposing the text to be processed to obtain a plurality of description tags; matching the plurality of description tags with a preset time rule template, and determining a target description tag containing time data from the plurality of description tags based on a matching result, wherein the preset time rule template comprises a plurality of time tag sequence templates; and determining time data in the text to be processed according to the target description tag. According to the time data extraction method provided by the invention, after the text to be processed is decomposed to obtain the corresponding plurality of description tags, the plurality of description tags are matched with the preset time rule template, so that the time data contained in the text to be processed can be rapidly extracted, and the technical problem of low efficiency in the time data extraction method in the prior art is effectively solved.

Description

Time data extraction method and device and electronic equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for extracting time data, and an electronic device.
Background
At present, each large system records various events, operations and operation and maintenance development related information through logs, after the log is recorded, key information is usually required to be extracted from log information periodically for analysis, whether the event is abnormal or not is judged according to the occurrence time of the event, or operation information in a certain period of time is searched, so that time data in the log information belongs to the key information, and the time data provides powerful data support for locating the event property and extracting other information.
In the prior art, the regular expression is commonly used for searching time data in the log information, but because the logs of all large systems are defined by the research personnel and development personnel and do not form a certain specification, the corresponding regular expression is manually defined according to the writing format of the time data in the log record and the log information format, and the operation time is long and the error is easy to occur.
In summary, the time data extraction method in the prior art has a technical problem of low efficiency.
Disclosure of Invention
The invention aims to provide a time data extraction method, a time data extraction device and electronic equipment, so as to solve the technical problem of low efficiency of the time data extraction method in the prior art.
In a first aspect, an embodiment of the present invention provides a method for extracting time data, including: acquiring a text to be processed; decomposing the text to be processed to obtain a plurality of description tags, wherein the description tags are used for representing at least one of the following: the character types of the continuous similar characters in the text to be processed, the character quantity of the continuous similar characters and the character codes of the continuous similar characters; matching the plurality of description tags with a preset time rule template, and determining a target description tag containing time data from the plurality of description tags based on a matching result, wherein the preset time rule template comprises a plurality of time tag sequence templates; and determining time data in the text to be processed according to the target description tag.
In an alternative embodiment, decomposing the text to be processed to obtain a plurality of description tags, including: character-by-character recognition is carried out on the text to be processed, and the character type of each character in the text to be processed is determined; and sequentially determining a plurality of description tags of the text to be processed based on the character type of each character.
In an alternative embodiment, the method further comprises: acquiring a time sample library, wherein the time sample library comprises sample time data in various time formats; decomposing each sample time data to obtain a time tag sequence template corresponding to each sample time data; and determining the preset time rule template based on the time tag sequence template corresponding to each sample time data.
In an alternative embodiment, the preset time rule templates are stored in the form of a tree structure.
In an optional embodiment, matching the plurality of description tags with a preset time rule template, and determining a target description tag containing time data from the plurality of description tags based on a matching result, including: sequentially matching each description tag with a preset time rule template layer by layer; returning a time tag which is successfully matched for each layer if the matching is successful, and executing the matching of the next layer; if the matching fails, the next description label of the current initial description label is used as a new initial description label to be matched with a preset time rule template layer by layer; and if the continuous description labels matched with the target time label sequence templates of the preset time rule templates exist in the plurality of description labels, the continuous description labels are used as the target description labels, wherein the target time label sequence templates are any time label sequence templates in the preset time rule templates.
In an alternative embodiment, before matching the plurality of description tags with a preset time rule template, the method further includes: screening the description labels according to a preset screening rule to obtain a plurality of screened description labels, wherein the preset screening rule at least comprises: the tag length rule and the character number rule are continuously described.
In a second aspect, an embodiment of the present invention provides a time data extraction apparatus, including: the first acquisition module is used for acquiring a text to be processed; the first decomposition module is used for decomposing the text to be processed to obtain a plurality of description tags, wherein the description tags are used for representing at least one of the following: the character types of the continuous similar characters in the text to be processed, the character quantity of the continuous similar characters and the character codes of the continuous similar characters; the first determining module is used for matching the plurality of description tags with a preset time rule template and determining a target description tag containing time data from the plurality of description tags based on a matching result, wherein the preset time rule template comprises a plurality of time tag sequence templates; and the second determining module is used for determining time data in the text to be processed according to the target description tag.
In an alternative embodiment, the first decomposition module includes: the recognition unit is used for recognizing the text to be processed character by character and determining the character type of each character in the text to be processed; and the determining unit is used for sequentially determining a plurality of description tags of the text to be processed based on the character type of each character.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, and a processor, where the memory stores a computer program executable on the processor, and where the processor implements the steps of the method in any of the foregoing embodiments when the processor executes the computer program.
In a fourth aspect, embodiments of the present invention provide a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method of any of the preceding embodiments.
The time data extraction method provided by the invention comprises the following steps: acquiring a text to be processed; decomposing the text to be processed to obtain a plurality of description tags, wherein the description tags are used for representing at least one of the following: character types of continuous similar characters in the text to be processed, character numbers of the continuous similar characters and character codes of the continuous similar characters; matching the plurality of description tags with a preset time rule template, and determining a target description tag containing time data from the plurality of description tags based on a matching result, wherein the preset time rule template comprises a plurality of time tag sequence templates; and determining time data in the text to be processed according to the target description tag.
Compared with the prior art, the time data extraction method provided by the invention has the advantages that after the text to be processed is decomposed to obtain a plurality of corresponding description labels, the description labels are matched with the preset time rule template, so that the time data contained in the text to be processed can be extracted rapidly, and the technical problem of low efficiency of the time data extraction method in the prior art is effectively solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a time data extraction method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a writing format of sample time data according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of character-by-character recognition of sample time data according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a time stamp sequence template corresponding to sample time data determination according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a tree structure of a linked list according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of layer-by-layer matching of a plurality of description tags with a preset time rule template according to an embodiment of the present invention;
FIG. 7 is a functional block diagram of a time data extraction device according to an embodiment of the present invention;
fig. 8 is a schematic diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Some embodiments of the present invention are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
The system log can record information related to various events, operations and operation and maintenance development of different service scenes, so that a lot of useful information can be extracted from the information, time data is used as key data, and powerful data support is provided for locating event properties and extracting other information. However, since the system log is defined by the research personnel and development personnel and does not form a unified specification, time data is extracted from the system log, generally, corresponding regular expressions are manually defined according to the writing format of the time data in the log record and the log information format, and the processing mode cannot conveniently and rapidly acquire effective information, is not flexible, has long required operation time and is easy to make errors, and the time data cannot be directly stored into a standard time format required by a user. In view of the above, the embodiments of the present invention provide a time data extraction method for alleviating the above-mentioned technical problems.
Example 1
Fig. 1 shows a flowchart of a time data extraction method, as shown in fig. 1, and the method specifically includes the following steps:
Step S12, obtaining a text to be processed.
Specifically, to extract time data, firstly, the text to be processed is obtained, the method is not limited by the way of obtaining the text to be processed, and the text to be processed can be imported through an external storage device or directly read from a local storage path.
And S14, decomposing the text to be processed to obtain a plurality of description tags.
After the text to be processed is obtained, the text to be processed is firstly decomposed, so that a plurality of description tags of the text to be processed are obtained, wherein the description tags are used for representing at least one of the following: character types of continuous similar characters in the text to be processed, character numbers of the continuous similar characters and character codes of the continuous similar characters.
For example, if the text to be processed is today is May 5 2020, the plurality of description tags should represent that the text to be processed is in the order: 5 consecutive letters, 1 consecutive space, 2 consecutive letters, 1 consecutive space, 3 consecutive letters, 1 consecutive space, 1 consecutive number, 4 consecutive numbers. In the embodiment of the present invention, when determining the description tag, the Letter and number description tag may be represented by its corresponding character type and number of consecutive characters, and the symbol selection may be represented by its corresponding character type and character code, which may be represented by the original value of the symbol or an ASCII (AMERICAN STANDARD Code for Information Interchange ) code value, for example, 5 consecutive letters may be represented as letter_5; the consecutive 4 numbers may be denoted digit_4; the 1 consecutive spaces using ASCII code values may be denoted as symbol_ox 20.
And S16, matching the plurality of description tags with a preset time rule template, and determining a target description tag containing time data from the plurality of description tags based on a matching result.
And S18, determining time data in the text to be processed according to the target description tag.
After a plurality of description labels corresponding to a text to be processed are obtained, in order to identify time data contained in the text to be processed, a preset time rule template is set in the embodiment of the invention, wherein the preset time rule template comprises a plurality of time label sequence templates, and each time label sequence template corresponds to a time writing format. The plurality of description labels are controlled to be matched with a plurality of time label sequence templates in a preset time rule template, so that a target description label can be determined, the target description label is the description label corresponding to the time data in the text to be processed, the text corresponding to the target description label is extracted, the time data in the text to be processed can be rapidly determined, and a user can further store the time data into any required standard time format.
The time data extraction method provided by the embodiment of the invention comprises the following steps: acquiring a text to be processed; decomposing the text to be processed to obtain a plurality of description tags, wherein the description tags are used for representing at least one of the following: character types of continuous similar characters in the text to be processed, character numbers of the continuous similar characters and character codes of the continuous similar characters; matching the plurality of description tags with a preset time rule template, and determining a target description tag containing time data from the plurality of description tags based on a matching result, wherein the preset time rule template comprises a plurality of time tag sequence templates; and determining time data in the text to be processed according to the target description tag.
Compared with the prior art, the time data extraction method provided by the invention has the advantages that after the text to be processed is decomposed to obtain a plurality of corresponding description labels, the description labels are matched with the preset time rule template, so that the time data contained in the text to be processed can be extracted rapidly, and the technical problem of low efficiency of the time data extraction method in the prior art is effectively solved.
The execution process of the time data extraction method provided by the embodiment of the invention is briefly described above, and the process of decomposing the text to be processed and obtaining a plurality of description tags is described in detail below.
In an optional embodiment, the step S14 includes decomposing the text to be processed to obtain a plurality of description tags, and specifically includes the following steps:
step S141, character-by-character recognition is carried out on the text to be processed, and the character type of each character in the text to be processed is determined.
Specifically, when decomposing the text to be processed, character-by-character recognition needs to be performed first, for example, if the text to be processed is: "today is May 5 2020", taking the first few characters of the text to be processed as an example, "today is", then when character-by-character recognition is performed, the 1 st character type is identified as Letter Letter, the 2 nd character type is identified as Letter Letter, the 3 rd character type is identified as Letter Letter, the 4 th character type is identified as Letter Letter, the 5 th character type is identified as Letter Letter, the 6 th character type is identified as symbol, the 7 th character type is identified as Letter Letter, and the 8 th character type is identified as Letter Letter.
In step S142, a plurality of description tags of the text to be processed are sequentially determined based on the character type of each character.
After the character type of each character in the text to be processed is determined, generating a description label by the continuous characters of the same type, so that a plurality of description labels of the text to be processed are determined. Along with the example in step S141 described above, the text to be processed: the plurality of description tags corresponding to "today is May 5" can be expressed in turn as: letter_5, symbols_ox20, letter_2, symbols_ox20, letter_3, symbols_ox20, digit_1, symbols_ox20, digit_4.
In addition, because of the specificity of the time format, in which month, day, minute and second have the possibility of a single number or two numbers, digit_1 and digit_2 in the description tag can be classified as digit_1_or_2.
The process of decomposing the text to be processed is described in detail above, and the process of obtaining the preset time rule template is described in detail below.
In an alternative embodiment, the method of the present invention further comprises the steps of:
step S21, a time sample library is obtained.
Specifically, in order to accurately identify the time data in the text to be processed, a plurality of possible writing formats of the time data need to be predetermined, so a time sample library is first obtained, where the time sample library includes sample time data in a plurality of time formats, and some writing formats of the sample time data are shown in fig. 2. The time sample library supports various modes such as a database, a configuration file, a command line and the like, and the embodiment of the invention is not limited to the specific form of the time sample library. If the time sample library exists in the form of a database, loading a corresponding database to obtain the time sample library; if the time sample library exists in the form of a configuration file, the configuration file is read to obtain the time sample library; if the time format in the text is extracted in a program mode, the parameters can be carried when the program is executed, and the parameters are used as a time sample library.
And S22, decomposing each sample time data to obtain a time tag sequence template corresponding to each sample time data.
After obtaining the time sample library, decomposing each sample time data in the time sample library, and referring to steps S141 to S142, if a certain sample time data is: in the case of "Jan 2 2006 3:04:05PM CST" and "Jan 2" as an example of the first several characters of the sample time data, when character-by-character recognition is performed, as shown in fig. 3, the 1 st character type is recognized as a Letter, the 2 nd character type is recognized as a Letter, the 3 rd character type is recognized as a Letter, the 4 th character type is recognized as a symbol, and the 5 th character type is recognized as a digital.
After determining the character type of each character in the sample time data, generating a description label by continuous characters of the same type, thereby determining a plurality of description labels of the sample time data. As shown in fig. 4, sample time data: the corresponding plurality of descriptive labels for "Jan 2 2006 3:04:05PM CST" may in turn be denoted :Letter_3,Symbols_ox20,Digit_1_or_2,Symbols_ox20,Digit_4,Symbols_ox20,Digit_1_or_2,Symbols_ox3A,Digit_1_or_2,Symbols_ox3A,Digit_1_or_2,Symbols_ox20,Letter_2,Symbols_ox20,Letter_3,END, wherein END is denoted leaf node (END node). The chain description in fig. 4 is a time tag sequence template obtained by sequentially combining a plurality of description tags corresponding to the sample time data in the above example according to the text sequence.
Similarly, a time tag sequence template corresponding to each sample time data in the time sample library can be obtained.
Step S23, a preset time rule template is determined based on the time tag sequence template corresponding to each sample time data.
After the time tag sequence templates corresponding to each sample time data are obtained, the preset time rule templates may store the plurality of time tag sequence templates in an individual form, that is, store a plurality of chain structures (chain descriptions) in fig. 4 in the preset time rule templates, or store the preset time rule templates in a tree structure form, that is, merge the plurality of chain structures (the plurality of time tag sequence templates) to obtain a linked list structure tree, as shown in fig. 5, take two time tag sequence templates as an example for merging, in the merging process, sequentially compare the two time tag sequence templates layer by layer, if the same layer of time tags are the same, merge the two time tag sequence templates into one time tag, and continue the comparison of the next layer; if the same layer of time labels are different, two time label sequence templates are respectively mounted on the structure of the layer above the layer, and the time labels of the layer and the subsequent layers are respectively mounted on the structure of the layer, so that a user can select to store original sample time data on each END label.
In addition, when generating the linked list structure tree, the above mainly describes that each sample time data is processed into an independent time tag sequence template, then a plurality of time tag sequence templates are combined, a user can also adopt to process one sample time data first to obtain a first time tag sequence template (current linked list structure tree), then randomly take another sample time data to match with the current linked list structure tree, and if the sample time data is matched with the current linked list structure tree, then sample time data is extracted from a time sample library to match with the current linked list structure tree; if the tree is not matched, a new time tag sequence template is added on the basis of the current linked list structure tree to obtain an updated linked list structure tree, and the linked list structure tree corresponding to the time sample library is obtained by analogy, wherein the process and the effect are equivalent to expanding new branches on the original tree structure.
The preset time rule template provided by the embodiment of the invention supports the functions of adding, deleting and modifying, if the sample time data in the new writing format is found, the sample time data can be processed into a time tag sequence template after being added into a time sample library, and then the preset time rule template is updated.
The process of how to obtain the preset time rule template is described in detail above, and the process of how to determine the target description tag containing time data from the description tags of the text to be processed is specifically described below.
In an optional embodiment, the step S16 matches the plurality of description tags with a preset time rule template, and determines, based on a matching result, a target description tag containing time data from the plurality of description tags, including the following specific contents:
Sequentially matching each description tag with a preset time rule template layer by layer; returning a time tag which is successfully matched for each layer if the matching is successful, and executing the matching of the next layer; if the matching fails, the next description label of the current initial description label is used as a new initial description label to be matched with the preset time rule template layer by layer. If a plurality of description labels exist, the continuous description labels are matched with a target time label sequence template of a preset time rule template, and the continuous description labels are used as target description labels, wherein the target time label sequence template is any time label sequence template of the preset time rule template.
Specifically, referring to fig. 6, when a plurality of description tags of a text to be processed are matched with a preset time rule template layer by layer, if a plurality of description tags corresponding to the text to be processed are sequentially combined into a description tag sequence according to a corresponding text sequence, the layer by layer matching process can be understood that, for each layer, if the matching is successful, a time tag which is successfully matched is returned, and the matching of the next layer is executed; if the matching fails, the first description tag (the current initial description tag) of the description tag sequence is moved out to obtain an updated description tag sequence, and the updated description tag sequence is matched with a preset time rule template layer by layer.
Two examples of text to be processed are provided in fig. 6, sample a and sample B, respectively, sample a: equipment will be started on Friday to match the preset time rule template layer by layer, because the layer 1 time label filter_3 cannot be matched, each description label (block in fig. 6) of the sample a is always matched with the layer 1 time label of the preset time rule template, and finally, the target description label containing the time data cannot be determined from the description labels of the sample a.
Sample B: THE SERVER RESTARTED AT MAR 4 2020 8:00:00AM CST to match the preset time rule template layer by layer, the 1 st description tag (block 1, letter_3) of sample B successfully matches the 1 st layer time tag (letter_3) of the preset time rule template, next, the block 2 (symbols_ox20) also successfully matches the 2 nd layer time tag (symbols_ox20), when performed until the block 3 (letter_6) matches the 3 rd layer time tag (digit_1_or_2), the match fails, whereupon the next description tag (block 2, symbols_ox20) of the current start description tag (block 1, letter_3) is matched layer by layer as a new start description tag to the preset time rule template, however, when the blocks 2 to 8 cannot be matched with the layer 1 time tag letter_3 and the matching process is continuously executed, the blocks 9 to 15 can be successfully matched with the layer 1 to 7 time tags, when the block 16 is matched with the layer 8 time tag, the blocks are respectively matched with two time tags of the layer 8, the blocks are successfully matched with the time tag of the right branch of the layer 8, then the blocks 17 to 23 are successfully matched with the time tag of the right branch of the layer 9 to 15, and the blocks are successfully executed to the leaf node (terminal node) END, and finally the target description tag :Letter_3,Symbols_ox20,Digit_1_or_2,Symbols_ox20,Digit_4,Symbols_ox20,Digit_1_or_2,Symbols_ox3A,Digit_1_or_2,Symbols_ox3A,Digit_1_or_2,Symbols_ox20,Letter_2,Symbols_ox20,Letter_3, determines the time data in the sample B according to the target description tag to be: mar 4 2020 8:00:00AM CST.
In an alternative embodiment, before matching the plurality of description tags with the preset time rule template, the method of the present invention further comprises the steps of:
Step S15, screening the plurality of description tags according to a preset screening rule to obtain a plurality of screened description tags, wherein the preset screening rule at least comprises: the tag length rule and the character number rule are continuously described.
Specifically, after decomposing the text to be processed to obtain a plurality of description tags, or before matching the description tags with a preset time rule template, the content which obviously does not belong to time data and the content which does not meet the minimum time tag sequence template in continuous length can be removed, that is, the description tags are screened according to a preset screening rule, the continuous description tag length rule in the preset screening rule can be understood as screening the text to be processed with too small number of the determined description tags, the character number rule in the preset screening rule can be understood as screening the characters with too long continuous length, the preset screening rule can be further increased according to the actual requirement of a user and the actual condition of the text to be processed, the embodiment of the invention does not limit the content, and the effect of increasing the screening step is that the time for layer-by-layer matching can be reduced.
In summary, according to the time data extraction method provided by the embodiment of the invention, through deep mining and feature extraction of sample time data, a plurality of time tag sequence templates are determined, and when a new time writing format is met, a preset time rule template can be expanded by adding sample data, so that a function of supporting simultaneous adaptation of a plurality of time formats is realized, and therefore, when the time data extraction is carried out on a text to be processed, the adaptation complexity is reduced, the matching time is shortened, and the time data extraction efficiency is effectively improved.
In addition, the method can also be applied to extraction of other non-time data, such as identification card number extraction, IP address extraction and the like, and extraction of data in a corresponding format can be realized only by setting corresponding rule templates (a plurality of tag sequence templates).
Example two
The embodiment of the invention also provides a time data extraction device which is mainly used for executing the time data extraction method provided by the first embodiment, and the time data extraction device provided by the embodiment of the invention is specifically described below.
Fig. 7 is a functional block diagram of a time data extraction device according to an embodiment of the present invention, as shown in fig. 7, where the device mainly includes: a first acquisition module 10, a first decomposition module 20, a first determination module 30, a second determination module 40, wherein:
A first obtaining module 10, configured to obtain text to be processed.
The first decomposing module 20 is configured to decompose the text to be processed to obtain a plurality of description tags, where the description tags are used to characterize at least one of the following: character types of continuous similar characters in the text to be processed, character numbers of the continuous similar characters and character codes of the continuous similar characters.
The first determining module 30 is configured to match the plurality of description tags with a preset time rule template, and determine a target description tag including time data from the plurality of description tags based on a matching result, where the preset time rule template includes a plurality of time tag sequence templates.
A second determining module 40, configured to determine time data in the text to be processed according to the object description tag.
The time data extraction device provided by the embodiment of the invention comprises: a first obtaining module 10, configured to obtain a text to be processed; the first decomposing module 20 is configured to decompose the text to be processed to obtain a plurality of description tags, where the description tags are used to characterize at least one of the following: character types of continuous similar characters in the text to be processed, character numbers of the continuous similar characters and character codes of the continuous similar characters; a first determining module 30, configured to match the plurality of description tags with a preset time rule template, and determine a target description tag containing time data from the plurality of description tags based on a matching result, where the preset time rule template includes a plurality of time tag sequence templates; a second determining module 40, configured to determine time data in the text to be processed according to the object description tag.
Compared with the prior art, the time data extraction device provided by the invention decomposes the text to be processed to obtain a plurality of corresponding description labels, and then matches the description labels with a preset time rule template, so that the time data contained in the text to be processed can be extracted rapidly, and the technical problem of low efficiency of the time data extraction method in the prior art is effectively solved.
Optionally, the first decomposition module 20 includes:
the recognition unit is used for recognizing the characters of the text to be processed, and determining the character type of each character in the text to be processed.
And the determining unit is used for sequentially determining a plurality of description tags of the text to be processed based on the character type of each character.
Optionally, the apparatus further comprises:
The second acquisition module is used for acquiring a time sample library, wherein the time sample library contains sample time data in various time formats.
And the second decomposition module is used for decomposing each sample time data to obtain a time tag sequence template corresponding to each sample time data.
And the third determining module is used for determining a preset time rule template based on the time tag sequence template corresponding to each sample time data.
Optionally, the preset time rule template is stored in the form of a tree structure.
Optionally, the first determining module 30 is specifically configured to:
And sequentially matching each description label with a preset time rule template layer by layer.
Returning a time tag which is successfully matched for each layer if the matching is successful, and executing the matching of the next layer; if the matching fails, the next description label of the current initial description label is used as a new initial description label to be matched with the preset time rule template layer by layer.
If a plurality of description labels exist, the continuous description labels are matched with a target time label sequence template of a preset time rule template, and the continuous description labels are used as target description labels, wherein the target time label sequence template is any time label sequence template of the preset time rule template.
Optionally, the apparatus further comprises:
The screening module is used for screening the plurality of description tags according to a preset screening rule to obtain a plurality of description tags after screening, wherein the preset screening rule at least comprises: the tag length rule and the character number rule are continuously described.
Example III
Referring to fig. 8, an embodiment of the present invention provides an electronic device including: a processor 60, a memory 61, a bus 62 and a communication interface 63, the processor 60, the communication interface 63 and the memory 61 being connected by the bus 62; the processor 60 is arranged to execute executable modules, such as computer programs, stored in the memory 61.
The memory 61 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is achieved via at least one communication interface 63 (which may be wired or wireless), and may use the internet, a wide area network, a local network, a metropolitan area network, etc.
Bus 62 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 8, but not only one bus or type of bus.
The memory 61 is configured to store a program, and the processor 60 executes the program after receiving an execution instruction, and the method executed by the apparatus for flow defining disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 60 or implemented by the processor 60.
The processor 60 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 60. The processor 60 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 61 and the processor 60 reads the information in the memory 61 and in combination with its hardware performs the steps of the method described above.
The embodiment of the invention provides a method, an apparatus and a computer program product of an electronic device for extracting time data, which include a computer readable storage medium storing a non-volatile program code executable by a processor, where the program code includes instructions for executing the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment and will not be repeated herein.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
In the description of the present invention, it should be noted that, directions or positional relationships indicated by terms such as "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., are directions or positional relationships based on those shown in the drawings, or are directions or positional relationships conventionally put in use of the inventive product, are merely for convenience of describing the present invention and simplifying the description, and are not indicative or implying that the apparatus or element to be referred to must have a specific direction, be constructed and operated in a specific direction, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.
Furthermore, the terms "horizontal," "vertical," "overhang," and the like do not denote a requirement that the component be absolutely horizontal or overhang, but rather may be slightly inclined. As "horizontal" merely means that its direction is more horizontal than "vertical", and does not mean that the structure must be perfectly horizontal, but may be slightly inclined.
In the description of the present invention, it should also be noted that, unless explicitly specified and limited otherwise, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (8)

1. A method for extracting time data, comprising:
Acquiring a text to be processed;
Decomposing the text to be processed to obtain a plurality of description tags, wherein the description tags are used for representing at least one of the following: the character types of the continuous similar characters in the text to be processed, the character quantity of the continuous similar characters and the character codes of the continuous similar characters;
matching the plurality of description tags with a preset time rule template, and determining a target description tag containing time data from the plurality of description tags based on a matching result, wherein the preset time rule template comprises a plurality of time tag sequence templates; if the description label is the same as the time label of the preset time rule template, determining that the matching result is successful;
Determining time data in the text to be processed according to the target description tag;
decomposing the text to be processed to obtain a plurality of description tags, wherein the method comprises the following steps:
character-by-character recognition is carried out on the text to be processed, and the character type of each character in the text to be processed is determined;
Generating a description label from the continuous characters of the same type to determine a plurality of description labels of the text to be processed.
2. The method according to claim 1, wherein the method further comprises:
Acquiring a time sample library, wherein the time sample library comprises sample time data in various time formats;
decomposing each sample time data to obtain a time tag sequence template corresponding to each sample time data;
And determining the preset time rule template based on the time tag sequence template corresponding to each sample time data.
3. The method of claim 2, wherein the pre-set time rule templates are stored in a tree structure.
4. The method of claim 3, wherein matching the plurality of description tags with a preset time rule template and determining a target description tag containing time data among the plurality of description tags based on a matching result comprises:
sequentially matching each description tag with a preset time rule template layer by layer;
Returning a time tag which is successfully matched for each layer if the matching is successful, and executing the matching of the next layer; if the matching fails, the next description label of the current initial description label is used as a new initial description label to be matched with a preset time rule template layer by layer;
And if the continuous description labels matched with the target time label sequence templates of the preset time rule templates exist in the plurality of description labels, the continuous description labels are used as the target description labels, wherein the target time label sequence templates are any time label sequence templates in the preset time rule templates.
5. The method of claim 1, wherein prior to matching the plurality of descriptive tags with a preset time rule template, the method further comprises:
Screening the description labels according to a preset screening rule to obtain a plurality of screened description labels, wherein the preset screening rule at least comprises: the tag length rule and the character number rule are continuously described.
6. A time data extraction apparatus, comprising:
the first acquisition module is used for acquiring a text to be processed;
The first decomposition module is used for decomposing the text to be processed to obtain a plurality of description tags, wherein the description tags are used for representing at least one of the following: the character types of the continuous similar characters in the text to be processed, the character quantity of the continuous similar characters and the character codes of the continuous similar characters;
The first determining module is used for matching the plurality of description tags with a preset time rule template and determining a target description tag containing time data from the plurality of description tags based on a matching result, wherein the preset time rule template comprises a plurality of time tag sequence templates; if the description label is the same as the time label of the preset time rule template, determining that the matching result is successful;
The second determining module is used for determining time data in the text to be processed according to the target description tag;
Wherein the first decomposition module comprises:
the recognition unit is used for recognizing the text to be processed character by character and determining the character type of each character in the text to be processed;
And the determining unit is used for generating a description label from the continuous characters of the same type so as to determine a plurality of description labels of the text to be processed.
7. An electronic device comprising a memory, a processor, the memory having stored thereon a computer program executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of the method according to any of the preceding claims 1-5.
8. A computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method of any one of claims 1 to 5.
CN202010418390.XA 2020-05-18 2020-05-18 Time data extraction method and device and electronic equipment Active CN111639491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010418390.XA CN111639491B (en) 2020-05-18 2020-05-18 Time data extraction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010418390.XA CN111639491B (en) 2020-05-18 2020-05-18 Time data extraction method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN111639491A CN111639491A (en) 2020-09-08
CN111639491B true CN111639491B (en) 2024-05-03

Family

ID=72331994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010418390.XA Active CN111639491B (en) 2020-05-18 2020-05-18 Time data extraction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111639491B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009140113A (en) * 2007-12-05 2009-06-25 Fuji Xerox Co Ltd Dictionary editing device, dictionary editing method, and computer program
CN104951508A (en) * 2015-05-21 2015-09-30 腾讯科技(深圳)有限公司 Time information identification method and device
CN107705784A (en) * 2017-09-28 2018-02-16 百度在线网络技术(北京)有限公司 Text regularization model training method and device, text regularization method and device
CN108874928A (en) * 2018-05-31 2018-11-23 平安科技(深圳)有限公司 Resume data information analyzing and processing method, device, equipment and storage medium
CN109190119A (en) * 2018-08-22 2019-01-11 腾讯科技(深圳)有限公司 Time extracting method and device, storage medium and electronic device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201013433A (en) * 2008-09-19 2010-04-01 Esobi Inc Filtering method for the same or similar documents

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009140113A (en) * 2007-12-05 2009-06-25 Fuji Xerox Co Ltd Dictionary editing device, dictionary editing method, and computer program
CN104951508A (en) * 2015-05-21 2015-09-30 腾讯科技(深圳)有限公司 Time information identification method and device
CN107705784A (en) * 2017-09-28 2018-02-16 百度在线网络技术(北京)有限公司 Text regularization model training method and device, text regularization method and device
CN108874928A (en) * 2018-05-31 2018-11-23 平安科技(深圳)有限公司 Resume data information analyzing and processing method, device, equipment and storage medium
CN109190119A (en) * 2018-08-22 2019-01-11 腾讯科技(深圳)有限公司 Time extracting method and device, storage medium and electronic device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向问答的数值信息抽取;张桂平等,;郑州大学学报(理学版);20180524(第04期);全文 *

Also Published As

Publication number Publication date
CN111639491A (en) 2020-09-08

Similar Documents

Publication Publication Date Title
CN107239666B (en) Method and system for desensitizing medical image data
US9025890B2 (en) Information classification device, information classification method, and information classification program
CN112036144B (en) Data analysis method, device, computer equipment and readable storage medium
CN111859093A (en) Sensitive word processing method and device and readable storage medium
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN109190119B (en) Time extraction method and device, storage medium and electronic device
WO2019062011A1 (en) Target customer group construction method, electronic device and storage medium
CN112463737A (en) System and method for rapidly acquiring data aiming at multi-format data intelligent matching template
CN111695093A (en) iOS application-based reinforcement method, electronic device and storage medium
CN104182416A (en) File downloading system and method
CN108921193B (en) Picture input method, server and computer storage medium
CN111639491B (en) Time data extraction method and device and electronic equipment
CN110727691A (en) Data analysis and verification method and device
CN115796146A (en) File comparison method and device
JP2018073354A (en) Device, method, and program for extracting similar document
CN107145947B (en) Information processing method and device and electronic equipment
CN112698866B (en) Code line life cycle tracing method based on Git and electronic device
CN113032457B (en) Method and device for maintaining database information of bill of material
CN115455271A (en) Label generating method, device and equipment based on search query words and storage medium
CN113704227A (en) Incremental update data storage method and device, electronic equipment and storage medium
CN107436728A (en) Rule analysis result storage method, regular retrogressive method and device
CN110874398A (en) Forbidden word processing method and device, electronic equipment and storage medium
CN114328283B (en) Counting address acquisition method, counting method, device, equipment and storage medium
US11947957B2 (en) Grouping software applications based on technical facets
CN115358198A (en) Table data reading method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant