CN113111233A

CN113111233A - Regular expression-based method and device for extracting residential address of alarm receiving and processing text

Info

Publication number: CN113111233A
Application number: CN202010307808.XA
Authority: CN
Inventors: 彭涛; 张鹏; 杨欣雨
Original assignee: Beijing Mingyi Technology Co ltd
Current assignee: Beijing Mingyi Technology Co ltd
Priority date: 2020-02-13
Filing date: 2020-04-17
Publication date: 2021-07-13
Anticipated expiration: 2040-04-17
Also published as: CN113111233B

Abstract

The embodiment of the disclosure discloses a regular expression-based method and a regular expression-based device for extracting a residential address of an alarm receiving and processing text. One embodiment of the method comprises: acquiring a residence address information alarm receiving and processing text to be extracted; matching the alarm receiving text of the address information of the residence to be extracted with a residence identification extraction regular expression to obtain a residence identification position information sequence; matching the alarm receiving and processing text of the to-be-extracted residential address information with an address extraction regular expression to obtain an address position information sequence; executing a residence address information extraction operation for each of the residence identification position information in the residence identification position information sequence; and determining the residence address information corresponding to each residence identification position information in the residence identification position information sequence as a residence address information set corresponding to the residence address information alarm receiving text to be extracted. The embodiment realizes automatic extraction of the residential address information in the alarm receiving and processing text.

Description

Regular expression-based method and device for extracting residential address of alarm receiving and processing text

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a regular expression-based method and device for extracting a residential address of an alarm receiving and processing text.

Background

Currently, a 110-degree alarm receiving person in a public security organization enters an alarm receiving text when receiving an alarm. The alarm handling person can enter an alarm handling text after the alarm handling is finished. The alarm receiving and processing text comprises the alarm receiving text and the alarm processing text. In practice, the alarm-receiving text will often involve a description of the residence (e.g., the place where the person was, the place where the person is, etc.) of the involved person. The case analyst can analyze the same or similar residence address information appearing in different alarm receiving texts for further processing according to the residence address information in the alarm receiving texts. For example, a series of cases or related cases may be discovered by the same or similar residence address information. Therefore, it is very important to extract the residential address information in the alarm receiving text.

However, the manual extraction of the residential address information in the alarm receiving and processing text is mostly adopted at present, and the manual extraction of the residential address information in the alarm receiving and processing text is high in labor cost and depends on personal experience.

Disclosure of Invention

The embodiment of the disclosure provides a regular expression-based method and a regular expression-based device for extracting an alarm receiving and processing text residential address.

In a first aspect, an embodiment of the present disclosure provides a regular expression-based method for extracting address information of a text-to-alarm residence, where the method includes: acquiring a residence address information alarm receiving and processing text to be extracted; matching the alarm receiving text of the address information of the residence to be extracted with a residence identification extraction regular expression to obtain a residence identification position information sequence; matching the alarm receiving and processing text of the to-be-extracted residential address information with an address extraction regular expression to obtain an address position information sequence; for each of the residence identification location information in the sequence of residence identification location information, performing the following residence address information extraction operations: determining the end position in the residence identification position information as a target end position; for each address position information in the address position information sequence, determining the difference obtained by subtracting the target end position from the start position in the address position information as the editing distance corresponding to the address position information; determining a text between a starting position and an ending position in target address position information in a residence address information alarm receiving text to be extracted as residence address information corresponding to the residence identification position information, wherein the editing distance corresponding to the target address position information in each address position information with a positive editing distance is the minimum; and determining the residence address information corresponding to each residence identification position information in the residence identification position information sequence as a residence address information set corresponding to the residence address information alarm receiving text to be extracted.

In some embodiments, the residential identification extraction regular expression is pre-trained by a first training step as follows: acquiring a first training sample set and a first testing sample set, wherein the first training sample and the first testing sample both comprise a historical alarm receiving and handling text and a corresponding marked residence identification position information sequence, the marked residence identification position information comprises a starting position and an ending position, and the marked residence identification position information is used for representing that residence identification is carried out between the starting position and the ending position in the historical alarm receiving and handling text; marking each first training sample with a residence identification position information sequence not empty in the first training sample set to generate a first positive sample set; selecting first positive samples from the first positive sample set to form a first target number of first positive sample subsets; for each first positive sample subset in a first target number of first positive sample subsets, generating a candidate regular expression corresponding to the first positive sample subset based on each first positive sample in the first positive sample subset; testing each generated candidate regular expression based on the first test sample set to determine an accuracy corresponding to each generated candidate regular expression; and determining the candidate regular expression with the highest accuracy rate in the generated candidate regular expressions as a residential area identification extraction regular expression.

In some embodiments, selecting the first positive samples in the first positive sample set to form a first target number of first positive sample subsets comprises: performing a first target number of first positive-sample subset generating operations to generate a first target number of first positive-sample subsets, the first positive-sample subset generating operations comprising: and randomly selecting N first positive samples from the first positive sample set to form a first positive sample subset, wherein N is an integer obtained by rounding down a quotient of L divided by M, L is the number of the first positive samples in the first positive sample set, and M is a positive integer which is greater than or equal to 2 and smaller than L.

In some embodiments, the address extraction regular expression is pre-trained by a second training step as follows: acquiring a second training sample set and a second test sample set, wherein the second training sample and the second test sample both comprise a historical alarm receiving and processing text and a corresponding labeled address position information sequence, the labeled address position information comprises a starting position and an ending position, and the labeled address position information is used for representing that an address is arranged between the starting position and the ending position in the historical alarm receiving and processing text; generating a second positive sample set by using each second training sample marked with an address position information sequence which is not empty in the second training sample set; selecting second positive samples from the second positive sample set to form a second target number of second positive sample subsets; for each second positive sample subset in a second target number of second positive sample subsets, generating a candidate regular expression corresponding to the second positive sample subset based on each second positive sample in the second positive sample subset; testing each generated candidate regular expression based on a second test sample set to determine an accuracy corresponding to each generated candidate regular expression; and determining the candidate regular expression with the highest accuracy rate in the generated candidate regular expressions as an address extraction regular expression.

In some embodiments, selecting the second positive samples from the second positive sample set to form a second target number of second positive sample subsets comprises: performing a second target number of second positive-sample subset generating operations to generate a second target number of second positive-sample subsets, the second positive-sample subset generating operations comprising: and randomly selecting N ' second positive samples from the second positive sample set to form a second positive sample subset, wherein N ' is an integer obtained by rounding down a quotient obtained by dividing L ' by M ', L ' is the number of the second positive samples in the second positive sample set, and M ' is a positive integer which is greater than or equal to 2 and smaller than L '.

In some embodiments, the edit distance corresponding to the target address location information is less than a preset edit distance threshold.

In some embodiments, the preset edit distance threshold is pre-calculated by a third training step as follows: acquiring a third training sample set, wherein the third training sample comprises a historical alarm receiving and handling text and a corresponding residence labeling information sequence, the residence labeling information comprises a residence identification starting position, a residence identification ending position, an address starting position and an address ending position, the residence labeling information is used for representing that residence identification is carried out between the residence identification starting position and the residence identification ending position in the historical alarm receiving and handling text, and the residence address information corresponding to the residence identification is address information between the address starting position and the address ending position in the historical alarm receiving and handling text; for each third training sample in the third training sample set, determining the maximum value of the editing distances corresponding to the third training sample in each marking residence information of the marking residence information sequence of the third training sample as the maximum editing distance corresponding to the third training sample, wherein the editing distance corresponding to the marking residence information is the difference obtained by subtracting the corresponding residence mark end position from the address start position in the marking track address information; and determining the maximum value of the maximum editing distances corresponding to the third training samples in the third training sample set as a preset editing distance threshold value.

In a second aspect, an embodiment of the present disclosure provides a regular expression-based device for extracting address information of a residential area of an alarm receiving text, where the device includes: an acquisition unit configured to acquire a residence address information alarm receiving text to be extracted; the first matching unit is configured to match the alarm receiving text of the residence address information to be extracted with the residence identification extraction regular expression to obtain a residence identification position information sequence; the second matching unit is configured to match the alarm receiving and processing text of the to-be-extracted residential address information with the address extraction regular expression to obtain an address position information sequence; an extraction unit configured to perform the following residence address information extraction operation for each of the sequences of residence identification location information: determining the end position in the residence identification position information as a target end position; for each address position information in the address position information sequence, determining the difference obtained by subtracting the target end position from the start position in the address position information as the editing distance corresponding to the address position information; determining a text between a starting position and an ending position in target address position information in a residence address information alarm receiving text to be extracted as residence address information corresponding to the residence identification position information, wherein the editing distance corresponding to the target address position information in each address position information with a positive editing distance is the minimum; and the determining unit is configured to determine the residence address information corresponding to each residence identification position information in the residence identification position information sequence as the residence address information set corresponding to the residence address information alarm receiving text to be extracted.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any implementation manner of the first aspect.

In a fourth aspect, the present disclosure provides a computer-readable storage medium on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method as described in any implementation manner of the first aspect.

In the prior art, the address information of the residential area is generally extracted through manually butting an alarm handling text, and the following problems may exist: (1) a large amount of alarm receiving and processing texts which have not been extracted from the residential address information are left in history, and an alarm receiving and processing worker can enter a new large amount of alarm receiving and processing texts every day along with the lapse of time, so that the data volume of the residential address information to be extracted from the alarm receiving and processing texts is too large, and the labor cost and the time cost required by manual extraction are too high; (2) the receiving and processing of the alarm texts mostly adopts natural language description and has serious and irregular expression modes, and the difficulty of manually extracting the address information of the residential area is high; (3) the types of the residential address information are more (for example, the residential address recording modes of different provincial and urban autonomous regions may be different), the extraction modes of the residential address information of different types are different, and the method depends on manual experience, namely, the learning cost in the manual extraction process is higher.

The embodiment of the disclosure provides a regular expression-based method and a regular expression-based device for extracting an alarm receiving and processing text residential address, the method comprises the steps of respectively matching a residence address information alarm receiving text to be extracted with a residence identification extraction regular expression and an address extraction regular expression to obtain a residence identification position information sequence and an address position information sequence, identifying position information of each residence in the residence identification position information sequence, and finally, determining the residence address information corresponding to the residence identification position information in the residence identification position information sequence as a residence address information set corresponding to the residence address information alarm receiving text to be extracted. Therefore, the regular expression for identifying and extracting the residence and the regular expression for extracting the address are effectively utilized, the automatic extraction of the residence address information from the alarm receiving and processing text is realized, manual operation is not needed, the cost for extracting the residence address information from the alarm receiving and processing text is reduced, and the extraction speed for extracting the residence address information from the alarm receiving and processing text is improved.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a regular expression based method for extracting alarm text residential address information according to the present disclosure;

FIG. 3 is a flow chart of one embodiment of a first training step according to the present disclosure;

FIG. 4 is a flow chart of one embodiment of a second training step according to the present disclosure;

FIG. 5 is a flow chart of one embodiment of a third training step according to the present disclosure;

FIG. 6 is a schematic structural diagram illustrating an embodiment of a regular expression-based alarm text residential address information extraction apparatus according to the present disclosure;

FIG. 7 is a schematic block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the regular expression based alarm text residential address information extraction method or regular expression based alarm text residential address information extraction apparatus of the present disclosure may be applied.

As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, and server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various communication client applications, such as an alarm receiving and handling record application, an alarm receiving and handling text residence address information extraction application, a web browser application, and the like, may be installed on the terminal device 101.

The terminal apparatus 101 may be hardware or software. When the terminal device 101 is hardware, it may be various electronic devices having a display screen and supporting text input, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatus 101 is software, it can be installed in the electronic apparatuses listed above. Which may be implemented as multiple software or software modules (e.g., to provide an alarm receipt text residence address information extraction service) or as a single software or software module. And is not particularly limited herein.

The server 103 may be a server that provides various services, such as a background server that provides extraction of the address information of the residential area for the alarm receiving text sent by the terminal apparatus 101. The background server can analyze and process the received alarm receiving text, and feed back the processing result (such as the residential address information set) to the terminal device.

In some cases, the regular expression-based method for extracting the alarm receiving text and the residential address information provided by the embodiment of the disclosure may be performed by the terminal device 101 and the server 103 together, for example, the step of "obtaining the alarm receiving text of the residential address information to be extracted" may be performed by the terminal device 101, and the rest of the steps may be performed by the server 103. The present disclosure is not limited thereto. Accordingly, the regular expression-based alarm receiving text residence address information extraction device may also be respectively provided in the terminal device 101 and the server 103.

In some cases, the regular expression-based method for extracting the address information of the alarm receiving and processing text residence provided by the embodiment of the present disclosure may be executed by the server 103, and accordingly, the regular expression-based device for extracting the address information of the alarm receiving and processing text residence may also be disposed in the server 103, and in this case, the system architecture 100 may not include the terminal device 101.

In some cases, the regular expression-based method for extracting the address information of the alarm receiving and processing text residence provided by the embodiment of the present disclosure may be executed by the terminal device 101, and accordingly, the regular expression-based device for extracting the address information of the alarm receiving and processing text residence may also be disposed in the terminal device 101, and in this case, the system architecture 100 may not include the server 103.

The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as a plurality of software or software modules (for example, for providing an alarm receiving text residence address information extraction service), or may be implemented as a single software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a regular expression based alarm text residential address information extraction method according to the present disclosure is shown. The regular expression-based method for extracting the address information of the alarm receiving text residence place comprises the following steps:

step 201, acquiring a to-be-extracted residence address information alarm receiving text.

In this embodiment, an execution subject (for example, a server shown in fig. 1) of the regular expression-based method for extracting the address information of the residence to be extracted may obtain the locally stored address information of the residence to be extracted, or the execution subject may remotely obtain the address information of the residence to be extracted from other electronic devices (for example, terminal devices shown in fig. 1) connected to the execution subject through a network.

Here, the alarm receiving and processing text of the residence address information to be extracted may be text data that an alarm receiver arranges according to the contents of an alarm receiving telephone or text data that an alarm receiver arranges according to an alarm processing. The alarm receiving and processing text of the address information of the residence to be extracted can also be an alarm text which is received from the terminal equipment and is input by the user in an alarm application installed on the terminal equipment or a webpage with an alarm function.

Step 202, matching the alarm receiving text of the to-be-extracted residence address information with a residence identification extraction regular expression to obtain a residence identification position information sequence.

In this embodiment, the regular expression is a logical formula for operating on a character string, that is, a "regular character string" is formed by using specific characters defined in advance and a combination of the specific characters, and the "regular character string" is used to express a filtering logic for the character string. Given one regular expression and another string, it can be determined whether the given string matches the regular expression's filtering logic, and by giving one regular expression, the particular portion of the string that is desired to be extracted can be retrieved from the string.

In this embodiment, the residence identification extraction regular expression may be a regular expression for extracting residence identification in text. Wherein the residence identification is text for indicating the start of the residence address information. For example, the residence identification may be "place of residence", "live address", and the like.

Here, an execution subject (for example, a server shown in fig. 1) of the regular expression-based alarm receiving text residential address information extraction method may match the residential address information alarm receiving text to be extracted with the residential area identification extraction regular expression, and extract the residential area identification position information, where the residential area identification position information may include a start position and an end position for characterizing the corresponding start position and end position of the extracted residential area identification in the residential area address information alarm receiving text to be extracted. It is understood that there may be no residence identification or at least one residence identification in the residence address information alarm receiving text to be extracted, and therefore the residence identification position information of each extracted residence identification may be formed into a sequence of residence position information in the order of the residence identification in the residence address information alarm receiving text to be extracted.

For example, matching a message of receiving and processing the alarm of the address information of the residence to be extracted, namely ' zhangyi, the place where the residence was in the city of the province A, the city of the city C, and the labor resource dispute existing between the current residence and the li where the current residence is in the city C, with a regular expression for residence identification extraction, so as to obtain a residence identification position information sequence { ' initial position-4 '; end position-7 "," start position-14; end position-16 "," start position-24; end position-26 ". That is, where "place of residence", "place of residence" and "place of residence" are identified for a place of residence.

In some alternative implementations, the residence identification extraction regular expression may be a logical formula operating on strings for extracting the residence identification, formulated by a technician based on statistical analysis of the residence identification portion of a large number of historical alarm receipt texts including the residence identification.

In some optional implementations, the residential identification extraction regular expression may also be pre-trained by a first training step as shown in fig. 3. Referring to fig. 3, fig. 3 shows a flow 300 of one embodiment of a first training step according to the present disclosure. The flow 300 of the first training step may include the steps of:

step 301, a first training sample set and a first testing sample set are obtained.

Here, the execution subject of the first training step may be the same as that of the regular expression-based alarm text residential address information extraction method described above. In this way, the execution main body of the first training step may store the residential area identifier extraction regular expression in the local execution main body after the residential area identifier extraction regular expression is obtained through training, and read the residential area identifier extraction regular expression obtained through training in the process of executing the regular expression-based alarm receiving text residential area address information extraction method.

Here, the execution subject of the first training step may also be different from that of the regular expression-based alarm receiving text residence address information extraction method described above. In this way, the execution main body of the first training step may send the residential area identification extraction regular expression to the execution main body of the regular expression-based alarm receiving text residential area address information extraction method after the residential area identification extraction regular expression is obtained through training. In this way, the execution subject of the regular expression-based alarm receiving text residential address information extraction method may read the residential area identifier received from the execution subject of the first training step to extract the regular expression in the process of executing the regular expression-based alarm receiving text residential address information extraction method.

Here, the performing subject of the first training step may first obtain a first set of training samples and a first set of test samples. The first training sample and the first testing sample both comprise historical alarm receiving and processing texts and corresponding labeling residence identification position information sequences, the labeling residence identification position information can comprise a starting position and an ending position, and the corresponding labeling residence identification position information of the historical alarm receiving and processing texts is used for representing that residence identification is carried out between the starting position and the ending position of the historical alarm receiving and processing texts in the corresponding labeling residence identification position information. It should be noted that, in practice, the alarm receiving text may include no residence identification or at least one residence identification. Thus, the sequence of tagged residence-identifying location information included in the first training sample and the first test sample may be empty or may include at least one tagged residence-identifying location information.

Here, the labeled residence-identifying location information sequences in the first training sample and the first testing sample may be manually labeled with the corresponding historical alarm-receiving text.

In practice, in order to improve the matching degree of the trained residence identification extraction regular expression to the residence identification, the historical alarm receiving and processing texts in the first training sample and the first test sample obtained here may not include the invalid alarm receiving and processing text. For example, some of the alarm receiving texts do not include any residential address information, and have no value in actually extracting the residential address information, and such alarm receiving texts can be regarded as invalid alarm receiving texts.

Step 302, marking the first training samples with the residence in the first training sample set to identify each first training sample with the position information sequence not being empty to generate a first positive sample set.

And if the marked residence mark position information sequence of the first training sample in the first training sample set is not null, indicating that at least one residence mark is included in the historical alarm receiving text of the first training sample, the first training sample is the first positive sample. Thus, a first set of positive samples may be generated with each first training sample in the first set of training samples labeled as populated to identify that the sequence of location information is not empty.

Step 303, select first positive samples from the first positive sample set to form a first target number of first positive sample subsets.

After obtaining the first set of positive samples in step 302, the performing agent of the first training step may select the first positive samples from the first set of positive samples to form a first target number of subsets of positive samples. Here, the first target number may be preset, and the first target number may be determined by receiving a user input through an interface provided in the execution main body.

In some alternative implementations, step 303 may be performed as follows: a first target number of first positive sample subset generation operations is performed to generate a first target number of first positive sample subsets. Wherein the first subset of positive samples generating operation comprises: and randomly selecting N first positive samples from the first positive sample set to form a first positive sample subset. And N is an integer obtained by rounding down a quotient of L divided by M, L is the first positive sample number in the first positive sample set, and M is a positive integer which is greater than or equal to 2 and smaller than L. For example, the first positive sample set includes 367 first positive samples, the first target number is 4, M is 3, L is 367, N is a positive integer 122 rounded down by a quotient of 367 divided by 3, where the following operations are performed 4 times: randomly selecting 122 first positive samples from the first positive sample set comprising 367 first positive samples to form a first positive sample subset. Finally, 4 first positive sample subsets are obtained, and each first positive sample subset includes 122 first positive samples.

In some alternative implementations, step 303 may also be performed as follows:

the first positive sample set is divided into a first target number of first positive sample subsets, wherein the number of first positive samples in each first positive sample subset is as close as possible. Specifically, assuming that the first positive sample set includes L first positive samples, the first target number is T, Q is a positive integer rounding down a quotient of L divided by T, and R is a remainder of L divided by T, when R is zero, the first positive sample set may be divided into T first positive sample subsets on average, and the number of the first positive samples in each first positive sample subset is Q. When R is greater than zero, the first positive sample set may be divided equally into T first positive sample subsets, where T-1 first positive sample subsets include Q first positive samples and another first positive sample subset includes Q + R first positive samples.

Step 304, for each first positive sample subset of the first target number of first positive sample subsets, generating a candidate regular expression corresponding to the first positive sample subset based on each first positive sample in the first positive sample subset.

Having selected the first positive samples in the first set of positive samples, a first target number of first subsets of positive samples are formed, via step 303. Here, the executing subject of the first training step may generate, for each of the first target number of first positive sample subsets generated as described above, the candidate regular expression in various implementations based on each first positive sample in the first positive sample subset. Specifically, for each first positive sample in the first positive sample subset, the corresponding residence identification in the historical alarm receiving text of the first positive sample may be obtained according to the starting position and the ending position in each labeled residence identification position information in the labeled residence identification position information sequence of the first positive sample. Then, based on the acquired residence identification for each first positive sample in the first positive sample subset, a candidate regular expression corresponding to the first positive sample subset is generated. It should be noted that generating a regular expression based on at least one text is a prior art widely studied and applied at present, and is not described herein again.

Via step 304, a maximum of a first target number of candidate regular expressions may be generated.

Step 305, testing the generated candidate regular expressions based on the first test sample set to determine an accuracy corresponding to each generated candidate regular expression.

Specifically, the executing agent of the first training step may perform the following first accuracy determination operations for each candidate regular expression generated in step 304: firstly, for each first test sample in the first test sample set obtained in step 301, determining whether a historical alarm receiving text in the first test sample is matched with the candidate regular expression; if the matching is determined, the historical alarm receiving text in the first test sample comprises a residence mark according to the candidate regular expression, then whether the marked residence mark position information sequence in the first test sample is empty is further determined, if the marked residence mark position information sequence in the first test sample is empty, the historical alarm receiving text in the first test sample does not comprise the residence mark, the first test sample can be determined to be a negative sample relative to the candidate regular expression, and if the marked residence mark is not empty, the historical alarm receiving text in the first test sample comprises the residence mark, the first test sample can be determined to be a positive sample relative to the candidate regular expression; if the determination result is not matched, the historical alarm receiving text in the test sample does not include a residence mark according to the candidate regular expression, and then whether the marked residence mark position information sequence in the first test sample is empty is further determined, if the marking residence mark position information sequence in the first test sample is empty, the historical alarm receiving text in the first test sample does not include a residence mark, the first test sample can be determined to be a positive sample relative to the candidate regular expression, and if the marking residence mark in the historical alarm receiving text in the first test sample is not empty, the first test sample can be determined to be a negative sample relative to the candidate regular expression; and finally, determining the ratio of the number of the first test samples which are positive samples relative to the candidate regular expression in the first test sample set to the total number of the first test samples in the first test sample set as the accuracy corresponding to the candidate regular expression.

And step 306, determining the candidate regular expression with the highest accuracy in the generated candidate regular expressions as a residential area identification extraction regular expression.

The first training step shown in the flow 300 can be used for automatically generating the residence identification extraction regular expression, so that the labor cost for generating the residence identification extraction regular expression is reduced. And as time goes by, the expression of people changes, the residence identification in the response alarm text may also change, and errors may occur if the residence identification in the response alarm text is extracted in an inherent way. At this time, the latest first training sample set and the latest first testing sample set can be obtained, and the first training step is adopted to regenerate the residence mark extraction regular expression so as to meet the latest expression requirement of the current alarm receiving and handling text.

Step 203, matching the alarm receiving text of the to-be-extracted residential address information with the address extraction regular expression to obtain an address position information sequence.

In this embodiment, the address extraction regular expression may be a regular expression for extracting an address in a text.

Here, an execution subject (for example, a server shown in fig. 1) of the regular expression-based alarm receiving text residential address information extraction method may match the alarm receiving text of the residential address information to be extracted with the address extraction regular expression, and may extract address location information, where the address location information may include a start position and an end position for representing the corresponding start position and end position of the extracted address in the alarm receiving text of the residential address information to be extracted. It can be understood that there may be no address or at least one address in the message for treating the address information of the residence to be extracted, so that the address location information of each extracted address may form an address location information sequence according to the sequence of the corresponding address in the message for treating the address information of the residence to be extracted.

For example, matching an address extraction regular expression with an alarm receiving text of the address information sequence { 'initial position-9' of the residence to be extracted, wherein the residence is the city of the first province, the second city, the third city and the plum with the current residence located in the city of the fifth city, and labor disputes exist between the current residence and the plum; end position-12 "," start position-17; end position-21 "," start position-29; end position-33 ". Namely, "Dietyork B city, Dietyork C city and Dietyork C city are addresses.

In some alternative implementations, the address extraction regular expression may be a logical formula for extracting addresses that is formulated by a technician operating on strings based on statistical analysis of address portions in a large number of historical alarm-receiving texts including addresses.

In some optional implementations, the address extraction regular expression may also be pre-trained by a second training step as shown in fig. 4. Referring to fig. 4, fig. 4 shows a flow 400 of one embodiment of a second training step according to the present disclosure. The flow 400 of the second training step may include the steps of:

step 401, a second training sample set and a second testing sample set are obtained.

Here, the execution subject of the second training step may be the same as that of the regular expression-based alarm text residential address information extraction method described above. In this way, the execution main body of the second training step may store the address extraction regular expression locally in the execution main body after the address extraction regular expression is obtained through training, and read the trained address extraction regular expression in the process of executing the regular expression-based method for extracting address information of the alarm receiving text residential area.

Here, the execution subject of the second training step may also be different from that of the regular expression-based alarm receiving text residence address information extraction method described above. In this way, the execution main body of the second training step may send the address extraction regular expression to the execution main body of the regular expression-based alarm receiving text residential address information extraction method after the address extraction regular expression is obtained through training. In this way, the execution subject of the regular expression-based alarm receiving text residential address information extraction method may read the address extraction regular expression received from the execution subject of the second training step in the process of executing the regular expression-based alarm receiving text residential address information extraction method.

Here, the performing subject of the second training step may first obtain a second training sample set and a second test sample set. The second training sample and the second testing sample both comprise historical alarm receiving and processing texts and corresponding labeled address position information sequences, the labeled address position information can comprise a starting position and an ending position, and the labeled address position information corresponding to the historical alarm receiving and processing texts is used for representing that the address is between the starting position and the ending position of the historical alarm receiving and processing texts in the corresponding labeled address position information. It should be noted that, in practice, the alarm receiving text may include no address or at least one address. Therefore, the second training sample and the second test sample may include the sequence of the annotation address location information as null or may include at least one annotation address location information.

Here, the labeled address location information sequence in the second training sample and the second testing sample may be obtained by manually labeling the corresponding historical alarm receiving and processing text.

In practice, in order to improve the matching degree of the trained address extraction regular expression to the address, the historical alarm receiving and processing texts in the second training sample and the second test sample obtained here may not include the invalid alarm receiving and processing text. For example, some of the alarm receiving texts do not include any residential address, and have no value in actually extracting the residential address information, and such alarm receiving texts can be considered as invalid alarm receiving texts.

Step 402, generating a second positive sample set by using each second training sample labeled in the second training sample set, wherein the address position information sequence of each second training sample is not null.

And if the labeled address position information sequence of the second training sample in the second training sample set is not null, which indicates that the historical alarm receiving text of the second training sample comprises at least one address, the second training sample is the second positive sample. Therefore, a second set of positive samples may be generated with each second training sample in the second set of training samples labeled that the address location information sequence is not empty.

In step 403, second positive samples are selected from the second positive sample set to form a second target number of second positive sample subsets.

After obtaining the second set of positive samples in step 402, the performing agent of the second training step may select second positive samples from the second set of positive samples to form a second target number of subsets of positive samples. Here, the second target number may be preset, or may be determined by receiving a user input through an interface provided in the execution main body.

In some alternative implementations, step 403 may be performed as follows: a second target number of times a second positive-sample subset generating operation is performed to generate a second target number of second positive-sample subsets. Wherein the second subset of positive samples generating operation comprises: and randomly selecting N' second positive samples from the second positive sample set to form a second positive sample subset. And N 'is an integer obtained by rounding down a quotient of L' divided by M ', L' is the second number of positive samples in the second positive sample set, and M 'is a positive integer which is greater than or equal to 2 and smaller than L'. For example, the second positive sample set includes 392 second positive samples, the second target number is 6, M is 3, L is 392, N' is a positive integer 130 rounded down by the quotient of 392 divided by 3, where the following operations are performed 6 times: and randomly selecting 130 second positive samples from the second positive sample set comprising 392 second positive samples to form a second positive sample subset. Finally, 6 second subsets of positive samples are obtained, and each second subset of positive samples includes 130 second positive samples.

In some alternative implementations, step 403 may also be performed as follows:

and dividing the second positive sample set into a second target number of second positive sample subsets, wherein the number of second positive samples in each second positive sample subset is as close as possible. Specifically, assuming that the second positive sample set includes L ' second positive samples, the second target number is T ', Q ' is a positive integer obtained by rounding down a quotient of L ' divided by T ', and R ' is a remainder of L ' divided by T ', when R ' is zero, the second positive sample set may be averagely divided into T ' second positive sample subsets, and the number of the second positive samples in each second positive sample subset is Q '. When R 'is greater than zero, the second set of positive samples may be equally divided into T' second subsets of positive samples, where T '-1 second subsets of positive samples include Q' second positive samples, and another second subset of positive samples includes Q '+ R' second positive samples.

Step 404, for each second positive sample subset in a second target number of second positive sample subsets, generating a candidate regular expression corresponding to the second positive sample subset based on each second positive sample in the second positive sample subset.

Having selected the second positive samples in the second set of positive samples, a second target number of second subsets of positive samples are formed, via step 403. Here, the executing subject of the second training step may generate, for each of the second target number of second positive sample subsets generated as described above, the candidate regular expression in various implementations based on each second positive sample in the second positive sample subset. Specifically, for each second positive sample in the second positive sample subset, the corresponding address in the historical alarm receiving text of the second positive sample may be obtained according to the start position and the end position in each address position information in the labeled address position information sequence of the second positive sample. Then, based on the addresses obtained for each second positive sample in the second positive sample subset, a candidate regular expression corresponding to the second positive sample subset is generated. It should be noted that generating a regular expression based on at least one text is a prior art widely studied and applied at present, and is not described herein again.

Step 405, testing each generated candidate regular expression based on the second set of test samples to determine an accuracy corresponding to each generated candidate regular expression.

Specifically, the executing agent of the second training step may perform the following second accuracy determination operation for each candidate regular expression generated in step 404: firstly, for each second test sample in the second test sample set obtained in step 401, determining whether a historical alarm receiving text in the second test sample is matched with the candidate regular expression; if the matching is determined, the historical alarm receiving and processing text in the second test sample comprises an address according to the candidate regular expression, and then whether the tagged address position information sequence in the second test sample is empty is further determined, if the empty indicates that the historical alarm receiving and processing text in the second test sample does not comprise an address, the second test sample can be determined to be a negative sample relative to the candidate regular expression, and if the empty indicates that the historical alarm receiving and processing text in the second test sample comprises an address, the second test sample can be determined to be a positive sample relative to the candidate regular expression; if the determination result is not matched, the historical alarm receiving and processing text in the test sample does not include an address according to the candidate regular expression, and then whether the tagged address position information sequence in the second test sample is empty is further determined, if the tagged address position information sequence is empty, the historical alarm receiving and processing text in the second test sample does not include an address, the second test sample can be determined to be a positive sample relative to the candidate regular expression, and if the tagged address position information sequence is empty, the historical alarm receiving and processing text in the second test sample does not include an address, the second test sample can be determined to be a negative sample relative to the candidate regular expression; and finally, determining the ratio of the number of second test samples which are positive samples relative to the candidate regular expression in the second test sample set to the total number of the second test samples in the second test sample set as the accuracy corresponding to the candidate regular expression.

And step 406, determining the candidate regular expression with the highest accuracy in the generated candidate regular expressions as an address extraction regular expression.

The address extraction regular expression can be automatically generated by using the second training step shown in the flow 400, so that the labor cost for generating the address extraction regular expression is reduced. And as time goes on, the expression mode of people changes, the address information reflected in the alarm receiving text may also change, and errors may occur if the address in the alarm receiving text is extracted in a natural mode. At this time, the latest second training sample set and the second testing sample set can be obtained, and the address extraction regular expression is regenerated by adopting the second training step, so as to meet the latest expression requirement of the current alarm receiving and processing text.

In step 204, for each of the residence identification location information in the sequence of residence identification location information, a residence address information extraction operation is performed.

In the present embodiment, an execution subject (for example, a server shown in fig. 1) of the regular expression-based alarm-receiving text residential address information extraction method may execute the residential address information extraction operation for each of the sequences of residential identification location information. Here, the residential address information extracting operation may include the following sub-steps 2041 to 2043:

substep 2041 determines the end position in the residence identification position information as a target end position.

Substep 2042 determines, for each address location information in the sequence of address location information, the difference between the target ending location and the starting location in the address location information as the edit distance corresponding to the address location information.

And a substep 2043 of determining a text between the start position and the end position in the target address position information in the residence address information alarm receiving text to be extracted as the residence address information corresponding to the residence identification position information.

Here, the edit distance corresponding to the target address position information is the smallest among the address position information whose corresponding edit distance is a positive number.

Since in practice, after each residence identification, the corresponding residence address appears, the residence identification and the corresponding residence address may be directly adjacent, and other characters may exist between the two, but they do not converge too far. Therefore, the end position of the residence identification is before the start position of the corresponding residence address of the residence identification, and the difference between the start position of the corresponding address of the residence identification and the end position of the residence identification should be equal to or greater than zero. To facilitate an understanding of the various substeps of step 204, the following is exemplified:

supposing that the alarm receiving text of the address information of the residence to be extracted is 'Zhangyi', the place where the residence was in the province-B city, the current residence is in the district of the third city, and a labor resource dispute exists between the current residence and a Li somewhere in the district of the fifth city, the residence identification position information sequence { 'initial position-4' can be obtained through the step 202; end position-7 "," start position-14; end position-16 "," start position-24; end position-26 ". The address position information sequence { "start position-9" can be obtained through step 203; end position-12 "," start position-17; end position-21 "," start position-29; end position-33 ". In step 204, a sequence of location information { "starting location-4 can be identified for the residence; end position-7 "," start position-14; end position-16 "," start position-24; each of the end positions-26 "} identifies the position information, and performs the residential address information extraction operation. That is, the location information "home location-4" is identified for the residential areas, respectively; end position-7 "," start position-14; end position-16 "and" start position-24; end location-26 "performs the residential address information extraction operation.

Wherein the location information "home location-4" is identified for the place of residence; the specific procedure for ending location-7 "to perform the residential address information extraction operation is as follows: first, the place of residence is identified with the place information "home position-4; an end position "7" of the end positions-7 "is determined as a target end position, i.e., the target end position is 7. Then, for the address location information sequence { "start location-9; end position-12 "," start position-17; end position-21 "," start position-29; and determining the difference obtained by subtracting the target end position 7 from the start position in the address position information as the editing distance corresponding to the address position information for each address position information in the end position-33' }. That is, three edit distances {2, 10, 22} are obtained, respectively. And finally, determining the text between the starting position and the ending position in the target address position information in the process that the address information of the residence to be extracted receives the alarm text "Zhang-a, the place where the residence was in the city of the province, the city of the third city, and the labor resource dispute exists between the current residence and the Liza located in the city of the third city as the residence address information corresponding to the residence identification position information. The destination address information is address position information in the address position information sequence obtained in step 203, the edit distance corresponding to the destination address information is a positive number, and the edit distance corresponding to the destination address position information is the smallest in each address position information whose corresponding edit distance is a positive number. From the three edit distances obtained above, "home position-9; if the ending position-12 ' is target address position information, determining that a text between a starting position-9 and an ending position-12 ' of a labor resource dispute existing between the ending position-12 ' and a certain plum in a city-Peng district of the residential area, namely a text "Zhang-A-Bishi" of the residential area to be extracted address information of the residential area, wherein the residential area once is the city of the first province and the second province, and the current residential area is located in the city-Peng district of the third city; end position-7 "corresponding to the residential address information.

Identifying location information "home location-14 for the residence; the specific procedure for performing the residential address information extraction operation of end position-16 "is as follows: first, the place of residence is identified with the location information "home location-14; an end position "16" of the end positions-16 "is determined as a target end position, i.e., the target end position is 16. Then, for the address location information sequence { "start location-9; end position-12 "," start position-17; end position-21 "," start position-29; and determining the difference obtained by subtracting the target end position 16 from the start position in the address position information as the editing distance corresponding to the address position information for each address position information in the end position-33' }. Namely, three edit distances { -7, 1, 13} are obtained, respectively. And finally, determining the text between the starting position and the ending position in the target address position information in the process that the address information of the residence to be extracted receives the alarm text "Zhang-a, the place where the residence was in the city of the province, the city of the third city, and the labor resource dispute exists between the current residence and the Liza located in the city of the third city as the residence address information corresponding to the residence identification position information. The destination address information is address position information in the address position information sequence obtained in step 203, the edit distance corresponding to the destination address information is a positive number, and the edit distance corresponding to the destination address position information is the smallest in each address position information whose corresponding edit distance is a positive number. From the three edit distances obtained above, "home position-17; if the ending position-21 is target address position information, determining that a text 'third city butyl district' between a 'middle' starting position-17 and an ending position-21 'of labor resource dispute exists between the ending position-21 and a certain plum which has a living place address information to be extracted and a certain alarm receiving text' Zhang, the living place was the first province, the second city, the current residential area, and the current residential address is located in the second city, the third city, and the living place identification position information 'starting position-14' in the current residential area; end location-15 "corresponding to the residential address information.

Identifying location information "home location-24 for the residence; the specific procedure for performing the residential address information extraction operation of end position-26 "is as follows: first, the place of residence is identified with the location information "home location-24; an end position "26" of the end positions-26 "is determined as a target end position, i.e., the target end position is 26. Then, for the address location information sequence { "start location-9; end position-12 "," start position-17; end position-21 "," start position-29; and determining the difference obtained by subtracting the target end position 26 from the start position in the address position information as the editing distance corresponding to the address position information for each address position information in the end position-33' }. Namely, three edit distances { -17, -9, 3} are obtained, respectively. And finally, determining the text between the starting position and the ending position in the target address position information in the process that the address information of the residence to be extracted receives the alarm text "Zhang-a, the place where the residence was in the city of the province, the city of the third city, and the labor resource dispute exists between the current residence and the Liza located in the city of the third city as the residence address information corresponding to the residence identification position information. The destination address information is address position information in the address position information sequence obtained in step 203, the edit distance corresponding to the destination address information is a positive number, and the edit distance corresponding to the destination address position information is the smallest in each address position information whose corresponding edit distance is a positive number. From the three edit distances obtained above, "home position-29; the ending position-33 is target address position information, namely a starting position-29 of a labor resource dispute exists between a place where the living address information to be extracted is received and an alarm text, the place where the living address information is located is Zhang, the place where the living address is located is the first province, the second city, the third city, and the Liza where the living address is located in the fifth city; the text "the pentane cell" between the end positions-33 "is determined as identifying the location information" start position-24 "with the place of residence; end location-26 "corresponding to the residential address information.

And step 205, determining the residence address information corresponding to each residence identification position information in the residence identification position information sequence as the residence address information set corresponding to the residence address information alarm receiving text to be extracted.

As an example in step 204 continues, through step 205, a living place information set { "first province, second city, third city, etc. } corresponding to" a person, the place where he resided was first province, second city, etc., the place where he resided now is located in the third city, etc., and the work dispute exists between the person and the plum, the place where he resided is located in the fifth city, etc. can be obtained.

Since, in practice, the distance between each residence identification and the corresponding residence address information is not too far, in some optional implementations of this embodiment, for each residence identification, in step 204, the edit distance corresponding to the target address location information corresponding to the residence identification location information may be smaller than the preset edit distance threshold. Here, the preset edit distance threshold may be manually set.

In some optional implementations of this embodiment, the preset edit distance threshold may be pre-calculated by a third training step as shown in fig. 5. Referring to fig. 5, fig. 5 shows a flow 500 of one embodiment of a third training step according to the present disclosure. The third training step flow 500 may include the following steps:

step 501, a third training sample set is obtained.

Here, the execution subject of the third training step may be the same as that of the regular expression-based alarm text residential address information extraction method described above. In this way, the execution subject of the third training step may store the preset edit distance threshold locally in the execution subject after the preset edit distance threshold is obtained by training, and read the preset edit distance threshold obtained by training in the process of executing the regular expression-based method for extracting address information of the alarm receiving text residence.

Here, the execution subject of the third training step may also be different from that of the regular expression-based alarm receiving text residence address information extraction method described above. In this way, the execution main body of the third training step may send the preset edit distance threshold to the execution main body of the regular expression-based alarm receiving text residential address information extraction method after the preset edit distance threshold is obtained through training. In this way, the executive agent of the regular expression-based alarm receiving text residential address information extraction method may read the preset edit distance threshold received from the executive agent of the third training step in the process of executing the regular expression-based alarm receiving text residential address information extraction method.

Here, the third training sample may include a historical alarm-receiving text and a corresponding sequence of labeled residence information. The information for marking the residence can include a residence identification starting position, a residence identification ending position, an address starting position and an address ending position. Here, the residence-labeled information is used to represent that the residence is labeled between the residence identification starting position and the residence identification ending position of the residence-labeled information in the corresponding historical alarm receiving text, and the residence address information corresponding to the residence is the address information between the address starting position and the address ending position of the residence-labeled information in the historical alarm receiving text.

The labeled residence information sequence in the third training sample may be obtained by manually labeling the corresponding historical alarm receiving and processing text.

It should be noted that, in practice, the alarm receiving text may include no residence information or at least one residence information. Thus, the third training sample may include a sequence of tagged residence information that is empty or may include at least one tagged residence information.

Step 502, for each third training sample in the third training sample set, determining the maximum value of the edit distances corresponding to the respective labeling residence information in the labeling residence information sequence of the third training sample as the maximum edit distance corresponding to the third training sample.

Here, the edit distance corresponding to the residence-labeled information is a difference obtained by subtracting the end position of the corresponding residence marker from the start position of the address in the residence-labeled information.

For ease of understanding, the distance here describes the maximum edit distance corresponding to each third training sample. For example, the historical alarm receiving and processing text in the third training sample is "zhang ji, the place where the user lives is b city, c city, d city, and the current address is located between li ji of the e city, the labor resource dispute" and the historical alarm receiving and processing text, and the information sequence of the marked living places corresponding to the historical alarm receiving and processing text is { "the living place identification start position-4, the living place identification end position-5, the address start position-9, the address end position-12", "the living place identification start position-14, the living place identification end position-16, the address start position-17, the address end position-21", "the living place identification start position-24, the living place identification end position-26, the address start position-29, and the address end position-33", wherein:

the label residence information "residence identification start position-4, residence identification end position-5, address start position-9, address end position-12" is used for representing that the residence identification "place of residence" the residence address information corresponding to the place of residence "is" city and city of first province ". The editing distance corresponding to the information for marking the residence is the difference 4 of the address starting position 9 minus the residence mark ending position 5.

The information of the residence is marked as 'residence identification starting position-14', residence identification ending position-16 ', address starting position-17 and address ending position-21', and the information of the residence is used for representing that the residence identification 'living at present' corresponds to the residence address information of 'third city Ding district'. The editing distance corresponding to the information for marking the residence is the difference 1 of the address starting position 17 minus the residence mark ending position 16.

The mark of the residence information "residence identification start position-24", residence identification end position-26, address start position-29 and address end position-33 "are used for representing that the residence address information corresponding to the residence identification" present residence "is" pentane district ". The editing distance corresponding to the information for marking the residence is the difference 3 of the address start position 29 minus the residence mark end position 26.

Therefore, the edit distances corresponding to the three pieces of marked residence information in the marked residence information sequence of the third training sample are 4, 1, and 3, respectively, and the maximum value of the edit distances is 4, then 4 is determined as the maximum edit distance corresponding to the third training sample.

Step 503, determining a maximum value of the maximum edit distances corresponding to each third training sample in the third training sample set as a preset edit distance threshold.

In step 502, a corresponding maximum edit distance is determined for each third training sample in the third training sample set, and therefore, in step 503, a maximum value of the corresponding maximum edit distances in each third training sample in the third training sample set may be determined as a preset edit distance threshold.

The preset edit distance threshold trained by the third training step is obtained through statistical analysis of a large number of historical alarm receiving and handling texts, so that the preset edit distance threshold is obtained according to the method, and in the process of extracting the residential address information in the alarm receiving and handling texts, the target address position information is constrained according to the preset edit distance threshold, and the accuracy of extracting the residential address information can be improved.

According to the method provided by the embodiment of the disclosure, the method extracts the address information of each residence in the alarm receiving text of the residence address information to be extracted by using the residence identification extraction regular expression and the address extraction regular expression, so that the automatic extraction of the address information of the residence in the alarm receiving text is realized, manual operation is not needed, the cost for extracting the address information of the residence in the alarm receiving text is reduced, and the extraction speed for extracting the address information of the residence in the alarm receiving text is improved.

With further reference to fig. 6, as an implementation of the methods shown in the above diagrams, the present disclosure provides an embodiment of a regular expression-based device for extracting address information of a place where an alarm is received and processed, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be applied to various electronic devices.

As shown in fig. 6, the regular expression-based alarm text-receiving residence address information extracting apparatus 600 of the present embodiment includes: an acquisition unit 601, a first matching unit 602, a second matching unit 603, an extraction unit 604, and a determination unit 605. The obtaining unit 601 is configured to obtain a residence address information alarm receiving text to be extracted; a first matching unit 602, configured to match the alert receiving and processing text of the to-be-extracted residence address information with a residence identification extraction regular expression, so as to obtain a residence identification position information sequence; the second matching unit 603 is configured to match the alert receiving and processing text of the to-be-extracted residential address information with the address extraction regular expression to obtain an address location information sequence; an extracting unit 604 configured to perform the following residence address information extracting operation for each of the residence identification position information in the above-described sequence of residence identification position information: determining the end position in the residence identification position information as a target end position; determining a difference obtained by subtracting the target end position from the start position in the address position information as an editing distance corresponding to the address position information for each address position information in the address position information sequence; determining a text between a starting position and an ending position in target address position information in the residence address information alarm receiving text to be extracted as residence address information corresponding to the residence identification position information, wherein the editing distance corresponding to the target address position information in each piece of address position information with a positive editing distance is the minimum; the determining unit 605 is configured to determine the living place address information corresponding to each living place identification position information in the living place identification position information sequence as the living place address information set corresponding to the to-be-extracted living place address information alarm receiving text.

In this embodiment, specific processes of the obtaining unit 501, the first matching unit 602, the second matching unit 603, the extracting unit 604, and the determining unit 605 of the regular expression-based alarm receiving text residential address information extracting apparatus 500 and technical effects brought by the specific processes may refer to relevant descriptions of step 201, step 202, step 203, step 204, and step 205 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of this embodiment, the residential area identification and extraction regular expression is obtained by pre-training through the following first training step: acquiring a first training sample set and a first testing sample set, wherein the first training sample and the first testing sample both comprise a historical alarm receiving and handling text and a corresponding marked residence identification position information sequence, the marked residence identification position information comprises a starting position and an ending position, and the marked residence identification position information is used for representing that residence identification is carried out between the starting position and the ending position in the historical alarm receiving and handling text; marking each first training sample with a residence identification position information sequence not empty in the first training sample set to generate a first positive sample set; selecting first positive samples from the first positive sample set to form a first target number of first positive sample subsets; for each first positive sample subset in the first target number of first positive sample subsets, generating a candidate regular expression corresponding to the first positive sample subset based on each first positive sample in the first positive sample subset; testing each generated candidate regular expression based on the first test sample set to determine the accuracy corresponding to each generated candidate regular expression; and determining the candidate regular expression with the highest accuracy in the generated candidate regular expressions as the residential area identification extraction regular expression.

In some optional implementations of this embodiment, the selecting the first positive samples from the first positive sample set to form a first target number of first positive sample subsets includes: performing the first target number of times a first positive sample subset generating operation to generate the first target number of first positive sample subsets, the first positive sample subset generating operation comprising: and randomly selecting N first positive samples from the first positive sample set to form a first positive sample subset, wherein N is an integer obtained by rounding down a quotient of L divided by M, L is the number of the first positive samples in the first positive sample set, and M is a positive integer which is greater than or equal to 2 and smaller than L.

In some optional implementation manners of this embodiment, the address extraction regular expression is obtained by pre-training through the following second training step: acquiring a second training sample set and a second test sample set, wherein the second training sample and the second test sample both comprise a historical alarm receiving and processing text and a corresponding labeled address position information sequence, the labeled address position information comprises a starting position and an ending position, and the labeled address position information is used for representing that an address is arranged between the starting position and the ending position in the historical alarm receiving and processing text; generating a second positive sample set by using each second training sample marked with an address position information sequence which is not empty in the second training sample set; selecting second positive samples from the second positive sample set to form a second target number of second positive sample subsets; for each second positive sample subset in the second target number of second positive sample subsets, generating a candidate regular expression corresponding to the second positive sample subset based on each second positive sample in the second positive sample subset; testing each generated candidate regular expression based on the second test sample set to determine the accuracy corresponding to each generated candidate regular expression; and determining the candidate regular expression with the highest accuracy in the generated candidate regular expressions as the address extraction regular expression.

In some optional implementations of this embodiment, the selecting second positive samples from the second positive sample set to form a second target number of second positive sample subsets includes: performing the second target number of times a second positive sample subset generating operation to generate the second target number of second positive sample subsets, the second positive sample subset generating operation comprising: and randomly selecting N ' second positive samples from the second positive sample set to form a second positive sample subset, wherein N ' is an integer obtained by rounding down a quotient obtained by dividing L ' by M ', L ' is the number of the second positive samples in the second positive sample set, and M ' is a positive integer which is greater than or equal to 2 and smaller than L '.

In some optional implementation manners of this embodiment, an edit distance corresponding to the target address location information is smaller than a preset edit distance threshold.

In some optional implementation manners of this embodiment, the preset edit distance threshold is pre-calculated by the following third training step: acquiring a third training sample set, wherein the third training sample comprises a historical alarm receiving and handling text and a corresponding residence labeling information sequence, the residence labeling information comprises a residence identification starting position, a residence identification ending position, an address starting position and an address ending position, the residence labeling information is used for representing that residence identification is carried out between the residence identification starting position and the residence identification ending position in the historical alarm receiving and handling text, and the residence address information corresponding to the residence identification is address information between the address starting position and the address ending position in the historical alarm receiving and handling text; for each third training sample in the third training sample set, determining a maximum value of edit distances corresponding to each marking residence information in the marking residence information sequence of the third training sample as a maximum edit distance corresponding to the third training sample, wherein the edit distance corresponding to the marking residence information is a difference obtained by subtracting a corresponding residence mark end position from an address start position in the marking track address information; and determining the maximum value of the maximum edit distances corresponding to the third training samples in the third training sample set as the preset edit distance threshold.

It should be noted that, for details and technical effects of implementation of each unit in the regular expression-based alarm receiving text residential address information extraction device according to the embodiment of the present disclosure, reference may be made to descriptions of other embodiments in the present disclosure, and details are not described herein again.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing the electronic devices of embodiments of the present disclosure. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An Input/Output (I/O) interface 705 is also connected to the bus 704.

The following components are connected to the I/O interface 705: an input section 706 including a touch panel, a tablet, a keyboard, a mouse, or the like; an output section 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709. The computer program, when executed by a Central Processing Unit (CPU)701, performs the above-described functions defined in the method of the present disclosure. It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a first matching unit, a second matching unit, an extraction unit, and a determination unit. Here, the names of the units do not constitute a limitation to the unit itself in some cases, and for example, the acquisition unit may also be described as a "unit that acquires a living address information alarm text to be extracted".

As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a residence address information alarm receiving and processing text to be extracted; matching the alarm receiving text of the to-be-extracted residence address information with a residence identification extraction regular expression to obtain a residence identification position information sequence; matching the alarm receiving text of the to-be-extracted residential address information with an address extraction regular expression to obtain an address position information sequence; for each of the above-described sequences of residence identification location information, the following residence address information extraction operations are performed: determining the end position in the residence identification position information as a target end position; determining a difference obtained by subtracting the target end position from the start position in the address position information as an editing distance corresponding to the address position information for each address position information in the address position information sequence; determining a text between a starting position and an ending position in target address position information in the residence address information alarm receiving text to be extracted as residence address information corresponding to the residence identification position information, wherein the editing distance corresponding to the target address position information in each piece of address position information with a positive editing distance is the minimum; and determining the residence address information corresponding to each residence identification position information in the residence identification position information sequence as the residence address information set corresponding to the residence address information alarm receiving text to be extracted.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A regular expression-based method for extracting address information of an alarm receiving text residence, comprising the following steps:

acquiring a residence address information alarm receiving and processing text to be extracted;

matching the alarm receiving and handling text of the to-be-extracted residence address information with a residence identification extraction regular expression to obtain a residence identification position information sequence;

matching the alarm receiving and processing text of the to-be-extracted residential address information with an address extraction regular expression to obtain an address position information sequence;

for each of the sequence of residence identification location information, performing the following residence address information extraction operations: determining the end position in the residence identification position information as a target end position; for each address position information in the address position information sequence, determining the difference obtained by subtracting the target end position from the start position in the address position information as the editing distance corresponding to the address position information; determining a text between a starting position and an ending position in target address position information in the residence address information alarm receiving text to be extracted as residence address information corresponding to the residence identification position information, wherein the editing distance corresponding to the target address position information is the minimum in each address position information with a positive editing distance;

and determining the residence address information corresponding to each residence identification position information in the residence identification position information sequence as a residence address information set corresponding to the to-be-extracted residence address information alarm receiving text.

2. The method of claim 1, wherein the residence identification extraction regular expression is pre-trained by a first training step of:

acquiring a first training sample set and a first testing sample set, wherein the first training sample and the first testing sample both comprise a historical alarm receiving and handling text and a corresponding marked residence identification position information sequence, the marked residence identification position information comprises a starting position and an ending position, and the marked residence identification position information is used for representing that residence identification is carried out between the starting position and the ending position in the historical alarm receiving and handling text;

marking each first training sample with a residence identification position information sequence not empty in the first training sample set to generate a first positive sample set;

selecting first positive samples from the first positive sample set to form a first target number of first positive sample subsets;

for each first positive sample subset in the first target number of first positive sample subsets, generating a candidate regular expression corresponding to the first positive sample subset based on each first positive sample in the first positive sample subset;

testing each generated candidate regular expression based on the first test sample set to determine an accuracy corresponding to each generated candidate regular expression;

and determining the candidate regular expression with the highest accuracy in the generated candidate regular expressions as the residence identification extraction regular expression.

3. The method of claim 2, wherein said selecting first positive samples in said first set of positive samples constitutes a first target number of first subsets of positive samples, comprising:

performing the first target number of first positive sample subset generation operations to generate the first target number of first positive sample subsets, the first positive sample subset generation operations comprising: and randomly selecting N first positive samples from the first positive sample set to form a first positive sample subset, wherein N is an integer obtained by rounding down a quotient of L divided by M, L is the number of the first positive samples in the first positive sample set, and M is a positive integer which is greater than or equal to 2 and smaller than L.

4. The method of claim 1, wherein the address extraction regular expression is pre-trained by a second training step of:

acquiring a second training sample set and a second test sample set, wherein the second training sample and the second test sample both comprise a historical alarm receiving and processing text and a corresponding labeled address position information sequence, the labeled address position information comprises a starting position and an ending position, and the labeled address position information is used for representing that an address is arranged between the starting position and the ending position in the historical alarm receiving and processing text;

generating a second positive sample set by using each second training sample labeled in the second training sample set, wherein the address position information sequence of each second training sample is not empty;

selecting second positive samples from the second positive sample set to form a second target number of second positive sample subsets;

for each second positive sample subset in the second target number of second positive sample subsets, generating a candidate regular expression corresponding to the second positive sample subset based on each second positive sample in the second positive sample subset;

testing each generated candidate regular expression based on the second test sample set to determine an accuracy corresponding to each generated candidate regular expression;

and determining the candidate regular expression with the highest accuracy in the generated candidate regular expressions as the address extraction regular expression.

5. The method of claim 4, wherein said selecting second positive samples in the second set of positive samples constitutes a second target number of second subsets of positive samples, comprising:

performing the second target number of second positive subset of samples generation operations to generate the second target number of second positive subset of samples, the second positive subset of samples generation operations comprising: and randomly selecting N ' second positive samples from the second positive sample set to form a second positive sample subset, wherein N ' is an integer obtained by rounding down a quotient obtained by dividing L ' by M ', L ' is the number of the second positive samples in the second positive sample set, and M ' is a positive integer which is greater than or equal to 2 and smaller than L '.

6. The method of claim 1, wherein the edit distance corresponding to the target address location information is less than a preset edit distance threshold.

7. The method of claim 6, wherein the preset edit distance threshold is pre-calculated by a third training step of:

acquiring a third training sample set, wherein the third training sample comprises a historical alarm receiving and handling text and a corresponding residence labeling information sequence, the residence labeling information comprises a residence identification starting position, a residence identification ending position, an address starting position and an address ending position, the residence labeling information is used for representing that residence identification is carried out between the residence identification starting position and the residence identification ending position in the historical alarm receiving and handling text, and the residence address information corresponding to the residence identification is address information between the address starting position and the address ending position in the historical alarm receiving and handling text;

for each third training sample in the third training sample set, determining a maximum value of edit distances corresponding to each marking residence information in the marking residence information sequence of the third training sample as a maximum edit distance corresponding to the third training sample, wherein the edit distance corresponding to the marking residence information is a difference value obtained by subtracting a corresponding residence mark end position from an address start position in the marking track address information;

and determining the maximum value of the maximum edit distances corresponding to the third training samples in the third training sample set as the preset edit distance threshold.

8. An alarm receiving and processing text residence address information extraction device based on regular expressions comprises:

an acquisition unit configured to acquire a residence address information alarm receiving text to be extracted;

the first matching unit is configured to match the to-be-extracted living place address information alarm receiving text with a living place identification extraction regular expression to obtain a living place identification position information sequence;

the second matching unit is configured to match the alarm receiving and processing text of the to-be-extracted residential address information with an address extraction regular expression to obtain an address position information sequence;

an extraction unit configured to perform the following residence address information extraction operation for each of the sequences of residence identification location information: determining the end position in the residence identification position information as a target end position; for each address position information in the address position information sequence, determining the difference obtained by subtracting the target end position from the start position in the address position information as the editing distance corresponding to the address position information; determining a text between a starting position and an ending position in target address position information in the residence address information alarm receiving text to be extracted as residence address information corresponding to the residence identification position information, wherein the editing distance corresponding to the target address position information is the minimum in each address position information with a positive editing distance;

the determining unit is configured to determine the residence address information corresponding to each residence identification position information in the residence identification position information sequence as the residence address information set corresponding to the to-be-extracted residence address information alarm receiving text.

9. The apparatus of claim 8, wherein the residence identification extraction regular expression is pre-trained by a first training step of:

10. The apparatus of claim 9, wherein said selecting first positive samples in the first set of positive samples constitutes a first target number of first subsets of positive samples, comprising:

11. The apparatus of claim 8, wherein the address extraction regular expression is pre-trained by a second training step of:

12. The apparatus of claim 11, wherein said selecting second positive samples in the second set of positive samples constitutes a second target number of second subsets of positive samples, comprising:

13. The apparatus of claim 8, wherein the edit distance corresponding to the target address location information is less than a preset edit distance threshold.

14. The apparatus of claim 13, wherein the preset edit distance threshold is pre-calculated by a third training step of:

15. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-7.

16. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.