CN111309861B - Site extraction method, apparatus, electronic device, and computer-readable storage medium - Google Patents

Site extraction method, apparatus, electronic device, and computer-readable storage medium Download PDF

Info

Publication number
CN111309861B
CN111309861B CN202010083644.7A CN202010083644A CN111309861B CN 111309861 B CN111309861 B CN 111309861B CN 202010083644 A CN202010083644 A CN 202010083644A CN 111309861 B CN111309861 B CN 111309861B
Authority
CN
China
Prior art keywords
place
character
places
label
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010083644.7A
Other languages
Chinese (zh)
Other versions
CN111309861A (en
Inventor
席丽娜
王文军
李德彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dingfu Intelligent Technology Co ltd
Original Assignee
Dingfu Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dingfu Intelligent Technology Co ltd filed Critical Dingfu Intelligent Technology Co ltd
Priority to CN202010083644.7A priority Critical patent/CN111309861B/en
Publication of CN111309861A publication Critical patent/CN111309861A/en
Application granted granted Critical
Publication of CN111309861B publication Critical patent/CN111309861B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a place extraction method, a place extraction device, electronic equipment and a computer readable storage medium, and belongs to the field of text processing. The method comprises the following steps: after the text to be processed is obtained, inputting the text to be processed into a pre-created sequence model, and screening out the places by the sequence model. Since the training set of the training sequence model includes the location labels of the respective granularities, the features of the locations of the respective range classes can be sufficiently learned for the obtained sequence model. After the task judgment document text is input into the sequence model, the sequence model can screen out sites with different range grades, so that the situation that many small sites cannot be extracted or the extraction of long sites is incomplete is avoided.

Description

Site extraction method, apparatus, electronic device, and computer-readable storage medium
Technical Field
The application belongs to the field of text processing, and particularly relates to a place extraction method, a place extraction device, electronic equipment and a computer readable storage medium.
Background
In criminal decision document text, a large number of words are often needed to describe a complex case. Among these words, there are places related to the case. Because the characters used for describing the places are generally complicated, the reading disorder is caused when the relevant personnel read the places clear the situation, and the workload of readers is greatly increased. To solve the above problem, the places in the text of the decision document are usually extracted by a trained neural network model.
In the prior art, when training a neural network model, labels for representing whether the task text corpus is a place are generally added to a large number of task text corpora, and then the neural network model is trained by taking the large number of task text corpora added with the labels as a training set.
However, in the actual task text corpus, the expressions of the places are various and different in length, and when the training set is obtained, the places included in the task text corpus are marked by the same label, so that the types of the places corresponding to the label are diversified, the learning difficulty of the neural network model is intangibly increased, and the situation that a plurality of small places cannot be extracted or the long places cannot be extracted is caused when the places in the judgment text are extracted through the trained neural network model, so that the optimal extraction effect cannot be achieved.
Disclosure of Invention
In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device, and a computer-readable storage medium for extracting points, which can screen points of different range levels and avoid situations in which many small points are not extracted or long points are not extracted.
Embodiments of the present application are implemented as follows:
in a first aspect, an embodiment of the present application provides a location extraction method, where the method includes: acquiring a text to be processed; inputting the text to be processed into a pre-created sequence model, and screening out places with different granularities; each character included in each sample included in the training set for training the sequence model is added with a place label, the place label is used for representing whether the character is place or not, and the granularity of the place labels included in a plurality of samples composing the training set is different. Since the training set includes the location labels of the respective granularities, the features of the locations of the respective range classes can be sufficiently learned for the obtained sequence model. After the task judgment document text is input into the sequence model, the sequence model can screen out sites with different range grades, so that the situation that many small sites cannot be extracted or the extraction of long sites is incomplete is avoided.
With reference to the first aspect embodiment, in a possible implementation manner, before the obtaining the text to be processed, the method further includes: acquiring the sample; responding to a label adding instruction of a user, adding the place label to each character included in the sample, wherein the place label is used for representing whether the character is a place or not; and forming a plurality of samples into the training set, inputting the training set into a first network model for training, and obtaining the sequence model for site screening.
With reference to the first aspect embodiment, in a possible implementation manner, before the responding to the user tag adding instruction, the method further includes: and correcting the error characters included in the sample. So as to avoid excessive influence on the effect of the model obtained by subsequent training.
With reference to the first aspect embodiment, in a possible implementation manner, after the adding the location tag to each character included in the sample, the method further includes: adding a classification label to each place included in the sample, wherein the classification label is used for representing whether the place is a required place or not; and inputting the training set added with the classification labels into a second network model for training to obtain a classification model for classifying places. The classification labels can be used for training a classification model after being added.
With reference to the first aspect embodiment, in a possible implementation manner, after the screening out the sites with different granularity, the method further includes: and inputting the places with different granularities into the classification model, and screening out the needed places. The classification model obtained through training can classify the places related to the case and the places not related to the case in the text to be processed, so that the places related to the case are screened out.
With reference to the first aspect embodiment, in a possible implementation manner, after the screening out a desired location, the method further includes: highlighting the required place in the text to be processed. After highlighting, the case can be conveniently combed by related personnel reading the text to be processed.
In a second aspect, an embodiment of the present application provides a location extraction apparatus, including: the device comprises an acquisition module and a screening module. The acquisition module is used for acquiring the text to be processed; the screening module is used for inputting the text to be processed into a pre-created sequence model and screening out places with different granularities; each character included in each sample included in the training set for training the sequence model is added with a place label, the place label is used for representing whether the character is place or not, and the granularity of the place labels included in a plurality of samples composing the training set is different.
With reference to the second aspect of the embodiment, in a possible implementation manner, the location extraction device further includes a response module and a training module; the acquisition module is further used for acquiring the sample; the response module is used for responding to a label adding instruction of a user, adding the place label to each character included in the sample, and the place label is used for representing whether the character is a place or not; the training module is used for forming a plurality of samples into the training set and inputting the training set into a first network model for training to obtain the sequence model for site screening.
With reference to the second aspect of the embodiment, in a possible implementation manner, the location extraction device further includes a correction module, configured to correct an error character included in the sample.
With reference to the second aspect of the embodiment, in a possible implementation manner, the response module is further configured to add a classification tag to each location included in the sample, where the classification tag is used to characterize whether the location is a desired location; the training module is further configured to input the training set added with the classification label into a second network model for training, so as to obtain a classification model for performing location classification.
With reference to the second aspect of the embodiment, in a possible implementation manner, the screening module is further configured to input the locations with different granularities into the classification model to screen out a required location.
With reference to the second aspect of the embodiment, in a possible implementation manner, the location extraction device further includes a display module, configured to highlight the required location in the text to be processed.
In a third aspect, an embodiment of the present application further provides an electronic device, including: the device comprises a memory and a processor, wherein the memory is connected with the processor; the memory is used for storing programs; the processor invokes a program stored in the memory to perform the above-described first aspect embodiment and/or the method provided in connection with any one of the possible implementations of the first aspect embodiment.
In a fourth aspect, embodiments of the present application further provide a non-volatile computer readable storage medium (hereinafter referred to as computer readable storage medium), on which a computer program is stored, which when executed by a computer performs the above-described embodiments of the first aspect and/or the method provided in connection with any one of the possible implementations of the embodiments of the first aspect.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. The above and other objects, features and advantages of the present application will become more apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the several views of the drawings. The drawings are not intended to be drawn to scale, with emphasis instead being placed upon illustrating the principles of the application.
Fig. 1 shows a flowchart of a location extraction method according to an embodiment of the present application.
Fig. 2 shows a training flowchart of a sequence model provided by an embodiment of the present application.
Fig. 3 shows a schematic labeling diagram of a location tag according to an embodiment of the present application.
Fig. 4 shows a training flowchart of the classification model provided by the embodiment of the application.
Fig. 5 shows a block diagram of a location extraction apparatus according to an embodiment of the present application.
Fig. 6 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.
Reference numerals: 100-an electronic device; 110-a processor; 120-memory; 400-place extraction means; 410-an acquisition module; 420-a screening module.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely to distinguish one entity or action from another entity or action in the description of the application without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Furthermore, the term "and/or" in the present application is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone.
Furthermore, the existing neural network model for extracting the places in the text of the decision document has the defect that the applicant obtains after practice and careful study, so the discovery process of the defect and the solution proposed by the embodiments of the present application below for the defect should be all contributions of the applicant to the present application in the process of the present application.
In order to solve the defects in the prior art, the embodiment of the application provides a place extraction method, a place extraction device, electronic equipment and a computer readable storage medium, which can construct a sequence model with good extraction effect on various places.
The technology can be realized by adopting corresponding software, hardware and a combination of the software and the hardware. Embodiments of the present application are described in detail below.
The following describes a location extraction method provided by the present application.
Referring to fig. 1, an embodiment of the present application provides a location extraction method applied to an electronic device. The steps involved will be described below in connection with fig. 1.
Step S101: and acquiring a text to be processed.
Step S102: inputting the text to be processed into a pre-created sequence model, and screening out places with different granularities.
In the embodiment of the application, the text to be processed can be task judgment document text.
After the text to be processed is obtained, the text to be processed is input into a sequence model, so that the sequence model screens out places with different granularities.
Referring to fig. 2, the training sequence model is as follows.
Step S110: a sample is obtained.
In the embodiment of the present application, the sample may be a decision document text (TXT file) of various real cases.
In addition, since the decision document text in the disclosure stage is generally a PDF (Portable Document Format ) file, errors, such as messy codes, redundant line wrapping, redundant space, and the like, are unavoidable in the process of parsing the PDF file into the TXT file, so in order to avoid excessive influence on the effect of the model obtained by subsequent training, in an alternative embodiment, after the sample is obtained, correction may be performed on the error characters included in the sample, such as deleting the error characters, deleting the redundant characters, and the like.
When the correction is carried out, the sample can be corrected through a pre-stored correction model, and the error in the sample can be directly modified through responding to an instruction input by a user.
Step S120: and responding to a label adding instruction of a user, adding a place label to each character included in the sample, wherein the place label is used for representing whether the character is a place or not.
A large number of sentences, each including a plurality of characters, are included in each of the sample decision document texts. In the embodiment of the present application, a place tag is added to each character included in each sentence of a sample in units of sentences. Wherein for each character, its place tag is used to characterize whether or not it is place itself.
In addition, in the embodiment of the application, the places are divided into different range grades according to the completeness and the detail degree of the places, and the types of the place labels correspondingly also comprise a plurality of places labels with various granularities, so that the places with different range grades are marked.
In an alternative embodiment, the sites are divided into complete detailed sites (e.g., XX cells in XX, XX municipalities), incomplete complete and incomplete detailed sites (e.g., XX lines), and minimum level sites (e.g., XX buildings); the type of location tag may include Large, medium, tiny, out.
Wherein Out is used to characterize that the character is not a character of a composition place in the sentence to which it belongs; large, medium, and Tiny are used to characterize the character as a constituent place in the sentence to which it belongs.
Further, large refers to a complete detailed place. When the place tag of a certain character is Large, the place tag is used for representing that the character is a character composing a complete detailed place in a sentence to which the character belongs.
Medium refers to places that are not completely complete and not completely detailed. When the place tag of a certain character is Medium, the place tag is used for representing that the character is a character which forms an incomplete complete or incomplete detailed place in a sentence to which the character belongs.
Tiny refers to the minimum level of locality. When the place tag of a certain character is Tiny, the character used for representing the place of the smallest level is formed in the sentence to which the character belongs.
In addition, when a place tag is added to each character, if the place tag of a certain character is used for representing that the character is a character composing a place, an identifier for representing the place start needs to be added to the place tag. For example, the beginning character of a place is characterized by letter B, and the middle and ending characters of a place are characterized by letter I.
When the letter B is used as the beginning character of the place, the letter I is used as the middle character and the ending character of the place, if a complete detailed place exists in a sentence, the first character forming the complete detailed place is added with a place tag B-Large, and other characters forming the complete detailed place are added with a place tag I-Large. Referring to fig. 3, take the sentence "small king in the B-region C-way D-region … …" as an example, in this sentence, "small", "king" and "in" are not characters composing a place, and then place tags Out are added to these three characters, respectively; in this sentence, the places consisting of "a", "city", "B", "region", "C", "way", "D", "small", "region" are complete and detailed places, and thus, the Large label is added to each of these characters, where the "a" character is the first character of the complete and detailed place, and thus, the place label corresponding thereto is B-Large, and the place labels corresponding to the remaining characters ("city", "B", "region", "C", "way", "D", "small", "region") constituting the complete and detailed place are I-Large.
Step S130: and after the plurality of samples form a training set, inputting the training set into a first network model for training to obtain a sequence model for site screening, wherein the plurality of samples form the training set and comprise site labels with different granularity.
After labeling each of the decision document texts as samples, a plurality of samples form a training set for training the model. In the training set, place tags of individual granularity are included.
In the embodiment of the application, a training set is input into a first network model for training to obtain a sequence model for site screening.
The model structure of the first network model may include ALBERT (A Lite Bidirectional Encoder Representations from Transformers, lightweight pre-training model) +bi-LSTM (bidirectional Long Sort-Term Memory, bidirectional long and short Term Memory network) +crf (Conditional Random Field, conditional random field model).
After the training set is input into the first network model, ALBERT encodes each sample to obtain 768-dimensional character codes of each character, and bi-LSTM learns the codes of each character to obtain a large number of serialization features. And secondly, learning constraint relations among the place labels through CRF, so that the place labels of the characters are predicted. And (3) carrying out loop iteration on the process to finally obtain a sequence model.
Since the training set includes the location labels of the respective granularities, the features of the locations of the respective range classes can be sufficiently learned for the obtained sequence model. After the text to be processed (task judgment document text) is input into the sequence model, the sequence model can screen out sites with different range grades from the sequence model, so that the situation that many small sites cannot be extracted or long sites are not extracted is avoided.
In addition, in the places included in the judgment document text, a large number of places which are irrelevant to the case but are required to be described can exist, and the places which are irrelevant to the case can not bring great obstacle to the reading related personnel to clear the coming pulse of the case, so that the workload of readers is greatly increased.
In order to avoid the above problems, it is required to have a neural network model capable of screening out places related to a case from a decision document.
To achieve the above effect, please refer to fig. 4, after step S120, the method further includes:
step S140: and adding a classification label to each place included in the sample, wherein the classification label is used for representing whether the place is a needed place or not.
In an alternative embodiment, the classification tag comprises a number 0 and a number 1, wherein the number 0 is used to characterize places that are not related to the case, i.e. places that are not needed, and the number 1 is used to characterize places that are related to the case, i.e. places that are needed.
When the classification label of a certain place is 0, the classification label is used for representing that the place is an unnecessary place in a sentence to which the place belongs; when the classification label of a certain place is 1, the classification label is used for representing that the place is a required place in a sentence to which the place belongs.
Step S150: and inputting the training set added with the classification labels into a second network model for training to obtain a classification model for classifying places.
And inputting the training set added with the classification labels into a second network model for training to obtain a classification model for classifying the places.
The model structure of the second network model may include albert+cnn (Convolutional Neural Network ) +sigmoid (activation function), among others.
After the training set added with the classified labels is input into the second network model, ALBERT codes each sample to obtain 768-dimensional character codes of each character, then the codes of each character are learned through CNN to obtain important characteristics among parts of each character, and finally multi-label classification is carried out through Sigmoid, so that whether each label is a required address is predicted. And carrying out loop iteration on the process to finally obtain a classification model for carrying out two classifications.
In the embodiment of obtaining the classification model, as an optional embodiment, the places with different granularities obtained after the text to be processed is input into the sequence model can be input into the classification model, so that the required places are screened out by the classification model.
Through the screening, the places with different range grades related to the case can be obtained from the text to be processed.
In addition, as an alternative implementation manner, after the places with different range grades related to the case are acquired, the required places (places with different range grades related to the case) can be highlighted in the text to be processed, for example, highlighting, red marking, underline adding and the like, so that the related personnel reading the text to be processed can conveniently comb the case.
After the text to be processed is obtained, the text to be processed is input into a pre-created sequence model, and the location is screened out by the sequence model. Since the training set of the training sequence model includes the location labels of the respective granularities, the features of the locations of the respective range classes can be sufficiently learned for the obtained sequence model. After the task judgment document text is input into the sequence model, the sequence model can screen out sites with different range grades, so that the situation that many small sites cannot be extracted or the extraction of long sites is incomplete is avoided.
As shown in fig. 5, the embodiment of the present application further provides a location extraction device 400, where the location extraction device 400 may include: an acquisition module 410 and a screening module 420.
An obtaining module 410, configured to obtain a text to be processed;
and the screening module 420 is used for inputting the text to be processed into a pre-created sequence model and screening out places with different granularities. Each character included in each sample included in the training set for training the sequence model is added with a place label, the place label is used for representing whether the character is place or not, and the granularity of the place labels included in a plurality of samples composing the training set is different.
In one possible implementation, the location extraction device 400 may further include a response module and a training module.
The obtaining module 410 is further configured to obtain the sample;
the response module is used for responding to a label adding instruction of a user, adding the place label to each character included in the sample, and the place label is used for representing whether the character is a place or not;
the training module is used for forming a plurality of samples into the training set and inputting the training set into a first network model for training to obtain the sequence model for site screening.
In a possible implementation, the location extraction device 400 further includes a correction module for correcting the error characters included in the sample.
In a possible implementation manner, the response module is further configured to add a classification tag to each location included in the sample, where the classification tag is used to characterize whether the location is a required location;
the training module is further configured to input the training set added with the classification label into a second network model for training, so as to obtain a classification model for performing location classification.
In a possible implementation manner, the screening module 420 is further configured to input the locations with different granularities into the classification model to screen out the required locations.
In a possible implementation, the location extraction device 400 further includes a display module for highlighting the required location in the text to be processed.
The location extraction device 400 according to the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing method embodiment, and for brevity, reference may be made to the corresponding content of the foregoing method embodiment where the device embodiment is not mentioned.
In addition, the embodiment of the application further provides a computer readable storage medium, and the computer readable storage medium stores a computer program, and when the computer program is executed by a computer, the computer program executes the steps included in the location extraction method.
In addition, referring to fig. 6, the embodiment of the application further provides an electronic device 100 for implementing the location extraction method and apparatus according to the embodiment of the application.
The electronic device 100 may run the sequence model and the classification model described above.
Alternatively, the electronic device 100 may be, but is not limited to, a personal computer (Personal computer, PC), a smart phone, a tablet computer, a mobile internet device (Mobile Internet Device, MID), a personal digital assistant, or the like.
Wherein the electronic device 100 may include: a processor 110, a memory 120.
It should be noted that the components and structures of the electronic device 100 shown in fig. 6 are exemplary only and not limiting, as the electronic device 100 may have other components and structures as desired.
The processor 110, the memory 120, and other components that may be present in the electronic device 100 are electrically connected to each other, either directly or indirectly, to enable transmission or interaction of data. For example, the processor 110, the memory 120, and possibly other components may be electrically connected to each other by one or more communication buses or signal lines.
The memory 120 is used to store a program, for example, a program corresponding to the above-presented place extraction method or the above-presented place extraction device. Alternatively, when the location extraction means is stored in the memory 120, the location extraction means includes at least one software function module that may be stored in the memory 120 in the form of software or firmware (firmware).
Alternatively, the software functional module included in the location extraction device may be solidified in an Operating System (OS) of the electronic apparatus 100.
The processor 110 is configured to execute executable modules stored in the memory 120, such as software functional modules or computer programs included in the location extraction device. When the processor 110 receives the execution instructions, it may execute a computer program, for example, to perform: acquiring a text to be processed; inputting the text to be processed into a pre-created sequence model, and screening out places with different granularities; each character included in each sample included in the training set for training the sequence model is added with a place label, the place label is used for representing whether the character is place or not, and the granularity of the place labels included in a plurality of samples composing the training set is different.
Of course, the methods disclosed in any of the embodiments of the present application may be applied to the processor 110 or implemented by the processor 110.
In summary, the method, the device, the electronic device and the computer readable storage medium for extracting the location according to the embodiments of the present application input the text to be processed into a pre-created sequence model after the text to be processed is obtained, and screen the location from the sequence model. Since the training set of the training sequence model includes the location labels of the respective granularities, the features of the locations of the respective range classes can be sufficiently learned for the obtained sequence model. After the task judgment document text is input into the sequence model, the sequence model can screen out sites with different range grades, so that the situation that many small sites cannot be extracted or the extraction of long sites is incomplete is avoided.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application.

Claims (9)

1. A method of location extraction, the method comprising:
acquiring a text to be processed;
inputting the text to be processed into a pre-created sequence model, and screening out places with different granularities;
each character included in each sample included in a training set for training the sequence model is added with a place label, the place label is used for representing whether the character is a place or not, the granularity of the place labels included in a plurality of samples forming the training set is different, place labels with multiple granularities are formed, and places with different range grades are marked; the location range class types include: complete detailed places, incomplete complete and incomplete detailed places, and places of minimum level; when the place label corresponding to the character is the complete detailed place, representing the character as the character forming the complete detailed place in the sentence to which the character belongs; when the place label corresponding to the character is the incomplete complete and incomplete detailed place, characterizing the character as the character composing the incomplete complete or incomplete detailed place in the sentence to which the character belongs; when the place label corresponding to the character is the complete detailed place, representing the character as the character of the place composing the minimum level in the sentence to which the character belongs;
inputting the places with different granularities into a classification model, and screening out the places required;
each place included in each sample included in the training set for training the classification model is added with a classification label, and the classification label is used for representing whether the place is a needed place or not.
2. The method of claim 1, wherein prior to the obtaining text to be processed, the method further comprises:
acquiring the sample;
responding to a label adding instruction of a user, adding the place label to each character included in the sample, wherein the place label is used for representing whether the character is a place or not;
and forming a plurality of samples into the training set, inputting the training set into a first network model for training, and obtaining the sequence model for site screening.
3. The method of claim 2, wherein prior to said responding to the user's tag-add instruction, the method further comprises:
and correcting the error characters included in the sample.
4. The method of claim 2, wherein after the adding the place tag to each character included in the sample, the method further comprises:
adding a classification label to each place included in the sample, wherein the classification label is used for representing whether the place is a required place or not;
and inputting the training set added with the classification labels into a second network model for training to obtain a classification model for classifying places.
5. The method of claim 1, wherein after said screening out the desired location, the method further comprises:
highlighting the required place in the text to be processed.
6. A location extraction device, the location extraction device comprising: the device comprises an acquisition module and a screening module;
the acquisition module is used for acquiring the text to be processed;
the screening module is used for inputting the text to be processed into a pre-created sequence model and screening out places with different granularities;
each character included in each sample included in a training set for training the sequence model is added with a place label, the place label is used for representing whether the character is a place or not, the granularity of the place labels included in a plurality of samples forming the training set is different, place labels with multiple granularities are formed, and places with different range grades are marked; the location range class types include: complete detailed places, incomplete complete and incomplete detailed places, and places of minimum level; when the place label corresponding to the character is the complete detailed place, representing the character as the character forming the complete detailed place in the sentence to which the character belongs; when the place label corresponding to the character is the incomplete complete and incomplete detailed place, characterizing the character as the character composing the incomplete complete or incomplete detailed place in the sentence to which the character belongs; when the place label corresponding to the character is the complete detailed place, representing the character as the character of the place composing the minimum level in the sentence to which the character belongs;
the screening module is also used for inputting the places with different granularities into a classification model to screen out the needed places; each place included in each sample included in the training set for training the classification model is added with a classification label, and the classification label is used for representing whether the place is a needed place or not.
7. The location extraction device of claim 6, further comprising a response module and a training module;
the acquisition module is further used for acquiring the sample;
the response module is used for responding to a label adding instruction of a user, adding the place label to each character included in the sample, and the place label is used for representing whether the character is a place or not;
the training module is used for forming a plurality of samples into the training set and inputting the training set into a first network model for training to obtain the sequence model for site screening.
8. An electronic device, comprising: the device comprises a memory and a processor, wherein the memory is connected with the processor;
the memory is used for storing programs;
the processor invokes a program stored in the memory to perform the method of any one of claims 1-5.
9. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being run by a computer, performs the method according to any of claims 1-5.
CN202010083644.7A 2020-02-07 2020-02-07 Site extraction method, apparatus, electronic device, and computer-readable storage medium Active CN111309861B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010083644.7A CN111309861B (en) 2020-02-07 2020-02-07 Site extraction method, apparatus, electronic device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010083644.7A CN111309861B (en) 2020-02-07 2020-02-07 Site extraction method, apparatus, electronic device, and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN111309861A CN111309861A (en) 2020-06-19
CN111309861B true CN111309861B (en) 2023-08-22

Family

ID=71161707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010083644.7A Active CN111309861B (en) 2020-02-07 2020-02-07 Site extraction method, apparatus, electronic device, and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN111309861B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112269881A (en) * 2020-11-05 2021-01-26 北京小米松果电子有限公司 Multi-label text classification method and device and storage medium
US20220147669A1 (en) * 2020-11-07 2022-05-12 International Business Machines Corporation Scalable Modeling for Large Collections of Time Series
CN113570480A (en) * 2021-07-19 2021-10-29 北京华宇元典信息服务有限公司 Judging document address information identification method and device and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160050541A1 (en) * 2014-05-29 2016-02-18 Egypt-Japan University Of Science And Technology Fine-Grained Indoor Location-Based Social Network
CN110069626B (en) * 2017-11-09 2023-08-04 菜鸟智能物流控股有限公司 Target address identification method, classification model training method and equipment
CN109740150A (en) * 2018-12-20 2019-05-10 出门问问信息科技有限公司 Address resolution method, device, computer equipment and computer readable storage medium
CN110298039B (en) * 2019-06-20 2023-05-30 北京百度网讯科技有限公司 Event place identification method, system, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN111309861A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
CN111309861B (en) Site extraction method, apparatus, electronic device, and computer-readable storage medium
CN109460551B (en) Signature information extraction method and device
US11061953B2 (en) Method and system for extraction of relevant sections from plurality of documents
CN111985229A (en) Sequence labeling method and device and computer equipment
WO2019060010A1 (en) Content pattern based automatic document classification
CN111897781B (en) Knowledge graph data extraction method and system
CN111275133A (en) Fusion method and device of classification models and storage medium
US12008830B2 (en) System for template invariant information extraction
US11763588B2 (en) Computing system for extraction of textual elements from a document
CN113468887A (en) Student information relation extraction method and system based on boundary and segment classification
CN112287272B (en) Method, system and storage medium for classifying website list pages
CN111860653A (en) Visual question answering method and device, electronic equipment and storage medium
CN111581972A (en) Method, device, equipment and medium for identifying corresponding relation between symptom and part in text
Tymoshenko et al. Real-Time Ukrainian Text Recognition and Voicing.
KR102516560B1 (en) Managing system for handwritten document
CN117152770A (en) Handwriting input-oriented writing capability intelligent evaluation method and system
CN110851597A (en) Method and device for sentence annotation based on similar entity replacement
US20220129795A1 (en) Systems and methods for cognitive information mining
CN111339301B (en) Label determining method, label determining device, electronic equipment and computer readable storage medium
Asha et al. Artificial Neural Networks based DIGI Writing
CN112364131B (en) Corpus processing method and related device thereof
Lafia et al. Digitizing and parsing semi-structured historical administrative documents from the GI Bill mortgage guarantee program
Batomalaque et al. Image to text conversion technique for anti-plagiarism system
CN111860862A (en) Performing hierarchical simplification of learning models
US20240126978A1 (en) Determining attributes for elements of displayable content and adding them to an accessibility tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Dingfu Intelligent Technology Co.,Ltd.

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant