CN114138928A - Method, system, device, electronic equipment and medium for extracting text content - Google Patents

Method, system, device, electronic equipment and medium for extracting text content Download PDF

Info

Publication number
CN114138928A
CN114138928A CN202111138827.5A CN202111138827A CN114138928A CN 114138928 A CN114138928 A CN 114138928A CN 202111138827 A CN202111138827 A CN 202111138827A CN 114138928 A CN114138928 A CN 114138928A
Authority
CN
China
Prior art keywords
text
extracted
paragraph
crf
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111138827.5A
Other languages
Chinese (zh)
Inventor
杨婉琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202111138827.5A priority Critical patent/CN114138928A/en
Publication of CN114138928A publication Critical patent/CN114138928A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method, a system, a device, electronic equipment and a medium for extracting text content. In the application, the text to be extracted can be obtained; identifying the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted; determining a target text paragraph with a target text meaning from each text paragraph by using an entity recognition model generated based on BI LSTM-CRF; and extracting a target text paragraph from the text to be extracted. By applying the technical scheme of the application, each paragraph of the text can be automatically extracted through the natural paragraph splitting model, and the text paragraph with a specific meaning can be automatically identified and extracted based on the entity identification model generated by the bidirectional LSTM layer. The method and the device take a machine learning model as a leading factor in the design of a detection system, and gradually reduce the proportion of manual intervention along with the continuous optimization of an algorithm and the continuous perfection of a data model, so that the accuracy rate of text content extraction is higher.

Description

Method, system, device, electronic equipment and medium for extracting text content
Technical Field
The present application relates to data processing technologies, and in particular, to a method, a system, an apparatus, an electronic device, and a medium for extracting text content.
Background
Text serves as a bridge for transferring information between human beings, and particularly, the text information is spread everywhere in the rapidly-developing internet form through communication.
However, in the related art, there are cases where the text content information uploaded by the user is uneven and the number of texts is huge. Further, when a text generated by a user and requiring to extract certain content information is received, the platform often needs to screen a large amount of uploaded texts one by one, and then the required text content can be selected from the uploaded texts and returned to the client. It can be understood that if the information required by the user is screened from the massive texts only by manpower, the workload is large and the efficiency is low.
Disclosure of Invention
The embodiment of the application provides a method, a system, a device, electronic equipment and a medium for extracting text content. The method and the device are used for solving the problems of large workload and low efficiency in manually screening the information required by the user from massive texts in the related technology.
According to an aspect of the embodiments of the present application, a method for extracting text content is provided, which includes:
acquiring a text to be extracted;
identifying the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted;
determining a target text paragraph with target text meaning from each text paragraph by using an entity recognition model generated based on a BILSTM-CRF, wherein the BILSTM-CRF entity recognition model is a model generated based on a bidirectional LSTM layer;
and extracting the target text paragraph from the text to be extracted.
Optionally, in another embodiment based on the foregoing method of the present application, the identifying the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted includes:
acquiring a TF-IDF keyword extraction model, and extracting each key field in the text to be extracted by using the TF-IDF model;
inputting each key field into a probability graph model to obtain a probability value classification result corresponding to each key field, wherein the probability value classification result is used for representing whether the key field is a paragraph distribution field to be extracted or not;
selecting a first key field with the probability value classification result higher than a preset threshold value, and taking a text paragraph where the first key field is located as each text paragraph of the text to be extracted.
Optionally, in another embodiment based on the foregoing method of the present application, the probability value classification result corresponding to each key field is obtained by using the following formula:
S(s,S,o)=S(s)S(o|s)S(S|s,o);
wherein S (S, S, o) is a probability value classification result, S corresponds to the first key field, o corresponds to the second key field, and p corresponds to the third key field.
Optionally, in another embodiment based on the above method of the present application, before the determining, by using the entity recognition model generated based on the blst-CRF, a target text passage having a target text meaning from each text passage, the method further includes:
obtaining an initial BILSTM-CRF model;
setting a first layer of the initial BILSTM-CRF model as a word vector layer, wherein the word vector layer is used for identifying key field vectors corresponding to the meanings of the text paragraphs; and setting a second layer of the initial BILSTM-CRF model as a bidirectional LSTM layer; setting the third layer of the initial BILSTM-CRF model as a CRF layer to obtain a BILSTM-CRF model to be trained;
and training the BILSTM-CRF model to be trained to be convergent by using sample data, and generating the entity recognition model generated based on the BILSTM-CRF.
Optionally, in another embodiment based on the foregoing method of the present application, after the generating the entity identification model generated based on the bilst-CRF, the method further includes:
inputting each text paragraph into the BILSTM-CRF entity recognition model;
obtaining key field vectors corresponding to each text paragraph by using a word vector layer of the BILSTM-CRF entity recognition model;
inputting the key field vectors corresponding to the text paragraphs into a bidirectional LSTM layer of the BILSTM-CRF entity recognition model to obtain a first hidden state sequence output by a forward LSTM layer and a second hidden state sequence output by a reverse LSTM layer;
splicing the first hidden state sequence and the second hidden state sequence to obtain a target hidden state sequence;
and identifying the target hidden state sequence through a CRF layer of the BILSTM-CRF entity identification model to obtain the target text paragraph.
Optionally, in another embodiment based on the foregoing method of the present application, after the obtaining the text to be extracted, the method further includes:
performing target word segmentation elimination on the text to be extracted, wherein the target word segmentation corresponds to stop words or specified parts of speech;
and removing noise words from the text to be extracted after the target word segmentation is eliminated by utilizing clustering operation.
Optionally, in another embodiment based on the foregoing method of the present application, after the determining a target text passage having a target text meaning from among the text passages, the method further includes:
carrying out unique serial number labeling on the target text paragraph of the text to be extracted;
and if an extraction instruction is received, extracting the target text paragraph from the text to be extracted according to the unique serial number label.
According to another aspect of the embodiments of the present application, there is provided an apparatus for extracting text content, including:
the acquisition module is configured to acquire a text to be extracted;
the generation module is configured to identify the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted;
the determining module is configured to determine a target text paragraph with a target text meaning from each text paragraph by using an entity recognition model generated based on BILSTM-CRF;
an extraction module configured to extract the target text paragraph from the text to be extracted.
According to another aspect of the embodiments of the present application, there is provided an electronic device including:
a memory for storing executable instructions; and
a display for displaying with the memory to execute the executable instructions to perform the operations of any of the methods of text content extraction described above.
According to a further aspect of the embodiments of the present application, there is provided a computer-readable storage medium for storing computer-readable instructions, which when executed, perform the operations of any one of the above-mentioned text content extraction methods.
In the application, the text to be extracted can be obtained; identifying the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted; determining a target text paragraph with the target text meaning from each text paragraph by using an entity recognition model generated based on BILSTM-CRF; and extracting a target text paragraph from the text to be extracted. By applying the technical scheme of the application, each paragraph of the text can be automatically extracted through the natural paragraph splitting model, and the text paragraph with a specific meaning can be automatically identified and extracted based on the entity identification model generated by the bidirectional LSTM layer. The method and the device take a machine learning model as a leading factor in the design of a detection system, and gradually reduce the proportion of manual intervention along with the continuous optimization of an algorithm and the continuous perfection of a data model, so that the accuracy rate of text content extraction is higher.
The technical solution of the present application is further described in detail by the accompanying drawings and examples.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.
The present application may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:
fig. 1 is a schematic diagram of a system for extracting text content according to the present application;
fig. 2 is a schematic diagram of a text content extraction method proposed in the present application;
fig. 3 is a schematic structural diagram of an electronic device for text content extraction according to the present application;
fig. 4 is a schematic structural diagram of an electronic device for text content extraction according to the present application.
Detailed Description
Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
In addition, technical solutions between the various embodiments of the present application may be combined with each other, but it must be based on the realization of the technical solutions by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should be considered to be absent and not within the protection scope of the present application.
It should be noted that all the directional indicators (such as upper, lower, left, right, front and rear … …) in the embodiment of the present application are only used to explain the relative position relationship between the components, the motion situation, etc. in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly.
A method for performing textual content extraction according to an exemplary embodiment of the present application is described below in conjunction with fig. 1-2. It should be noted that the following application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.
Fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which the method of text content extraction or the apparatus of text content extraction of an embodiment of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and the like.
The terminal apparatuses 101, 102, 103 in the present application may be terminal apparatuses that provide various services. For example, the user via terminal 103 (which may also be terminal 101 or 102): acquiring a text to be extracted; identifying the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted; determining a target text paragraph with target text meaning from each text paragraph by using an entity recognition model generated based on a BILSTM-CRF, wherein the BILSTM-CRF entity recognition model is a model generated based on a bidirectional LSTM layer; and extracting the target text paragraph from the text to be extracted.
It should be noted that the method for extracting text content provided in the embodiments of the present application may be executed by one or more of the terminal devices 101, 102, and 103, and/or the server 105, and accordingly, the apparatus for extracting text content provided in the embodiments of the present application is generally disposed in the corresponding terminal device, and/or the server 105, but the present application is not limited thereto.
The application also provides a text content extraction method, a text content extraction device, a target terminal and a medium.
Fig. 2 schematically shows a flow chart of a method for text content extraction according to an embodiment of the present application. As shown in fig. 2, the method includes:
s101, acquiring a text to be extracted.
S102, identifying the text to be extracted by using the natural paragraph splitting model to obtain each text paragraph of the text to be extracted.
First, it should be noted that the terminal for obtaining the text to be extracted is not specifically limited in this application, and may be, for example, an intelligent device or a server. The smart device may be a PC (Personal Computer), a smart phone, a tablet PC, an e-book reader, an MP3(Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4) player, a portable Computer, or a mobile terminal device with a display function, and the like.
In order to avoid the problems of large workload and low efficiency in manually screening out information required by a user from a large amount of texts in the related art. According to the method and the device, each text paragraph of the text to be extracted can be extracted based on a preset natural paragraph splitting model.
In one mode, each key field in the text to be extracted can be extracted through a preset TF-IDF keyword extraction model, and each key field can be input into a probability graph model to obtain a probability value classification result corresponding to each key field and used for representing the distribution of the paragraph to which the key field belongs, so that each text paragraph of the text to be extracted is determined according to the probability value classification result in the following process.
Specifically, each key field in the text to be screened can be extracted based on the extraction model, and by establishing a preset probability graph model, each key field is enabled to correspondingly output a probability value for indicating the possibility of whether each key field belongs to the text paragraph to be extracted, and the key field represented by the highest value is selected as the extracted text paragraph.
It can be understood that, a part of the text segments are useless paragraphs, and extracting the useless paragraphs wastes the processing performance of the system, so that the method and the device can determine whether the key fields are important key fields according to the probability value classification results corresponding to the key fields, and thus determine whether the text paragraphs in which the key fields are located are text paragraphs that need to be extracted.
Specifically, information (such as the name of the text, the date, the area where the text is located, punctuation marks and the like) with multiple channels for describing each type of key field is input into the CopyNetWork model for word segmentation, and a plurality of key fields are obtained. And predicting the paragraph of each key field according to a probability graph model based on the Seq2Seq, wherein the probability graph model formula is as follows:
S(s,S,o)=S(s)S(o|s)S(S|s,o);
wherein S (S, S, o) is a probability value classification result, S corresponds to the first key field, o corresponds to the second key field, and p corresponds to the third key field.
Wherein, the keyword extraction model can be a TF-IDF-based model. Further, since the text to be extracted may relate to various industry types, when extracting the content, a multi-label classification model (multi-label classification) may be established first. The model is used for classifying the industry categories of each text paragraph, for example, positive samples (feasibility study report text data belonging to the industry category) and negative samples (feasibility study report text data not belonging to the industry category) of corresponding project feasibility study reports can be collected for N industry types;
the method comprises the following steps that industry type data of multiple industries can be collected in advance, and two classification models are made for the multiple industry type data;
adding the plurality of secondary classification results into multivariate classifier training, and realizing classifier superposition to obtain a multi-label multivariate classification prediction model; it should be noted that the classification model employs textcnn two-class and multi-class.
For example, the total number of tags for the industry-type data is n (n-5), which is m-1, m2, m3, m4, m5, respectively, assuming that there are two samples in common, one sample may have tags of [ m1, m4m ], and the other sample tag of [ m2, m4, m5 ].
The method can use the label of each sample as a label co-occurrence mode (label co-occurrence Sattern) and how many different sample labels have different label co-occurrence modes (the samples can be infinite, but the number of the label types is at most 2n), and then sets the weight parameter of the formula.
The input layer obtains key fields after segmenting words of texts, and initializes and combines word vectors trained by word2vec into a sparse matrix as input; and outputting a convolution kernel through the calculation of the convolution layer; articles with different lengths are converted into fixed-length representation through the pooling layer; and finally, outputting the class probability through a full connection layer. It is understood that the present application may determine the value with the highest probability of the category as the paragraph number to which the key field belongs.
S103, determining a target text paragraph with the target text meaning from each text paragraph by using an entity recognition model generated based on the BILSTM-CRF, wherein the BILSTM-CRF entity recognition model is a model generated based on a bidirectional LSTM layer.
The method can identify the boundaries and the categories of entity designations in the natural text through an entity identification model generated based on the BILSTM-CRF, wherein the entity identification is a sequence marking problem. The text paragraph meaning required in this application has characteristic boundaries that can be identified by named entities as proper nouns.
Further, for training the entity recognition model generated based on the BILSTM-CRF, the following settings can be included:
the first layer of the model is a word vector layer, and each sentence is represented as a word vector and a word vector corresponding to the meaning of a text paragraph;
the second layer of the model is a bidirectional LSTM layer, a vector of each word of a key field (sentence or phrase) corresponding to the meaning of a text paragraph is used as input, and a hidden state sequence output by forward LSTM and a hidden state sequence output by reverse LSTM are spliced to obtain a complete hidden state sequence;
and the third layer of the model is a CRF layer, and sentence-level sequence labeling is carried out according to the extracted sentence characteristics to obtain a text paragraph meaning identification model.
Specifically, for the CRF layer, the natural paragraph codes obtained in step 1 may be added to the sequence labels of each input word as features, so that the model can learn complete paragraph features; and (3) adding the KEY word characteristics of the front 20 KEY words of the different meaning paragraphs obtained in the step (2) (namely, the corresponding KEY words appear in the paragraphs, and then the words obtain a corresponding KEY label), and matching with the probability output by the LSTM layer to be used as a second layer guarantee for extracting the specific meaning paragraphs to obtain the specific meaning paragraph extraction model.
In addition, the output matrix of the BilSTM layer is set as S, wherein Si, j represents the non-normalized probability of the word wi mapped to tagj, and the non-normalized probability is similar to the emission probability matrix in the CRF model. Wherein, there is a transition probability matrix AAi, j in the CRF layer, which represents the transition probability of tagI transitioning to tagj.
Furthermore, learning is carried out by using a gradient descent method, so that an optimal parameter number theta is obtained, and a CRF model is obtained through the subsequent optimal parameter training:
θ=argmaxθ∑ilog(S(xi|yi;θ))
where x is the input sequence, y is the output tag sequence corresponding to the input sequence, and yi ═ argmaxy 'score (x, y').
And S104, extracting a target text paragraph from the text to be extracted.
In the application, the text to be extracted can be obtained; identifying the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted; determining a target text paragraph with the target text meaning from each text paragraph by using an entity recognition model generated based on BILSTM-CRF; and extracting a target text paragraph from the text to be extracted. By applying the technical scheme of the application, each paragraph of the text can be automatically extracted through the natural paragraph splitting model, and the text paragraph with a specific meaning can be automatically identified and extracted based on the entity identification model generated by the bidirectional LSTM layer. The method and the device take a machine learning model as a leading factor in the design of a detection system, and gradually reduce the proportion of manual intervention along with the continuous optimization of an algorithm and the continuous perfection of a data model, so that the accuracy rate of text content extraction is higher.
Optionally, in another embodiment based on the foregoing method of the present application, identifying the text to be extracted by using a natural paragraph splitting model, to obtain each text paragraph of the text to be extracted includes:
acquiring a TF-IDF keyword extraction model, and extracting each key field in the text to be extracted by using the TF-IDF model;
inputting each key field into a probability graph model to obtain a probability value classification result corresponding to each key field, wherein the probability value classification result is used for representing whether the key field is a paragraph distribution field to be extracted or not;
and obtaining each text paragraph of the text to be extracted according to the probability value classification result.
Optionally, in another embodiment based on the foregoing method of the present application, the probability value classification result corresponding to each key field is obtained by using the following formula:
S(s,S,o)=S(s)S(o|s)S(S|s,o);
wherein S (S, S, o) is a probability value classification result, S corresponds to the first key field, o corresponds to the second key field, and p corresponds to the third key field.
Optionally, in another embodiment based on the above method of the present application, before the determining, by using the entity recognition model generated based on the blst-CRF, a target text passage having a target text meaning from each text passage, the method further includes:
obtaining an initial BILSTM-CRF model;
setting a first layer of the initial BILSTM-CRF model as a word vector layer, wherein the word vector layer is used for identifying key field vectors corresponding to the meanings of the text paragraphs; and setting a second layer of the initial BILSTM-CRF model as a bidirectional LSTM layer; setting the third layer of the initial BILSTM-CRF model as a CRF layer to obtain a BILSTM-CRF model to be trained;
and training the BILSTM-CRF model to be trained to be convergent by using sample data, and generating the entity recognition model generated based on the BILSTM-CRF.
It can be understood that the application first needs to obtain an initial blank blst-CRF model and set the model architecture thereof as a word vector layer, a bidirectional LSTM layer and a CRF layer. So that the entity recognition model generated based on the BILSTM-CRF and proposed by the application can not be obtained until the entity recognition model is trained to be converged.
Optionally, in another embodiment based on the foregoing method of the present application, after the generating the entity identification model generated based on the bilst-CRF, the method further includes:
inputting each text paragraph into the BILSTM-CRF entity recognition model;
obtaining key field vectors corresponding to each text paragraph by using a word vector layer of the BILSTM-CRF entity recognition model;
inputting the key field vectors corresponding to the text paragraphs into a bidirectional LSTM layer of the BILSTM-CRF entity recognition model to obtain a first hidden state sequence output by a forward LSTM layer and a second hidden state sequence output by a reverse LSTM layer;
splicing the first hidden state sequence and the second hidden state sequence to obtain a target hidden state sequence;
and identifying the target hidden state sequence through a CRF layer of the BILSTM-CRF entity identification model to obtain the target text paragraph.
Optionally, in another embodiment based on the foregoing method of the present application, after the obtaining the text to be extracted, the method further includes:
performing target word segmentation elimination on the text to be extracted, wherein the target word segmentation corresponds to stop words or specified parts of speech;
and removing noise words from the text to be extracted after the target word segmentation is eliminated by utilizing clustering operation.
Further, in the process of preprocessing the text, the application may include performing word segmentation on the text to be screened to obtain stop words, where the stop words at least include at least one of prepositions, auxiliary words, conjunctions, and exclamation words.
It can be understood that the above part-of-speech words are difficult to become key fields, so that they can be eliminated, and a text with the part-of-speech eliminated is obtained.
Further, the text with the part of speech removed can be processed by using a density clustering algorithm, for example, a neighborhood distance threshold and a sample number (e, MinPts) can be taken, and principal component keywords and noise words are obtained through clustering operation, so that the noise words are removed.
Optionally, in another embodiment based on the foregoing method of the present application, after the determining a target text passage having a target text meaning from among the text passages, the method further includes:
carrying out unique serial number labeling on the target text paragraph of the text to be extracted;
and if an extraction instruction is received, extracting the target text paragraph from the text to be extracted according to the unique serial number label.
Further, after determining the target text paragraph with the target meaning, the application may select to perform unique serial number labeling instead of extracting the target text paragraph at the first time. And after an extraction instruction generated by a user is received, correspondingly extracting a target text paragraph from the text to be extracted according to the unique serial number label.
Optionally, in another embodiment of the present application, as shown in fig. 3, the present application further provides a text content extraction apparatus. Which comprises the following steps:
an obtaining module 201 configured to obtain a text to be extracted;
a generating module 202, configured to identify the text to be extracted by using a natural paragraph splitting model, so as to obtain each text paragraph of the text to be extracted;
a determining module 203, configured to determine a target text paragraph with a target text meaning from each text paragraph by using an entity recognition model generated based on the BILSTM-CRF;
an extracting module 204 configured to extract the target text passage from the text to be extracted.
In the application, the text to be extracted can be obtained; identifying the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted; determining a target text paragraph with the target text meaning from each text paragraph by using an entity recognition model generated based on BILSTM-CRF; and extracting a target text paragraph from the text to be extracted. By applying the technical scheme of the application, each paragraph of the text can be automatically extracted through the natural paragraph splitting model, and the text paragraph with a specific meaning can be automatically identified and extracted based on the entity identification model generated by the bidirectional LSTM layer. The method and the device take a machine learning model as a leading factor in the design of a detection system, and gradually reduce the proportion of manual intervention along with the continuous optimization of an algorithm and the continuous perfection of a data model, so that the accuracy rate of text content extraction is higher.
In another embodiment of the present application, the obtaining module 201 further includes:
an obtaining module 201, configured to obtain a TF-IDF keyword extraction model, and extract each key field in the text to be extracted by using the TF-IDF model;
the obtaining module 201 is configured to input each key field into a probability graph model, and obtain a probability value classification result corresponding to each key field, where the probability value classification result is used to represent whether the key field is a to-be-extracted paragraph distribution field;
the obtaining module 201 is configured to obtain each text paragraph of the text to be extracted according to the probability value classification result.
In another embodiment of the present application, the method further includes: obtaining probability value classification results corresponding to the key fields by using the following formula:
S(s,S,o)=S(s)S(o|s)S(S|s,o);
wherein S (S, S, o) is a probability value classification result, S corresponds to the first key field, o corresponds to the second key field, and p corresponds to the third key field.
In another embodiment of the present application, the obtaining module 201 further includes:
an acquisition module 201 configured to acquire an initial BILSTM-CRF model; setting a first layer of the initial BILSTM-CRF model as a word vector layer, wherein the word vector layer is used for identifying key field vectors corresponding to the meanings of the text paragraphs; and setting a second layer of the initial BILSTM-CRF model as a bidirectional LSTM layer; setting the third layer of the initial BILSTM-CRF model as a CRF layer to obtain a BILSTM-CRF model to be trained;
the obtaining module 201 is configured to train the to-be-trained blst-CRF model to converge by using sample data, and generate the entity identification model generated based on the blst-CRF.
In another embodiment of the present application, the obtaining module 201 further includes:
an obtaining module 201 configured to input the text paragraphs into the BILSTM-CRF entity recognition model;
an obtaining module 201, configured to obtain, by using a word vector layer of the BILSTM-CRF entity recognition model, a key field vector corresponding to each text paragraph;
an obtaining module 201, configured to input the key field vectors corresponding to the text paragraphs into a bidirectional LSTM layer of the BILSTM-CRF entity recognition model, to obtain a first hidden state sequence output by a forward LSTM layer and a second hidden state sequence output by a reverse LSTM layer;
an obtaining module 201, configured to splice the first hidden state sequence and the second hidden state sequence to obtain a target hidden state sequence;
an obtaining module 201 configured to identify the target hidden state sequence through a CRF layer of the BILSTM-CRF entity identification model, so as to obtain a paragraph with the target text.
In another embodiment of the present application, the obtaining module 201 further includes:
the obtaining module 201 is configured to perform target word segmentation elimination on the text to be extracted, where the target word segmentation corresponds to a stop word or a specified part of speech;
the obtaining module 201 is configured to perform noise word removal on the text to be extracted after the target word segmentation is eliminated by using clustering operation.
In another embodiment of the present application, the obtaining module 201 further includes:
the obtaining module 201 is configured to label a unique serial number of a target text paragraph of the text to be extracted;
the obtaining module 201 is configured to, if an extracting instruction is received, extract the target text paragraph from the text to be extracted according to the unique serial number label.
Fig. 4 is a block diagram illustrating a logical structure of an electronic device in accordance with an exemplary embodiment. For example, the electronic device 300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium, such as a memory, including instructions executable by a processor of an electronic device to perform the method of text content extraction, the method comprising: acquiring a text to be extracted; identifying the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted; determining a target text paragraph with target text meaning from each text paragraph by using an entity recognition model generated based on a BILSTM-CRF, wherein the BILSTM-CRF entity recognition model is a model generated based on a bidirectional LSTM layer; and extracting the target text paragraph from the text to be extracted. Optionally, the instructions may also be executable by a processor of the electronic device to perform other steps involved in the exemplary embodiments described above. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, there is also provided an application/computer program product including one or more instructions executable by a processor of an electronic device to perform the method of text content extraction described above, the method comprising: acquiring a text to be extracted; identifying the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted; determining a target text paragraph with target text meaning from each text paragraph by using an entity recognition model generated based on a BILSTM-CRF, wherein the BILSTM-CRF entity recognition model is a model generated based on a bidirectional LSTM layer; and extracting the target text paragraph from the text to be extracted. Optionally, the instructions may also be executable by a processor of the electronic device to perform other steps involved in the exemplary embodiments described above.
Fig. 4 is an exemplary diagram of the computer device 30. Those skilled in the art will appreciate that the schematic diagram 4 is merely an example of the computer device 30 and does not constitute a limitation of the computer device 30 and may include more or less components than those shown, or combine certain components, or different components, e.g., the computer device 30 may also include input output devices, network access devices, buses, etc.
The Processor 302 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor 302 may be any conventional processor or the like, the processor 302 being the control center for the computer device 30 and connecting the various parts of the overall computer device 30 using various interfaces and lines.
Memory 301 may be used to store computer readable instructions 303 and processor 302 may implement various functions of computer device 30 by executing or executing computer readable instructions or modules stored within memory 301 and by invoking data stored within memory 301. The memory 301 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the computer device 30, and the like. In addition, the Memory 301 may include a hard disk, a Memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Memory Card (Flash Card), at least one disk storage device, a Flash Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), or other non-volatile/volatile storage devices.
The modules integrated by the computer device 30 may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by hardware related to computer readable instructions, which may be stored in a computer readable storage medium, and when the computer readable instructions are executed by a processor, the steps of the method embodiments may be implemented.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A method of textual content extraction, comprising:
acquiring a text to be extracted;
identifying the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted;
determining a target text paragraph with a target text meaning from each text paragraph by using an entity recognition model generated based on a BILSTM-CRF, wherein the BILSTM-CRF entity recognition model is a model generated based on a bidirectional LSTM layer;
and extracting the target text paragraph from the text to be extracted.
2. The method of claim 1, wherein the identifying the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted comprises:
acquiring a TF-IDF keyword extraction model, and extracting each key field in the text to be extracted by using the TF-IDF model;
inputting each key field into a probability graph model to obtain a probability value classification result corresponding to each key field, wherein the probability value classification result is used for representing whether the key field is a paragraph distribution field to be extracted or not;
selecting a first key field with the probability value classification result higher than a preset threshold value, and taking a text paragraph where the first key field is located as each text paragraph of the text to be extracted.
3. The method of claim 2, wherein the probability value classification result corresponding to each key field is obtained by using the following formula:
S(s,S,o)=S(s)S(o|s)S(S|s,o);
wherein S (S, S, o) is a probability value classification result, S corresponds to the first key field, o corresponds to the second key field, and p corresponds to the third key field.
4. The method of claim 1, wherein prior to determining a target passage of text having a target textual meaning from among the passages of text using the entity recognition model generated based on BILSTM-CRF, further comprising:
obtaining an initial BILSTM-CRF model;
setting a first layer of the initial BILSTM-CRF model as a word vector layer, wherein the word vector layer is used for identifying key field vectors corresponding to the meanings of the text paragraphs; and setting a second layer of the initial BILSTM-CRF model as a bidirectional LSTM layer; setting the third layer of the initial BILSTM-CRF model as a CRF layer to obtain a BILSTM-CRF model to be trained;
and training the BILSTM-CRF model to be trained to be convergent by using sample data, and generating the entity recognition model generated based on the BILSTM-CRF.
5. The method of claim 4, wherein after the generating the entity recognition model generated based on the BILSTM-CRF, further comprising:
inputting each text paragraph into the BILSTM-CRF entity recognition model;
obtaining key field vectors corresponding to each text paragraph by using a word vector layer of the BILSTM-CRF entity recognition model;
inputting the key field vectors corresponding to the text paragraphs into a bidirectional LSTM layer of the BILSTM-CRF entity recognition model to obtain a first hidden state sequence output by a forward LSTM layer and a second hidden state sequence output by a reverse LSTM layer;
splicing the first hidden state sequence and the second hidden state sequence to obtain a target hidden state sequence;
and identifying the target hidden state sequence through a CRF layer of the BILSTM-CRF entity identification model to obtain the target text paragraph.
6. The method of claim 1, after the obtaining text to be extracted, further comprising:
performing target word segmentation elimination on the text to be extracted, wherein the target word segmentation corresponds to stop words or specified parts of speech;
and removing noise words from the text to be extracted after the target word segmentation is eliminated by utilizing clustering operation.
7. The method of claim 1, wherein after said determining a target passage of text having a target textual meaning from among the respective passages of text, further comprising:
carrying out unique serial number labeling on the target text paragraph of the text to be extracted;
and if an extraction instruction is received, extracting the target text paragraph from the text to be extracted according to the unique serial number label.
8. An apparatus for text content extraction, comprising:
the acquisition module is configured to acquire a text to be extracted;
the generation module is configured to identify the text to be extracted by using a natural paragraph splitting model to obtain each text paragraph of the text to be extracted;
a determining module configured to determine a target text paragraph with a target text meaning from the text paragraphs by using an entity recognition model generated based on the BILSTM-CRF;
an extraction module configured to extract the target text paragraph from the text to be extracted.
9. An electronic device, comprising:
a memory for storing executable instructions; and the number of the first and second groups,
a processor for display with the memory to execute the executable instructions to perform the operations of the method of text content extraction of any of claims 1-7.
10. A computer-readable storage medium storing computer-readable instructions that, when executed, perform the operations of the method of text content extraction of any of claims 1-7.
CN202111138827.5A 2021-09-27 2021-09-27 Method, system, device, electronic equipment and medium for extracting text content Pending CN114138928A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111138827.5A CN114138928A (en) 2021-09-27 2021-09-27 Method, system, device, electronic equipment and medium for extracting text content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111138827.5A CN114138928A (en) 2021-09-27 2021-09-27 Method, system, device, electronic equipment and medium for extracting text content

Publications (1)

Publication Number Publication Date
CN114138928A true CN114138928A (en) 2022-03-04

Family

ID=80393974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111138827.5A Pending CN114138928A (en) 2021-09-27 2021-09-27 Method, system, device, electronic equipment and medium for extracting text content

Country Status (1)

Country Link
CN (1) CN114138928A (en)

Similar Documents

Publication Publication Date Title
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN108121699B (en) Method and apparatus for outputting information
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN112988963B (en) User intention prediction method, device, equipment and medium based on multi-flow nodes
CN113627797B (en) Method, device, computer equipment and storage medium for generating staff member portrait
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN108776677B (en) Parallel sentence library creating method and device and computer readable storage medium
CN112214576B (en) Public opinion analysis method, public opinion analysis device, terminal equipment and computer readable storage medium
CN110209772B (en) Text processing method, device and equipment and readable storage medium
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN110610003A (en) Method and system for assisting text annotation
CN113360660A (en) Text type identification method and device, electronic equipment and storage medium
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN112632950A (en) PPT generation method, device, equipment and computer-readable storage medium
CN110309355A (en) Generation method, device, equipment and the storage medium of content tab
CN112989050A (en) Table classification method, device, equipment and storage medium
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
CN111274384B (en) Text labeling method, equipment and computer storage medium thereof
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
CN116010545A (en) Data processing method, device and equipment
CN114021004A (en) Method, device and equipment for recommending science similar questions and readable storage medium
CN110457436B (en) Information labeling method and device, computer readable storage medium and electronic equipment
CN114067343A (en) Data set construction method, model training method and corresponding device
CN114138928A (en) Method, system, device, electronic equipment and medium for extracting text content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination