WO2021051871A1 - Text extraction method, apparatus, and device, and storage medium - Google Patents

Text extraction method, apparatus, and device, and storage medium Download PDF

Info

Publication number
WO2021051871A1
WO2021051871A1 PCT/CN2020/093466 CN2020093466W WO2021051871A1 WO 2021051871 A1 WO2021051871 A1 WO 2021051871A1 CN 2020093466 W CN2020093466 W CN 2020093466W WO 2021051871 A1 WO2021051871 A1 WO 2021051871A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
text
extracted
extraction
target
Prior art date
Application number
PCT/CN2020/093466
Other languages
French (fr)
Chinese (zh)
Inventor
郝正鸿
许开河
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051871A1 publication Critical patent/WO2021051871A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Definitions

  • This application relates to the field of text processing technology, and in particular to a text extraction method, device, equipment, and storage medium.
  • Information extraction is the process of automatically extracting and converting unstructured data in documents (such as resumes, insurance clauses, encyclopedias, contracts and other business scenarios) into structured data, for example, the contracting parties in the lease contract Extract and convert unstructured data such as the name, contract time, and contract address of the company.
  • documents such as resumes, insurance clauses, encyclopedias, contracts and other business scenarios
  • unstructured data such as the name, contract time, and contract address of the company.
  • Information extraction is divided from the perspective of extraction content, including entity extraction, relationship extraction, and event extraction. From the length of extraction, it mainly includes vocabulary extraction and field/paragraph extraction. In addition, it is also divided into open domain information extraction and closed domain information extraction. With the development of deep neural networks and the enhancement of computer computing power, the existing information extraction methods are mainly based on large-scale labeled data training end-to-end deep learning models with larger parameters, and then perform different methods based on the trained models. Text information extraction in business scenarios. The inventor found that this information extraction method did not perform classification extraction for different extraction lengths, resulting in the final extraction result being not highly targeted, low accuracy, and reducing the efficiency of information extraction.
  • the main purpose of this application is to provide a text extraction method, device, equipment and storage medium, aiming to solve the technical problems of the existing information extraction technology that the extraction results are not very specific, the accuracy is not high, and the extraction efficiency is low.
  • this application provides a text extraction method, which includes the following steps:
  • an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
  • this application also proposes a text extraction device, which includes:
  • the text acquisition module is used to read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;
  • the sentence segmentation module is configured to call a multi-threaded processing script to segment the text to be extracted into sentence sets when it is detected that the extraction type is identified as field extraction;
  • the vector conversion module is used to convert the sentences in the sentence set into sentence vectors through the multi-thread processing script
  • the vector splicing module is used for splicing the sentence vector to obtain the target sentence vector
  • a model prediction module configured to input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
  • the text extraction module is used to extract the target field from the text to be extracted by using an exact matching retrieval algorithm according to the first prediction result.
  • this application also proposes a text extraction device, the text extraction includes a memory, a processor, and computer-readable instructions stored on the memory and running on the processor.
  • the computer-readable instructions are executed by the processor, the following steps are implemented:
  • an exact matching retrieval algorithm is used to extract a target field from the text to be extracted
  • this application also proposes a computer-readable storage medium having computer-readable instructions stored in the computer-readable storage medium, and the computer-readable instructions can be executed by at least one processor to Make the at least one processor execute the following steps:
  • an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
  • This application extracts the extraction type identification contained in the text to be extracted by reading the text to be extracted; when the extraction type identification is detected as field extraction, the multi-threaded processing script is called to divide the text to be extracted into sentence sets; the script is processed by multi-threading Transform the sentences in the sentence set into sentence vectors; splice the sentence vectors to obtain the target sentence vector; input the target sentence vector into the first conditional random field model to obtain the first prediction result output by the first conditional random field model; According to the first prediction result, an exact matching retrieval algorithm is used to extract the target field from the text to be extracted.
  • This application determines the extraction length according to the extraction type identification, and selects the corresponding conditional random field model for different extraction lengths to extract text to make the text extraction more targeted.
  • this application uses multi-threaded processing scripts for text segmentation to improve the text
  • the overall efficiency of extraction, extracting the target field through the exact matching retrieval algorithm also ensures the accuracy of the target field extraction.
  • FIG. 1 is a schematic structural diagram of a text extraction device of a hardware operating environment involved in a solution of an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a second embodiment of the text extraction method of this application.
  • FIG. 5 is a structural block diagram of the first embodiment of the text extraction device of this application.
  • FIG. 1 is a schematic structural diagram of a text extraction device of a hardware operating environment involved in a solution of an embodiment of the application.
  • the text extraction device may include: a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a wireless fidelity (WIreless-FIdelity, WI-FI) interface).
  • WIreless-FIdelity WI-FI
  • the memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory, or a stable non-volatile memory (Non-Volatile memory). Memory, NVM), such as disk storage.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
  • FIG. 1 does not constitute a limitation on the text extraction device, and may include more or less components than shown in the figure, or a combination of certain components, or different component arrangements.
  • the memory 1005 as a computer-readable storage medium may include an operating system, a data storage module, a network communication module, a user interface module, and a text extraction program.
  • the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with users; the processor 1001 and the memory 1005 in the text extraction device of this application can be Set in a text extraction device, the text extraction device calls the text extraction program stored in the memory 1005 through the processor 1001, and executes the text extraction method provided in the embodiment of the present application.
  • FIG. 2 is a schematic flowchart of the first embodiment of the text extraction method of this application.
  • the text extraction method includes the following steps:
  • Step S10 Read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;
  • the execution subject of the method in this example can be a computing service device with data processing, network communication and program operation functions, such as smart phones, tablets, personal computers, etc., or it can be pre-loaded on the above computing service devices
  • the text extraction tool In addition, in the specific implementation scenario of this embodiment, the user needs to upload a sample document to the text extraction tool first, and the sample document is marked with paragraphs/fields or vocabulary that need to be extracted. The text extraction tool compares untrained documents based on these sample documents.
  • the initial conditional random field (Conditional Random Field, CRF) model is trained to obtain a CRF model dedicated to field extraction, or a CRF model dedicated to vocabulary extraction; and then based on these trained CRF models for paragraph/field extraction or vocabulary Extract.
  • CRF Conditional Random Field
  • the extraction type identification includes field extraction and vocabulary extraction.
  • the user only needs to annotate a small number (a few or a dozen) of sample documents to achieve high accuracy in extracting the same vocabulary or words from similar documents. paragraph.
  • the extraction type identification in this step needs to be selected by the user when uploading the text to be extracted, so that the text to be extracted carries an identification or mark for determining the specific extraction type of the text.
  • the text extraction tool reads the text to be extracted uploaded by the user, and extracts the extraction type identification contained in the text to be extracted.
  • Step S20 when it is detected that the extraction type is identified as field extraction, call a multi-threaded processing script to divide the text to be extracted into sentence sets;
  • the text extraction tool in this embodiment may first segment the text to be extracted according to sentence dimensions, obtain a number of sentences corresponding to the text to be extracted, and then combine these segmented sentences into a sentence set.
  • the multi-thread processing script may be a pre-written computer readable instruction or code file that enables multiple threads to concurrently execute a text segmentation operation.
  • Step S30 Transforming the sentences in the sentence set into sentence vectors through the multi-thread processing script
  • the sentence is converted into a sentence vector by first performing word segmentation processing on the sentence through a multi-threaded processing script, and then obtaining the vocabulary dimension after the word segmentation (for example, the sentence "I like watching TV, I don't like watching movies "The corresponding vocabulary dimension is: I, like, watch, TV, movie, no, also), and then count the word frequency of each vocabulary after the word segmentation "I 1, like 2, watch 2, TV 1, movie 1, no 1, Also 0", and finally the sentence vector is transformed according to the word frequency of each vocabulary to obtain the sentence vector "[1, 2, 2, 1, 1, 1, 0]".
  • the specific sentence vectorization method can also be other methods, and this embodiment does not specifically limit this.
  • Step S40 splicing the sentence vectors to obtain a target sentence vector
  • the text extraction tool in this embodiment will also splice the sentence vectors corresponding to each sentence in the paragraph order of the text. , To obtain the target sentence vector that is finally used to input into the CRF model.
  • the BERT model (a method of pre-training language representation, which is a general "language understanding" model trained on a large amount of text corpus (such as Wikipedia)) is compared with other language models in terms of natural language processing
  • this embodiment preferably uses the BERT model to vectorize the sentence.
  • the sentences in the sentence set may be input to a pre-training language model (ie, the above-mentioned BERT model) through the multi-threaded processing script to obtain the sentence vector corresponding to each sentence output by the pre-training language model; Acquire the text position information of each sentence in the to-be-extracted text, and determine the sentence order corresponding to each sentence according to the text position information; then splice the sentence vectors in the sentence order to obtain the target sentence vector .
  • a pre-training language model ie, the above-mentioned BERT model
  • Step S50 Input the target sentence vector to a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
  • the text extraction tool can obtain a number of user-labeled documents, vectorize the user-labeled documents to obtain a labeled text vector, the labeled text vector contains an observation text sequence; input the labeled text vector to the initial condition Random field model, so that the initial condition random field model is trained based on the observation text sequence to obtain the conditional random field model to be verified; model evaluation is performed on the conditional random field model to be verified, and the evaluation result meets the preset When the condition is met, the conditional random field model to be verified is used as the first conditional random field model.
  • the preset condition may be that the evaluation result of the model (for example, the accuracy of the prediction result) meets the use standard, for example, the accuracy of the prediction result exceeds 95%, which is not limited in this embodiment.
  • the CRF model is an undirected graph learning model proposed on the basis of the maximum entropy model and the hidden Markov model, and is used to label and segment ordered data Conditional probability model.
  • the identification sequence obtained by the conditional random field model in this embodiment can make the corresponding observation sequence the same or the most similar to the observation sequence pre-marked by the user in the sample document (that is, the maximum conditional probability), so as to achieve the accuracy of the target field. extract.
  • CRF model training can be as follows:
  • the observation sequence is the field or vocabulary marked by the user
  • the identification sequence is a text extraction tool based on the observation sequence using OBIE (ontology-based
  • OBIE origin-based
  • the information extraction method automatically generates a text sequence
  • the above observation text sequence is a text sequence after the observation sequence is vectorized.
  • the text extraction tool may input the spliced target sentence vector into the first conditional random field model, and then obtain the first prediction result output by the first conditional random field model.
  • the first prediction result output by the first conditional random field usually includes multiple conditional probabilities, such as field 1.
  • Step S60 Use an exact matching search algorithm to extract a target field from the text to be extracted according to the first prediction result.
  • exact matching search also called exact matching search, refers to a search method in which the search term is exactly the same as a certain field in the resource database.
  • Exact matching refers to searching the input search term as a fixed phrase.
  • the text extraction tool can search the field corresponding to the conditional probability in the prediction result as a "fixed phrase" to extract the retrieved target field.
  • the text extraction tool can sort the conditional probabilities in the first prediction result from high to low, and then select one or more conditional probabilities that are ranked first, and then pass the fields corresponding to these conditional probabilities as the target field.
  • Exact matching search for text extraction the text extraction tool can also filter the conditional probabilities contained in the prediction result according to a preset conditional probability threshold, for example, all conditional probabilities whose conditional probability value is higher than the conditional probability threshold Both are used as the target condition probability, and then the target field is determined according to the target condition probability, and then text extraction is performed based on the target field through exact matching search.
  • This embodiment does not specifically limit the method used to determine the target field according to the first prediction result.
  • the extraction type identification contained in the text to be extracted is extracted; when the extraction type identification is detected as field extraction, the multi-threaded processing script is called to divide the text to be extracted into sentence sets; multi-threaded processing The script converts the sentences in the sentence set into sentence vectors; splices the sentence vectors to obtain the target sentence vector; inputs the target sentence vector to the first conditional random field model, and obtains the first prediction result output by the first conditional random field model ; According to the first prediction result, an exact matching retrieval algorithm is used to extract the target field from the text to be extracted.
  • the extraction length is determined according to the extraction type identification, and the corresponding conditional random field model is selected for different extraction lengths to extract the text to make the text extraction more targeted.
  • this embodiment uses a multi-threaded processing script to improve text segmentation. In order to improve the overall efficiency of text extraction, extracting the target field through the exact matching retrieval algorithm also ensures the accuracy of the target field extraction.
  • FIG. 3 is a schematic flowchart of a second embodiment of the text extraction method of this application.
  • the method further includes:
  • Step S201 When it is detected that the extraction type is identified as vocabulary extraction, call a multi-threaded processing script to divide the text to be extracted into several sentences;
  • the vocabulary extraction is also called point extraction, that is, the extraction of characters or words.
  • the user needs to mark the vocabulary to be extracted in the sample document, such as the vocabulary of different dimensions such as contract signatory, contract time, contract address, etc., and configure different label categories for vocabulary of different dimensions. Such as person, time, address, etc.
  • the text extraction tool when the text extraction tool determines that the text to be extracted is a vocabulary extraction according to the extraction type identification or mark carried in the text to be extracted, it can call a multi-threaded processing script to divide the text to be extracted into several sentences.
  • Step S301 Obtain the similarity between each sentence and the sample sentence
  • the user before using the text extraction tool to extract vocabulary from the text to be extracted, the user also needs to use the text extraction tool to train the CRF model based on pre-labeled sample documents (documents containing labeled characters or vocabulary). Therefore, in this embodiment, the sentence carrying the marked characters or vocabulary in the sample document is used as the sample sentence.
  • the text extraction method of this embodiment first searches for sentences that are similar to the sample sentence, and then finds similar sentences. Extract the target vocabulary.
  • the word frequency statistics technique can be used to count the word frequency of each vocabulary in each sentence; then the keyword (set) corresponding to each sentence is determined according to the statistical result; Then the similarity between sentence keywords (sets) is regarded as the similarity between sentences, which can improve the accuracy of calculation of similarity between sentences.
  • the current similarity calculation algorithms include cosine similarity algorithm, Euclidean distance algorithm, Pearson correlation coefficient and so on.
  • the similarity calculation algorithm described in this embodiment is preferably a cosine similarity algorithm that calculates the similarity by calculating the angle between vectors.
  • the text extraction tool performs word segmentation processing on the segmented sentences, and obtains the word frequency-inverse text frequency index value (that is, the TF-IDF value) corresponding to each vocabulary after the word segmentation based on the TF-IDF algorithm; and then according to the word frequency-
  • the inverse text frequency index value determines the sentence keyword corresponding to the sentence to which each vocabulary belongs; finally, based on the sentence keyword, the similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained.
  • the step of obtaining the similarity between the sentence belonging to each vocabulary and the sample sentence based on the sentence keywords may specifically include: obtaining the word frequency vector corresponding to the sentence keywords, and then using a cosine similarity algorithm to calculate the belongingness of each vocabulary The cosine similarity between the word frequency vector of the sentence and the word frequency vector of the sample sentence. The greater the cosine similarity value, the more similar the two sentences; otherwise, the less similar.
  • Step S401 filter out several target sentences corresponding to the sample sentence from the segmented sentences based on the similarity
  • the text extraction tool of this embodiment needs to first filter out several target sentences corresponding to the sample sentences from the segmented sentences according to the calculated similarity, and then extract the final target vocabulary from these target sentences.
  • Step S501 Construct a candidate sentence set according to the target sentence, and input the sentence in the candidate sentence set into a second conditional random field model after vectorization;
  • this embodiment uses a pre-trained CRF model dedicated to vocabulary extraction as the second conditional random field model.
  • the text extraction tool can construct a candidate sentence set according to the target sentence, and then input the sentences in the sentence set into the BERT model and obtain the sentence vectors output by the BERT model. After these sentence vectors are obtained, the text extraction tool can be used These sentence vectors are input into the second conditional random field model to predict the conditional probability.
  • Step S601 Obtain a second prediction result output by the second conditional random field model, and use an exact matching retrieval algorithm to extract a target vocabulary from the text to be extracted according to the second prediction result.
  • the text extraction tool after the text extraction tool obtains the second prediction result output by the second conditional random field model, it can determine the target vocabulary to be extracted according to the conditional probability value contained in the second prediction result, and then determine the target vocabulary according to the The target vocabulary is extracted from the text to be extracted through the exact matching retrieval algorithm to extract all the target vocabulary retrieved.
  • the multi-threaded processing script when it is detected that the extraction type is identified as vocabulary extraction, the multi-threaded processing script is called to divide the text to be extracted into several sentences; the similarity between each sentence and the sample sentence is obtained; Several target sentences corresponding to the sample sentences are selected from the sentence; a candidate sentence set is constructed according to the target sentence, and the sentence in the candidate sentence set is vectorized and input to the second conditional random field model; the second conditional random field model output is obtained According to the prediction result, the exact matching retrieval algorithm is used to extract the target vocabulary from the text to be extracted according to the second prediction result.
  • the multi-threaded processing script is used to segment the text to be extracted, which improves the efficiency of segmentation.
  • the similarity between the target sentences is selected to construct a candidate sentence set, which can ensure that the sentences input to the conditional random field model are closer to the sample sentences, reducing the amount of model calculations and improving the accuracy of vocabulary extraction.
  • FIG. 4 is a schematic flowchart of a third embodiment of a text extraction method of this application.
  • the text extraction method of this embodiment further includes:
  • Step S01 Obtain a number of user-labeled documents, and the user-labeled documents contain label sentences of multiple preset label categories;
  • the document marked by the user in this embodiment is text marked by characters or vocabulary in advance by the user.
  • the preset label category may be pre-configured to distinguish between characters or vocabulary of different dimensions.
  • the label corresponding to the characters or vocabulary of the two parties to the contract is configured as "person”
  • the appearance time, time, duration is configured as "time”
  • the label corresponding to the character or vocabulary of the place and occasion is configured as "address”, etc.
  • each user-labeled document can be labeled with multiple different label categories by the user, and there can be multiple label sentences corresponding to each label category.
  • Step S02 Perform word segmentation processing on the label sentence through the multi-thread processing script, and construct a vocabulary dictionary according to the sentence vocabulary after the word segmentation;
  • the text extraction tool can perform word segmentation processing on each tag sentence contained in the user-labeled document through a multi-threaded processing script, and then perform stop word removal on the sentence vocabulary after the word segmentation process to remove the sentence vocabulary contained in the sentence vocabulary. Stop words such as " ⁇ " and " ⁇ ". After removing the stop words, the text extraction tool can construct a vocabulary dictionary based on the sentence vocabulary after removing the stop words. For example, if the user-labeled document a contains n labeled sentences with the label category b, the text extraction tool can segment the n labeled sentences, remove the stop word processing, and then obtain a vocabulary dictionary with the number of words v.
  • Step S03 Calculate the word frequency-inverse text frequency index value of each word in the vocabulary dictionary, and construct a word frequency-inverse text frequency index value matrix according to the calculation result;
  • the text extraction tool can calculate the word frequency-inverse text frequency index value (TF-IDF value) of each word in the vocabulary dictionary through the TF-IDF algorithm, and then build the order based on the calculated TF-IDF value as TF-IDF matrix of v*n.
  • TF-IDF value word frequency-inverse text frequency index value
  • Step S04 Obtain the sentence vector corresponding to the labeled sentence according to the word frequency-inverse text frequency index value matrix
  • the corresponding TF-IDF matrix may be more complex.
  • the more complex the matrix the more computing resources the computer occupies during processing, which leads to a decrease in computing efficiency and is not conducive to Filter out the more important matrix data from the matrix. Therefore, after acquiring the above-mentioned TF-IDF matrix, the text extraction tool in this embodiment will also perform dimensionality reduction processing on the TF-IDF matrix.
  • the text extraction tool may perform singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values; then select a preset number of target singular values from the set of singular values, and then select a preset number of target singular values according to the target singular value.
  • the value matrix reconstructs the word frequency-inverse text frequency index value matrix to obtain a target matrix; finally, a sentence vector corresponding to the labeled sentence is obtained based on the target matrix.
  • the singular values obtained from the singular value decomposition are generally arranged in descending order of value.
  • the larger the singular value the more capable it can be. Characterize the information of the original matrix, that is, the higher the information content, the stronger the representativeness. Therefore, after obtaining the singular value set, the text extraction tool of this embodiment can also select a preset number of target singular values (for example, 60 or 120 with a larger singular value) from the singular value set to reconstruct the matrix, thereby achieving Without missing the main matrix information, the TF-IDF matrix is effectively reduced in dimension.
  • the preset number can be set according to actual conditions, which is not limited in this embodiment.
  • the text extraction tool can obtain the sentence vector corresponding to each labeled sentence based on the dimensionality reduction matrix after performing SVD dimensionality reduction on the word frequency-inverse text frequency index value matrix.
  • Step S05 Input the sentence vector to the conditional random field model to be trained for training, and obtain the second conditional random field model.
  • the text extraction can input the obtained sentence vector into the conditional random field model to be trained for training, thereby obtaining a second conditional random field model for predicting the similarity of words based on the words marked in the sample sentences.
  • the user-labeled documents contain multiple label sentences of preset label categories; the label sentences are processed by multi-thread processing script, and the vocabulary dictionary is constructed according to the sentence vocabulary after the word segmentation; the vocabulary is calculated The word frequency-inverse text frequency index value of each vocabulary in the dictionary, and the word frequency-inverse text frequency index value matrix is constructed according to the calculation result; the sentence vector corresponding to the label sentence is obtained according to the word frequency-inverse text frequency index value matrix; the sentence vector is input to The conditional random field model to be trained is trained to obtain the second conditional random field model. Because it is a matrix constructed by the word frequency-inverse text frequency index value of each vocabulary to obtain the sentence vector corresponding to the label sentence, and then the condition is based on the sentence vector The random field model is trained to ensure that the trained model has high accuracy.
  • the embodiment of the present application also proposes a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile, and a text extraction program is stored on the computer-readable storage medium.
  • the text extraction program is executed by the processor, the steps of the text extraction method as described above are realized.
  • Fig. 5 is a structural block diagram of the first embodiment of the text extraction device of this application.
  • the text extraction device proposed in the embodiment of the present application includes:
  • the text acquisition module 501 is configured to read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;
  • the sentence segmentation module 502 is configured to call a multi-thread processing script to segment the to-be-extracted text into sentence sets when it is detected that the extraction type identification is field extraction;
  • the vector conversion module 503 is configured to convert the sentences in the sentence set into sentence vectors through the multi-thread processing script
  • the vector splicing module 504 is used to splice the sentence vectors to obtain a target sentence vector
  • the model prediction module 505 is configured to input the target sentence vector into a first conditional random field model, and obtain the first prediction result output by the first conditional random field model;
  • the text extraction module 506 is configured to extract a target field from the text to be extracted by using an exact matching retrieval algorithm according to the first prediction result.
  • the extraction type identification contained in the text to be extracted is extracted; when the extraction type identification is detected as field extraction, the multi-threaded processing script is called to divide the text to be extracted into sentence sets; multi-threaded processing The script converts the sentences in the sentence set into sentence vectors; splices the sentence vectors to obtain the target sentence vector; inputs the target sentence vector to the first conditional random field model, and obtains the first prediction result output by the first conditional random field model ; According to the first prediction result, an exact matching retrieval algorithm is used to extract the target field from the text to be extracted.
  • the extraction length is determined according to the extraction type identification, and the corresponding conditional random field model is selected for different extraction lengths to extract the text to make the text extraction more targeted.
  • this embodiment uses a multi-threaded processing script to improve text segmentation. In order to improve the overall efficiency of text extraction, extracting the target field through the exact matching retrieval algorithm also ensures the accuracy of the target field extraction.
  • the vector conversion module 503 is further configured to input the sentences in the sentence set into the pre-training language model through the multi-threaded processing script to obtain sentences output by the pre-training language model.
  • the vector splicing module 504 is also used to obtain the text position information of each sentence in the to-be-extracted text, and determine the sentence sequence corresponding to each sentence according to the text position information; The sentence vectors are spliced in the sentence order to obtain the target sentence vector.
  • the text extraction device of this embodiment further includes: a model training module for acquiring a number of user-labeled documents, and vectorizing the user-labeled documents to obtain a labeled text vector, the labeled text vector containing an observation text sequence Input the labeled text vector to the initial condition random field model, so that the initial condition random field model performs model training based on the observation text sequence to obtain the conditional random field model to be verified;
  • the model performs model evaluation, and when the evaluation result meets a preset condition, the conditional random field model to be verified is used as the first conditional random field model.
  • the text extraction device of this embodiment further includes: a vocabulary extraction module, which is used to call a multi-threaded processing script to divide the text to be extracted into several sentences when it is detected that the extraction type is identified as vocabulary extraction; The similarity between a sentence and a sample sentence; based on the similarity, a number of target sentences corresponding to the sample sentence are selected from the segmented sentences; a candidate sentence set is constructed according to the target sentence, and the candidate The sentences in the sentence set are vectorized and then input to the second conditional random field model; the second prediction result output by the second conditional random field model is obtained, and the exact matching retrieval algorithm is used according to the second prediction result from the text to be extracted Extract the target vocabulary from it.
  • a vocabulary extraction module which is used to call a multi-threaded processing script to divide the text to be extracted into several sentences when it is detected that the extraction type is identified as vocabulary extraction
  • the similarity between a sentence and a sample sentence based on the similarity, a number of target sentences corresponding to the sample sentence are
  • the vocabulary extraction module is also used to perform word segmentation processing on the segmented sentences, and obtain the word frequency-inverse text frequency index value corresponding to each vocabulary after word segmentation; determine each word frequency-inverse text frequency index value according to the word frequency-inverse text frequency index value.
  • Sentence keywords corresponding to sentences to which the vocabulary belongs based on the sentence keywords, the similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained.
  • the model training module is also used to obtain several user-labeled documents, and the user-labeled documents contain multiple label sentences of preset label categories; and perform word segmentation on the label sentences through the multi-threaded processing script Process, and construct a vocabulary dictionary according to the sentence vocabulary after word segmentation; calculate the word frequency-inverse text frequency index value of each vocabulary in the vocabulary dictionary, and construct a word frequency-inverse text frequency index value matrix according to the calculation result; according to the word frequency- The inverse text frequency index value matrix obtains the sentence vector corresponding to the label sentence; the sentence vector is input to the conditional random field model to be trained for training, and the second conditional random field model is obtained.
  • model training module is also used to perform singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values; select a preset number of target singular values from the set of singular values, according to The target singular value performs matrix reconstruction on the word frequency-inverse text frequency index value matrix to obtain a target matrix; and obtains a sentence vector corresponding to the labeled sentence based on the target matrix.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as read-only memory/random access
  • the storage, magnetic disk, and optical disk includes several instructions to make a text extraction tool device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A text extraction method, apparatus and device, and a storage medium. The method comprises: reading text to the extracted, and extracting an extraction type identifier comprised in the text to be extracted (S10); upon detection that the extraction type identifier is field extraction, invoking a multi-threaded process script to segment the text to be extracted into sentence sets (S20); converting sentences in the sentence sets into sentence vectors by means of the multi-threaded process script (S30); splicing the sentence vectors to obtain a target sentence vector (S40); inputting the target sentence vector into a first conditional random field model to obtain a first prediction result output by the first conditional random field model (S50); and extracting a target field from the text to be extracted according to the first prediction result by using an exact matching retrieval algorithm (S60). According to the method, an extraction length is determined according to an extraction type identifier, and corresponding conditional random field models are selected for text extraction depending on different extraction lengths so that text extraction is more targeted; furthermore, a multi-threaded process script is used for text segmentation so that the overall efficiency of text extraction is improved, and target field extraction by means of an exact matching retrieval algorithm also guarantees the accuracy of target field extraction.

Description

文本抽取方法、装置、设备及存储介质Text extraction method, device, equipment and storage medium
本申请申明2019年09月18日递交的申请号为201910885399.9、名称为“文本抽取方法、装置、设备及存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中This application affirms the priority of the Chinese patent application filed on September 18, 2019 with the application number 201910885399.9 and titled "Text extraction method, device, equipment and storage medium". The entire content of the Chinese patent application is incorporated by reference In this application
技术领域Technical field
本申请涉及文本处理技术领域,尤其涉及一种文本抽取方法、装置、设备及存储介质。This application relates to the field of text processing technology, and in particular to a text extraction method, device, equipment, and storage medium.
背景技术Background technique
信息抽取是将文档(如简历类、保险条款类、百科类、合同类等多种业务场景的文档)中的非结构化数据自动提取转换为结构化数据的过程,例如将租赁合同中签约双方的名称、签约时间、签约地址等非结构化数据进行提取并转换等。Information extraction is the process of automatically extracting and converting unstructured data in documents (such as resumes, insurance clauses, encyclopedias, contracts and other business scenarios) into structured data, for example, the contracting parties in the lease contract Extract and convert unstructured data such as the name, contract time, and contract address of the company.
信息抽取从抽取内容角度划分主要包括实体抽取、关系抽取、事件抽取,从抽取长度划分主要包括词汇抽取和字段/段落抽取。另外,也分开放域信息抽取和封闭域信息抽取。随着深度神经网络的发展和计算机算力的增强,现有的信息抽取方法主要是基于大规模的标注数据训练参数量级较大的端到端的深度学习模型,然后基于训练出的模型进行不同业务场景下的文本信息抽取。发明人发现,这种信息抽取方式并未针对不同的抽取长度进行分类抽取,导致最终的抽取结果针对性不强、准确度不高、降低了信息抽取的效率。Information extraction is divided from the perspective of extraction content, including entity extraction, relationship extraction, and event extraction. From the length of extraction, it mainly includes vocabulary extraction and field/paragraph extraction. In addition, it is also divided into open domain information extraction and closed domain information extraction. With the development of deep neural networks and the enhancement of computer computing power, the existing information extraction methods are mainly based on large-scale labeled data training end-to-end deep learning models with larger parameters, and then perform different methods based on the trained models. Text information extraction in business scenarios. The inventor found that this information extraction method did not perform classification extraction for different extraction lengths, resulting in the final extraction result being not highly targeted, low accuracy, and reducing the efficiency of information extraction.
上述内容仅用于辅助理解本申请的技术方案,并不代表承认上述内容是现有技术。The above content is only used to assist the understanding of the technical solutions of this application, and does not mean that the above content is recognized as prior art.
技术问题technical problem
本申请的主要目的在于提供了一种文本抽取方法、装置、设备及存储介质,旨在解决现有的信息抽取技术抽取结果针对性不强、准确度不高、抽取效率较低的技术问题。The main purpose of this application is to provide a text extraction method, device, equipment and storage medium, aiming to solve the technical problems of the existing information extraction technology that the extraction results are not very specific, the accuracy is not high, and the extraction efficiency is low.
技术解决方案Technical solutions
为实现上述目的,本申请提供了一种文本抽取方法,所述方法包括以下步骤:In order to achieve the above objective, this application provides a text extraction method, which includes the following steps:
读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识;Read the text to be extracted, and extract the extraction type identifier contained in the text to be extracted;
在检测到所述抽取类型标识为字段抽取时,调用多线程处理脚本将所述待抽取文本切分成句子集合;When it is detected that the extraction type is identified as field extraction, calling a multi-threaded processing script to divide the text to be extracted into sentence sets;
通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量;Converting the sentences in the sentence set into sentence vectors through the multi-thread processing script;
对所述句子向量进行拼接,以获得目标句子向量;Splicing the sentence vectors to obtain a target sentence vector;
将所述目标句子向量输入至第一条件随机场模型,获取所述第一条件随机场模型输出的第一预测结果;Input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
根据所述第一预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标字段。According to the first prediction result, an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
此外,为实现上述目的,本申请还提出一种文本抽取装置,所述装置包括:In addition, in order to achieve the above objective, this application also proposes a text extraction device, which includes:
文本获取模块,用于读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识;The text acquisition module is used to read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;
句子切分模块,用于在检测到所述抽取类型标识为字段抽取时,调用多线程处理脚本将所述待抽取文本切分成句子集合;The sentence segmentation module is configured to call a multi-threaded processing script to segment the text to be extracted into sentence sets when it is detected that the extraction type is identified as field extraction;
向量转化模块,用于通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量;The vector conversion module is used to convert the sentences in the sentence set into sentence vectors through the multi-thread processing script;
向量拼接模块,用于对所述句子向量进行拼接,以获得目标句子向量;The vector splicing module is used for splicing the sentence vector to obtain the target sentence vector;
模型预测模块,用于将所述目标句子向量输入至第一条件随机场模型,获取所述第一条件随机场模型输出的第一预测结果;A model prediction module, configured to input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
文本抽取模块,用于根据所述第一预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标字段。The text extraction module is used to extract the target field from the text to be extracted by using an exact matching retrieval algorithm according to the first prediction result.
此外,为实现上述目的,本申请还提出一种文本抽取设备,所述文本抽取包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被处理器执行时实现以下步骤:In addition, in order to achieve the above-mentioned object, this application also proposes a text extraction device, the text extraction includes a memory, a processor, and computer-readable instructions stored on the memory and running on the processor. When the computer-readable instructions are executed by the processor, the following steps are implemented:
读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识;Read the text to be extracted, and extract the extraction type identifier contained in the text to be extracted;
在检测到所述抽取类型标识为字段抽取时,调用多线程处理脚本将所述待抽取文本切分成句子集合;When it is detected that the extraction type is identified as field extraction, calling a multi-threaded processing script to divide the text to be extracted into sentence sets;
通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量;Converting the sentences in the sentence set into sentence vectors through the multi-thread processing script;
对所述句子向量进行拼接,以获得目标句子向量;Splicing the sentence vectors to obtain a target sentence vector;
将所述目标句子向量输入至第一条件随机场模型,获取所述第一条件随机场模型输出的第一预测结果;Input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
根据所述第一预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标字段According to the first prediction result, an exact matching retrieval algorithm is used to extract a target field from the text to be extracted
此外,为实现上述目的,本申请还提出一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机可读指令,所述计算机可读指令可被至少一个处理器所执行,以使所述至少一个处理器执行如下步骤:In addition, in order to achieve the above-mentioned object, this application also proposes a computer-readable storage medium having computer-readable instructions stored in the computer-readable storage medium, and the computer-readable instructions can be executed by at least one processor to Make the at least one processor execute the following steps:
读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识;Read the text to be extracted, and extract the extraction type identifier contained in the text to be extracted;
在检测到所述抽取类型标识为字段抽取时,调用多线程处理脚本将所述待抽取文本切分成句子集合;When it is detected that the extraction type is identified as field extraction, calling a multi-threaded processing script to divide the text to be extracted into sentence sets;
通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量;Converting the sentences in the sentence set into sentence vectors through the multi-thread processing script;
对所述句子向量进行拼接,以获得目标句子向量;Splicing the sentence vectors to obtain a target sentence vector;
将所述目标句子向量输入至第一条件随机场模型,获取所述第一条件随机场模型输出的第一预测结果;Input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
根据所述第一预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标字段。According to the first prediction result, an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
有益效果Beneficial effect
本申请通过读取待抽取文本,提取待抽取文本中包含的抽取类型标识;在检测到抽取类型标识为字段抽取时,调用多线程处理脚本将待抽取文本切分成句子集合;通过多线程处理脚本将句子集合中的句子转化为句子向量;对句子向量进行拼接,以获得目标句子向量;将目标句子向量输入至第一条件随机场模型,获取第一条件随机场模型输出的第一预测结果;根据第一预测结果采用精确匹配检索算法从待抽取文本中抽取目标字段。本申请根据抽取类型标识确定抽取长度,针对不同的抽取长度选用对应的条件随机场模型对文本进行文本抽取使文本抽取更具针对性,同时本申请采用多线程处理脚本进行文本切分提高了文本抽取的整体效率,通过精确匹配检索算法提取目标字段也保证了目标字段抽取的准确性。This application extracts the extraction type identification contained in the text to be extracted by reading the text to be extracted; when the extraction type identification is detected as field extraction, the multi-threaded processing script is called to divide the text to be extracted into sentence sets; the script is processed by multi-threading Transform the sentences in the sentence set into sentence vectors; splice the sentence vectors to obtain the target sentence vector; input the target sentence vector into the first conditional random field model to obtain the first prediction result output by the first conditional random field model; According to the first prediction result, an exact matching retrieval algorithm is used to extract the target field from the text to be extracted. This application determines the extraction length according to the extraction type identification, and selects the corresponding conditional random field model for different extraction lengths to extract text to make the text extraction more targeted. At the same time, this application uses multi-threaded processing scripts for text segmentation to improve the text The overall efficiency of extraction, extracting the target field through the exact matching retrieval algorithm also ensures the accuracy of the target field extraction.
附图说明Description of the drawings
图1是本申请实施例方案涉及的硬件运行环境的文本抽取设备的结构示意图;FIG. 1 is a schematic structural diagram of a text extraction device of a hardware operating environment involved in a solution of an embodiment of the present application;
图2为本申请文本抽取方法第一实施例的流程示意图;2 is a schematic flowchart of the first embodiment of the text extraction method of this application;
图3为本申请文本抽取方法第二实施例的流程示意图;3 is a schematic flowchart of a second embodiment of the text extraction method of this application;
图4为本申请文本抽取方法第三实施例的流程示意图;4 is a schematic flowchart of a third embodiment of the text extraction method of this application;
图5为本申请文本抽取装置第一实施例的结构框图。FIG. 5 is a structural block diagram of the first embodiment of the text extraction device of this application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
本发明的实施方式Embodiments of the present invention
应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.
参照图1,图1为本申请实施例方案涉及的硬件运行环境的文本抽取设备结构示意图。Referring to FIG. 1, FIG. 1 is a schematic structural diagram of a text extraction device of a hardware operating environment involved in a solution of an embodiment of the application.
如图1所示,该文本抽取设备可以包括:处理器1001,例如中央处理器(Central Processing Unit,CPU),通信总线1002、用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如无线保真(WIreless-FIdelity,WI-FI)接口)。存储器1005可以是高速的随机存取存储器(Random Access Memory,RAM)存储器,也可以是稳定的非易失性存储器(Non-Volatile Memory,NVM),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1, the text extraction device may include: a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a wireless fidelity (WIreless-FIdelity, WI-FI) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory, or a stable non-volatile memory (Non-Volatile memory). Memory, NVM), such as disk storage. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
本领域技术人员可以理解,图1中示出的结构并不构成对文本抽取设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation on the text extraction device, and may include more or less components than shown in the figure, or a combination of certain components, or different component arrangements.
如图1所示,作为一种计算机可读存储介质的存储器1005中可以包括操作系统、数据存储模块、网络通信模块、用户接口模块以及文本抽取程序。As shown in FIG. 1, the memory 1005 as a computer-readable storage medium may include an operating system, a data storage module, a network communication module, a user interface module, and a text extraction program.
在图1所示的文本抽取设备中,网络接口1004主要用于与网络服务器进行数据通信;用户接口1003主要用于与用户进行数据交互;本申请文本抽取设备中的处理器1001、存储器1005可以设置在文本抽取设备中,所述文本抽取设备通过处理器1001调用存储器1005中存储的文本抽取程序,并执行本申请实施例提供的文本抽取方法。In the text extraction device shown in FIG. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with users; the processor 1001 and the memory 1005 in the text extraction device of this application can be Set in a text extraction device, the text extraction device calls the text extraction program stored in the memory 1005 through the processor 1001, and executes the text extraction method provided in the embodiment of the present application.
本申请实施例提供了一种文本抽取方法,参照图2,图2为本申请文本抽取方法第一实施例的流程示意图。An embodiment of the present application provides a text extraction method. Refer to FIG. 2, which is a schematic flowchart of the first embodiment of the text extraction method of this application.
本实施例中,所述文本抽取方法包括以下步骤:In this embodiment, the text extraction method includes the following steps:
步骤S10:读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识;Step S10: Read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;
需要说明的是,本实例方法的执行主体可以是具有数据处理、网络通信以及程序运行功能的计算服务设备,例如智能手机、平板电脑、个人电脑等,也可以是预先装载在上述计算服务设备上的文本抽取工具。另外,本实施例在具体实现场景中,用户需要先上传样本文档至所述文本抽取工具,该样本文档中标注有需要抽取的段落/字段或词汇,由文本抽取工具根据这些样本文档对未训练的初始条件随机场(Conditional Random Field,CRF)模型进行训练获得专用于字段抽取的CRF模型,或者是专用于词汇抽取的CRF模型;然后再基于这些训练好的CRF模型进行段落/字段抽取或者词汇抽取。It should be noted that the execution subject of the method in this example can be a computing service device with data processing, network communication and program operation functions, such as smart phones, tablets, personal computers, etc., or it can be pre-loaded on the above computing service devices The text extraction tool. In addition, in the specific implementation scenario of this embodiment, the user needs to upload a sample document to the text extraction tool first, and the sample document is marked with paragraphs/fields or vocabulary that need to be extracted. The text extraction tool compares untrained documents based on these sample documents. The initial conditional random field (Conditional Random Field, CRF) model is trained to obtain a CRF model dedicated to field extraction, or a CRF model dedicated to vocabulary extraction; and then based on these trained CRF models for paragraph/field extraction or vocabulary Extract.
应理解的是,所述抽取类型标识包括字段抽取和词汇抽取。本实施例中针对字段抽取和词汇抽取这两类不同的应用场景,用户只需要标注少量(几篇或十几篇)样本文档,即可实现高准确率地从同类文档中抽取相同的词汇或段落。此外,本步骤中所述抽取类型标识需要用户在上传所述待抽取文本时一并选择,以使得待抽取文本中携带有用于确定该文本具体抽取类型的标识或标记。It should be understood that the extraction type identification includes field extraction and vocabulary extraction. In this embodiment, for the two different application scenarios of field extraction and vocabulary extraction, the user only needs to annotate a small number (a few or a dozen) of sample documents to achieve high accuracy in extracting the same vocabulary or words from similar documents. paragraph. In addition, the extraction type identification in this step needs to be selected by the user when uploading the text to be extracted, so that the text to be extracted carries an identification or mark for determining the specific extraction type of the text.
在具体实现中,文本抽取工具读取用户上传的待抽取文本,并提取待抽取文本中包含的抽取类型标识。In a specific implementation, the text extraction tool reads the text to be extracted uploaded by the user, and extracts the extraction type identification contained in the text to be extracted.
步骤S20:在检测到所述抽取类型标识为字段抽取时,调用多线程处理脚本将所述待抽取文本切分成句子集合;Step S20: when it is detected that the extraction type is identified as field extraction, call a multi-threaded processing script to divide the text to be extracted into sentence sets;
应理解的是,所述字段抽取即对段落或句子进行抽取。因此,本实施例中文本抽取工具可先对待抽取文本按句子维度进行切分,获得待抽取文本对应的若干个句子,然后将这些切分后的句子组成一个句子集合。所述多线程处理脚本可以是预先编写的实现多个线程并发执行文本切分操作的计算机可读指令或代码文件。It should be understood that the field extraction refers to the extraction of paragraphs or sentences. Therefore, the text extraction tool in this embodiment may first segment the text to be extracted according to sentence dimensions, obtain a number of sentences corresponding to the text to be extracted, and then combine these segmented sentences into a sentence set. The multi-thread processing script may be a pre-written computer readable instruction or code file that enables multiple threads to concurrently execute a text segmentation operation.
步骤S30:通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量;Step S30: Transforming the sentences in the sentence set into sentence vectors through the multi-thread processing script;
需要说明的是,本实施例中将句子转化为句子向量,可以是先通过多线程处理脚本对句子进行分词处理,然后获取分词后的词汇维度(例如句子“我喜欢看电视,不喜欢看电影”对应的词汇维度为:我,喜欢,看,电视,电影,不,也),再统计出分词后各词汇的词频“我1,喜欢2,看2,电视1,电影1,不1,也0”,最后根据各词汇的词频对句子进行向量转化获得句子向量“[1, 2, 2, 1, 1, 1, 0]”。当然,具体的句子向量化方式还可以其它方式,本实施例对此不作具体限制。It should be noted that, in this embodiment, the sentence is converted into a sentence vector by first performing word segmentation processing on the sentence through a multi-threaded processing script, and then obtaining the vocabulary dimension after the word segmentation (for example, the sentence "I like watching TV, I don't like watching movies "The corresponding vocabulary dimension is: I, like, watch, TV, movie, no, also), and then count the word frequency of each vocabulary after the word segmentation "I 1, like 2, watch 2, TV 1, movie 1, no 1, Also 0", and finally the sentence vector is transformed according to the word frequency of each vocabulary to obtain the sentence vector "[1, 2, 2, 1, 1, 1, 0]". Of course, the specific sentence vectorization method can also be other methods, and this embodiment does not specifically limit this.
步骤S40:对所述句子向量进行拼接,以获得目标句子向量;Step S40: splicing the sentence vectors to obtain a target sentence vector;
应理解的是,为了对整篇待抽取文档进行字段抽取,避免遗漏需要抽取的目标字段,本实施例中所述文本抽取工具还将对各句子所对应的句子向量按照文本的段落顺序进行拼接,获得最终用来输入到CRF模型中的目标句子向量。It should be understood that, in order to extract fields from the entire document to be extracted and avoid omitting the target fields that need to be extracted, the text extraction tool in this embodiment will also splice the sentence vectors corresponding to each sentence in the paragraph order of the text. , To obtain the target sentence vector that is finally used to input into the CRF model.
进一步地,考虑到BERT模型(一种预训练语言表示的方法,它是在大量文本语料(例如维基百科)上训练的一个通用“语言理解”模型)相较于其它语言模型在自然语言处理方面优势明显,本实施例优选通过BERT模型来对句子进行向量化。Further, considering that the BERT model (a method of pre-training language representation, which is a general "language understanding" model trained on a large amount of text corpus (such as Wikipedia)) is compared with other language models in terms of natural language processing The advantage is obvious, and this embodiment preferably uses the BERT model to vectorize the sentence.
具体的,可通过所述多线程处理脚本将所述句子集合中的句子输入至预训练语言模型(即上述BERT模型),以获得所述预训练语言模型输出的各句子对应的句子向量;然后获取各句子在所述待抽取文本中所处的文本位置信息,并根据所述文本位置信息确定各句子对应的句子顺序;再按所述句子顺序对所述句子向量进行拼接以获得目标句子向量。Specifically, the sentences in the sentence set may be input to a pre-training language model (ie, the above-mentioned BERT model) through the multi-threaded processing script to obtain the sentence vector corresponding to each sentence output by the pre-training language model; Acquire the text position information of each sentence in the to-be-extracted text, and determine the sentence order corresponding to each sentence according to the text position information; then splice the sentence vectors in the sentence order to obtain the target sentence vector .
步骤S50:将所述目标句子向量输入至第一条件随机场模型,获取所述第一条件随机场模型输出的第一预测结果;Step S50: Input the target sentence vector to a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
需要说明的是,由于字段抽取和词汇抽取的应用场景可能不同,而不同的应用场景对文本抽取结果的准确度等要求也可能存在差异。因此本实施例中用户在通过文本抽取工具进行文本信息抽取时,可分别针对不同的文本抽取类型训练不同的CRF模型。本实施例将专用于段落/字段抽取的CRF模型作为第一条件随机场模型。It should be noted that because the application scenarios of field extraction and vocabulary extraction may be different, and different application scenarios may have different requirements on the accuracy of the text extraction results. Therefore, in this embodiment, when the user extracts text information through the text extraction tool, different CRF models can be trained for different text extraction types. In this embodiment, the CRF model dedicated to paragraph/field extraction is used as the first conditional random field model.
此外,在执行本实施例上述步骤S10之前,用户需根据实际需求在文本抽取工具上进行初始CRF模型的训练。具体的,文本抽取工具可获取若干个用户标注文档,对所述用户标注文档进行向量化以获得标注文本向量,所述标注文本向量中包含观察文本序列;将所述标注文本向量输入至初始条件随机场模型,以使所述初始条件随机场模型基于所述观察文本序列进行模型训练,获得待验证条件随机场模型;对所述待验证条件随机场模型进行模型评估,在评估结果满足预设条件时,将所述待验证条件随机场模型作为所述第一条件随机场模型。其中,所述预设条件可以是模型的评估结果(例如预测结果的准确率)满足使用标准,如预测结果的准确率超过95%等,本实施例对此不加以限制。In addition, before performing step S10 in this embodiment, the user needs to train the initial CRF model on the text extraction tool according to actual needs. Specifically, the text extraction tool can obtain a number of user-labeled documents, vectorize the user-labeled documents to obtain a labeled text vector, the labeled text vector contains an observation text sequence; input the labeled text vector to the initial condition Random field model, so that the initial condition random field model is trained based on the observation text sequence to obtain the conditional random field model to be verified; model evaluation is performed on the conditional random field model to be verified, and the evaluation result meets the preset When the condition is met, the conditional random field model to be verified is used as the first conditional random field model. Wherein, the preset condition may be that the evaluation result of the model (for example, the accuracy of the prediction result) meets the use standard, for example, the accuracy of the prediction result exceeds 95%, which is not limited in this embodiment.
应理解的是,CRF模型,即条件随机场模型,是在最大熵模型和隐马尔可夫模型的基础上提出的一种无向图学习模型,是一种用于标注和切分有序数据的条件概率模型。模型最终求取的条件概率为P=(y1……yn丨x),即从文本中求一个标识序列y1……yn使得该标识序列y1……yn在观察序列x(即被用户标注的字段)的条件下概率最大。换言之,本实施例中条件随机场模型求取的标识序列能够使其对应的观察序列与样本文档中用户预先标注的观察序列相同或最相似(即条件概率最大),从而实现对目标字段的精准提取。It should be understood that the CRF model, the conditional random field model, is an undirected graph learning model proposed on the basis of the maximum entropy model and the hidden Markov model, and is used to label and segment ordered data Conditional probability model. The conditional probability that the model finally obtains is P=(y1……yn丨x), that is, an identification sequence y1……yn is obtained from the text so that the identification sequence y1……yn is in the observation sequence x (that is, the field marked by the user) ) Has the greatest probability. In other words, the identification sequence obtained by the conditional random field model in this embodiment can make the corresponding observation sequence the same or the most similar to the observation sequence pre-marked by the user in the sample document (that is, the maximum conditional probability), so as to achieve the accuracy of the target field. extract.
实际应用中,CRF模型训练可按如下方式:In practical applications, CRF model training can be as follows:
(1)按照以下方式对样本文档中需要抽取的字段或词汇进行标注,例如需要抽取的字段为“承租人:张三(中国)投资有限公司”,则用户需要对样本文档中包含的所有“承租人:张三(中国)投资有限公司”的字段进行标注(即下述观察序列),如:(1) Mark the fields or vocabulary that need to be extracted in the sample document in the following way. For example, if the field to be extracted is "Leasee: Zhang San (China) Investment Co., Ltd.", the user needs to mark all the " Lessee: Zhang San (China) Investment Co., Ltd." field is marked (that is, the following observation sequence), such as:
观察序列:承租人:张三(中国)投资有限公司Observation sequence: Lessee: Zhang San (China) Investment Co., Ltd.
标识序列:O OOO B I IIIIIIIII EIdentification sequence: O OOO B I IIIIIIIII E
(2)将标注后的样本文档输入至初始CRF模型进行训练,以使初始CRF模型通过多个包含上述标注的样本文档进行条件概率(函数)的自学习,使得训练后的CRF模型可以通过观察序列预测出正确的标识序列。(2) Input the annotated sample documents into the initial CRF model for training, so that the initial CRF model can self-learn conditional probabilities (functions) through multiple sample documents containing the above annotations, so that the trained CRF model can be observed The sequence predicts the correct identification sequence.
其中,观察序列即用户标注的字段或词汇,标识序列为文本抽取工具基于观察序列利用OBIE(ontology-based information extraction)方法自动生成的文本序列,上述观察文本序列则为所述观察序列向量化之后的文本序列。Among them, the observation sequence is the field or vocabulary marked by the user, and the identification sequence is a text extraction tool based on the observation sequence using OBIE (ontology-based The information extraction method automatically generates a text sequence, and the above observation text sequence is a text sequence after the observation sequence is vectorized.
在具体实现中,文本抽取工具可将拼接后的目标句子向量输入至第一条件随机场模型,然后获取所述第一条件随机场模型输出的第一预测结果。可理解的是,通常情况下,待抽取文档中可能包含多个与观察序列相同或相似的字段,因此第一条件随机场输出的第一预测结果中也通常包括多个条件概率,例如字段1的条件概率P1:98%,字段2的条件概率P2:95%,字段3的条件概率P3:90%等。In a specific implementation, the text extraction tool may input the spliced target sentence vector into the first conditional random field model, and then obtain the first prediction result output by the first conditional random field model. It is understandable that under normal circumstances, the document to be extracted may contain multiple fields that are the same or similar to the observation sequence. Therefore, the first prediction result output by the first conditional random field usually includes multiple conditional probabilities, such as field 1. The conditional probability of P1: 98%, the conditional probability of field 2 P2: 95%, the conditional probability of field 3 P3: 90%, etc.
步骤S60:根据所述第一预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标字段。Step S60: Use an exact matching search algorithm to extract a target field from the text to be extracted according to the first prediction result.
可理解的是,所述精确匹配检索算法,又称精确匹配检索,是指检索词与资源库中某一字段完全相同的检索方式。精确匹配是指将输入的检索词当固定词组进行检索,本实施例中文本抽取工具可将预测结果中条件概率对应的字段作为“固定词组”进行检索,从而提取出检索到的目标字段。It is understandable that the exact matching search algorithm, also called exact matching search, refers to a search method in which the search term is exactly the same as a certain field in the resource database. Exact matching refers to searching the input search term as a fixed phrase. In this embodiment, the text extraction tool can search the field corresponding to the conditional probability in the prediction result as a "fixed phrase" to extract the retrieved target field.
具体的,文本抽取工具可以将第一预测结果中的条件概率按从高到低进行排序,然后选取排序靠前的一个或多个条件概率,再将这些条件概率对应的字段均作为目标字段通过精确匹配检索进行文本抽取;当然,文本抽取工具也可以根据预先设定的条件概率阈值来对预测结果中包含的条件概率进行筛选,例如将条件概率值高于所述条件概率阈值的所有条件概率均作为目标条件概率,然后根据目标条件概率来确定目标字段,再基于目标字段通过精确匹配检索进行文本抽取。本实施例对根据第一预测结果确定目标字段所采用的方式并不做具体限制。Specifically, the text extraction tool can sort the conditional probabilities in the first prediction result from high to low, and then select one or more conditional probabilities that are ranked first, and then pass the fields corresponding to these conditional probabilities as the target field. Exact matching search for text extraction; of course, the text extraction tool can also filter the conditional probabilities contained in the prediction result according to a preset conditional probability threshold, for example, all conditional probabilities whose conditional probability value is higher than the conditional probability threshold Both are used as the target condition probability, and then the target field is determined according to the target condition probability, and then text extraction is performed based on the target field through exact matching search. This embodiment does not specifically limit the method used to determine the target field according to the first prediction result.
本实施例通过读取待抽取文本,提取待抽取文本中包含的抽取类型标识;在检测到抽取类型标识为字段抽取时,调用多线程处理脚本将待抽取文本切分成句子集合;通过多线程处理脚本将句子集合中的句子转化为句子向量;对句子向量进行拼接,以获得目标句子向量;将目标句子向量输入至第一条件随机场模型,获取第一条件随机场模型输出的第一预测结果;根据第一预测结果采用精确匹配检索算法从待抽取文本中抽取目标字段。本实施例根据抽取类型标识确定抽取长度,针对不同的抽取长度选用对应的条件随机场模型对文本进行文本抽取使文本抽取更具针对性,同时本实施例采用多线程处理脚本进行文本切分提高了文本抽取的整体效率,通过精确匹配检索算法提取目标字段也保证了目标字段抽取的准确性。In this embodiment, by reading the text to be extracted, the extraction type identification contained in the text to be extracted is extracted; when the extraction type identification is detected as field extraction, the multi-threaded processing script is called to divide the text to be extracted into sentence sets; multi-threaded processing The script converts the sentences in the sentence set into sentence vectors; splices the sentence vectors to obtain the target sentence vector; inputs the target sentence vector to the first conditional random field model, and obtains the first prediction result output by the first conditional random field model ; According to the first prediction result, an exact matching retrieval algorithm is used to extract the target field from the text to be extracted. In this embodiment, the extraction length is determined according to the extraction type identification, and the corresponding conditional random field model is selected for different extraction lengths to extract the text to make the text extraction more targeted. At the same time, this embodiment uses a multi-threaded processing script to improve text segmentation. In order to improve the overall efficiency of text extraction, extracting the target field through the exact matching retrieval algorithm also ensures the accuracy of the target field extraction.
参考图3,图3为本申请文本抽取方法第二实施例的流程示意图。Referring to FIG. 3, FIG. 3 is a schematic flowchart of a second embodiment of the text extraction method of this application.
基于上述第一实施例,在本实施例中,所述步骤S10之后,所述方法还包括:Based on the foregoing first embodiment, in this embodiment, after the step S10, the method further includes:
步骤S201:在检测到所述抽取类型标识为词汇抽取时,调用多线程处理脚本将所述待抽取文本切分成若干个句子;Step S201: When it is detected that the extraction type is identified as vocabulary extraction, call a multi-threaded processing script to divide the text to be extracted into several sentences;
应理解的是,所述词汇抽取又称点(point)抽取,即对字符或词语进行提取。同样的,在进行词汇抽取前用户需要先在样本文档中标注出待抽取的词汇,如合同签约方、签约时间、签约地址等不同维度的词汇,且为不同维度的词汇配置不同的标签类别,如人物、时间、地址等。It should be understood that the vocabulary extraction is also called point extraction, that is, the extraction of characters or words. Similarly, before performing vocabulary extraction, the user needs to mark the vocabulary to be extracted in the sample document, such as the vocabulary of different dimensions such as contract signatory, contract time, contract address, etc., and configure different label categories for vocabulary of different dimensions. Such as person, time, address, etc.
在具体实现中,文本抽取工具在根据待抽取文本中携带的抽取类型标识或标记确定该待抽取文本为词汇抽取时,可调用多线程处理脚本将待抽取文本切分成若干个句子。In a specific implementation, when the text extraction tool determines that the text to be extracted is a vocabulary extraction according to the extraction type identification or mark carried in the text to be extracted, it can call a multi-threaded processing script to divide the text to be extracted into several sentences.
步骤S301:获取每个句子与样本句子之间的相似度;Step S301: Obtain the similarity between each sentence and the sample sentence;
需要说明的是,由于用户在通过文本抽取工具对待抽取文本进行词汇抽取前,同样需要通过文本抽取工具基于事先标注的样本文档(文档中包含被标注的字符或词汇)进行CRF模型的训练。因此,本实施例将样本文档中携带被标注的字符或词汇的句子作为所述样本句子。It should be noted that before using the text extraction tool to extract vocabulary from the text to be extracted, the user also needs to use the text extraction tool to train the CRF model based on pre-labeled sample documents (documents containing labeled characters or vocabulary). Therefore, in this embodiment, the sentence carrying the marked characters or vocabulary in the sample document is used as the sample sentence.
应理解的是,通常情况下,两个句子越相似,它们所包含的词汇也越相近,因此本实施例文本抽取方法通过先查找与样本句子相似的句子,然后再从查找到的相似句子中提取目标词汇。It should be understood that under normal circumstances, the more similar two sentences are, the more similar the vocabulary they contain. Therefore, the text extraction method of this embodiment first searches for sentences that are similar to the sample sentence, and then finds similar sentences. Extract the target vocabulary.
具体的,本实施例在计算句子之间的相似度时,可先通过词频统计技术来统计每个句子中各词汇的词频;然后根据统计结果来确定每个句子对应的关键词(集合);再将句子关键词(集合)之间的相似度作为句子之间的相似度,从而能够提高句子间相似度计算的准确性。Specifically, when calculating the similarity between sentences in this embodiment, the word frequency statistics technique can be used to count the word frequency of each vocabulary in each sentence; then the keyword (set) corresponding to each sentence is determined according to the statistical result; Then the similarity between sentence keywords (sets) is regarded as the similarity between sentences, which can improve the accuracy of calculation of similarity between sentences.
目前的相似度计算算法包括余弦相似度算法、欧几里得距离算法、皮尔逊相关系数等。为提高相似度计算效率,降低计算量,本实施例所述相似度计算算法优选为通过计算向量夹角来计算相似度的余弦相似度算法。The current similarity calculation algorithms include cosine similarity algorithm, Euclidean distance algorithm, Pearson correlation coefficient and so on. In order to improve the similarity calculation efficiency and reduce the calculation amount, the similarity calculation algorithm described in this embodiment is preferably a cosine similarity algorithm that calculates the similarity by calculating the angle between vectors.
进一步地,考虑到现有的词频统计技术虽然简单便捷,但缺陷也较为明显,例如采用词频统计技术进行词频统计的文档中“我”、“的”等出现频率很高的词汇通常被会赋予较高的权值,但是这些词汇本身无意义,一定程度上影响了句子关键词的确定。因此本实施例中优选通过使用词频-逆文本频率指数(Term Frequency Inverse Document Frequency,TF-IDF)算法来克服词频统计技术的上述缺陷。Furthermore, considering that the existing word frequency statistics technology is simple and convenient, its defects are also obvious. For example, in documents that use word frequency statistics technology for word frequency statistics, words such as "I" and "的" that appear frequently are usually assigned Higher weight, but these words themselves are meaningless, which affects the determination of sentence keywords to a certain extent. Therefore, in this embodiment, it is preferable to use the term frequency-inverse document frequency index (Term Frequency Inverse Document Frequency, TF-IDF) algorithm to overcome the above-mentioned shortcomings of the term frequency statistical technology.
具体的,文本抽取工具对切分后的句子进行分词处理,并基于TF-IDF算法获取分词后各词汇对应的词频-逆文本频率指数值(即TF-IDF值);然后根据所述词频-逆文本频率指数值确定各词汇所属句子所对应的句子关键词;最后再基于所述句子关键词获取各词汇所属句子与样本句子之间的相似度。Specifically, the text extraction tool performs word segmentation processing on the segmented sentences, and obtains the word frequency-inverse text frequency index value (that is, the TF-IDF value) corresponding to each vocabulary after the word segmentation based on the TF-IDF algorithm; and then according to the word frequency- The inverse text frequency index value determines the sentence keyword corresponding to the sentence to which each vocabulary belongs; finally, based on the sentence keyword, the similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained.
其中,所述基于所述句子关键词获取各词汇所属句子与样本句子之间的相似度的步骤可具体包括:获取所述句子关键词对应的词频向量,然后采用余弦相似度算法计算各词汇所属句子的词频向量与样本句子的词频向量之间的余弦相似度。余弦相似度值越大,表明两个句子越相似;反之,则越不相似。Wherein, the step of obtaining the similarity between the sentence belonging to each vocabulary and the sample sentence based on the sentence keywords may specifically include: obtaining the word frequency vector corresponding to the sentence keywords, and then using a cosine similarity algorithm to calculate the belongingness of each vocabulary The cosine similarity between the word frequency vector of the sentence and the word frequency vector of the sample sentence. The greater the cosine similarity value, the more similar the two sentences; otherwise, the less similar.
步骤S401:基于所述相似度从切分后的句子中筛选出所述样本句子对应的若干个目标句子;Step S401: filter out several target sentences corresponding to the sample sentence from the segmented sentences based on the similarity;
应理解的是,对于样本文档中的每一个样本句子,待抽取文本中都可能存在多个与之相似的目标句子。故本实施例文本抽取工具需要先根据计算出的相似度来从切分后的句子中筛选出所述样本句子对应的若干个目标句子,然后再从这些目标句子中抽取出最终的目标词汇。It should be understood that for each sample sentence in the sample document, there may be multiple similar target sentences in the text to be extracted. Therefore, the text extraction tool of this embodiment needs to first filter out several target sentences corresponding to the sample sentences from the segmented sentences according to the calculated similarity, and then extract the final target vocabulary from these target sentences.
步骤S501:根据所述目标句子构建候选句子集,将所述候选句子集中的句子向量化后输入至第二条件随机场模型;Step S501: Construct a candidate sentence set according to the target sentence, and input the sentence in the candidate sentence set into a second conditional random field model after vectorization;
需要说明的是,本实施例将预先训练的专用于词汇抽取的CRF模型作为第二条件随机场模型。It should be noted that this embodiment uses a pre-trained CRF model dedicated to vocabulary extraction as the second conditional random field model.
在具体实现中,文本抽取工具可根据目标句子构建候选句子集,然后将句子集中的句子输入至BERT模型中并获取BERT模型输出的句子向量,在获取到这些句子向量后,文本抽取工具即可将这些句子向量输入至第二条件随机场模型来进行条件概率的预测。In specific implementation, the text extraction tool can construct a candidate sentence set according to the target sentence, and then input the sentences in the sentence set into the BERT model and obtain the sentence vectors output by the BERT model. After these sentence vectors are obtained, the text extraction tool can be used These sentence vectors are input into the second conditional random field model to predict the conditional probability.
步骤S601:获取所述第二条件随机场模型输出的第二预测结果,根据所述第二预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标词汇。Step S601: Obtain a second prediction result output by the second conditional random field model, and use an exact matching retrieval algorithm to extract a target vocabulary from the text to be extracted according to the second prediction result.
在具体实现中,文本抽取工具在获取到第二条件随机场模型输出的第二预测结果后,即可根据第二预测结果中包含的条件概率值来确定出需要抽取的目标词汇,然后根据确定出的目标词汇通过精确匹配检索算法从待抽取文本中提取检索到的所有目标词汇。In specific implementation, after the text extraction tool obtains the second prediction result output by the second conditional random field model, it can determine the target vocabulary to be extracted according to the conditional probability value contained in the second prediction result, and then determine the target vocabulary according to the The target vocabulary is extracted from the text to be extracted through the exact matching retrieval algorithm to extract all the target vocabulary retrieved.
本实施例在检测到抽取类型标识为词汇抽取时,调用多线程处理脚本将待抽取文本切分成若干个句子;获取每个句子与样本句子之间的相似度;基于相似度从切分后的句子中筛选出样本句子对应的若干个目标句子;根据目标句子构建候选句子集,将候选句子集中的句子向量化后输入至第二条件随机场模型;获取第二条件随机场模型输出的第二预测结果,根据第二预测结果采用精确匹配检索算法从待抽取文本中抽取目标词汇,本实施例通过多线程处理脚本对待抽取文本进行句子切分,提高了切分效率,同时根据句子与样本句子之间的相似度来选取目标句子构建候选句子集,能够保证输入至条件随机场模型的句子与样本句子较为接近,降低模型运算量的同时提高了词汇抽取的准确度。In this embodiment, when it is detected that the extraction type is identified as vocabulary extraction, the multi-threaded processing script is called to divide the text to be extracted into several sentences; the similarity between each sentence and the sample sentence is obtained; Several target sentences corresponding to the sample sentences are selected from the sentence; a candidate sentence set is constructed according to the target sentence, and the sentence in the candidate sentence set is vectorized and input to the second conditional random field model; the second conditional random field model output is obtained According to the prediction result, the exact matching retrieval algorithm is used to extract the target vocabulary from the text to be extracted according to the second prediction result. In this embodiment, the multi-threaded processing script is used to segment the text to be extracted, which improves the efficiency of segmentation. At the same time, according to the sentence and the sample sentence The similarity between the target sentences is selected to construct a candidate sentence set, which can ensure that the sentences input to the conditional random field model are closer to the sample sentences, reducing the amount of model calculations and improving the accuracy of vocabulary extraction.
参考图4,图4为本申请文本抽取方法第三实施例的流程示意图。Referring to FIG. 4, FIG. 4 is a schematic flowchart of a third embodiment of a text extraction method of this application.
基于上述第二实施例,本实施例文本抽取方法在上述步骤S10之前,还包括:Based on the foregoing second embodiment, before the foregoing step S10, the text extraction method of this embodiment further includes:
步骤S01:获取若干个用户标注文档,所述用户标注文档中包含多个预设标签类别的标签句子;Step S01: Obtain a number of user-labeled documents, and the user-labeled documents contain label sentences of multiple preset label categories;
应理解的是,本实施例中所述用户标注文档,即用户事先进行字符或词汇标注的文本。所述预设标签类别可以是预先配置的,对不同维度的字符或词汇进行区分的标识,例如将出现合同签约双方的字符或词汇对应的标签配置为“人物”、将出现时间、时刻、时长的字符或词汇对应的标签配置为“时间”、将出现地点、场合的字符或词汇对应的标签配置为“地址”等。It should be understood that the document marked by the user in this embodiment is text marked by characters or vocabulary in advance by the user. The preset label category may be pre-configured to distinguish between characters or vocabulary of different dimensions. For example, the label corresponding to the characters or vocabulary of the two parties to the contract is configured as "person", and the appearance time, time, duration The label corresponding to the character or vocabulary of is configured as "time", the label corresponding to the character or vocabulary of the place and occasion is configured as "address", etc.
在实际应用中,每一个用户标注文档都可由用户标注多个不同的标签类别,且每一标签类别对应的标签句子可以有多个。In practical applications, each user-labeled document can be labeled with multiple different label categories by the user, and there can be multiple label sentences corresponding to each label category.
步骤S02:通过所述多线程处理脚本对所述标签句子进行分词处理,并根据分词后的句子词汇构建词汇字典;Step S02: Perform word segmentation processing on the label sentence through the multi-thread processing script, and construct a vocabulary dictionary according to the sentence vocabulary after the word segmentation;
在具体实现中,文本抽取工具可通过多线程处理脚本对用户标注文档中包含的每个标签句子进行分词处理,然后对分词处理后的句子词汇进行停用词剔除,以去除句子词汇中包含的诸如“的”、“在”等停用词。在去除停用词后,文本抽取工具即可根据去停用词后的句子词汇构建词汇字典。例如,用户标注文档a包含标签类别为b的标签句子有n个,文本抽取工具可对这n个标签句子进行分词、去停用词处理,然后得到词汇数量为v的词汇字典。In specific implementation, the text extraction tool can perform word segmentation processing on each tag sentence contained in the user-labeled document through a multi-threaded processing script, and then perform stop word removal on the sentence vocabulary after the word segmentation process to remove the sentence vocabulary contained in the sentence vocabulary. Stop words such as "的" and "在". After removing the stop words, the text extraction tool can construct a vocabulary dictionary based on the sentence vocabulary after removing the stop words. For example, if the user-labeled document a contains n labeled sentences with the label category b, the text extraction tool can segment the n labeled sentences, remove the stop word processing, and then obtain a vocabulary dictionary with the number of words v.
步骤S03:计算所述词汇字典中每个词汇的词频-逆文本频率指数值,并根据计算结果构建词频-逆文本频率指数值矩阵;Step S03: Calculate the word frequency-inverse text frequency index value of each word in the vocabulary dictionary, and construct a word frequency-inverse text frequency index value matrix according to the calculation result;
在具体实现中,文本抽取工具可通过TF-IDF算法来计算词汇字典中每个词汇的词频-逆文本频率指数值(TF-IDF值),然后基于计算出的TF-IDF值构建阶数为v*n的TF-IDF矩阵。In specific implementation, the text extraction tool can calculate the word frequency-inverse text frequency index value (TF-IDF value) of each word in the vocabulary dictionary through the TF-IDF algorithm, and then build the order based on the calculated TF-IDF value as TF-IDF matrix of v*n.
步骤S04:根据所述词频-逆文本频率指数值矩阵获取所述标签句子对应的句子向量;Step S04: Obtain the sentence vector corresponding to the labeled sentence according to the word frequency-inverse text frequency index value matrix;
应理解的是,对于词汇量较大的文档,其对应的TF-IDF矩阵可能比较复杂,矩阵越复杂计算机在处理时所占用的计算资源就越多,从而导致计算效率下降,也不利于从矩阵中筛选出较为重要的矩阵数据。因此,本实施例中文本抽取工具在获取到上述TF-IDF矩阵后,还将对该TF-IDF矩阵进行降维处理。It should be understood that for a document with a large vocabulary, the corresponding TF-IDF matrix may be more complex. The more complex the matrix, the more computing resources the computer occupies during processing, which leads to a decrease in computing efficiency and is not conducive to Filter out the more important matrix data from the matrix. Therefore, after acquiring the above-mentioned TF-IDF matrix, the text extraction tool in this embodiment will also perform dimensionality reduction processing on the TF-IDF matrix.
具体的,文本抽取工具可对所述词频-逆文本频率指数值矩阵进行奇异值分解,获取奇异值集合;然后从所述奇异值集合中选取预设数量的目标奇异值,根据所述目标奇异值对所述词频-逆文本频率指数值矩阵进行矩阵重构,获得目标矩阵;最后基于所述目标矩阵获取所述标签句子对应的句子向量。Specifically, the text extraction tool may perform singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values; then select a preset number of target singular values from the set of singular values, and then select a preset number of target singular values according to the target singular value. The value matrix reconstructs the word frequency-inverse text frequency index value matrix to obtain a target matrix; finally, a sentence vector corresponding to the labeled sentence is obtained based on the target matrix.
应理解的是,在进行奇异值分解时,从奇异值分解(Singular Value Decomposition,SVD)函数中得到的奇异值一般是按数值从大到小排列的,奇异值越大,表明它能越能够表征原来矩阵的信息,即信息含量越高,代表性越强。因此,本实施例文本抽取工具在获取到奇异值集合后,还可从奇异值集合中选取预设数量(例如奇异值较大的60或120个)的目标奇异值来重新构建矩阵,从而实现在不遗漏主要矩阵信息的情况下,对TF-IDF矩阵进行有效降维。其中,所述预设数量可根据实际情况设定,本实施例对此不作限制。It should be understood that when performing singular value decomposition, the singular values obtained from the singular value decomposition (Singular Value Decomposition, SVD) function are generally arranged in descending order of value. The larger the singular value, the more capable it can be. Characterize the information of the original matrix, that is, the higher the information content, the stronger the representativeness. Therefore, after obtaining the singular value set, the text extraction tool of this embodiment can also select a preset number of target singular values (for example, 60 or 120 with a larger singular value) from the singular value set to reconstruct the matrix, thereby achieving Without missing the main matrix information, the TF-IDF matrix is effectively reduced in dimension. Wherein, the preset number can be set according to actual conditions, which is not limited in this embodiment.
在具体实现中,文本抽取工具可在对词频-逆文本频率指数值矩阵进行SVD降维后,基于降维后的矩阵获取每个标签句子对应的句子向量。In a specific implementation, the text extraction tool can obtain the sentence vector corresponding to each labeled sentence based on the dimensionality reduction matrix after performing SVD dimensionality reduction on the word frequency-inverse text frequency index value matrix.
步骤S05:将所述句子向量输入至待训练的条件随机场模型进行训练,获得所述第二条件随机场模型。Step S05: Input the sentence vector to the conditional random field model to be trained for training, and obtain the second conditional random field model.
在具体实现中,文本抽取可将获得的句子向量输入至待训练的条件随机场模型中进行训练,从而获得以样本句子中标注的词汇为基准进行词汇相似度预测的第二条件随机场模型。In a specific implementation, the text extraction can input the obtained sentence vector into the conditional random field model to be trained for training, thereby obtaining a second conditional random field model for predicting the similarity of words based on the words marked in the sample sentences.
本实施例获取若干个用户标注文档,用户标注文档中包含多个预设标签类别的标签句子;通过多线程处理脚本对标签句子进行分词处理,并根据分词后的句子词汇构建词汇字典;计算词汇字典中每个词汇的词频-逆文本频率指数值,并根据计算结果构建词频-逆文本频率指数值矩阵;根据词频-逆文本频率指数值矩阵获取标签句子对应的句子向量;将句子向量输入至待训练的条件随机场模型进行训练,获得第二条件随机场模型,由于是通过每个词汇的词频-逆文本频率指数值构建的矩阵来获取标签句子对应的句子向量,然后基于句子向量对条件随机场模型进行训练,从而能够保证训练出的模型具有较高的准确度。In this embodiment, several user-labeled documents are obtained, and the user-labeled documents contain multiple label sentences of preset label categories; the label sentences are processed by multi-thread processing script, and the vocabulary dictionary is constructed according to the sentence vocabulary after the word segmentation; the vocabulary is calculated The word frequency-inverse text frequency index value of each vocabulary in the dictionary, and the word frequency-inverse text frequency index value matrix is constructed according to the calculation result; the sentence vector corresponding to the label sentence is obtained according to the word frequency-inverse text frequency index value matrix; the sentence vector is input to The conditional random field model to be trained is trained to obtain the second conditional random field model. Because it is a matrix constructed by the word frequency-inverse text frequency index value of each vocabulary to obtain the sentence vector corresponding to the label sentence, and then the condition is based on the sentence vector The random field model is trained to ensure that the trained model has high accuracy.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质上存储有文本抽取程序,所述文本抽取程序被处理器执行时实现如上文所述的文本抽取方法的步骤。In addition, the embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile, and a text extraction program is stored on the computer-readable storage medium. When the text extraction program is executed by the processor, the steps of the text extraction method as described above are realized.
参照图5,图5为本申请文本抽取装置第一实施例的结构框图。Referring to Fig. 5, Fig. 5 is a structural block diagram of the first embodiment of the text extraction device of this application.
如图5所示,本申请实施例提出的文本抽取装置包括:As shown in FIG. 5, the text extraction device proposed in the embodiment of the present application includes:
文本获取模块501,用于读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识;The text acquisition module 501 is configured to read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;
句子切分模块502,用于在检测到所述抽取类型标识为字段抽取时,调用多线程处理脚本将所述待抽取文本切分成句子集合;The sentence segmentation module 502 is configured to call a multi-thread processing script to segment the to-be-extracted text into sentence sets when it is detected that the extraction type identification is field extraction;
向量转化模块503,用于通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量;The vector conversion module 503 is configured to convert the sentences in the sentence set into sentence vectors through the multi-thread processing script;
向量拼接模块504,用于对所述句子向量进行拼接,以获得目标句子向量;The vector splicing module 504 is used to splice the sentence vectors to obtain a target sentence vector;
模型预测模块505,用于将所述目标句子向量输入至第一条件随机场模型,获取所述第一条件随机场模型输出的第一预测结果;The model prediction module 505 is configured to input the target sentence vector into a first conditional random field model, and obtain the first prediction result output by the first conditional random field model;
文本抽取模块506,用于根据所述第一预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标字段。The text extraction module 506 is configured to extract a target field from the text to be extracted by using an exact matching retrieval algorithm according to the first prediction result.
本实施例通过读取待抽取文本,提取待抽取文本中包含的抽取类型标识;在检测到抽取类型标识为字段抽取时,调用多线程处理脚本将待抽取文本切分成句子集合;通过多线程处理脚本将句子集合中的句子转化为句子向量;对句子向量进行拼接,以获得目标句子向量;将目标句子向量输入至第一条件随机场模型,获取第一条件随机场模型输出的第一预测结果;根据第一预测结果采用精确匹配检索算法从待抽取文本中抽取目标字段。本实施例根据抽取类型标识确定抽取长度,针对不同的抽取长度选用对应的条件随机场模型对文本进行文本抽取使文本抽取更具针对性,同时本实施例采用多线程处理脚本进行文本切分提高了文本抽取的整体效率,通过精确匹配检索算法提取目标字段也保证了目标字段抽取的准确性。In this embodiment, by reading the text to be extracted, the extraction type identification contained in the text to be extracted is extracted; when the extraction type identification is detected as field extraction, the multi-threaded processing script is called to divide the text to be extracted into sentence sets; multi-threaded processing The script converts the sentences in the sentence set into sentence vectors; splices the sentence vectors to obtain the target sentence vector; inputs the target sentence vector to the first conditional random field model, and obtains the first prediction result output by the first conditional random field model ; According to the first prediction result, an exact matching retrieval algorithm is used to extract the target field from the text to be extracted. In this embodiment, the extraction length is determined according to the extraction type identification, and the corresponding conditional random field model is selected for different extraction lengths to extract the text to make the text extraction more targeted. At the same time, this embodiment uses a multi-threaded processing script to improve text segmentation. In order to improve the overall efficiency of text extraction, extracting the target field through the exact matching retrieval algorithm also ensures the accuracy of the target field extraction.
基于本申请上述文本抽取装置第一实施例,提出本申请文本抽取装置的第二实施例。Based on the first embodiment of the above-mentioned text extraction device of this application, a second embodiment of the text extraction device of this application is proposed.
在本实施例中,所述向量转化模块503,还用于通过所述多线程处理脚本将所述句子集合中的句子输入至预训练语言模型,以获得所述预训练语言模型输出的各句子对应的句子向量;相应地,所述向量拼接模块504,还用于获取各句子在所述待抽取文本中所处的文本位置信息,并根据所述文本位置信息确定各句子对应的句子顺序;按所述句子顺序对所述句子向量进行拼接以获得目标句子向量。In this embodiment, the vector conversion module 503 is further configured to input the sentences in the sentence set into the pre-training language model through the multi-threaded processing script to obtain sentences output by the pre-training language model. Corresponding sentence vector; correspondingly, the vector splicing module 504 is also used to obtain the text position information of each sentence in the to-be-extracted text, and determine the sentence sequence corresponding to each sentence according to the text position information; The sentence vectors are spliced in the sentence order to obtain the target sentence vector.
进一步地,本实施例文本抽取装置还包括:模型训练模块,用于获取若干个用户标注文档,对所述用户标注文档进行向量化以获得标注文本向量,所述标注文本向量中包含观察文本序列;将所述标注文本向量输入至初始条件随机场模型,以使所述初始条件随机场模型基于所述观察文本序列进行模型训练,获得待验证条件随机场模型;对所述待验证条件随机场模型进行模型评估,在评估结果满足预设条件时,将所述待验证条件随机场模型作为所述第一条件随机场模型。Further, the text extraction device of this embodiment further includes: a model training module for acquiring a number of user-labeled documents, and vectorizing the user-labeled documents to obtain a labeled text vector, the labeled text vector containing an observation text sequence Input the labeled text vector to the initial condition random field model, so that the initial condition random field model performs model training based on the observation text sequence to obtain the conditional random field model to be verified; The model performs model evaluation, and when the evaluation result meets a preset condition, the conditional random field model to be verified is used as the first conditional random field model.
进一步地,本实施例文本抽取装置还包括:词汇抽取模块,用于在检测到所述抽取类型标识为词汇抽取时,调用多线程处理脚本将所述待抽取文本切分成若干个句子;获取每个句子与样本句子之间的相似度;基于所述相似度从切分后的句子中筛选出所述样本句子对应的若干个目标句子;根据所述目标句子构建候选句子集,将所述候选句子集中的句子向量化后输入至第二条件随机场模型;获取所述第二条件随机场模型输出的第二预测结果,根据所述第二预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标词汇。Further, the text extraction device of this embodiment further includes: a vocabulary extraction module, which is used to call a multi-threaded processing script to divide the text to be extracted into several sentences when it is detected that the extraction type is identified as vocabulary extraction; The similarity between a sentence and a sample sentence; based on the similarity, a number of target sentences corresponding to the sample sentence are selected from the segmented sentences; a candidate sentence set is constructed according to the target sentence, and the candidate The sentences in the sentence set are vectorized and then input to the second conditional random field model; the second prediction result output by the second conditional random field model is obtained, and the exact matching retrieval algorithm is used according to the second prediction result from the text to be extracted Extract the target vocabulary from it.
进一步地,所述词汇抽取模块,还用于对切分后的句子进行分词处理,并获取分词后各词汇对应的词频-逆文本频率指数值;根据所述词频-逆文本频率指数值确定各词汇所属句子所对应的句子关键词;基于所述句子关键词获取各词汇所属句子与样本句子之间的相似度。Further, the vocabulary extraction module is also used to perform word segmentation processing on the segmented sentences, and obtain the word frequency-inverse text frequency index value corresponding to each vocabulary after word segmentation; determine each word frequency-inverse text frequency index value according to the word frequency-inverse text frequency index value. Sentence keywords corresponding to sentences to which the vocabulary belongs; based on the sentence keywords, the similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained.
进一步地,所述模型训练模块,还用于获取若干个用户标注文档,所述用户标注文档中包含多个预设标签类别的标签句子;通过所述多线程处理脚本对所述标签句子进行分词处理,并根据分词后的句子词汇构建词汇字典;计算所述词汇字典中每个词汇的词频-逆文本频率指数值,并根据计算结果构建词频-逆文本频率指数值矩阵;根据所述词频-逆文本频率指数值矩阵获取所述标签句子对应的句子向量;将所述句子向量输入至待训练的条件随机场模型进行训练,获得所述第二条件随机场模型。Further, the model training module is also used to obtain several user-labeled documents, and the user-labeled documents contain multiple label sentences of preset label categories; and perform word segmentation on the label sentences through the multi-threaded processing script Process, and construct a vocabulary dictionary according to the sentence vocabulary after word segmentation; calculate the word frequency-inverse text frequency index value of each vocabulary in the vocabulary dictionary, and construct a word frequency-inverse text frequency index value matrix according to the calculation result; according to the word frequency- The inverse text frequency index value matrix obtains the sentence vector corresponding to the label sentence; the sentence vector is input to the conditional random field model to be trained for training, and the second conditional random field model is obtained.
进一步地,所述模型训练模块,还用于对所述词频-逆文本频率指数值矩阵进行奇异值分解,获取奇异值集合;从所述奇异值集合中选取预设数量的目标奇异值,根据所述目标奇异值对所述词频-逆文本频率指数值矩阵进行矩阵重构,获得目标矩阵;基于所述目标矩阵获取所述标签句子对应的句子向量。Further, the model training module is also used to perform singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values; select a preset number of target singular values from the set of singular values, according to The target singular value performs matrix reconstruction on the word frequency-inverse text frequency index value matrix to obtain a target matrix; and obtains a sentence vector corresponding to the labeled sentence based on the target matrix.
本申请文本抽取装置的其他实施例或具体实现方式可参照上述各方法实施例,此处不再赘述。For other embodiments or specific implementations of the text extraction device of the present application, reference may be made to the foregoing method embodiments, and details are not described herein again.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, method, article, or system. Without more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or system that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如只读存储器/随机存取存储器、磁碟、光盘)中,包括若干指令用以使得一台文本抽取工具设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as read-only memory/random access The storage, magnetic disk, and optical disk) includes several instructions to make a text extraction tool device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种文本抽取方法,其中,所述方法包括: A text extraction method, wherein the method includes:
    读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识;Read the text to be extracted, and extract the extraction type identifier contained in the text to be extracted;
    在检测到所述抽取类型标识为字段抽取时,调用多线程处理脚本将所述待抽取文本切分成句子集合;When it is detected that the extraction type is identified as field extraction, calling a multi-threaded processing script to divide the text to be extracted into sentence sets;
    通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量;Converting the sentences in the sentence set into sentence vectors through the multi-thread processing script;
    对所述句子向量进行拼接,以获得目标句子向量;Splicing the sentence vectors to obtain a target sentence vector;
    将所述目标句子向量输入至第一条件随机场模型,获取所述第一条件随机场模型输出的第一预测结果;Input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
    根据所述第一预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标字段。According to the first prediction result, an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
  2. 如权利要求1所述的方法,其中,所述通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量的步骤,包括: 8. The method of claim 1, wherein the step of converting the sentences in the sentence set into sentence vectors through the multi-threaded processing script comprises:
    通过所述多线程处理脚本将所述句子集合中的句子输入至预训练语言模型,以获得所述预训练语言模型输出的各句子对应的句子向量;Input the sentences in the sentence set to the pre-training language model through the multi-threaded processing script, so as to obtain the sentence vector corresponding to each sentence output by the pre-training language model;
    所述对所述句子向量进行拼接,以获得目标句子向量的步骤,包括:The step of splicing the sentence vectors to obtain a target sentence vector includes:
    获取各句子在所述待抽取文本中所处的文本位置信息,并根据所述文本位置信息确定各句子对应的句子顺序;Acquiring the text position information of each sentence in the to-be-extracted text, and determining the sentence order corresponding to each sentence according to the text position information;
    按所述句子顺序对所述句子向量进行拼接以获得目标句子向量。The sentence vectors are spliced in the sentence order to obtain the target sentence vector.
  3. 如权利要求1所述的方法,其中,所述读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识的步骤之前,所述方法还包括: The method according to claim 1, wherein before the step of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:
    获取若干个用户标注文档,对所述用户标注文档进行向量化以获得标注文本向量,所述标注文本向量中包含观察文本序列;Acquiring a number of user-annotated documents, and vectorizing the user-annotated documents to obtain annotated text vector, where the annotated text vector includes an observation text sequence;
    将所述标注文本向量输入至初始条件随机场模型,以使所述初始条件随机场模型基于所述观察文本序列进行模型训练,获得待验证条件随机场模型;Inputting the labeled text vector into an initial condition random field model, so that the initial condition random field model performs model training based on the observed text sequence to obtain a conditional random field model to be verified;
    对所述待验证条件随机场模型进行模型评估,在评估结果满足预设条件时,将所述待验证条件随机场模型作为所述第一条件随机场模型。Model evaluation is performed on the conditional random field model to be verified, and when the evaluation result meets a preset condition, the conditional random field model to be verified is used as the first conditional random field model.
  4. 如权利要求1所述的方法,其中,所述读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识的步骤之后,所述方法还包括: The method according to claim 1, wherein after the steps of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:
    在检测到所述抽取类型标识为词汇抽取时,调用多线程处理脚本将所述待抽取文本切分成若干个句子;When it is detected that the extraction type is identified as vocabulary extraction, calling a multi-threaded processing script to divide the text to be extracted into several sentences;
    获取每个句子与样本句子之间的相似度;Obtain the similarity between each sentence and the sample sentence;
    基于所述相似度从切分后的句子中筛选出所述样本句子对应的若干个目标句子;Filtering out several target sentences corresponding to the sample sentences from the segmented sentences based on the similarity;
    根据所述目标句子构建候选句子集,将所述候选句子集中的句子向量化后输入至第二条件随机场模型;Construct a candidate sentence set according to the target sentence, and input the sentence in the candidate sentence set into a second conditional random field model after vectorization;
    获取所述第二条件随机场模型输出的第二预测结果,根据所述第二预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标词汇。A second prediction result output by the second conditional random field model is obtained, and an exact matching retrieval algorithm is used to extract a target vocabulary from the text to be extracted according to the second prediction result.
  5. 如权利要求4所述的方法,其中,所述获取每个句子与样本句子之间的相似度的步骤,包括: The method according to claim 4, wherein the step of obtaining the similarity between each sentence and the sample sentence comprises:
    对切分后的句子进行分词处理,并获取分词后各词汇对应的词频-逆文本频率指数值;Perform word segmentation on the segmented sentence, and obtain the word frequency-inverse text frequency index value corresponding to each vocabulary after word segmentation;
    根据所述词频-逆文本频率指数值确定各词汇所属句子所对应的句子关键词;Determine the sentence keyword corresponding to the sentence to which each vocabulary belongs according to the word frequency-inverse text frequency index value;
    基于所述句子关键词获取各词汇所属句子与样本句子之间的相似度。The similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained based on the sentence keywords.
  6. 如权利要求4所述的方法,其中,所述读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识的步骤之前,所述方法还包括: The method according to claim 4, wherein before the step of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:
    获取若干个用户标注文档,所述用户标注文档中包含多个预设标签类别的标签句子;Acquiring a number of user-labeled documents, the user-labeled documents containing multiple label sentences of preset label categories;
    通过所述多线程处理脚本对所述标签句子进行分词处理,并根据分词后的句子词汇构建词汇字典;Performing word segmentation processing on the label sentence through the multi-threaded processing script, and constructing a vocabulary dictionary according to the sentence vocabulary after the word segmentation;
    计算所述词汇字典中每个词汇的词频-逆文本频率指数值,并根据计算结果构建词频-逆文本频率指数值矩阵;Calculating the word frequency-inverse text frequency index value of each word in the vocabulary dictionary, and constructing a word frequency-inverse text frequency index value matrix according to the calculation result;
    根据所述词频-逆文本频率指数值矩阵获取所述标签句子对应的句子向量;Obtaining the sentence vector corresponding to the label sentence according to the word frequency-inverse text frequency index value matrix;
    将所述句子向量输入至待训练的条件随机场模型进行训练,获得所述第二条件随机场模型。The sentence vector is input to the conditional random field model to be trained for training, and the second conditional random field model is obtained.
  7. 如权利要求6所述的方法,其中,所述根据所述词频-逆文本频率指数值矩阵获取所述标签句子对应的句子向量的步骤,包括: 7. The method of claim 6, wherein the step of obtaining the sentence vector corresponding to the labeled sentence according to the word frequency-inverse text frequency index value matrix comprises:
    对所述词频-逆文本频率指数值矩阵进行奇异值分解,获取奇异值集合;Performing singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values;
    从所述奇异值集合中选取预设数量的目标奇异值,根据所述目标奇异值对所述词频-逆文本频率指数值矩阵进行矩阵重构,获得目标矩阵;Selecting a preset number of target singular values from the set of singular values, and performing matrix reconstruction on the word frequency-inverse text frequency index value matrix according to the target singular values to obtain a target matrix;
    基于所述目标矩阵获取所述标签句子对应的句子向量。Obtain the sentence vector corresponding to the labeled sentence based on the target matrix.
  8. 一种文本抽取装置,其中,所述装置包括: A text extraction device, wherein the device includes:
    文本获取模块,用于读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识;The text acquisition module is used to read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;
    句子切分模块,用于在检测到所述抽取类型标识为字段抽取时,调用多线程处理脚本将所述待抽取文本切分成句子集合;The sentence segmentation module is configured to call a multi-threaded processing script to segment the text to be extracted into sentence sets when it is detected that the extraction type is identified as field extraction;
    向量转化模块,用于通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量;The vector conversion module is used to convert the sentences in the sentence set into sentence vectors through the multi-thread processing script;
    向量拼接模块,用于对所述句子向量进行拼接,以获得目标句子向量;The vector splicing module is used for splicing the sentence vector to obtain the target sentence vector;
    模型预测模块,用于将所述目标句子向量输入至第一条件随机场模型,获取所述第一条件随机场模型输出的第一预测结果;A model prediction module, configured to input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
    文本抽取模块,用于根据所述第一预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标字段。The text extraction module is used to extract the target field from the text to be extracted by using an exact matching retrieval algorithm according to the first prediction result.
  9. 一种文本抽取设备,其中,所述文本抽取包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被处理器执行时实现以下步骤: A text extraction device, wherein the text extraction includes a memory, a processor, and computer readable instructions stored on the memory and capable of running on the processor, and when the computer readable instructions are executed by the processor Implement the following steps:
    读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识;Read the text to be extracted, and extract the extraction type identifier contained in the text to be extracted;
    在检测到所述抽取类型标识为字段抽取时,调用多线程处理脚本将所述待抽取文本切分成句子集合;When it is detected that the extraction type is identified as field extraction, calling a multi-threaded processing script to divide the text to be extracted into sentence sets;
    通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量;Converting the sentences in the sentence set into sentence vectors through the multi-thread processing script;
    对所述句子向量进行拼接,以获得目标句子向量;Splicing the sentence vectors to obtain a target sentence vector;
    将所述目标句子向量输入至第一条件随机场模型,获取所述第一条件随机场模型输出的第一预测结果;Input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
    根据所述第一预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标字段。According to the first prediction result, an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
  10. 如权利要求9所述的文本抽取设备,其中,所述通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量的步骤,包括: 9. The text extraction device according to claim 9, wherein the step of converting the sentences in the sentence set into sentence vectors through the multi-threaded processing script comprises:
    通过所述多线程处理脚本将所述句子集合中的句子输入至预训练语言模型,以获得所述预训练语言模型输出的各句子对应的句子向量;Input the sentences in the sentence set to the pre-training language model through the multi-threaded processing script, so as to obtain the sentence vector corresponding to each sentence output by the pre-training language model;
    所述对所述句子向量进行拼接,以获得目标句子向量的步骤,包括:The step of splicing the sentence vectors to obtain a target sentence vector includes:
    获取各句子在所述待抽取文本中所处的文本位置信息,并根据所述文本位置信息确定各句子对应的句子顺序;Acquiring the text position information of each sentence in the to-be-extracted text, and determining the sentence order corresponding to each sentence according to the text position information;
    按所述句子顺序对所述句子向量进行拼接以获得目标句子向量。The sentence vectors are spliced in the sentence order to obtain the target sentence vector.
  11. 如权利要求9所述的文本抽取设备,其中,所述读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识的步骤之前,所述方法还包括: 9. The text extraction device according to claim 9, wherein before the step of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:
    获取若干个用户标注文档,对所述用户标注文档进行向量化以获得标注文本向量,所述标注文本向量中包含观察文本序列;Acquiring a number of user-annotated documents, and vectorizing the user-annotated documents to obtain annotated text vector, where the annotated text vector includes an observation text sequence;
    将所述标注文本向量输入至初始条件随机场模型,以使所述初始条件随机场模型基于所述观察文本序列进行模型训练,获得待验证条件随机场模型;Inputting the labeled text vector into an initial condition random field model, so that the initial condition random field model performs model training based on the observed text sequence to obtain a conditional random field model to be verified;
    对所述待验证条件随机场模型进行模型评估,在评估结果满足预设条件时,将所述待验证条件随机场模型作为所述第一条件随机场模型。Model evaluation is performed on the conditional random field model to be verified, and when the evaluation result meets a preset condition, the conditional random field model to be verified is used as the first conditional random field model.
  12. 如权利要求9所述的文本抽取设备,其中,所述读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识的步骤之后,所述方法还包括: 9. The text extraction device according to claim 9, wherein after the steps of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:
    在检测到所述抽取类型标识为词汇抽取时,调用多线程处理脚本将所述待抽取文本切分成若干个句子;When it is detected that the extraction type is identified as vocabulary extraction, calling a multi-threaded processing script to divide the text to be extracted into several sentences;
    获取每个句子与样本句子之间的相似度;Obtain the similarity between each sentence and the sample sentence;
    基于所述相似度从切分后的句子中筛选出所述样本句子对应的若干个目标句子;Filtering out several target sentences corresponding to the sample sentences from the segmented sentences based on the similarity;
    根据所述目标句子构建候选句子集,将所述候选句子集中的句子向量化后输入至第二条件随机场模型;Construct a candidate sentence set according to the target sentence, and input the sentence in the candidate sentence set into a second conditional random field model after vectorization;
    获取所述第二条件随机场模型输出的第二预测结果,根据所述第二预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标词汇。A second prediction result output by the second conditional random field model is obtained, and an exact matching retrieval algorithm is used to extract a target vocabulary from the text to be extracted according to the second prediction result.
  13. 如权利要求12所述的文本抽取设备,其中,所述获取每个句子与样本句子之间的相似度的步骤,包括: The text extraction device according to claim 12, wherein the step of obtaining the similarity between each sentence and the sample sentence comprises:
    对切分后的句子进行分词处理,并获取分词后各词汇对应的词频-逆文本频率指数值;Perform word segmentation on the segmented sentence, and obtain the word frequency-inverse text frequency index value corresponding to each vocabulary after word segmentation;
    根据所述词频-逆文本频率指数值确定各词汇所属句子所对应的句子关键词;Determine the sentence keyword corresponding to the sentence to which each vocabulary belongs according to the word frequency-inverse text frequency index value;
    基于所述句子关键词获取各词汇所属句子与样本句子之间的相似度。The similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained based on the sentence keywords.
  14. 如权利要求12所述的文本抽取设备,其中,所述读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识的步骤之前,所述方法还包括: The text extraction device according to claim 12, wherein before the step of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:
    获取若干个用户标注文档,所述用户标注文档中包含多个预设标签类别的标签句子;Acquiring a number of user-labeled documents, the user-labeled documents containing multiple label sentences of preset label categories;
    通过所述多线程处理脚本对所述标签句子进行分词处理,并根据分词后的句子词汇构建词汇字典;Performing word segmentation processing on the label sentence through the multi-threaded processing script, and constructing a vocabulary dictionary according to the sentence vocabulary after the word segmentation;
    计算所述词汇字典中每个词汇的词频-逆文本频率指数值,并根据计算结果构建词频-逆文本频率指数值矩阵;Calculating the word frequency-inverse text frequency index value of each word in the vocabulary dictionary, and constructing a word frequency-inverse text frequency index value matrix according to the calculation result;
    根据所述词频-逆文本频率指数值矩阵获取所述标签句子对应的句子向量;Obtaining the sentence vector corresponding to the label sentence according to the word frequency-inverse text frequency index value matrix;
    将所述句子向量输入至待训练的条件随机场模型进行训练,获得所述第二条件随机场模型。The sentence vector is input to the conditional random field model to be trained for training, and the second conditional random field model is obtained.
  15. 如权利要求14所述的文本抽取设备,其中,所述根据所述词频-逆文本频率指数值矩阵获取所述标签句子对应的句子向量的步骤,包括: The text extraction device according to claim 14, wherein the step of obtaining the sentence vector corresponding to the labeled sentence according to the word frequency-inverse text frequency index value matrix comprises:
    对所述词频-逆文本频率指数值矩阵进行奇异值分解,获取奇异值集合;Performing singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values;
    从所述奇异值集合中选取预设数量的目标奇异值,根据所述目标奇异值对所述词频-逆文本频率指数值矩阵进行矩阵重构,获得目标矩阵;Selecting a preset number of target singular values from the set of singular values, and performing matrix reconstruction on the word frequency-inverse text frequency index value matrix according to the target singular values to obtain a target matrix;
    基于所述目标矩阵获取所述标签句子对应的句子向量。Obtain the sentence vector corresponding to the labeled sentence based on the target matrix.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质内存储有计算机可读指令,所述计算机可读指令可被至少一个处理器所执行,以使所述至少一个处理器执行如下步骤: A computer-readable storage medium, wherein computer-readable instructions are stored in the computer-readable storage medium, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the following step:
    读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识;Read the text to be extracted, and extract the extraction type identifier contained in the text to be extracted;
    在检测到所述抽取类型标识为字段抽取时,调用多线程处理脚本将所述待抽取文本切分成句子集合;When it is detected that the extraction type is identified as field extraction, calling a multi-threaded processing script to divide the text to be extracted into sentence sets;
    通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量;Converting the sentences in the sentence set into sentence vectors through the multi-thread processing script;
    对所述句子向量进行拼接,以获得目标句子向量;Splicing the sentence vectors to obtain a target sentence vector;
    将所述目标句子向量输入至第一条件随机场模型,获取所述第一条件随机场模型输出的第一预测结果;Input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
    根据所述第一预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标字段。According to the first prediction result, an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述通过所述多线程处理脚本将所述句子集合中的句子转化为句子向量的步骤,包括: 15. The computer-readable storage medium according to claim 16, wherein the step of converting the sentences in the sentence set into sentence vectors through the multi-threaded processing script comprises:
    通过所述多线程处理脚本将所述句子集合中的句子输入至预训练语言模型,以获得所述预训练语言模型输出的各句子对应的句子向量;Input the sentences in the sentence set to the pre-training language model through the multi-threaded processing script, so as to obtain the sentence vector corresponding to each sentence output by the pre-training language model;
    所述对所述句子向量进行拼接,以获得目标句子向量的步骤,包括:The step of splicing the sentence vectors to obtain a target sentence vector includes:
    获取各句子在所述待抽取文本中所处的文本位置信息,并根据所述文本位置信息确定各句子对应的句子顺序;Acquiring the text position information of each sentence in the to-be-extracted text, and determining the sentence order corresponding to each sentence according to the text position information;
    按所述句子顺序对所述句子向量进行拼接以获得目标句子向量。The sentence vectors are spliced in the sentence order to obtain the target sentence vector.
  18. 如权利要求16所述的计算机可读存储介质,其中,所述读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识的步骤之前,所述方法还包括: 15. The computer-readable storage medium according to claim 16, wherein before the step of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:
    获取若干个用户标注文档,对所述用户标注文档进行向量化以获得标注文本向量,所述标注文本向量中包含观察文本序列;Acquiring a number of user-annotated documents, and vectorizing the user-annotated documents to obtain annotated text vector, where the annotated text vector includes an observation text sequence;
    将所述标注文本向量输入至初始条件随机场模型,以使所述初始条件随机场模型基于所述观察文本序列进行模型训练,获得待验证条件随机场模型;Inputting the labeled text vector into an initial condition random field model, so that the initial condition random field model performs model training based on the observed text sequence to obtain a conditional random field model to be verified;
    对所述待验证条件随机场模型进行模型评估,在评估结果满足预设条件时,将所述待验证条件随机场模型作为所述第一条件随机场模型。Model evaluation is performed on the conditional random field model to be verified, and when the evaluation result meets a preset condition, the conditional random field model to be verified is used as the first conditional random field model.
  19. 如权利要求16所述的计算机可读存储介质,其中,所述读取待抽取文本,提取所述待抽取文本中包含的抽取类型标识的步骤之后,所述方法还包括: 15. The computer-readable storage medium according to claim 16, wherein after the steps of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:
    在检测到所述抽取类型标识为词汇抽取时,调用多线程处理脚本将所述待抽取文本切分成若干个句子;When it is detected that the extraction type is identified as vocabulary extraction, calling a multi-threaded processing script to divide the text to be extracted into several sentences;
    获取每个句子与样本句子之间的计算机可读存储介质;Obtain the computer-readable storage medium between each sentence and the sample sentence;
    基于所述相似度从切分后的句子中筛选出所述样本句子对应的若干个目标句子;Filtering out several target sentences corresponding to the sample sentences from the segmented sentences based on the similarity;
    根据所述目标句子构建候选句子集,将所述候选句子集中的句子向量化后输入至第二条件随机场模型;Construct a candidate sentence set according to the target sentence, and input the sentence in the candidate sentence set into a second conditional random field model after vectorization;
    获取所述第二条件随机场模型输出的第二预测结果,根据所述第二预测结果采用精确匹配检索算法从所述待抽取文本中抽取目标词汇。A second prediction result output by the second conditional random field model is obtained, and an exact matching retrieval algorithm is used to extract a target vocabulary from the text to be extracted according to the second prediction result.
  20. 如权利要求19所述的计算机可读存储介质,其中,所述获取每个句子与样本句子之间的相似度的步骤,包括: 19. The computer-readable storage medium of claim 19, wherein the step of obtaining the similarity between each sentence and the sample sentence comprises:
    对切分后的句子进行分词处理,并获取分词后各词汇对应的词频-逆文本频率指数值;Perform word segmentation on the segmented sentence, and obtain the word frequency-inverse text frequency index value corresponding to each vocabulary after word segmentation;
    根据所述词频-逆文本频率指数值确定各词汇所属句子所对应的句子关键词;Determine the sentence keyword corresponding to the sentence to which each vocabulary belongs according to the word frequency-inverse text frequency index value;
    基于所述句子关键词获取各词汇所属句子与样本句子之间的相似度。The similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained based on the sentence keywords.
PCT/CN2020/093466 2019-09-18 2020-05-29 Text extraction method, apparatus, and device, and storage medium WO2021051871A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910885399.9 2019-09-18
CN201910885399.9A CN110781276B (en) 2019-09-18 2019-09-18 Text extraction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021051871A1 true WO2021051871A1 (en) 2021-03-25

Family

ID=69383817

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093466 WO2021051871A1 (en) 2019-09-18 2020-05-29 Text extraction method, apparatus, and device, and storage medium

Country Status (2)

Country Link
CN (1) CN110781276B (en)
WO (1) WO2021051871A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926332A (en) * 2021-03-30 2021-06-08 善诊(上海)信息技术有限公司 Entity relationship joint extraction method and device
CN113268452A (en) * 2021-05-25 2021-08-17 联仁健康医疗大数据科技股份有限公司 Entity extraction method, device, equipment and storage medium
CN113407584A (en) * 2021-06-29 2021-09-17 微民保险代理有限公司 Label extraction method, device, equipment and storage medium
CN113408296A (en) * 2021-06-24 2021-09-17 东软集团股份有限公司 Text information extraction method, device and equipment
CN115145928A (en) * 2022-08-01 2022-10-04 支付宝(杭州)信息技术有限公司 Model training method and device and structured abstract acquisition method and device
CN116248375A (en) * 2023-02-01 2023-06-09 北京市燃气集团有限责任公司 Webpage login entity identification method, device, equipment and storage medium
CN117176471A (en) * 2023-10-25 2023-12-05 北京派网科技有限公司 Dual high-efficiency detection method, device and storage medium for anomaly of text and digital network protocol
CN117573815A (en) * 2024-01-17 2024-02-20 之江实验室 Retrieval enhancement generation method based on vector similarity matching optimization

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781276B (en) * 2019-09-18 2023-09-19 平安科技(深圳)有限公司 Text extraction method, device, equipment and storage medium
CN111506696A (en) * 2020-03-03 2020-08-07 平安科技(深圳)有限公司 Information extraction method and device based on small number of training samples
CN113343645A (en) * 2020-03-03 2021-09-03 北京沃东天骏信息技术有限公司 Information extraction model establishing method and device, storage medium and electronic equipment
CN111401050A (en) * 2020-03-28 2020-07-10 苏州机数芯微科技有限公司 Chemical reaction extractor and extraction method based on template generation
CN111753060B (en) * 2020-07-29 2023-09-26 腾讯科技(深圳)有限公司 Information retrieval method, apparatus, device and computer readable storage medium
CN112069785A (en) * 2020-08-06 2020-12-11 北京明略昭辉科技有限公司 Text sampling method and device for improving labeling efficiency
CN112069307B (en) * 2020-08-25 2023-07-04 中国人民大学 Legal provision quotation information extraction system
CN112463959A (en) * 2020-10-29 2021-03-09 中国人寿保险股份有限公司 Service processing method based on uplink short message and related equipment
CN112380348B (en) * 2020-11-25 2024-03-26 中信百信银行股份有限公司 Metadata processing method, apparatus, electronic device and computer readable storage medium
CN113033216B (en) * 2021-03-03 2024-05-28 东软集团股份有限公司 Text preprocessing method and device, storage medium and electronic equipment
CN113505201A (en) * 2021-07-29 2021-10-15 宁波薄言信息技术有限公司 Contract extraction method based on SegaBert pre-training model
CN114490998B (en) * 2021-12-28 2022-11-08 北京百度网讯科技有限公司 Text information extraction method and device, electronic equipment and storage medium
CN115952279B (en) * 2022-12-02 2023-09-12 杭州瑞成信息技术股份有限公司 Text outline extraction method and device, electronic device and storage medium
CN115841113B (en) * 2023-02-24 2023-05-12 山东云天安全技术有限公司 Domain name label detection method, storage medium and electronic equipment
CN116628168B (en) * 2023-06-12 2023-11-14 深圳市逗娱科技有限公司 User personality analysis processing method and system based on big data and cloud platform
CN116450813B (en) * 2023-06-19 2023-09-19 深圳得理科技有限公司 Text key information extraction method, device, equipment and computer storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
CN107808011A (en) * 2017-11-20 2018-03-16 北京大学深圳研究院 Classification abstracting method, device, computer equipment and the storage medium of information
CN109189820A (en) * 2018-07-30 2019-01-11 北京信息科技大学 A kind of mine safety accidents Ontological concept abstracting method
CN110019794A (en) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 Classification method, device, storage medium and the electronic device of textual resources
CN110781276A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Text extraction method, device, equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6178208B2 (en) * 2013-10-28 2017-08-09 株式会社Nttドコモ Question field judgment device and question field judgment method
CN105912570B (en) * 2016-03-29 2019-11-15 北京工业大学 Resume critical field abstracting method based on hidden Markov model
US10157177B2 (en) * 2016-10-28 2018-12-18 Kira Inc. System and method for extracting entities in electronic documents
CN107729526B (en) * 2017-10-30 2020-04-07 清华大学 Text structuring method
CN109766524B (en) * 2018-12-28 2022-11-25 重庆邮电大学 Method and system for extracting combined purchasing recombination type notice information
CN109960756B (en) * 2019-03-19 2021-04-09 国家计算机网络与信息安全管理中心 News event information induction method
CN110162786B (en) * 2019-04-23 2024-02-27 百度在线网络技术(北京)有限公司 Method and device for constructing configuration file and extracting structured information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
CN110019794A (en) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 Classification method, device, storage medium and the electronic device of textual resources
CN107808011A (en) * 2017-11-20 2018-03-16 北京大学深圳研究院 Classification abstracting method, device, computer equipment and the storage medium of information
CN109189820A (en) * 2018-07-30 2019-01-11 北京信息科技大学 A kind of mine safety accidents Ontological concept abstracting method
CN110781276A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Text extraction method, device, equipment and storage medium

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926332A (en) * 2021-03-30 2021-06-08 善诊(上海)信息技术有限公司 Entity relationship joint extraction method and device
CN113268452A (en) * 2021-05-25 2021-08-17 联仁健康医疗大数据科技股份有限公司 Entity extraction method, device, equipment and storage medium
CN113268452B (en) * 2021-05-25 2024-02-02 联仁健康医疗大数据科技股份有限公司 Entity extraction method, device, equipment and storage medium
CN113408296A (en) * 2021-06-24 2021-09-17 东软集团股份有限公司 Text information extraction method, device and equipment
CN113408296B (en) * 2021-06-24 2024-02-13 东软集团股份有限公司 Text information extraction method, device and equipment
CN113407584A (en) * 2021-06-29 2021-09-17 微民保险代理有限公司 Label extraction method, device, equipment and storage medium
CN115145928A (en) * 2022-08-01 2022-10-04 支付宝(杭州)信息技术有限公司 Model training method and device and structured abstract acquisition method and device
CN116248375A (en) * 2023-02-01 2023-06-09 北京市燃气集团有限责任公司 Webpage login entity identification method, device, equipment and storage medium
CN116248375B (en) * 2023-02-01 2023-12-15 北京市燃气集团有限责任公司 Webpage login entity identification method, device, equipment and storage medium
CN117176471A (en) * 2023-10-25 2023-12-05 北京派网科技有限公司 Dual high-efficiency detection method, device and storage medium for anomaly of text and digital network protocol
CN117176471B (en) * 2023-10-25 2023-12-29 北京派网科技有限公司 Dual high-efficiency detection method, device and storage medium for anomaly of text and digital network protocol
CN117573815A (en) * 2024-01-17 2024-02-20 之江实验室 Retrieval enhancement generation method based on vector similarity matching optimization
CN117573815B (en) * 2024-01-17 2024-04-30 之江实验室 Retrieval enhancement generation method based on vector similarity matching optimization

Also Published As

Publication number Publication date
CN110781276B (en) 2023-09-19
CN110781276A (en) 2020-02-11

Similar Documents

Publication Publication Date Title
WO2021051871A1 (en) Text extraction method, apparatus, and device, and storage medium
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
US11544459B2 (en) Method and apparatus for determining feature words and server
KR102288249B1 (en) Information processing method, terminal, and computer storage medium
WO2020082560A1 (en) Method, apparatus and device for extracting text keyword, as well as computer readable storage medium
CN109284399B (en) Similarity prediction model training method and device and computer readable storage medium
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
CN108038208B (en) Training method and device of context information recognition model and storage medium
CN113094578A (en) Deep learning-based content recommendation method, device, equipment and storage medium
CN115544240B (en) Text sensitive information identification method and device, electronic equipment and storage medium
CN110399547B (en) Method, apparatus, device and storage medium for updating model parameters
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN113297351A (en) Text data labeling method and device, electronic equipment and storage medium
CN114547257B (en) Class matching method and device, computer equipment and storage medium
WO2021056750A1 (en) Search method and device, and storage medium
CN114817478A (en) Text-based question and answer method and device, computer equipment and storage medium
CN117520523B (en) Data processing method, device, equipment and storage medium
CN114722837A (en) Multi-turn dialog intention recognition method and device and computer readable storage medium
CN112100360B (en) Dialogue response method, device and system based on vector retrieval
CN109753646B (en) Article attribute identification method and electronic equipment
CN113536784A (en) Text processing method and device, computer equipment and storage medium
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
CN115906797A (en) Text entity alignment method, device, equipment and medium
CN115169345A (en) Training method, device and equipment for text emotion analysis model and storage medium
KR102215259B1 (en) Method of analyzing relationships of words or documents by subject and device implementing the same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20865090

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20865090

Country of ref document: EP

Kind code of ref document: A1