WO2021051871A1

WO2021051871A1 - Text extraction method, apparatus, and device, and storage medium

Info

Publication number: WO2021051871A1
Application number: PCT/CN2020/093466
Authority: WO
Inventors: 郝正鸿; 许开河; 王少军
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-09-18
Filing date: 2020-05-29
Publication date: 2021-03-25
Also published as: CN110781276B; CN110781276A

Abstract

A text extraction method, apparatus and device, and a storage medium. The method comprises: reading text to the extracted, and extracting an extraction type identifier comprised in the text to be extracted (S10); upon detection that the extraction type identifier is field extraction, invoking a multi-threaded process script to segment the text to be extracted into sentence sets (S20); converting sentences in the sentence sets into sentence vectors by means of the multi-threaded process script (S30); splicing the sentence vectors to obtain a target sentence vector (S40); inputting the target sentence vector into a first conditional random field model to obtain a first prediction result output by the first conditional random field model (S50); and extracting a target field from the text to be extracted according to the first prediction result by using an exact matching retrieval algorithm (S60). According to the method, an extraction length is determined according to an extraction type identifier, and corresponding conditional random field models are selected for text extraction depending on different extraction lengths so that text extraction is more targeted; furthermore, a multi-threaded process script is used for text segmentation so that the overall efficiency of text extraction is improved, and target field extraction by means of an exact matching retrieval algorithm also guarantees the accuracy of target field extraction.

Description

Text extraction method, device, equipment and storage medium

This application affirms the priority of the Chinese patent application filed on September 18, 2019 with the application number 201910885399.9 and titled "Text extraction method, device, equipment and storage medium". The entire content of the Chinese patent application is incorporated by reference In this application

Technical field

This application relates to the field of text processing technology, and in particular to a text extraction method, device, equipment, and storage medium.

Background technique

Information extraction is the process of automatically extracting and converting unstructured data in documents (such as resumes, insurance clauses, encyclopedias, contracts and other business scenarios) into structured data, for example, the contracting parties in the lease contract Extract and convert unstructured data such as the name, contract time, and contract address of the company.

Information extraction is divided from the perspective of extraction content, including entity extraction, relationship extraction, and event extraction. From the length of extraction, it mainly includes vocabulary extraction and field/paragraph extraction. In addition, it is also divided into open domain information extraction and closed domain information extraction. With the development of deep neural networks and the enhancement of computer computing power, the existing information extraction methods are mainly based on large-scale labeled data training end-to-end deep learning models with larger parameters, and then perform different methods based on the trained models. Text information extraction in business scenarios. The inventor found that this information extraction method did not perform classification extraction for different extraction lengths, resulting in the final extraction result being not highly targeted, low accuracy, and reducing the efficiency of information extraction.

The above content is only used to assist the understanding of the technical solutions of this application, and does not mean that the above content is recognized as prior art.

technical problem

The main purpose of this application is to provide a text extraction method, device, equipment and storage medium, aiming to solve the technical problems of the existing information extraction technology that the extraction results are not very specific, the accuracy is not high, and the extraction efficiency is low.

Technical solutions

In order to achieve the above objective, this application provides a text extraction method, which includes the following steps:

Read the text to be extracted, and extract the extraction type identifier contained in the text to be extracted;

When it is detected that the extraction type is identified as field extraction, calling a multi-threaded processing script to divide the text to be extracted into sentence sets;

Converting the sentences in the sentence set into sentence vectors through the multi-thread processing script;

Splicing the sentence vectors to obtain a target sentence vector;

Input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;

According to the first prediction result, an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.

In addition, in order to achieve the above objective, this application also proposes a text extraction device, which includes:

The text acquisition module is used to read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;

The sentence segmentation module is configured to call a multi-threaded processing script to segment the text to be extracted into sentence sets when it is detected that the extraction type is identified as field extraction;

The vector conversion module is used to convert the sentences in the sentence set into sentence vectors through the multi-thread processing script;

The vector splicing module is used for splicing the sentence vector to obtain the target sentence vector;

A model prediction module, configured to input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;

The text extraction module is used to extract the target field from the text to be extracted by using an exact matching retrieval algorithm according to the first prediction result.

In addition, in order to achieve the above-mentioned object, this application also proposes a text extraction device, the text extraction includes a memory, a processor, and computer-readable instructions stored on the memory and running on the processor. When the computer-readable instructions are executed by the processor, the following steps are implemented:

Splicing the sentence vectors to obtain a target sentence vector;

According to the first prediction result, an exact matching retrieval algorithm is used to extract a target field from the text to be extracted

In addition, in order to achieve the above-mentioned object, this application also proposes a computer-readable storage medium having computer-readable instructions stored in the computer-readable storage medium, and the computer-readable instructions can be executed by at least one processor to Make the at least one processor execute the following steps:

Splicing the sentence vectors to obtain a target sentence vector;

Beneficial effect

This application extracts the extraction type identification contained in the text to be extracted by reading the text to be extracted; when the extraction type identification is detected as field extraction, the multi-threaded processing script is called to divide the text to be extracted into sentence sets; the script is processed by multi-threading Transform the sentences in the sentence set into sentence vectors; splice the sentence vectors to obtain the target sentence vector; input the target sentence vector into the first conditional random field model to obtain the first prediction result output by the first conditional random field model; According to the first prediction result, an exact matching retrieval algorithm is used to extract the target field from the text to be extracted. This application determines the extraction length according to the extraction type identification, and selects the corresponding conditional random field model for different extraction lengths to extract text to make the text extraction more targeted. At the same time, this application uses multi-threaded processing scripts for text segmentation to improve the text The overall efficiency of extraction, extracting the target field through the exact matching retrieval algorithm also ensures the accuracy of the target field extraction.

Description of the drawings

FIG. 1 is a schematic structural diagram of a text extraction device of a hardware operating environment involved in a solution of an embodiment of the present application;

2 is a schematic flowchart of the first embodiment of the text extraction method of this application;

3 is a schematic flowchart of a second embodiment of the text extraction method of this application;

4 is a schematic flowchart of a third embodiment of the text extraction method of this application;

FIG. 5 is a structural block diagram of the first embodiment of the text extraction device of this application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Embodiments of the present invention

It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.

Referring to FIG. 1, FIG. 1 is a schematic structural diagram of a text extraction device of a hardware operating environment involved in a solution of an embodiment of the application.

As shown in FIG. 1, the text extraction device may include: a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a wireless fidelity (WIreless-FIdelity, WI-FI) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory, or a stable non-volatile memory (Non-Volatile memory). Memory, NVM), such as disk storage. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001.

Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation on the text extraction device, and may include more or less components than shown in the figure, or a combination of certain components, or different component arrangements.

As shown in FIG. 1, the memory 1005 as a computer-readable storage medium may include an operating system, a data storage module, a network communication module, a user interface module, and a text extraction program.

In the text extraction device shown in FIG. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with users; the processor 1001 and the memory 1005 in the text extraction device of this application can be Set in a text extraction device, the text extraction device calls the text extraction program stored in the memory 1005 through the processor 1001, and executes the text extraction method provided in the embodiment of the present application.

An embodiment of the present application provides a text extraction method. Refer to FIG. 2, which is a schematic flowchart of the first embodiment of the text extraction method of this application.

In this embodiment, the text extraction method includes the following steps:

Step S10: Read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;

It should be noted that the execution subject of the method in this example can be a computing service device with data processing, network communication and program operation functions, such as smart phones, tablets, personal computers, etc., or it can be pre-loaded on the above computing service devices The text extraction tool. In addition, in the specific implementation scenario of this embodiment, the user needs to upload a sample document to the text extraction tool first, and the sample document is marked with paragraphs/fields or vocabulary that need to be extracted. The text extraction tool compares untrained documents based on these sample documents. The initial conditional random field (Conditional Random Field, CRF) model is trained to obtain a CRF model dedicated to field extraction, or a CRF model dedicated to vocabulary extraction; and then based on these trained CRF models for paragraph/field extraction or vocabulary Extract.

It should be understood that the extraction type identification includes field extraction and vocabulary extraction. In this embodiment, for the two different application scenarios of field extraction and vocabulary extraction, the user only needs to annotate a small number (a few or a dozen) of sample documents to achieve high accuracy in extracting the same vocabulary or words from similar documents. paragraph. In addition, the extraction type identification in this step needs to be selected by the user when uploading the text to be extracted, so that the text to be extracted carries an identification or mark for determining the specific extraction type of the text.

In a specific implementation, the text extraction tool reads the text to be extracted uploaded by the user, and extracts the extraction type identification contained in the text to be extracted.

Step S20: when it is detected that the extraction type is identified as field extraction, call a multi-threaded processing script to divide the text to be extracted into sentence sets;

It should be understood that the field extraction refers to the extraction of paragraphs or sentences. Therefore, the text extraction tool in this embodiment may first segment the text to be extracted according to sentence dimensions, obtain a number of sentences corresponding to the text to be extracted, and then combine these segmented sentences into a sentence set. The multi-thread processing script may be a pre-written computer readable instruction or code file that enables multiple threads to concurrently execute a text segmentation operation.

Step S30: Transforming the sentences in the sentence set into sentence vectors through the multi-thread processing script;

It should be noted that, in this embodiment, the sentence is converted into a sentence vector by first performing word segmentation processing on the sentence through a multi-threaded processing script, and then obtaining the vocabulary dimension after the word segmentation (for example, the sentence "I like watching TV, I don't like watching movies "The corresponding vocabulary dimension is: I, like, watch, TV, movie, no, also), and then count the word frequency of each vocabulary after the word segmentation "I 1, like 2, watch 2, TV 1, movie 1, no 1, Also 0", and finally the sentence vector is transformed according to the word frequency of each vocabulary to obtain the sentence vector "[1, 2, 2, 1, 1, 1, 0]". Of course, the specific sentence vectorization method can also be other methods, and this embodiment does not specifically limit this.

Step S40: splicing the sentence vectors to obtain a target sentence vector;

It should be understood that, in order to extract fields from the entire document to be extracted and avoid omitting the target fields that need to be extracted, the text extraction tool in this embodiment will also splice the sentence vectors corresponding to each sentence in the paragraph order of the text. , To obtain the target sentence vector that is finally used to input into the CRF model.

Further, considering that the BERT model (a method of pre-training language representation, which is a general "language understanding" model trained on a large amount of text corpus (such as Wikipedia)) is compared with other language models in terms of natural language processing The advantage is obvious, and this embodiment preferably uses the BERT model to vectorize the sentence.

Specifically, the sentences in the sentence set may be input to a pre-training language model (ie, the above-mentioned BERT model) through the multi-threaded processing script to obtain the sentence vector corresponding to each sentence output by the pre-training language model; Acquire the text position information of each sentence in the to-be-extracted text, and determine the sentence order corresponding to each sentence according to the text position information; then splice the sentence vectors in the sentence order to obtain the target sentence vector .

Step S50: Input the target sentence vector to a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;

It should be noted that because the application scenarios of field extraction and vocabulary extraction may be different, and different application scenarios may have different requirements on the accuracy of the text extraction results. Therefore, in this embodiment, when the user extracts text information through the text extraction tool, different CRF models can be trained for different text extraction types. In this embodiment, the CRF model dedicated to paragraph/field extraction is used as the first conditional random field model.

In addition, before performing step S10 in this embodiment, the user needs to train the initial CRF model on the text extraction tool according to actual needs. Specifically, the text extraction tool can obtain a number of user-labeled documents, vectorize the user-labeled documents to obtain a labeled text vector, the labeled text vector contains an observation text sequence; input the labeled text vector to the initial condition Random field model, so that the initial condition random field model is trained based on the observation text sequence to obtain the conditional random field model to be verified; model evaluation is performed on the conditional random field model to be verified, and the evaluation result meets the preset When the condition is met, the conditional random field model to be verified is used as the first conditional random field model. Wherein, the preset condition may be that the evaluation result of the model (for example, the accuracy of the prediction result) meets the use standard, for example, the accuracy of the prediction result exceeds 95%, which is not limited in this embodiment.

It should be understood that the CRF model, the conditional random field model, is an undirected graph learning model proposed on the basis of the maximum entropy model and the hidden Markov model, and is used to label and segment ordered data Conditional probability model. The conditional probability that the model finally obtains is P=(y1……yn丨x), that is, an identification sequence y1……yn is obtained from the text so that the identification sequence y1……yn is in the observation sequence x (that is, the field marked by the user) ) Has the greatest probability. In other words, the identification sequence obtained by the conditional random field model in this embodiment can make the corresponding observation sequence the same or the most similar to the observation sequence pre-marked by the user in the sample document (that is, the maximum conditional probability), so as to achieve the accuracy of the target field. extract.

In practical applications, CRF model training can be as follows:

(1) Mark the fields or vocabulary that need to be extracted in the sample document in the following way. For example, if the field to be extracted is "Leasee: Zhang San (China) Investment Co., Ltd.", the user needs to mark all the " Lessee: Zhang San (China) Investment Co., Ltd." field is marked (that is, the following observation sequence), such as:

Observation sequence: Lessee: Zhang San (China) Investment Co., Ltd.

Identification sequence: O OOO B I IIIIIIIII E

(2) Input the annotated sample documents into the initial CRF model for training, so that the initial CRF model can self-learn conditional probabilities (functions) through multiple sample documents containing the above annotations, so that the trained CRF model can be observed The sequence predicts the correct identification sequence.

Among them, the observation sequence is the field or vocabulary marked by the user, and the identification sequence is a text extraction tool based on the observation sequence using OBIE (ontology-based The information extraction method automatically generates a text sequence, and the above observation text sequence is a text sequence after the observation sequence is vectorized.

In a specific implementation, the text extraction tool may input the spliced target sentence vector into the first conditional random field model, and then obtain the first prediction result output by the first conditional random field model. It is understandable that under normal circumstances, the document to be extracted may contain multiple fields that are the same or similar to the observation sequence. Therefore, the first prediction result output by the first conditional random field usually includes multiple conditional probabilities, such as field 1. The conditional probability of P1: 98%, the conditional probability of field 2 P2: 95%, the conditional probability of field 3 P3: 90%, etc.

Step S60: Use an exact matching search algorithm to extract a target field from the text to be extracted according to the first prediction result.

It is understandable that the exact matching search algorithm, also called exact matching search, refers to a search method in which the search term is exactly the same as a certain field in the resource database. Exact matching refers to searching the input search term as a fixed phrase. In this embodiment, the text extraction tool can search the field corresponding to the conditional probability in the prediction result as a "fixed phrase" to extract the retrieved target field.

Specifically, the text extraction tool can sort the conditional probabilities in the first prediction result from high to low, and then select one or more conditional probabilities that are ranked first, and then pass the fields corresponding to these conditional probabilities as the target field. Exact matching search for text extraction; of course, the text extraction tool can also filter the conditional probabilities contained in the prediction result according to a preset conditional probability threshold, for example, all conditional probabilities whose conditional probability value is higher than the conditional probability threshold Both are used as the target condition probability, and then the target field is determined according to the target condition probability, and then text extraction is performed based on the target field through exact matching search. This embodiment does not specifically limit the method used to determine the target field according to the first prediction result.

In this embodiment, by reading the text to be extracted, the extraction type identification contained in the text to be extracted is extracted; when the extraction type identification is detected as field extraction, the multi-threaded processing script is called to divide the text to be extracted into sentence sets; multi-threaded processing The script converts the sentences in the sentence set into sentence vectors; splices the sentence vectors to obtain the target sentence vector; inputs the target sentence vector to the first conditional random field model, and obtains the first prediction result output by the first conditional random field model ; According to the first prediction result, an exact matching retrieval algorithm is used to extract the target field from the text to be extracted. In this embodiment, the extraction length is determined according to the extraction type identification, and the corresponding conditional random field model is selected for different extraction lengths to extract the text to make the text extraction more targeted. At the same time, this embodiment uses a multi-threaded processing script to improve text segmentation. In order to improve the overall efficiency of text extraction, extracting the target field through the exact matching retrieval algorithm also ensures the accuracy of the target field extraction.

Referring to FIG. 3, FIG. 3 is a schematic flowchart of a second embodiment of the text extraction method of this application.

Based on the foregoing first embodiment, in this embodiment, after the step S10, the method further includes:

Step S201: When it is detected that the extraction type is identified as vocabulary extraction, call a multi-threaded processing script to divide the text to be extracted into several sentences;

It should be understood that the vocabulary extraction is also called point extraction, that is, the extraction of characters or words. Similarly, before performing vocabulary extraction, the user needs to mark the vocabulary to be extracted in the sample document, such as the vocabulary of different dimensions such as contract signatory, contract time, contract address, etc., and configure different label categories for vocabulary of different dimensions. Such as person, time, address, etc.

In a specific implementation, when the text extraction tool determines that the text to be extracted is a vocabulary extraction according to the extraction type identification or mark carried in the text to be extracted, it can call a multi-threaded processing script to divide the text to be extracted into several sentences.

Step S301: Obtain the similarity between each sentence and the sample sentence;

It should be noted that before using the text extraction tool to extract vocabulary from the text to be extracted, the user also needs to use the text extraction tool to train the CRF model based on pre-labeled sample documents (documents containing labeled characters or vocabulary). Therefore, in this embodiment, the sentence carrying the marked characters or vocabulary in the sample document is used as the sample sentence.

It should be understood that under normal circumstances, the more similar two sentences are, the more similar the vocabulary they contain. Therefore, the text extraction method of this embodiment first searches for sentences that are similar to the sample sentence, and then finds similar sentences. Extract the target vocabulary.

Specifically, when calculating the similarity between sentences in this embodiment, the word frequency statistics technique can be used to count the word frequency of each vocabulary in each sentence; then the keyword (set) corresponding to each sentence is determined according to the statistical result; Then the similarity between sentence keywords (sets) is regarded as the similarity between sentences, which can improve the accuracy of calculation of similarity between sentences.

The current similarity calculation algorithms include cosine similarity algorithm, Euclidean distance algorithm, Pearson correlation coefficient and so on. In order to improve the similarity calculation efficiency and reduce the calculation amount, the similarity calculation algorithm described in this embodiment is preferably a cosine similarity algorithm that calculates the similarity by calculating the angle between vectors.

Furthermore, considering that the existing word frequency statistics technology is simple and convenient, its defects are also obvious. For example, in documents that use word frequency statistics technology for word frequency statistics, words such as "I" and "的" that appear frequently are usually assigned Higher weight, but these words themselves are meaningless, which affects the determination of sentence keywords to a certain extent. Therefore, in this embodiment, it is preferable to use the term frequency-inverse document frequency index (Term Frequency Inverse Document Frequency, TF-IDF) algorithm to overcome the above-mentioned shortcomings of the term frequency statistical technology.

Specifically, the text extraction tool performs word segmentation processing on the segmented sentences, and obtains the word frequency-inverse text frequency index value (that is, the TF-IDF value) corresponding to each vocabulary after the word segmentation based on the TF-IDF algorithm; and then according to the word frequency- The inverse text frequency index value determines the sentence keyword corresponding to the sentence to which each vocabulary belongs; finally, based on the sentence keyword, the similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained.

Wherein, the step of obtaining the similarity between the sentence belonging to each vocabulary and the sample sentence based on the sentence keywords may specifically include: obtaining the word frequency vector corresponding to the sentence keywords, and then using a cosine similarity algorithm to calculate the belongingness of each vocabulary The cosine similarity between the word frequency vector of the sentence and the word frequency vector of the sample sentence. The greater the cosine similarity value, the more similar the two sentences; otherwise, the less similar.

Step S401: filter out several target sentences corresponding to the sample sentence from the segmented sentences based on the similarity;

It should be understood that for each sample sentence in the sample document, there may be multiple similar target sentences in the text to be extracted. Therefore, the text extraction tool of this embodiment needs to first filter out several target sentences corresponding to the sample sentences from the segmented sentences according to the calculated similarity, and then extract the final target vocabulary from these target sentences.

Step S501: Construct a candidate sentence set according to the target sentence, and input the sentence in the candidate sentence set into a second conditional random field model after vectorization;

It should be noted that this embodiment uses a pre-trained CRF model dedicated to vocabulary extraction as the second conditional random field model.

In specific implementation, the text extraction tool can construct a candidate sentence set according to the target sentence, and then input the sentences in the sentence set into the BERT model and obtain the sentence vectors output by the BERT model. After these sentence vectors are obtained, the text extraction tool can be used These sentence vectors are input into the second conditional random field model to predict the conditional probability.

Step S601: Obtain a second prediction result output by the second conditional random field model, and use an exact matching retrieval algorithm to extract a target vocabulary from the text to be extracted according to the second prediction result.

In specific implementation, after the text extraction tool obtains the second prediction result output by the second conditional random field model, it can determine the target vocabulary to be extracted according to the conditional probability value contained in the second prediction result, and then determine the target vocabulary according to the The target vocabulary is extracted from the text to be extracted through the exact matching retrieval algorithm to extract all the target vocabulary retrieved.

In this embodiment, when it is detected that the extraction type is identified as vocabulary extraction, the multi-threaded processing script is called to divide the text to be extracted into several sentences; the similarity between each sentence and the sample sentence is obtained; Several target sentences corresponding to the sample sentences are selected from the sentence; a candidate sentence set is constructed according to the target sentence, and the sentence in the candidate sentence set is vectorized and input to the second conditional random field model; the second conditional random field model output is obtained According to the prediction result, the exact matching retrieval algorithm is used to extract the target vocabulary from the text to be extracted according to the second prediction result. In this embodiment, the multi-threaded processing script is used to segment the text to be extracted, which improves the efficiency of segmentation. At the same time, according to the sentence and the sample sentence The similarity between the target sentences is selected to construct a candidate sentence set, which can ensure that the sentences input to the conditional random field model are closer to the sample sentences, reducing the amount of model calculations and improving the accuracy of vocabulary extraction.

Referring to FIG. 4, FIG. 4 is a schematic flowchart of a third embodiment of a text extraction method of this application.

Based on the foregoing second embodiment, before the foregoing step S10, the text extraction method of this embodiment further includes:

Step S01: Obtain a number of user-labeled documents, and the user-labeled documents contain label sentences of multiple preset label categories;

It should be understood that the document marked by the user in this embodiment is text marked by characters or vocabulary in advance by the user. The preset label category may be pre-configured to distinguish between characters or vocabulary of different dimensions. For example, the label corresponding to the characters or vocabulary of the two parties to the contract is configured as "person", and the appearance time, time, duration The label corresponding to the character or vocabulary of is configured as "time", the label corresponding to the character or vocabulary of the place and occasion is configured as "address", etc.

In practical applications, each user-labeled document can be labeled with multiple different label categories by the user, and there can be multiple label sentences corresponding to each label category.

Step S02: Perform word segmentation processing on the label sentence through the multi-thread processing script, and construct a vocabulary dictionary according to the sentence vocabulary after the word segmentation;

In specific implementation, the text extraction tool can perform word segmentation processing on each tag sentence contained in the user-labeled document through a multi-threaded processing script, and then perform stop word removal on the sentence vocabulary after the word segmentation process to remove the sentence vocabulary contained in the sentence vocabulary. Stop words such as "的" and "在". After removing the stop words, the text extraction tool can construct a vocabulary dictionary based on the sentence vocabulary after removing the stop words. For example, if the user-labeled document a contains n labeled sentences with the label category b, the text extraction tool can segment the n labeled sentences, remove the stop word processing, and then obtain a vocabulary dictionary with the number of words v.

Step S03: Calculate the word frequency-inverse text frequency index value of each word in the vocabulary dictionary, and construct a word frequency-inverse text frequency index value matrix according to the calculation result;

In specific implementation, the text extraction tool can calculate the word frequency-inverse text frequency index value (TF-IDF value) of each word in the vocabulary dictionary through the TF-IDF algorithm, and then build the order based on the calculated TF-IDF value as TF-IDF matrix of v*n.

Step S04: Obtain the sentence vector corresponding to the labeled sentence according to the word frequency-inverse text frequency index value matrix;

It should be understood that for a document with a large vocabulary, the corresponding TF-IDF matrix may be more complex. The more complex the matrix, the more computing resources the computer occupies during processing, which leads to a decrease in computing efficiency and is not conducive to Filter out the more important matrix data from the matrix. Therefore, after acquiring the above-mentioned TF-IDF matrix, the text extraction tool in this embodiment will also perform dimensionality reduction processing on the TF-IDF matrix.

Specifically, the text extraction tool may perform singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values; then select a preset number of target singular values from the set of singular values, and then select a preset number of target singular values according to the target singular value. The value matrix reconstructs the word frequency-inverse text frequency index value matrix to obtain a target matrix; finally, a sentence vector corresponding to the labeled sentence is obtained based on the target matrix.

It should be understood that when performing singular value decomposition, the singular values obtained from the singular value decomposition (Singular Value Decomposition, SVD) function are generally arranged in descending order of value. The larger the singular value, the more capable it can be. Characterize the information of the original matrix, that is, the higher the information content, the stronger the representativeness. Therefore, after obtaining the singular value set, the text extraction tool of this embodiment can also select a preset number of target singular values (for example, 60 or 120 with a larger singular value) from the singular value set to reconstruct the matrix, thereby achieving Without missing the main matrix information, the TF-IDF matrix is effectively reduced in dimension. Wherein, the preset number can be set according to actual conditions, which is not limited in this embodiment.

In a specific implementation, the text extraction tool can obtain the sentence vector corresponding to each labeled sentence based on the dimensionality reduction matrix after performing SVD dimensionality reduction on the word frequency-inverse text frequency index value matrix.

Step S05: Input the sentence vector to the conditional random field model to be trained for training, and obtain the second conditional random field model.

In a specific implementation, the text extraction can input the obtained sentence vector into the conditional random field model to be trained for training, thereby obtaining a second conditional random field model for predicting the similarity of words based on the words marked in the sample sentences.

In this embodiment, several user-labeled documents are obtained, and the user-labeled documents contain multiple label sentences of preset label categories; the label sentences are processed by multi-thread processing script, and the vocabulary dictionary is constructed according to the sentence vocabulary after the word segmentation; the vocabulary is calculated The word frequency-inverse text frequency index value of each vocabulary in the dictionary, and the word frequency-inverse text frequency index value matrix is constructed according to the calculation result; the sentence vector corresponding to the label sentence is obtained according to the word frequency-inverse text frequency index value matrix; the sentence vector is input to The conditional random field model to be trained is trained to obtain the second conditional random field model. Because it is a matrix constructed by the word frequency-inverse text frequency index value of each vocabulary to obtain the sentence vector corresponding to the label sentence, and then the condition is based on the sentence vector The random field model is trained to ensure that the trained model has high accuracy.

In addition, the embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile, and a text extraction program is stored on the computer-readable storage medium. When the text extraction program is executed by the processor, the steps of the text extraction method as described above are realized.

Referring to Fig. 5, Fig. 5 is a structural block diagram of the first embodiment of the text extraction device of this application.

As shown in FIG. 5, the text extraction device proposed in the embodiment of the present application includes:

The text acquisition module 501 is configured to read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;

The sentence segmentation module 502 is configured to call a multi-thread processing script to segment the to-be-extracted text into sentence sets when it is detected that the extraction type identification is field extraction;

The vector conversion module 503 is configured to convert the sentences in the sentence set into sentence vectors through the multi-thread processing script;

The vector splicing module 504 is used to splice the sentence vectors to obtain a target sentence vector;

The model prediction module 505 is configured to input the target sentence vector into a first conditional random field model, and obtain the first prediction result output by the first conditional random field model;

The text extraction module 506 is configured to extract a target field from the text to be extracted by using an exact matching retrieval algorithm according to the first prediction result.

Based on the first embodiment of the above-mentioned text extraction device of this application, a second embodiment of the text extraction device of this application is proposed.

In this embodiment, the vector conversion module 503 is further configured to input the sentences in the sentence set into the pre-training language model through the multi-threaded processing script to obtain sentences output by the pre-training language model. Corresponding sentence vector; correspondingly, the vector splicing module 504 is also used to obtain the text position information of each sentence in the to-be-extracted text, and determine the sentence sequence corresponding to each sentence according to the text position information; The sentence vectors are spliced in the sentence order to obtain the target sentence vector.

Further, the text extraction device of this embodiment further includes: a model training module for acquiring a number of user-labeled documents, and vectorizing the user-labeled documents to obtain a labeled text vector, the labeled text vector containing an observation text sequence Input the labeled text vector to the initial condition random field model, so that the initial condition random field model performs model training based on the observation text sequence to obtain the conditional random field model to be verified; The model performs model evaluation, and when the evaluation result meets a preset condition, the conditional random field model to be verified is used as the first conditional random field model.

Further, the text extraction device of this embodiment further includes: a vocabulary extraction module, which is used to call a multi-threaded processing script to divide the text to be extracted into several sentences when it is detected that the extraction type is identified as vocabulary extraction; The similarity between a sentence and a sample sentence; based on the similarity, a number of target sentences corresponding to the sample sentence are selected from the segmented sentences; a candidate sentence set is constructed according to the target sentence, and the candidate The sentences in the sentence set are vectorized and then input to the second conditional random field model; the second prediction result output by the second conditional random field model is obtained, and the exact matching retrieval algorithm is used according to the second prediction result from the text to be extracted Extract the target vocabulary from it.

Further, the vocabulary extraction module is also used to perform word segmentation processing on the segmented sentences, and obtain the word frequency-inverse text frequency index value corresponding to each vocabulary after word segmentation; determine each word frequency-inverse text frequency index value according to the word frequency-inverse text frequency index value. Sentence keywords corresponding to sentences to which the vocabulary belongs; based on the sentence keywords, the similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained.

Further, the model training module is also used to obtain several user-labeled documents, and the user-labeled documents contain multiple label sentences of preset label categories; and perform word segmentation on the label sentences through the multi-threaded processing script Process, and construct a vocabulary dictionary according to the sentence vocabulary after word segmentation; calculate the word frequency-inverse text frequency index value of each vocabulary in the vocabulary dictionary, and construct a word frequency-inverse text frequency index value matrix according to the calculation result; according to the word frequency- The inverse text frequency index value matrix obtains the sentence vector corresponding to the label sentence; the sentence vector is input to the conditional random field model to be trained for training, and the second conditional random field model is obtained.

Further, the model training module is also used to perform singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values; select a preset number of target singular values from the set of singular values, according to The target singular value performs matrix reconstruction on the word frequency-inverse text frequency index value matrix to obtain a target matrix; and obtains a sentence vector corresponding to the labeled sentence based on the target matrix.

For other embodiments or specific implementations of the text extraction device of the present application, reference may be made to the foregoing method embodiments, and details are not described herein again.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, method, article, or system. Without more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or system that includes the element.

The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.

Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as read-only memory/random access The storage, magnetic disk, and optical disk) includes several instructions to make a text extraction tool device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A text extraction method, wherein the method includes:

Read the text to be extracted, and extract the extraction type identifier contained in the text to be extracted;

When it is detected that the extraction type is identified as field extraction, calling a multi-threaded processing script to divide the text to be extracted into sentence sets;

Converting the sentences in the sentence set into sentence vectors through the multi-thread processing script;

Splicing the sentence vectors to obtain a target sentence vector;

Input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;

According to the first prediction result, an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
8. The method of claim 1, wherein the step of converting the sentences in the sentence set into sentence vectors through the multi-threaded processing script comprises:

Input the sentences in the sentence set to the pre-training language model through the multi-threaded processing script, so as to obtain the sentence vector corresponding to each sentence output by the pre-training language model;

The step of splicing the sentence vectors to obtain a target sentence vector includes:

Acquiring the text position information of each sentence in the to-be-extracted text, and determining the sentence order corresponding to each sentence according to the text position information;

The sentence vectors are spliced in the sentence order to obtain the target sentence vector.
The method according to claim 1, wherein before the step of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:

Acquiring a number of user-annotated documents, and vectorizing the user-annotated documents to obtain annotated text vector, where the annotated text vector includes an observation text sequence;

Inputting the labeled text vector into an initial condition random field model, so that the initial condition random field model performs model training based on the observed text sequence to obtain a conditional random field model to be verified;

Model evaluation is performed on the conditional random field model to be verified, and when the evaluation result meets a preset condition, the conditional random field model to be verified is used as the first conditional random field model.
The method according to claim 1, wherein after the steps of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:

When it is detected that the extraction type is identified as vocabulary extraction, calling a multi-threaded processing script to divide the text to be extracted into several sentences;

Obtain the similarity between each sentence and the sample sentence;

Filtering out several target sentences corresponding to the sample sentences from the segmented sentences based on the similarity;

Construct a candidate sentence set according to the target sentence, and input the sentence in the candidate sentence set into a second conditional random field model after vectorization;

A second prediction result output by the second conditional random field model is obtained, and an exact matching retrieval algorithm is used to extract a target vocabulary from the text to be extracted according to the second prediction result.
The method according to claim 4, wherein the step of obtaining the similarity between each sentence and the sample sentence comprises:

Perform word segmentation on the segmented sentence, and obtain the word frequency-inverse text frequency index value corresponding to each vocabulary after word segmentation;

Determine the sentence keyword corresponding to the sentence to which each vocabulary belongs according to the word frequency-inverse text frequency index value;

The similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained based on the sentence keywords.
The method according to claim 4, wherein before the step of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:

Acquiring a number of user-labeled documents, the user-labeled documents containing multiple label sentences of preset label categories;

Performing word segmentation processing on the label sentence through the multi-threaded processing script, and constructing a vocabulary dictionary according to the sentence vocabulary after the word segmentation;

Calculating the word frequency-inverse text frequency index value of each word in the vocabulary dictionary, and constructing a word frequency-inverse text frequency index value matrix according to the calculation result;

Obtaining the sentence vector corresponding to the label sentence according to the word frequency-inverse text frequency index value matrix;

The sentence vector is input to the conditional random field model to be trained for training, and the second conditional random field model is obtained.
7. The method of claim 6, wherein the step of obtaining the sentence vector corresponding to the labeled sentence according to the word frequency-inverse text frequency index value matrix comprises:

Performing singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values;

Selecting a preset number of target singular values from the set of singular values, and performing matrix reconstruction on the word frequency-inverse text frequency index value matrix according to the target singular values to obtain a target matrix;

Obtain the sentence vector corresponding to the labeled sentence based on the target matrix.
A text extraction device, wherein the device includes:

The text acquisition module is used to read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;

The sentence segmentation module is configured to call a multi-threaded processing script to segment the text to be extracted into sentence sets when it is detected that the extraction type is identified as field extraction;

The vector conversion module is used to convert the sentences in the sentence set into sentence vectors through the multi-thread processing script;

The vector splicing module is used for splicing the sentence vector to obtain the target sentence vector;

A model prediction module, configured to input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;

The text extraction module is used to extract the target field from the text to be extracted by using an exact matching retrieval algorithm according to the first prediction result.
A text extraction device, wherein the text extraction includes a memory, a processor, and computer readable instructions stored on the memory and capable of running on the processor, and when the computer readable instructions are executed by the processor Implement the following steps:

Read the text to be extracted, and extract the extraction type identifier contained in the text to be extracted;

When it is detected that the extraction type is identified as field extraction, calling a multi-threaded processing script to divide the text to be extracted into sentence sets;

Converting the sentences in the sentence set into sentence vectors through the multi-thread processing script;

Splicing the sentence vectors to obtain a target sentence vector;

Input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;

According to the first prediction result, an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
9. The text extraction device according to claim 9, wherein the step of converting the sentences in the sentence set into sentence vectors through the multi-threaded processing script comprises:

Input the sentences in the sentence set to the pre-training language model through the multi-threaded processing script, so as to obtain the sentence vector corresponding to each sentence output by the pre-training language model;

The step of splicing the sentence vectors to obtain a target sentence vector includes:

Acquiring the text position information of each sentence in the to-be-extracted text, and determining the sentence order corresponding to each sentence according to the text position information;

The sentence vectors are spliced in the sentence order to obtain the target sentence vector.
9. The text extraction device according to claim 9, wherein before the step of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:

Acquiring a number of user-annotated documents, and vectorizing the user-annotated documents to obtain annotated text vector, where the annotated text vector includes an observation text sequence;

Inputting the labeled text vector into an initial condition random field model, so that the initial condition random field model performs model training based on the observed text sequence to obtain a conditional random field model to be verified;

Model evaluation is performed on the conditional random field model to be verified, and when the evaluation result meets a preset condition, the conditional random field model to be verified is used as the first conditional random field model.
9. The text extraction device according to claim 9, wherein after the steps of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:

When it is detected that the extraction type is identified as vocabulary extraction, calling a multi-threaded processing script to divide the text to be extracted into several sentences;

Obtain the similarity between each sentence and the sample sentence;

Filtering out several target sentences corresponding to the sample sentences from the segmented sentences based on the similarity;

Construct a candidate sentence set according to the target sentence, and input the sentence in the candidate sentence set into a second conditional random field model after vectorization;

A second prediction result output by the second conditional random field model is obtained, and an exact matching retrieval algorithm is used to extract a target vocabulary from the text to be extracted according to the second prediction result.
The text extraction device according to claim 12, wherein the step of obtaining the similarity between each sentence and the sample sentence comprises:

Perform word segmentation on the segmented sentence, and obtain the word frequency-inverse text frequency index value corresponding to each vocabulary after word segmentation;

Determine the sentence keyword corresponding to the sentence to which each vocabulary belongs according to the word frequency-inverse text frequency index value;

The similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained based on the sentence keywords.
The text extraction device according to claim 12, wherein before the step of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:

Acquiring a number of user-labeled documents, the user-labeled documents containing multiple label sentences of preset label categories;

Performing word segmentation processing on the label sentence through the multi-threaded processing script, and constructing a vocabulary dictionary according to the sentence vocabulary after the word segmentation;

Calculating the word frequency-inverse text frequency index value of each word in the vocabulary dictionary, and constructing a word frequency-inverse text frequency index value matrix according to the calculation result;

Obtaining the sentence vector corresponding to the label sentence according to the word frequency-inverse text frequency index value matrix;

The sentence vector is input to the conditional random field model to be trained for training, and the second conditional random field model is obtained.
The text extraction device according to claim 14, wherein the step of obtaining the sentence vector corresponding to the labeled sentence according to the word frequency-inverse text frequency index value matrix comprises:

Performing singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values;

Selecting a preset number of target singular values from the set of singular values, and performing matrix reconstruction on the word frequency-inverse text frequency index value matrix according to the target singular values to obtain a target matrix;

Obtain the sentence vector corresponding to the labeled sentence based on the target matrix.
A computer-readable storage medium, wherein computer-readable instructions are stored in the computer-readable storage medium, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the following step:

Read the text to be extracted, and extract the extraction type identifier contained in the text to be extracted;

When it is detected that the extraction type is identified as field extraction, calling a multi-threaded processing script to divide the text to be extracted into sentence sets;

Converting the sentences in the sentence set into sentence vectors through the multi-thread processing script;

Splicing the sentence vectors to obtain a target sentence vector;

Input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;

According to the first prediction result, an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
15. The computer-readable storage medium according to claim 16, wherein the step of converting the sentences in the sentence set into sentence vectors through the multi-threaded processing script comprises:

Input the sentences in the sentence set to the pre-training language model through the multi-threaded processing script, so as to obtain the sentence vector corresponding to each sentence output by the pre-training language model;

The step of splicing the sentence vectors to obtain a target sentence vector includes:

Acquiring the text position information of each sentence in the to-be-extracted text, and determining the sentence order corresponding to each sentence according to the text position information;

The sentence vectors are spliced in the sentence order to obtain the target sentence vector.
15. The computer-readable storage medium according to claim 16, wherein before the step of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:

Acquiring a number of user-annotated documents, and vectorizing the user-annotated documents to obtain annotated text vector, where the annotated text vector includes an observation text sequence;

Inputting the labeled text vector into an initial condition random field model, so that the initial condition random field model performs model training based on the observed text sequence to obtain a conditional random field model to be verified;

Model evaluation is performed on the conditional random field model to be verified, and when the evaluation result meets a preset condition, the conditional random field model to be verified is used as the first conditional random field model.
15. The computer-readable storage medium according to claim 16, wherein after the steps of reading the text to be extracted and extracting the extraction type identification contained in the text to be extracted, the method further comprises:

When it is detected that the extraction type is identified as vocabulary extraction, calling a multi-threaded processing script to divide the text to be extracted into several sentences;

Obtain the computer-readable storage medium between each sentence and the sample sentence;

Filtering out several target sentences corresponding to the sample sentences from the segmented sentences based on the similarity;

Construct a candidate sentence set according to the target sentence, and input the sentence in the candidate sentence set into a second conditional random field model after vectorization;

A second prediction result output by the second conditional random field model is obtained, and an exact matching retrieval algorithm is used to extract a target vocabulary from the text to be extracted according to the second prediction result.
19. The computer-readable storage medium of claim 19, wherein the step of obtaining the similarity between each sentence and the sample sentence comprises:

Perform word segmentation on the segmented sentence, and obtain the word frequency-inverse text frequency index value corresponding to each vocabulary after word segmentation;

Determine the sentence keyword corresponding to the sentence to which each vocabulary belongs according to the word frequency-inverse text frequency index value;

The similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained based on the sentence keywords.