WO2021051871A1 - Procédé, appareil et dispositif d"extraction de texte, et support d'enregistrement - Google Patents

Procédé, appareil et dispositif d"extraction de texte, et support d'enregistrement Download PDF

Info

Publication number
WO2021051871A1
WO2021051871A1 PCT/CN2020/093466 CN2020093466W WO2021051871A1 WO 2021051871 A1 WO2021051871 A1 WO 2021051871A1 CN 2020093466 W CN2020093466 W CN 2020093466W WO 2021051871 A1 WO2021051871 A1 WO 2021051871A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
text
extracted
extraction
target
Prior art date
Application number
PCT/CN2020/093466
Other languages
English (en)
Chinese (zh)
Inventor
郝正鸿
许开河
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051871A1 publication Critical patent/WO2021051871A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Definitions

  • This application relates to the field of text processing technology, and in particular to a text extraction method, device, equipment, and storage medium.
  • Information extraction is the process of automatically extracting and converting unstructured data in documents (such as resumes, insurance clauses, encyclopedias, contracts and other business scenarios) into structured data, for example, the contracting parties in the lease contract Extract and convert unstructured data such as the name, contract time, and contract address of the company.
  • documents such as resumes, insurance clauses, encyclopedias, contracts and other business scenarios
  • unstructured data such as the name, contract time, and contract address of the company.
  • Information extraction is divided from the perspective of extraction content, including entity extraction, relationship extraction, and event extraction. From the length of extraction, it mainly includes vocabulary extraction and field/paragraph extraction. In addition, it is also divided into open domain information extraction and closed domain information extraction. With the development of deep neural networks and the enhancement of computer computing power, the existing information extraction methods are mainly based on large-scale labeled data training end-to-end deep learning models with larger parameters, and then perform different methods based on the trained models. Text information extraction in business scenarios. The inventor found that this information extraction method did not perform classification extraction for different extraction lengths, resulting in the final extraction result being not highly targeted, low accuracy, and reducing the efficiency of information extraction.
  • the main purpose of this application is to provide a text extraction method, device, equipment and storage medium, aiming to solve the technical problems of the existing information extraction technology that the extraction results are not very specific, the accuracy is not high, and the extraction efficiency is low.
  • this application provides a text extraction method, which includes the following steps:
  • an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
  • this application also proposes a text extraction device, which includes:
  • the text acquisition module is used to read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;
  • the sentence segmentation module is configured to call a multi-threaded processing script to segment the text to be extracted into sentence sets when it is detected that the extraction type is identified as field extraction;
  • the vector conversion module is used to convert the sentences in the sentence set into sentence vectors through the multi-thread processing script
  • the vector splicing module is used for splicing the sentence vector to obtain the target sentence vector
  • a model prediction module configured to input the target sentence vector into a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
  • the text extraction module is used to extract the target field from the text to be extracted by using an exact matching retrieval algorithm according to the first prediction result.
  • this application also proposes a text extraction device, the text extraction includes a memory, a processor, and computer-readable instructions stored on the memory and running on the processor.
  • the computer-readable instructions are executed by the processor, the following steps are implemented:
  • an exact matching retrieval algorithm is used to extract a target field from the text to be extracted
  • this application also proposes a computer-readable storage medium having computer-readable instructions stored in the computer-readable storage medium, and the computer-readable instructions can be executed by at least one processor to Make the at least one processor execute the following steps:
  • an exact matching retrieval algorithm is used to extract a target field from the text to be extracted.
  • This application extracts the extraction type identification contained in the text to be extracted by reading the text to be extracted; when the extraction type identification is detected as field extraction, the multi-threaded processing script is called to divide the text to be extracted into sentence sets; the script is processed by multi-threading Transform the sentences in the sentence set into sentence vectors; splice the sentence vectors to obtain the target sentence vector; input the target sentence vector into the first conditional random field model to obtain the first prediction result output by the first conditional random field model; According to the first prediction result, an exact matching retrieval algorithm is used to extract the target field from the text to be extracted.
  • This application determines the extraction length according to the extraction type identification, and selects the corresponding conditional random field model for different extraction lengths to extract text to make the text extraction more targeted.
  • this application uses multi-threaded processing scripts for text segmentation to improve the text
  • the overall efficiency of extraction, extracting the target field through the exact matching retrieval algorithm also ensures the accuracy of the target field extraction.
  • FIG. 1 is a schematic structural diagram of a text extraction device of a hardware operating environment involved in a solution of an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a second embodiment of the text extraction method of this application.
  • FIG. 5 is a structural block diagram of the first embodiment of the text extraction device of this application.
  • FIG. 1 is a schematic structural diagram of a text extraction device of a hardware operating environment involved in a solution of an embodiment of the application.
  • the text extraction device may include: a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a wireless fidelity (WIreless-FIdelity, WI-FI) interface).
  • WIreless-FIdelity WI-FI
  • the memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory, or a stable non-volatile memory (Non-Volatile memory). Memory, NVM), such as disk storage.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
  • FIG. 1 does not constitute a limitation on the text extraction device, and may include more or less components than shown in the figure, or a combination of certain components, or different component arrangements.
  • the memory 1005 as a computer-readable storage medium may include an operating system, a data storage module, a network communication module, a user interface module, and a text extraction program.
  • the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with users; the processor 1001 and the memory 1005 in the text extraction device of this application can be Set in a text extraction device, the text extraction device calls the text extraction program stored in the memory 1005 through the processor 1001, and executes the text extraction method provided in the embodiment of the present application.
  • FIG. 2 is a schematic flowchart of the first embodiment of the text extraction method of this application.
  • the text extraction method includes the following steps:
  • Step S10 Read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;
  • the execution subject of the method in this example can be a computing service device with data processing, network communication and program operation functions, such as smart phones, tablets, personal computers, etc., or it can be pre-loaded on the above computing service devices
  • the text extraction tool In addition, in the specific implementation scenario of this embodiment, the user needs to upload a sample document to the text extraction tool first, and the sample document is marked with paragraphs/fields or vocabulary that need to be extracted. The text extraction tool compares untrained documents based on these sample documents.
  • the initial conditional random field (Conditional Random Field, CRF) model is trained to obtain a CRF model dedicated to field extraction, or a CRF model dedicated to vocabulary extraction; and then based on these trained CRF models for paragraph/field extraction or vocabulary Extract.
  • CRF Conditional Random Field
  • the extraction type identification includes field extraction and vocabulary extraction.
  • the user only needs to annotate a small number (a few or a dozen) of sample documents to achieve high accuracy in extracting the same vocabulary or words from similar documents. paragraph.
  • the extraction type identification in this step needs to be selected by the user when uploading the text to be extracted, so that the text to be extracted carries an identification or mark for determining the specific extraction type of the text.
  • the text extraction tool reads the text to be extracted uploaded by the user, and extracts the extraction type identification contained in the text to be extracted.
  • Step S20 when it is detected that the extraction type is identified as field extraction, call a multi-threaded processing script to divide the text to be extracted into sentence sets;
  • the text extraction tool in this embodiment may first segment the text to be extracted according to sentence dimensions, obtain a number of sentences corresponding to the text to be extracted, and then combine these segmented sentences into a sentence set.
  • the multi-thread processing script may be a pre-written computer readable instruction or code file that enables multiple threads to concurrently execute a text segmentation operation.
  • Step S30 Transforming the sentences in the sentence set into sentence vectors through the multi-thread processing script
  • the sentence is converted into a sentence vector by first performing word segmentation processing on the sentence through a multi-threaded processing script, and then obtaining the vocabulary dimension after the word segmentation (for example, the sentence "I like watching TV, I don't like watching movies "The corresponding vocabulary dimension is: I, like, watch, TV, movie, no, also), and then count the word frequency of each vocabulary after the word segmentation "I 1, like 2, watch 2, TV 1, movie 1, no 1, Also 0", and finally the sentence vector is transformed according to the word frequency of each vocabulary to obtain the sentence vector "[1, 2, 2, 1, 1, 1, 0]".
  • the specific sentence vectorization method can also be other methods, and this embodiment does not specifically limit this.
  • Step S40 splicing the sentence vectors to obtain a target sentence vector
  • the text extraction tool in this embodiment will also splice the sentence vectors corresponding to each sentence in the paragraph order of the text. , To obtain the target sentence vector that is finally used to input into the CRF model.
  • the BERT model (a method of pre-training language representation, which is a general "language understanding" model trained on a large amount of text corpus (such as Wikipedia)) is compared with other language models in terms of natural language processing
  • this embodiment preferably uses the BERT model to vectorize the sentence.
  • the sentences in the sentence set may be input to a pre-training language model (ie, the above-mentioned BERT model) through the multi-threaded processing script to obtain the sentence vector corresponding to each sentence output by the pre-training language model; Acquire the text position information of each sentence in the to-be-extracted text, and determine the sentence order corresponding to each sentence according to the text position information; then splice the sentence vectors in the sentence order to obtain the target sentence vector .
  • a pre-training language model ie, the above-mentioned BERT model
  • Step S50 Input the target sentence vector to a first conditional random field model, and obtain a first prediction result output by the first conditional random field model;
  • the text extraction tool can obtain a number of user-labeled documents, vectorize the user-labeled documents to obtain a labeled text vector, the labeled text vector contains an observation text sequence; input the labeled text vector to the initial condition Random field model, so that the initial condition random field model is trained based on the observation text sequence to obtain the conditional random field model to be verified; model evaluation is performed on the conditional random field model to be verified, and the evaluation result meets the preset When the condition is met, the conditional random field model to be verified is used as the first conditional random field model.
  • the preset condition may be that the evaluation result of the model (for example, the accuracy of the prediction result) meets the use standard, for example, the accuracy of the prediction result exceeds 95%, which is not limited in this embodiment.
  • the CRF model is an undirected graph learning model proposed on the basis of the maximum entropy model and the hidden Markov model, and is used to label and segment ordered data Conditional probability model.
  • the identification sequence obtained by the conditional random field model in this embodiment can make the corresponding observation sequence the same or the most similar to the observation sequence pre-marked by the user in the sample document (that is, the maximum conditional probability), so as to achieve the accuracy of the target field. extract.
  • CRF model training can be as follows:
  • the observation sequence is the field or vocabulary marked by the user
  • the identification sequence is a text extraction tool based on the observation sequence using OBIE (ontology-based
  • OBIE origin-based
  • the information extraction method automatically generates a text sequence
  • the above observation text sequence is a text sequence after the observation sequence is vectorized.
  • the text extraction tool may input the spliced target sentence vector into the first conditional random field model, and then obtain the first prediction result output by the first conditional random field model.
  • the first prediction result output by the first conditional random field usually includes multiple conditional probabilities, such as field 1.
  • Step S60 Use an exact matching search algorithm to extract a target field from the text to be extracted according to the first prediction result.
  • exact matching search also called exact matching search, refers to a search method in which the search term is exactly the same as a certain field in the resource database.
  • Exact matching refers to searching the input search term as a fixed phrase.
  • the text extraction tool can search the field corresponding to the conditional probability in the prediction result as a "fixed phrase" to extract the retrieved target field.
  • the text extraction tool can sort the conditional probabilities in the first prediction result from high to low, and then select one or more conditional probabilities that are ranked first, and then pass the fields corresponding to these conditional probabilities as the target field.
  • Exact matching search for text extraction the text extraction tool can also filter the conditional probabilities contained in the prediction result according to a preset conditional probability threshold, for example, all conditional probabilities whose conditional probability value is higher than the conditional probability threshold Both are used as the target condition probability, and then the target field is determined according to the target condition probability, and then text extraction is performed based on the target field through exact matching search.
  • This embodiment does not specifically limit the method used to determine the target field according to the first prediction result.
  • the extraction type identification contained in the text to be extracted is extracted; when the extraction type identification is detected as field extraction, the multi-threaded processing script is called to divide the text to be extracted into sentence sets; multi-threaded processing The script converts the sentences in the sentence set into sentence vectors; splices the sentence vectors to obtain the target sentence vector; inputs the target sentence vector to the first conditional random field model, and obtains the first prediction result output by the first conditional random field model ; According to the first prediction result, an exact matching retrieval algorithm is used to extract the target field from the text to be extracted.
  • the extraction length is determined according to the extraction type identification, and the corresponding conditional random field model is selected for different extraction lengths to extract the text to make the text extraction more targeted.
  • this embodiment uses a multi-threaded processing script to improve text segmentation. In order to improve the overall efficiency of text extraction, extracting the target field through the exact matching retrieval algorithm also ensures the accuracy of the target field extraction.
  • FIG. 3 is a schematic flowchart of a second embodiment of the text extraction method of this application.
  • the method further includes:
  • Step S201 When it is detected that the extraction type is identified as vocabulary extraction, call a multi-threaded processing script to divide the text to be extracted into several sentences;
  • the vocabulary extraction is also called point extraction, that is, the extraction of characters or words.
  • the user needs to mark the vocabulary to be extracted in the sample document, such as the vocabulary of different dimensions such as contract signatory, contract time, contract address, etc., and configure different label categories for vocabulary of different dimensions. Such as person, time, address, etc.
  • the text extraction tool when the text extraction tool determines that the text to be extracted is a vocabulary extraction according to the extraction type identification or mark carried in the text to be extracted, it can call a multi-threaded processing script to divide the text to be extracted into several sentences.
  • Step S301 Obtain the similarity between each sentence and the sample sentence
  • the user before using the text extraction tool to extract vocabulary from the text to be extracted, the user also needs to use the text extraction tool to train the CRF model based on pre-labeled sample documents (documents containing labeled characters or vocabulary). Therefore, in this embodiment, the sentence carrying the marked characters or vocabulary in the sample document is used as the sample sentence.
  • the text extraction method of this embodiment first searches for sentences that are similar to the sample sentence, and then finds similar sentences. Extract the target vocabulary.
  • the word frequency statistics technique can be used to count the word frequency of each vocabulary in each sentence; then the keyword (set) corresponding to each sentence is determined according to the statistical result; Then the similarity between sentence keywords (sets) is regarded as the similarity between sentences, which can improve the accuracy of calculation of similarity between sentences.
  • the current similarity calculation algorithms include cosine similarity algorithm, Euclidean distance algorithm, Pearson correlation coefficient and so on.
  • the similarity calculation algorithm described in this embodiment is preferably a cosine similarity algorithm that calculates the similarity by calculating the angle between vectors.
  • the text extraction tool performs word segmentation processing on the segmented sentences, and obtains the word frequency-inverse text frequency index value (that is, the TF-IDF value) corresponding to each vocabulary after the word segmentation based on the TF-IDF algorithm; and then according to the word frequency-
  • the inverse text frequency index value determines the sentence keyword corresponding to the sentence to which each vocabulary belongs; finally, based on the sentence keyword, the similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained.
  • the step of obtaining the similarity between the sentence belonging to each vocabulary and the sample sentence based on the sentence keywords may specifically include: obtaining the word frequency vector corresponding to the sentence keywords, and then using a cosine similarity algorithm to calculate the belongingness of each vocabulary The cosine similarity between the word frequency vector of the sentence and the word frequency vector of the sample sentence. The greater the cosine similarity value, the more similar the two sentences; otherwise, the less similar.
  • Step S401 filter out several target sentences corresponding to the sample sentence from the segmented sentences based on the similarity
  • the text extraction tool of this embodiment needs to first filter out several target sentences corresponding to the sample sentences from the segmented sentences according to the calculated similarity, and then extract the final target vocabulary from these target sentences.
  • Step S501 Construct a candidate sentence set according to the target sentence, and input the sentence in the candidate sentence set into a second conditional random field model after vectorization;
  • this embodiment uses a pre-trained CRF model dedicated to vocabulary extraction as the second conditional random field model.
  • the text extraction tool can construct a candidate sentence set according to the target sentence, and then input the sentences in the sentence set into the BERT model and obtain the sentence vectors output by the BERT model. After these sentence vectors are obtained, the text extraction tool can be used These sentence vectors are input into the second conditional random field model to predict the conditional probability.
  • Step S601 Obtain a second prediction result output by the second conditional random field model, and use an exact matching retrieval algorithm to extract a target vocabulary from the text to be extracted according to the second prediction result.
  • the text extraction tool after the text extraction tool obtains the second prediction result output by the second conditional random field model, it can determine the target vocabulary to be extracted according to the conditional probability value contained in the second prediction result, and then determine the target vocabulary according to the The target vocabulary is extracted from the text to be extracted through the exact matching retrieval algorithm to extract all the target vocabulary retrieved.
  • the multi-threaded processing script when it is detected that the extraction type is identified as vocabulary extraction, the multi-threaded processing script is called to divide the text to be extracted into several sentences; the similarity between each sentence and the sample sentence is obtained; Several target sentences corresponding to the sample sentences are selected from the sentence; a candidate sentence set is constructed according to the target sentence, and the sentence in the candidate sentence set is vectorized and input to the second conditional random field model; the second conditional random field model output is obtained According to the prediction result, the exact matching retrieval algorithm is used to extract the target vocabulary from the text to be extracted according to the second prediction result.
  • the multi-threaded processing script is used to segment the text to be extracted, which improves the efficiency of segmentation.
  • the similarity between the target sentences is selected to construct a candidate sentence set, which can ensure that the sentences input to the conditional random field model are closer to the sample sentences, reducing the amount of model calculations and improving the accuracy of vocabulary extraction.
  • FIG. 4 is a schematic flowchart of a third embodiment of a text extraction method of this application.
  • the text extraction method of this embodiment further includes:
  • Step S01 Obtain a number of user-labeled documents, and the user-labeled documents contain label sentences of multiple preset label categories;
  • the document marked by the user in this embodiment is text marked by characters or vocabulary in advance by the user.
  • the preset label category may be pre-configured to distinguish between characters or vocabulary of different dimensions.
  • the label corresponding to the characters or vocabulary of the two parties to the contract is configured as "person”
  • the appearance time, time, duration is configured as "time”
  • the label corresponding to the character or vocabulary of the place and occasion is configured as "address”, etc.
  • each user-labeled document can be labeled with multiple different label categories by the user, and there can be multiple label sentences corresponding to each label category.
  • Step S02 Perform word segmentation processing on the label sentence through the multi-thread processing script, and construct a vocabulary dictionary according to the sentence vocabulary after the word segmentation;
  • the text extraction tool can perform word segmentation processing on each tag sentence contained in the user-labeled document through a multi-threaded processing script, and then perform stop word removal on the sentence vocabulary after the word segmentation process to remove the sentence vocabulary contained in the sentence vocabulary. Stop words such as " ⁇ " and " ⁇ ". After removing the stop words, the text extraction tool can construct a vocabulary dictionary based on the sentence vocabulary after removing the stop words. For example, if the user-labeled document a contains n labeled sentences with the label category b, the text extraction tool can segment the n labeled sentences, remove the stop word processing, and then obtain a vocabulary dictionary with the number of words v.
  • Step S03 Calculate the word frequency-inverse text frequency index value of each word in the vocabulary dictionary, and construct a word frequency-inverse text frequency index value matrix according to the calculation result;
  • the text extraction tool can calculate the word frequency-inverse text frequency index value (TF-IDF value) of each word in the vocabulary dictionary through the TF-IDF algorithm, and then build the order based on the calculated TF-IDF value as TF-IDF matrix of v*n.
  • TF-IDF value word frequency-inverse text frequency index value
  • Step S04 Obtain the sentence vector corresponding to the labeled sentence according to the word frequency-inverse text frequency index value matrix
  • the corresponding TF-IDF matrix may be more complex.
  • the more complex the matrix the more computing resources the computer occupies during processing, which leads to a decrease in computing efficiency and is not conducive to Filter out the more important matrix data from the matrix. Therefore, after acquiring the above-mentioned TF-IDF matrix, the text extraction tool in this embodiment will also perform dimensionality reduction processing on the TF-IDF matrix.
  • the text extraction tool may perform singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values; then select a preset number of target singular values from the set of singular values, and then select a preset number of target singular values according to the target singular value.
  • the value matrix reconstructs the word frequency-inverse text frequency index value matrix to obtain a target matrix; finally, a sentence vector corresponding to the labeled sentence is obtained based on the target matrix.
  • the singular values obtained from the singular value decomposition are generally arranged in descending order of value.
  • the larger the singular value the more capable it can be. Characterize the information of the original matrix, that is, the higher the information content, the stronger the representativeness. Therefore, after obtaining the singular value set, the text extraction tool of this embodiment can also select a preset number of target singular values (for example, 60 or 120 with a larger singular value) from the singular value set to reconstruct the matrix, thereby achieving Without missing the main matrix information, the TF-IDF matrix is effectively reduced in dimension.
  • the preset number can be set according to actual conditions, which is not limited in this embodiment.
  • the text extraction tool can obtain the sentence vector corresponding to each labeled sentence based on the dimensionality reduction matrix after performing SVD dimensionality reduction on the word frequency-inverse text frequency index value matrix.
  • Step S05 Input the sentence vector to the conditional random field model to be trained for training, and obtain the second conditional random field model.
  • the text extraction can input the obtained sentence vector into the conditional random field model to be trained for training, thereby obtaining a second conditional random field model for predicting the similarity of words based on the words marked in the sample sentences.
  • the user-labeled documents contain multiple label sentences of preset label categories; the label sentences are processed by multi-thread processing script, and the vocabulary dictionary is constructed according to the sentence vocabulary after the word segmentation; the vocabulary is calculated The word frequency-inverse text frequency index value of each vocabulary in the dictionary, and the word frequency-inverse text frequency index value matrix is constructed according to the calculation result; the sentence vector corresponding to the label sentence is obtained according to the word frequency-inverse text frequency index value matrix; the sentence vector is input to The conditional random field model to be trained is trained to obtain the second conditional random field model. Because it is a matrix constructed by the word frequency-inverse text frequency index value of each vocabulary to obtain the sentence vector corresponding to the label sentence, and then the condition is based on the sentence vector The random field model is trained to ensure that the trained model has high accuracy.
  • the embodiment of the present application also proposes a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile, and a text extraction program is stored on the computer-readable storage medium.
  • the text extraction program is executed by the processor, the steps of the text extraction method as described above are realized.
  • Fig. 5 is a structural block diagram of the first embodiment of the text extraction device of this application.
  • the text extraction device proposed in the embodiment of the present application includes:
  • the text acquisition module 501 is configured to read the text to be extracted, and extract the extraction type identification contained in the text to be extracted;
  • the sentence segmentation module 502 is configured to call a multi-thread processing script to segment the to-be-extracted text into sentence sets when it is detected that the extraction type identification is field extraction;
  • the vector conversion module 503 is configured to convert the sentences in the sentence set into sentence vectors through the multi-thread processing script
  • the vector splicing module 504 is used to splice the sentence vectors to obtain a target sentence vector
  • the model prediction module 505 is configured to input the target sentence vector into a first conditional random field model, and obtain the first prediction result output by the first conditional random field model;
  • the text extraction module 506 is configured to extract a target field from the text to be extracted by using an exact matching retrieval algorithm according to the first prediction result.
  • the extraction type identification contained in the text to be extracted is extracted; when the extraction type identification is detected as field extraction, the multi-threaded processing script is called to divide the text to be extracted into sentence sets; multi-threaded processing The script converts the sentences in the sentence set into sentence vectors; splices the sentence vectors to obtain the target sentence vector; inputs the target sentence vector to the first conditional random field model, and obtains the first prediction result output by the first conditional random field model ; According to the first prediction result, an exact matching retrieval algorithm is used to extract the target field from the text to be extracted.
  • the extraction length is determined according to the extraction type identification, and the corresponding conditional random field model is selected for different extraction lengths to extract the text to make the text extraction more targeted.
  • this embodiment uses a multi-threaded processing script to improve text segmentation. In order to improve the overall efficiency of text extraction, extracting the target field through the exact matching retrieval algorithm also ensures the accuracy of the target field extraction.
  • the vector conversion module 503 is further configured to input the sentences in the sentence set into the pre-training language model through the multi-threaded processing script to obtain sentences output by the pre-training language model.
  • the vector splicing module 504 is also used to obtain the text position information of each sentence in the to-be-extracted text, and determine the sentence sequence corresponding to each sentence according to the text position information; The sentence vectors are spliced in the sentence order to obtain the target sentence vector.
  • the text extraction device of this embodiment further includes: a model training module for acquiring a number of user-labeled documents, and vectorizing the user-labeled documents to obtain a labeled text vector, the labeled text vector containing an observation text sequence Input the labeled text vector to the initial condition random field model, so that the initial condition random field model performs model training based on the observation text sequence to obtain the conditional random field model to be verified;
  • the model performs model evaluation, and when the evaluation result meets a preset condition, the conditional random field model to be verified is used as the first conditional random field model.
  • the text extraction device of this embodiment further includes: a vocabulary extraction module, which is used to call a multi-threaded processing script to divide the text to be extracted into several sentences when it is detected that the extraction type is identified as vocabulary extraction; The similarity between a sentence and a sample sentence; based on the similarity, a number of target sentences corresponding to the sample sentence are selected from the segmented sentences; a candidate sentence set is constructed according to the target sentence, and the candidate The sentences in the sentence set are vectorized and then input to the second conditional random field model; the second prediction result output by the second conditional random field model is obtained, and the exact matching retrieval algorithm is used according to the second prediction result from the text to be extracted Extract the target vocabulary from it.
  • a vocabulary extraction module which is used to call a multi-threaded processing script to divide the text to be extracted into several sentences when it is detected that the extraction type is identified as vocabulary extraction
  • the similarity between a sentence and a sample sentence based on the similarity, a number of target sentences corresponding to the sample sentence are
  • the vocabulary extraction module is also used to perform word segmentation processing on the segmented sentences, and obtain the word frequency-inverse text frequency index value corresponding to each vocabulary after word segmentation; determine each word frequency-inverse text frequency index value according to the word frequency-inverse text frequency index value.
  • Sentence keywords corresponding to sentences to which the vocabulary belongs based on the sentence keywords, the similarity between the sentence to which each vocabulary belongs and the sample sentence is obtained.
  • the model training module is also used to obtain several user-labeled documents, and the user-labeled documents contain multiple label sentences of preset label categories; and perform word segmentation on the label sentences through the multi-threaded processing script Process, and construct a vocabulary dictionary according to the sentence vocabulary after word segmentation; calculate the word frequency-inverse text frequency index value of each vocabulary in the vocabulary dictionary, and construct a word frequency-inverse text frequency index value matrix according to the calculation result; according to the word frequency- The inverse text frequency index value matrix obtains the sentence vector corresponding to the label sentence; the sentence vector is input to the conditional random field model to be trained for training, and the second conditional random field model is obtained.
  • model training module is also used to perform singular value decomposition on the word frequency-inverse text frequency index value matrix to obtain a set of singular values; select a preset number of target singular values from the set of singular values, according to The target singular value performs matrix reconstruction on the word frequency-inverse text frequency index value matrix to obtain a target matrix; and obtains a sentence vector corresponding to the labeled sentence based on the target matrix.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as read-only memory/random access
  • the storage, magnetic disk, and optical disk includes several instructions to make a text extraction tool device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé, un appareil et un dispositif d'extraction de texte, ainsi qu'un support d'enregistrement. Le procédé comprend les étapes consistant à : lire un texte à extraire et extraire un identifiant de type d'extraction compris dans le texte à extraire (S10) ; lors de la détection selon laquelle l'identifiant de type d'extraction est l'extraction de champ, invoquer un script de processus multifilière pour segmenter le texte à extraire en ensembles de phrases (S20) ; convertir des phrases dans les ensembles de phrases en vecteurs de phrases, au moyen du script de processus multifilière (S30) ; épisser les vecteurs de phrases pour obtenir un vecteur de phrase cible (S40) ; entrer le vecteur de phrase cible dans un premier modèle de champ aléatoire conditionnel pour obtenir un premier résultat de prédiction délivré par le premier modèle de champ aléatoire conditionnel (S50) ; et extraire un champ cible à partir du texte à extraire, conformément au premier résultat de prédiction, à l'aide d'un algorithme de récupération de correspondance exacte (S60). Selon ce procédé, une longueur d'extraction est déterminée conformément à un identifiant de type d'extraction, et des modèles de champs aléatoires conditionnels correspondants sont sélectionnés pour une extraction de texte en fonction de différentes longueurs d'extraction, de façon que l'extraction de texte soit plus ciblée ; en outre, un script de processus multifilière est utilisé pour la segmentation de texte, de façon que l'efficacité globale d'extraction de texte soit améliorée, et l'extraction de champ cible au moyen d'un algorithme de récupération de correspondance exacte garantit également la précision de l'extraction de champ cible.
PCT/CN2020/093466 2019-09-18 2020-05-29 Procédé, appareil et dispositif d"extraction de texte, et support d'enregistrement WO2021051871A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910885399.9A CN110781276B (zh) 2019-09-18 2019-09-18 文本抽取方法、装置、设备及存储介质
CN201910885399.9 2019-09-18

Publications (1)

Publication Number Publication Date
WO2021051871A1 true WO2021051871A1 (fr) 2021-03-25

Family

ID=69383817

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093466 WO2021051871A1 (fr) 2019-09-18 2020-05-29 Procédé, appareil et dispositif d"extraction de texte, et support d'enregistrement

Country Status (2)

Country Link
CN (1) CN110781276B (fr)
WO (1) WO2021051871A1 (fr)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926332A (zh) * 2021-03-30 2021-06-08 善诊(上海)信息技术有限公司 一种实体关系联合抽取方法及装置
CN113011189A (zh) * 2021-03-26 2021-06-22 深圳壹账通智能科技有限公司 开放式实体关系的抽取方法、装置、设备及存储介质
CN113268452A (zh) * 2021-05-25 2021-08-17 联仁健康医疗大数据科技股份有限公司 一种实体抽取的方法、装置、设备和存储介质
CN113407584A (zh) * 2021-06-29 2021-09-17 微民保险代理有限公司 标签抽取方法、装置、设备及存储介质
CN113408296A (zh) * 2021-06-24 2021-09-17 东软集团股份有限公司 一种文本信息提取方法、装置及设备
CN114860873A (zh) * 2022-04-22 2022-08-05 北京北大软件工程股份有限公司 一种生成文本摘要的方法、装置及存储介质
CN115145928A (zh) * 2022-08-01 2022-10-04 支付宝(杭州)信息技术有限公司 模型训练方法及装置、结构化摘要获取方法及装置
CN116248375A (zh) * 2023-02-01 2023-06-09 北京市燃气集团有限责任公司 一种网页登录实体识别方法、装置、设备和存储介质
CN117176471A (zh) * 2023-10-25 2023-12-05 北京派网科技有限公司 一种文、数网络协议异常的双重高效检测方法、装置和存储介质
CN117573815A (zh) * 2024-01-17 2024-02-20 之江实验室 一种基于向量相似度匹配优化的检索增强生成方法
CN117973378A (zh) * 2024-01-05 2024-05-03 北京语言大学 一种基于bert模型的词语搭配提取方法及装置

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781276B (zh) * 2019-09-18 2023-09-19 平安科技(深圳)有限公司 文本抽取方法、装置、设备及存储介质
CN111506696A (zh) * 2020-03-03 2020-08-07 平安科技(深圳)有限公司 基于少量训练样本的信息抽取方法及装置
CN113343645A (zh) * 2020-03-03 2021-09-03 北京沃东天骏信息技术有限公司 信息提取模型的建立方法及装置、存储介质及电子设备
CN111401050A (zh) * 2020-03-28 2020-07-10 苏州机数芯微科技有限公司 一种基于模板生成的化学反应抽取器和抽取方法
CN111753060B (zh) * 2020-07-29 2023-09-26 腾讯科技(深圳)有限公司 信息检索方法、装置、设备及计算机可读存储介质
CN112069785A (zh) * 2020-08-06 2020-12-11 北京明略昭辉科技有限公司 用于提高标注效率的文本抽样方法及装置
CN112069307B (zh) * 2020-08-25 2023-07-04 中国人民大学 一种法律法条引用信息抽取系统
CN112463959A (zh) * 2020-10-29 2021-03-09 中国人寿保险股份有限公司 一种基于上行短信的业务处理方法及相关设备
CN112380348B (zh) * 2020-11-25 2024-03-26 中信百信银行股份有限公司 元数据处理方法、装置、电子设备及计算机可读存储介质
CN113033216B (zh) * 2021-03-03 2024-05-28 东软集团股份有限公司 文本预处理方法、装置、存储介质及电子设备
CN113505201A (zh) * 2021-07-29 2021-10-15 宁波薄言信息技术有限公司 一种基于SegaBert预训练模型的合同抽取方法
CN114490998B (zh) * 2021-12-28 2022-11-08 北京百度网讯科技有限公司 文本信息的抽取方法、装置、电子设备和存储介质
CN115952279B (zh) * 2022-12-02 2023-09-12 杭州瑞成信息技术股份有限公司 文本大纲的提取方法、装置、电子装置和存储介质
CN115841113B (zh) * 2023-02-24 2023-05-12 山东云天安全技术有限公司 一种域名标号检测方法、存储介质及电子设备
CN116628168B (zh) * 2023-06-12 2023-11-14 深圳市逗娱科技有限公司 基于大数据的用户个性分析处理方法、系统及云平台
CN116450813B (zh) * 2023-06-19 2023-09-19 深圳得理科技有限公司 文本关键信息提取方法、装置、设备以及计算机存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
CN107808011A (zh) * 2017-11-20 2018-03-16 北京大学深圳研究院 信息的分类抽取方法、装置、计算机设备和存储介质
CN109189820A (zh) * 2018-07-30 2019-01-11 北京信息科技大学 一种煤矿安全事故本体概念抽取方法
CN110019794A (zh) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 文本资源的分类方法、装置、存储介质及电子装置
CN110781276A (zh) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 文本抽取方法、装置、设备及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6178208B2 (ja) * 2013-10-28 2017-08-09 株式会社Nttドコモ 質問分野判定装置及び質問分野判定方法
CN105912570B (zh) * 2016-03-29 2019-11-15 北京工业大学 基于隐马尔可夫模型的英文简历关键字段抽取方法
US10157177B2 (en) * 2016-10-28 2018-12-18 Kira Inc. System and method for extracting entities in electronic documents
CN107729526B (zh) * 2017-10-30 2020-04-07 清华大学 一种文本结构化的方法
CN109766524B (zh) * 2018-12-28 2022-11-25 重庆邮电大学 一种并购重组类公告信息抽取方法及系统
CN109960756B (zh) * 2019-03-19 2021-04-09 国家计算机网络与信息安全管理中心 新闻事件信息归纳方法
CN110162786B (zh) * 2019-04-23 2024-02-27 百度在线网络技术(北京)有限公司 构建配置文件以及抽取结构化信息的方法、装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
CN110019794A (zh) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 文本资源的分类方法、装置、存储介质及电子装置
CN107808011A (zh) * 2017-11-20 2018-03-16 北京大学深圳研究院 信息的分类抽取方法、装置、计算机设备和存储介质
CN109189820A (zh) * 2018-07-30 2019-01-11 北京信息科技大学 一种煤矿安全事故本体概念抽取方法
CN110781276A (zh) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 文本抽取方法、装置、设备及存储介质

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011189A (zh) * 2021-03-26 2021-06-22 深圳壹账通智能科技有限公司 开放式实体关系的抽取方法、装置、设备及存储介质
CN112926332A (zh) * 2021-03-30 2021-06-08 善诊(上海)信息技术有限公司 一种实体关系联合抽取方法及装置
CN113268452B (zh) * 2021-05-25 2024-02-02 联仁健康医疗大数据科技股份有限公司 一种实体抽取的方法、装置、设备和存储介质
CN113268452A (zh) * 2021-05-25 2021-08-17 联仁健康医疗大数据科技股份有限公司 一种实体抽取的方法、装置、设备和存储介质
CN113408296B (zh) * 2021-06-24 2024-02-13 东软集团股份有限公司 一种文本信息提取方法、装置及设备
CN113408296A (zh) * 2021-06-24 2021-09-17 东软集团股份有限公司 一种文本信息提取方法、装置及设备
CN113407584A (zh) * 2021-06-29 2021-09-17 微民保险代理有限公司 标签抽取方法、装置、设备及存储介质
CN114860873A (zh) * 2022-04-22 2022-08-05 北京北大软件工程股份有限公司 一种生成文本摘要的方法、装置及存储介质
CN115145928A (zh) * 2022-08-01 2022-10-04 支付宝(杭州)信息技术有限公司 模型训练方法及装置、结构化摘要获取方法及装置
CN116248375A (zh) * 2023-02-01 2023-06-09 北京市燃气集团有限责任公司 一种网页登录实体识别方法、装置、设备和存储介质
CN116248375B (zh) * 2023-02-01 2023-12-15 北京市燃气集团有限责任公司 一种网页登录实体识别方法、装置、设备和存储介质
CN117176471A (zh) * 2023-10-25 2023-12-05 北京派网科技有限公司 一种文、数网络协议异常的双重高效检测方法、装置和存储介质
CN117176471B (zh) * 2023-10-25 2023-12-29 北京派网科技有限公司 一种文、数网络协议异常的双重高效检测方法、装置和存储介质
CN117973378A (zh) * 2024-01-05 2024-05-03 北京语言大学 一种基于bert模型的词语搭配提取方法及装置
CN117573815A (zh) * 2024-01-17 2024-02-20 之江实验室 一种基于向量相似度匹配优化的检索增强生成方法
CN117573815B (zh) * 2024-01-17 2024-04-30 之江实验室 一种基于向量相似度匹配优化的检索增强生成方法

Also Published As

Publication number Publication date
CN110781276A (zh) 2020-02-11
CN110781276B (zh) 2023-09-19

Similar Documents

Publication Publication Date Title
WO2021051871A1 (fr) Procédé, appareil et dispositif d"extraction de texte, et support d'enregistrement
CN108363790B (zh) 用于对评论进行评估的方法、装置、设备和存储介质
US11544459B2 (en) Method and apparatus for determining feature words and server
KR102288249B1 (ko) 정보 처리 방법, 단말기, 및 컴퓨터 저장 매체
WO2020082560A1 (fr) Procédé, appareil et dispositif d'extraction de mot-clé de texte, ainsi que support de stockage lisible par ordinateur
CN109284399B (zh) 相似度预测模型训练方法、设备及计算机可读存储介质
CN112069298A (zh) 基于语义网和意图识别的人机交互方法、设备及介质
CN108038208B (zh) 上下文信息识别模型的训练方法、装置和存储介质
CN113094578A (zh) 基于深度学习的内容推荐方法、装置、设备及存储介质
CN115544240B (zh) 文本类敏感信息识别方法、装置、电子设备和存储介质
CN110399547B (zh) 用于更新模型参数的方法、装置、设备和存储介质
CN117520523B (zh) 数据处理方法、装置、设备及存储介质
CN112183102A (zh) 基于注意力机制与图注意力网络的命名实体识别方法
CN113297351A (zh) 文本数据标注方法及装置、电子设备及存储介质
WO2021056750A1 (fr) Dispositif et procédé de recherche et support de stockage
CN114817478A (zh) 基于文本的问答方法、装置、计算机设备及存储介质
CN114722837A (zh) 一种多轮对话意图识别方法、装置及计算机可读存储介质
CN112100360B (zh) 一种基于向量检索的对话应答方法、装置和系统
CN109753646B (zh) 一种文章属性识别方法以及电子设备
CN113536784A (zh) 文本处理方法、装置、计算机设备和存储介质
CN114547257B (zh) 类案匹配方法、装置、计算机设备及存储介质
CN113095073B (zh) 语料标签生成方法、装置、计算机设备和存储介质
CN115906797A (zh) 文本实体对齐方法、装置、设备及介质
CN115169345A (zh) 文本情感分析模型的训练方法、装置、设备及存储介质
KR102215259B1 (ko) 주제별 단어 또는 문서의 관계성 분석 방법 및 이를 구현하는 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20865090

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20865090

Country of ref document: EP

Kind code of ref document: A1