WO2022095385A1 - 文档知识抽取方法、装置、计算机设备及可读存储介质 - Google Patents

文档知识抽取方法、装置、计算机设备及可读存储介质 Download PDF

Info

Publication number
WO2022095385A1
WO2022095385A1 PCT/CN2021/091435 CN2021091435W WO2022095385A1 WO 2022095385 A1 WO2022095385 A1 WO 2022095385A1 CN 2021091435 W CN2021091435 W CN 2021091435W WO 2022095385 A1 WO2022095385 A1 WO 2022095385A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sample
processed
entity
target data
Prior art date
Application number
PCT/CN2021/091435
Other languages
English (en)
French (fr)
Inventor
梁烨
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022095385A1 publication Critical patent/WO2022095385A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of natural language processing of artificial intelligence, and in particular, to a document knowledge extraction method, apparatus, computer equipment and readable storage medium.
  • intelligent question answering applied to customer service robots has great prospects.
  • Intelligent question answering mainly understands user questions described in natural language, and returns concise, Accurately matching the correct answer, especially in the insurance industry, through customer service robots can effectively deal with customers' daily consultation, claims, and renewal services.
  • the corpus or question-and-answer knowledge base associated with the current customer service robot requires a large number of QA (Question & Answer) pairs to support user consultation.
  • the purpose of the present application is to provide a document knowledge extraction method, device, computer equipment and readable storage medium, which are used to solve the problems of manual extraction of existing structured documents, which has a large workload and low efficiency.
  • the present application provides a document knowledge extraction method, including:
  • the correlation between the entity data and the second processing data is calculated, and target data is generated according to the calculation result.
  • the present application also provides a document knowledge extraction device, including:
  • an acquisition module configured to acquire a structured document to be processed, perform data extraction on the structured document to be processed, and obtain the paragraph where the target data is located as the first processed data
  • a matching module configured to obtain entity data matching the structured document to be processed from a preset entity library according to the type of the structured document to be processed;
  • an extraction module configured to perform data extraction on the first processing data according to the entity data, and obtain a statement containing the target data as the second processing data;
  • a generating module is configured to calculate the correlation between the entity data and the second processed data, and generate target data according to the calculation result.
  • the present application also provides a computer device, the computer device includes a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor implements the computer program when the processor executes the computer program.
  • the correlation between the entity data and the second processing data is calculated, and target data is generated according to the calculation result.
  • the present application also provides a computer-readable storage medium, which includes multiple storage media, each storage medium stores a computer program, and the computer program stored in the multiple storage media is executed by the processor.
  • the correlation between the entity data and the second processing data is calculated, and target data is generated according to the calculation result.
  • the document knowledge extraction method, device, computer equipment and readable storage medium provided by the present application, by extracting data from the structured document to be processed, the paragraphs in which QA pairs may appear in the text to be processed are obtained as the first processing data, and then according to the After matching the preset entity library, data extraction is performed, and the statement matching the entity data is obtained as the second processing data, and the similarity between the second processing data and the entity data is calculated to obtain the target data, which is used to solve the existing structure.
  • the problem of manual extraction of sexual documents is large and low in efficiency.
  • FIG. 1 is a flowchart of Embodiment 1 of the document knowledge extraction method described in this application;
  • FIG. 2 is a flowchart of performing data extraction on the to-be-processed structured document in Embodiment 1 of the document knowledge extraction method described in the present application, and obtaining the paragraph where the target data is located as the first processing data;
  • FIG. 3 is the first embodiment of the document knowledge extraction method described in this application, before the first attention model is used to perform weight assignment on the encoded data corresponding to each paragraph in the document to be processed, the first attention model is trained flow chart;
  • FIG. 4 is a flow chart of calculating the correlation between the entity data and the second processing data in Embodiment 1 of the document knowledge extraction method described in the present application, and generating target data according to the calculation result;
  • FIG. 5 is the first embodiment of the document knowledge extraction method described in the present application, in which the second attention model is used to calculate the correlation between the entity data and each word in the word set, and the data corresponding to the entity data is obtained according to the correlation.
  • FIG. 6 is a schematic diagram of a program module of Embodiment 2 of the document knowledge extraction apparatus described in this application;
  • FIG. 7 is a schematic diagram of a hardware structure of a computer device in Embodiment 3 of the computer device of the present application.
  • the document knowledge extraction method, device, computer equipment and readable storage medium provided by this application are suitable for the natural language processing field of artificial intelligence, and provide a document knowledge extraction method based on an acquisition module, a matching module, an extraction module and a generation module .
  • This application is used on the server side to perform knowledge extraction on text data with a certain structure, to obtain QA pairs for customer service robots, and to extract data from structured documents to be processed to obtain QA pairs that may appear in the text to be processed.
  • the paragraph is used as the first processing data, and then the second data extraction is performed after data matching according to the preset entity database, and the sentence matching the entity data (that is, the sentence where the target data corresponding to the entity data is located) is obtained as the second processing data.
  • the relationship data and associated data corresponding to the entity data are obtained by calculating the similarity between each word of the second processing data and the entity data, and finally the similarity of the triplet of the entity data, the relationship data and the associated data is calculated and a target is generated data, to solve the problems of the existing structured document extraction method that requires a lot of manpower, low work efficiency and poor quality of the generated QA pairs, which can greatly reduce the time cost of manually entering the QA pairs.
  • the target detection model is also used to monitor the rationality of the target data, and adjust the target data with defects such as lack of subject-verb-object or repetition to further improve the accuracy of the obtained target results.
  • a document knowledge extraction method in this embodiment is applied to scenarios such as insurance business, and knowledge extraction is performed for text data with a certain structure, such as insurance contracts, official publicity documents, etc., see FIG. 1 , including the following steps:
  • S100 Obtain a structured document to be processed, perform data extraction on the structured document to be processed, and obtain a paragraph where the target data is located as the first processing data;
  • the above knowledge extraction method is mainly used in the application scenarios of customer service robots in the insurance industry to meet the daily consultation, claim settlement, insurance renewal and other services of insured customers.
  • the documents to be processed are mainly insurance contracts and official documents. , News Encyclopedia, etc.
  • the above-mentioned documents to be processed can come from multiple channels.
  • the above types of documents have certain prompts and structures, and are limited as examples. For example, if the user needs to ask questions about insurance costs, the answers related to insurance costs can be obtained from the "Insurance Rates and Premiums" section of the insurance contract. .
  • step S100 data extraction is performed on the structured document to be processed, and the paragraph where the target data is located is obtained as the first processing data, referring to FIG. 2, which specifically includes the following steps:
  • S110 Perform semantic encoding on the to-be-processed file to obtain encoded data corresponding to the to-be-processed file;
  • the above semantic encoding corresponds to the following semantic decoding, which can be implemented by existing neural networks, including but not limited to common ones such as CNN/RNN/Bi-RNN/GRU/LSTM/Bi-LSTM.
  • the attention model (including the above-mentioned first attention model and the second attention model) is a resource allocation model that uses weighted changes to target data to achieve target data acquisition.
  • the data extraction The main purpose is to determine the paragraphs in which QA pairs (ie target data) may appear in the document to be processed.
  • the first attention model is used to be processed in the neural network processing process.
  • Each paragraph of the structured document is weighted, and the paragraphs including the target data are obtained by extracting the paragraphs according to the weights as the first processing data.
  • the first attention model is used to extract data from the document to be processed.
  • the attention model relies on the Encoder-Decoder framework. After encoding the question to be processed by the Encoder, the input sentence is passed through the The linear transformation is converted into intermediate semantics, and then the attention model is used to assign weights to the data in the text to be processed, and finally the paragraph data that may generate QA pairs in the text to be processed is obtained as the first processed data.
  • the first attention model before using the first attention model to assign weights to the encoded data corresponding to each paragraph in the to-be-processed file, it also includes training the first attention model, referring to FIG. 3 , including the following:
  • the training sample be an insurance contract of XX
  • S122 Semantically encode the sample data to be processed
  • S130 Perform semantic decoding on the to-be-processed file according to the weight, and obtain paragraph data including the target data as the first processing data.
  • the document to be processed contains hierarchical titles, such as first-level titles such as "insurance objects, insurance rates and premiums, and claims settlement terms", and the first attention model can be used to obtain insurance rates for the to-be-processed document
  • first-level titles such as "insurance objects, insurance rates and premiums, and claims settlement terms”
  • the first attention model can be used to obtain insurance rates for the to-be-processed document
  • the position of the paragraph corresponding to the insurance fee and all the data contained in the paragraph are used to generate the QA pair associated with the insurance fee.
  • S200 Obtain entity data matching the structured document to be processed from a preset entity library according to the type of the structured document to be processed;
  • a common entity database for the insurance industry in this implementation scenario is pre-built, which can be generated by manual input and data mining.
  • the entities include but are not limited to the specific names of insurance types, common terms in the industry, etc.
  • After the entity database is constructed It is used to discover the relationship between the entities extracted from the second processed data and the relationship between the entities when the second processed data is subsequently analyzed, so as to facilitate the subsequent generation of QA pairs.
  • the preset entity library can be regarded as QA alignment problem data, so that entity extraction and correlation calculation in step S300 can be subsequently performed according to the entity data.
  • S300 Perform data extraction on the first processing data according to the entity data, and obtain a statement containing target data as the second processing data;
  • the above target data is the final QA pair that needs to be extracted.
  • the first processing data is extracted based on the preset entity library, and the sentence corresponding to the entity can be obtained as the second processing data according to the entity in the entity library matching the first processing data, as an example and not a limitation, such as
  • the first processing data obtained is “claims settlement clauses include the data of acceptance and report, on-site investigation, acceptance and report, settlement, and claim statistics.
  • the claim statistics include that Party B shall pay Party A’s claim settlement fee within 7 working days of each quarter” , then the matching with “claims statistics” is "claims statistics include that Party B needs to pay Party A's claim expenses within 7 working days of each quarter", and then use the analysis of the second processing data in S400 to generate
  • the target data associated with the claim settlement time can also be processed by using a pre-trained attention model as in the above step S100 to process the first processing data.
  • the specific processing process is similar to the above steps S110-S130, and the text to be processed is replaced by the first processing. data, and the training samples can be replaced during the training of the attention model.
  • S400 Calculate the correlation between the entity data and the second processed data, and generate target data according to the calculation result.
  • the above-mentioned correlation analysis refers to the analysis of two or more correlated variable elements, so as to measure the degree of correlation between the two variable factors.
  • the correlation analysis is used to find the first 2.
  • the degree of correlation between the above three that is, used to judge whether the QA pair is established).
  • step S400 calculates the correlation between the entity data and the second processing data, and generates target data according to the calculation result. Referring to FIG. 4 , the following steps are included:
  • the entity data corresponding to the second processing data is the entity data matched from the preset database in the above step S200.
  • the above-mentioned splitting of the second processing data may be based on preset rules, for example, splitting according to characters and words, or performing semantic analysis on the second processing data and splitting according to semantics , which can also be implemented autonomously through deep learning models.
  • S430 Calculate the correlation between the entity data and each word in the word set using a second attention model, and obtain relationship data and associated data corresponding to the entity data according to the correlation;
  • the above-mentioned associated data is the candidate entity corresponding to the above-mentioned entity data
  • the relationship data is the relationship between the entity data and the candidate entity.
  • the correlation between the entity data and each word is calculated, any word is obtained, and the entity data, the word, and the relationship data between the entity data and the word are formed into a triple, as an example, such as ⁇ entity, Relationship, word>, or ⁇ entity, attribute, attribute value>, the former represents the relationship between two entities and words, the latter represents the attribute relationship within the description entity, and the second attention model is used to process the aforementioned triples , using the second attention model to assign weights to triples to obtain correlation results.
  • the second attention is paid to the second attention model.
  • the force model is trained, see Figure 5, including the following:
  • training samples include sample data with sample entity data, sample relationship data, and sample associated data associated tags;
  • the triplet of sample entity data, sample relationship data and sample associated data can be obtained by collecting entities commonly used in the existing insurance industry.
  • “level”-"yes"-"customer manager” is a triple.
  • S432 Calculate the correlation between the entity data in the sample data and each word in the sample data
  • a weighted average method can be used for the correlation, that is, weighted summation is performed on each component of the similarity score vector to obtain the similarity between the final entity and each word.
  • S434 Compare the sample relationship result and the sample correlation result with the sample relationship data and the sample correlation data respectively, and adjust the second attention model until the training process is completed, and the trained second attention model is obtained.
  • the second attention model in the above steps is used to calculate the similarity between each word in the second processed data and its corresponding entity data, and to obtain relational data and associated data (ie, candidate entities), which can be quickly and accurately located, so that The target data is subsequently generated.
  • S440 Calculate the correlation between the entity data, relationship data, and associated data, and generate target data after the correlation exceeds a preset threshold.
  • Entity data, relational data, and associated data form triples. If the correlation between the triples exceeds the preset threshold, it is the QA pair that needs to be generated (that is, the target data can be generated). It is an example but not a limitation. For example, a pair of Knowledge extraction is performed for the insurance contract of XX. The entity data corresponds to "insurance”, the corresponding word is “meet the requirements”, the relationship data between them is "yes”, and the correlation exceeds the threshold, then the triplet is when the user asks a question. "Whether it meets the insurance requirements" corresponds to the answer "XX insurance meets the requirements”. In this solution, the generated target data is the QA pair in the text to be processed, which is used by the customer service robot to deal with the user's daily consultation, claim settlement, insurance renewal and other services.
  • S500 Use a pre-trained target detection model to detect the target data, and adjust the target data according to the detection result.
  • a target detection model is set to detect the rationality of the generated target data.
  • the specific detection includes the following (specifically the following three adjustment strategies): judgment of subject, verb and object, identification of duplicate or similar Removing the knowledge content and scoring the input data can further improve the quality of the generated target data, and facilitate the maintenance and update of knowledge in the later stage. After enabling this technology, the generated QA pair can also be manually reviewed. review.
  • the above-mentioned target detection model needs to be trained before use, and uses logical sentences commonly used in existing application scenarios as training samples to detect judgments including subject, verb and object, identify and remove duplicate or similar knowledge content, and Enter data for scoring.
  • the above-mentioned pre-trained target detection model is used to detect the target data, and the target data is adjusted according to the detection results, including but not limited to the following adjustment strategies:
  • analyzing the target data obtaining subject-verb-object data corresponding to the target data, performing a correlation score on the subject-verb-object data, and marking the target data with a lower score;
  • the target data is recorded and compared with historical target data to check for duplicates, and when the target data is duplicated with the historical target data, the target data is deleted.
  • the generated target data should have complete subject-verb-object data.
  • the subject-verb-object data in the target data is incomplete, it means that there may be errors in the data, and it can be marked Afterwards, manual verification and adjustment or automatic deletion and re-extraction are adopted.
  • the generated target data should not have duplicate data. Therefore, when duplicate data is detected, duplicate data needs to be automatically deleted, and only one is required.
  • the above to-be-processed text and target data can be correspondingly uploaded to the blockchain for subsequent use as a reference sample or training sample. Uploading to the blockchain can ensure its security and fairness and transparency to users. User equipment can be retrieved from the blockchain. The summary information is downloaded to verify whether the priority list has been tampered with, and the voice file with the corresponding amount of data can also be downloaded from the blockchain for voice broadcast, without the need for a generation process, which effectively improves the efficiency of voice processing.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • This solution is applied to knowledge extraction for text data with a certain structure, to obtain QA pairs for customer service robots, and to perform structural analysis on the text to be processed through the first attention model in step S100, to obtain possible information in the text to be processed.
  • the paragraph where the QA pair (that is, the target data) appears is used as the first processing data, which is equivalent to locating the paragraph data where the target data is located, and then extracting the data according to the preset entity database to obtain the matching entity data in the preset entity database.
  • the sentence is used as the second processing data, and finally the second attention model is used to calculate the similarity between each word in the second processing data and its corresponding entity data, obtain relationship data and associated data, and obtain entities whose similarity exceeds the threshold Data, relational data, and associated data triplet and generate target data (step S400), to solve the problems of the prior art that the structured document extraction method requires a lot of manpower, the work efficiency is low, and the quality of the generated QA is poor, It can greatly reduce the time cost of manually entering QA pairs.
  • a target detection model is also set up in this scheme to further improve the quality of the generated target data and facilitate the maintenance and update of knowledge in the later stage.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • a document knowledge extraction apparatus 6 in this embodiment includes: an acquisition module 61 , a matching module 62 , an extraction module 63 , a generation module 64 and an adjustment module 65 .
  • the obtaining module 61 is configured to obtain a structured document to be processed, perform data extraction on the structured document to be processed, and obtain the paragraph where the target data is located as the first processing data;
  • a matching module 62 configured to obtain entity data matching the structured document to be processed from a preset entity library according to the type of the structured document to be processed;
  • Extraction module 63 configured to perform data extraction on the first processing data according to the entity data, and obtain a statement containing the target data as the second processing data;
  • the generating module 64 is configured to calculate the correlation between the entity data and the second processed data, and generate target data according to the calculation result.
  • the generating module 64 also includes the following:
  • an obtaining unit 641 configured to obtain second processing data and entity data corresponding to the second processing data
  • a splitting unit 642 configured to split the second processing data to obtain a word set corresponding to the second processing data
  • a processing unit 643, configured to calculate the correlation between the entity data and each word in the word set using the second attention model, and obtain relationship data and associated data corresponding to the entity data according to the correlation;
  • the generating unit 644 is configured to calculate the correlation between the entity data, the relationship data and the associated data, and generate target data after the correlation exceeds a preset threshold.
  • the adjustment module 65 is used to detect the target data by using a pre-trained target detection model after generating the target data, and adjust the target data according to the detection result, including the following: analyzing the target data , obtain the subject-verb-object data corresponding to the target data, when the subject-verb-object data is partially missing, mark the target data; and/or, analyze the target data to obtain the corresponding target data subject-verb-object data, the correlation score is performed on the subject-verb-object data, and the target data with a lower score is marked; and/or, the target data is recorded, and compared with the historical target data for duplication checking, When the target data is duplicated with historical target data, the target data is deleted.
  • This technical solution is based on the natural language processing of semantic parsing in speech semantics. This solution is applied to scenarios such as insurance.
  • Knowledge extraction is performed on text data with a certain structure to obtain QA pairs for customer service robots, and the processing structure is processed through the acquisition module. Extracting data from the document, obtaining the paragraphs that may appear QA pairs in the text to be processed as the first processing data, locating the paragraph data where the target data is located, using the matching module to perform data matching according to the preset entity library, and then using The extraction module performs the second data extraction, obtains the statement matching the entity data (that is, the statement where the target data corresponding to the entity data is located) as the second processed data, and finally uses the generating unit to calculate the difference between the second processed data and the entity data. and generate target data, solve the problem that the extraction method in the existing technology requires a lot of manpower, the work efficiency is low, and the quality of the generated QA pair is poor, which can greatly reduce the time cost of manually entering the QA pair.
  • the adjustment module is used to perform the rationality test of the target data by the target detection model, including but not limited to the judgment of the subject, verb and object, identifying and removing duplicate or similar knowledge content, and
  • the input data is used for scoring, etc., to further improve the quality of the generated target data, and to facilitate the maintenance and update of knowledge in the later stage.
  • the present application also provides a computer device 7, which may include multiple computer devices, and the components of the document knowledge extraction device 1 in the second embodiment may be dispersed in different computer devices 7, and the computer devices 7 It can be a smartphone, tablet, laptop, desktop computer, rack server, blade server, tower server or rack server (including a standalone server, or a server cluster consisting of multiple servers) that executes the program, etc. .
  • the computer equipment in this embodiment at least includes but is not limited to: a memory 51 , a processor 72 , a network interface 73 and a document knowledge extraction device 6 that can be communicatively connected to each other through a system bus, as shown in FIG. 7 .
  • FIG. 7 only shows a computer device having a component -, but it should be understood that it is not required to implement all the shown components, and more or less components may be implemented instead.
  • the memory 71 includes at least one type of computer-readable storage medium, and the readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 71 may be an internal storage unit of a computer device, such as a hard disk or memory of the computer device.
  • the memory 71 may also be an external storage device of a computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), Secure Digital (SD) card, Flash Card (Flash Card), etc.
  • the memory 71 may also include both the internal storage unit of the computer device and its external storage device.
  • the memory 71 is generally used to store the operating system and various application software installed on the computer equipment, such as the program code of the document knowledge extraction apparatus 6 in the first embodiment, and the like.
  • the memory 71 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 72 may be a central processing unit (Central Processing Unit) in some embodiments. Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip.
  • the processor 72 is typically used to control the overall operation of the computer equipment.
  • the processor 72 is configured to run the program code or process data stored in the memory 71 , for example, run the document knowledge extraction device 6 to implement the document knowledge extraction method of the first embodiment.
  • the network interface 73 may include a wireless network interface or a wired network interface, and the network interface 73 is generally used to establish a communication connection between the computer device 7 and other computer devices 7 .
  • the network interface 73 is used to connect the computer device 7 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 7 and the external terminal.
  • the network may be an intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network Wireless or wired network such as network, Bluetooth (Bluetooth), Wi-Fi, etc.
  • FIG. 7 only shows the computer device 7 having components 71-73, but it should be understood that it is not required to implement all of the shown components and that more or less components may be implemented instead.
  • the document knowledge extraction device 6 stored in the memory 71 can also be divided into one or more program modules, and the one or more program modules are stored in the memory 71 and are composed of one or more program modules.
  • a plurality of processors (the processor 72 in this embodiment) are executed to complete the present application.
  • Embodiment 4 is a diagrammatic representation of Embodiment 4:
  • the present application also provides a computer-readable storage system
  • the computer-readable storage medium may be non-volatile or volatile, and includes multiple storage media, such as flash memory, hard disk, multimedia Card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), Programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, server, App application mall, etc., on which computer programs are stored, and when the programs are executed by the processor 72, corresponding functions are realized.
  • the computer-readable storage medium of this embodiment is used to store a document knowledge extraction device, and when executed by the processor 72, implements the document knowledge extraction method of the first embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

文档知识抽取方法、装置、计算机设备及可读存储介质,涉及人工智能的自然语言处理技术领域,包括获取待处理结构化文档,对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据(S100);根据所述待处理结构化文档的类型从预设实体库中获取与所述待处理结构化文档匹配的实体数据(S200);根据所述实体数据对所述第一处理数据进行数据提取,获得包含目标数据的语句作为第二处理数据(S300);计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据(S400),解决现有的结构性文档人工提取,工作量大且效率较低的问题。

Description

文档知识抽取方法、装置、计算机设备及可读存储介质
本申请要求于2020年11月06日提交中国专利局、申请号为202011228800.0,名称为“文档知识抽取方法、装置、计算机设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能的自然语言处理技术领域,尤其涉及一种文档知识抽取方法、装置、计算机设备及可读存储介质。
背景技术
随着人工智能技术和自然语言处理技术的发展,应用于客服机器人的智能问答具有很大的前景,智能问答主要理解以自然语言形式描述的用户提问,并通过检索语料库或问答知识库返回简洁、精确的匹配正确答案,尤其是在保险行业,通过客服机器人能够有效应对客户的日常咨询、理赔、续保服务。当前的客服机器人关联的语料库或问答知识库需要大量QA(Question & Answer)对来支撑用户咨询。
但是发明人发现现有的QA对主要的生成方式是通过人工对大量的保险合同或宣传文档,进行概括后,手工录入到知识库里面,供客服机器人使用,这些文档呈现一定的结构性和规律性,而且蕴含大量用户经常咨询的问题,这种提取方式需要有大量的人力,工作效率较低。
申请内容
本申请的目的是提供一种文档知识抽取方法、装置、计算机设备及可读存储介质,用于解决现有的结构性文档人工提取,工作量大且效率较低的问题。
为实现上述目的,本申请提供一种文档知识抽取方法,包括:
获取待处理结构化文档,对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据;
根据所述待处理结构化文档的类型从预设实体库中获取与所述待处理结构化文档匹配的实体数据;
根据所述实体数据对所述第一处理数据进行数据提取,获得包含目标数据的语句作为第二处理数据;
计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据。
为实现上述目的,本申请还提供一种文档知识抽取装置,包括:
获取模块,用于获取待处理结构化文档,对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据;
匹配模块,用于根据所述待处理结构化文档的类型从预设实体库中获取与所述待处理结构化文档匹配的实体数据;
提取模块,用于根据所述实体数据对所述第一处理数据进行数据提取,获得包含目标数据的语句作为第二处理数据;
生成模块,用于计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据。
为实现上述目的,本申请还提供一种计算机设备,所述计算机设备包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述文档知识抽取方法的以下步骤:
获取待处理结构化文档,对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据;
根据所述待处理结构化文档的类型从预设实体库中获取与所述待处理结构化文档匹配的实体数据;
根据所述实体数据对所述第一处理数据进行数据提取,获得包含目标数据的语句作为第二处理数据;
计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据。
为实现上述目的,本申请还提供一种计算机可读存储介质,其包括多个存储介质,各存储介质上存储有计算机程序,所述多个存储介质存储的所述计算机程序被处理器执行时共同实现上述文档知识抽取方法的以下步骤:
获取待处理结构化文档,对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据;
根据所述待处理结构化文档的类型从预设实体库中获取与所述待处理结构化文档匹配的实体数据;
根据所述实体数据对所述第一处理数据进行数据提取,获得包含目标数据的语句作为第二处理数据;
计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据。
本申请提供的文档知识抽取方法、装置、计算机设备及可读存储介质,通过对待处理结构化文档进行数据提取,获得所述待处理文本中可能出现QA对的段落作为第一处理数据,再根据预设实体库进行匹配后进行数据提取,获得与实体数据匹配的语句作为第二处理数据,计算所述第二处理数据与实体数据之间的相似性获得目标数据,用于解决现有的结构性文档人工提取,工作量大且效率较低的问题。
附图说明
图1为本申请所述文档知识抽取方法实施例一的流程图;
图2为本申请所述文档知识抽取方法实施例一中对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据的流程图;
图3为本申请所述文档知识抽取方法实施例一中在采用第一注意力模型对所述待处理文件中每个段落对应的编码数据进行权重分配前对所述第一注意力模型进行训练的流程图;
图4为本申请所述文档知识抽取方法实施例一中计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据的流程图;
图5为本申请所述文档知识抽取方法实施例一中在采用第二注意力模型计算所述实体数据与所述单词集合中各个单词的相关性,根据相关性获取与所述实体数据对应的关系数据和关联数据前,对第二注意力模型进行训练的流程图;
图6为本申请所述文档知识抽取装置实施例二的程序模块示意图;
图7为本申请计算机设备实施例三中计算机设备的硬件结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
本申请提供的文档知识抽取方法、装置、计算机设备及可读存储介质,适用于人工智能的自然语言处理领域,为提供一种基于获取模块、匹配模块、提取模块、生成模块的文档知识抽取方法。本申请用于服务器端,对于具有一定结构性的文本数据进行知识抽取,获得用于客服机器人的QA对,通过对待处理结构化文档进行数据提取,获得所述待处理文本中可能出现QA对的段落作为第一处理数据,再根据预设实体库进行数据匹配后进行第二次数据提取,获得与实体数据匹配的语句(即与实体数据对应的目标数据所在语句)作为第二处理数据,最后采用计算所述第二处理数据各个单词与实体数据之间的相似性来获得与实体数据对应的关系数据和关联数据,最后计算实体数据、关系数据和关联数据三元组的相似性并生成目标数据,解决现有技术中结构化文档提取方式需要有大量的人力,工作效率较低且生成的QA对质量较差的问题,能够大幅度减少人工录入QA对的时间成本。在完成上述目标数据提取后,还采用目标检测模型对目标数据进行合理性监测,对于具有缺乏主谓宾或重复等缺陷的目标数据进行调整,进一步提高获得的目标结果的准确性。
实施例一
请参阅图1,本实施例的一种文档知识抽取方法,应用于保险业务等场景下,对于具有一定结构性的文本数据进行知识抽取,具体的如保险合同、官方宣传文档等,参阅图1,包括以下步骤:
S100:获取待处理结构化文档,对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据;
在本方案中,上述知识抽取方法主要应用于保险行业的客服机器人的应用场景中,用于满足投保客户日常咨询、理赔、续保等服务,所述待处理文档主要是保险合同,以及官方范文,新闻百科等等,上述待处理文档可来源多个渠道。上述类型文档具有一定的提示和结构性的,作为举例而限定的,比如:用户需要询问关于保险费用的问题,可从保险合同中“保险费率和保险费”部分获得与保险费用相关的回答。
具体的,上述步骤S100中对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据,参阅图2,具体包括以下步骤:
S110:对所述待处理文件进行语义编码,获得与所述待处理文件对应的编码数据;
在上述步骤中,上述语义编码与下述语义解码对应,可以通过现有的神经网络实现,包括但不限于常见的比如CNN/RNN/Bi-RNN/GRU/LSTM/Bi-LSTM等。
S120:采用第一注意力模型对所述待处理文件中每个段落对应的编码数据进行权重分配;
在本方案中,注意力模型(包括上述第一注意力模型和第二注意力模型)是采用对目标数据进行加权变化实现目标数据获取的一种资源分配模型,所述步骤S100所述数据提取主要是确定该待处理文档中可能出现QA对(即目标数据)的段落,采用现有的神经网络对待处理结构化文档进行语义编码后,采用第一注意力模型在神经网络处理过程中待处理结构化文档各个段落进行权重分配,根据权重提取获得包括目标数据的段落作为第一处理数据。
具体的,在本方案中,通过第一注意力模型对所述待处理文档进行数据提取,注意力模型依赖于Encoder-Decoder框架下,通过Encoder对待处理问问进行编码后,将输入句子通过非线性变换转化为中间语义,而后采用注意力模型对该待处理文本中的数据进行权重分配,最后获得待处理文本中可能产生QA对的段落数据作为第一处理数据。
具体的,在采用第一注意力模型对所述待处理文件中每个段落对应的编码数据进行权重分配前,还包括对所述第一注意力模型进行训练,参阅图3,包括以下:
S121:获取训练样本,所述训练样本为带有样本结果标签的样本待处理文本;
作为举例而限定的,设训练样本为XX保险合同,对该XX保险合同中“理赔条款”对应的段落数据进行标记作为样本结果标签,更具体的,如“理赔条款”在该XX合同第i页第x至x+7行,则标记该合同中第x至x+7行中的所有数据作为样本结果标签。
S122:对所述样本待处理数据进行语义编码;
S123: 对所述语义编码后的样本待处理数据中段落进行权重分配,并解码,获得样本处理结果;
上述步骤S122和S123与处理过程中S120和S230处理方式一致。
S124:采用样本处理结果与所述样本结果标签进行比对并调整所述第一注意力模型的损失函数,直至完成训练过程,获得训练好的第一注意力模型。
通过上述训练好的第一注意力模型在处理过程中对待处理文本进行数据提取,获取待处理文档中最容易可能出现有效知识的区域,可以很快定位到这些包含有效知识(即可能生成QA对,即可生成目标数据)区域,学习这部分特征,有利于对待处理文本进行快速筛选知识,进一步提高后续获得目标数据(即QA对)的准确度。
S130:根据权重对所述待处理文件进行语义解码,获得包含目标数据所在段落数据作为第一处理数据。
作为举例的,待处理文档中包含层级标题,如“保险对象、保险费率和保险费、理赔条款”等一级大标题,可以采用第一注意力模型对所述待处理文档获取保险费率和保险费对应的段落位置以及该段落中包含的所有数据,用于生成保险费用关联的QA对。
S200:根据所述待处理结构化文档的类型从预设实体库中获取与所述待处理结构化文档匹配的实体数据;
本方案中,预先构建用于本实施场景中保险行业的常见实体库,可以通过人工录入、数据挖掘的方式产生,实体包括但不限于险种的具体名称,行业常见名词等等,实体库构建后用于后续对第二处理数据进行分析时发现从第二处理数据中提取获得的实体和实体之间的关系,便于后续生成QA对。需要说明的是,所述预设实体库可视为QA对中问题数据,以便后续根据该实体数据进行步骤S300的实体抽取和相关性计算。
S300:根据所述实体数据对所述第一处理数据进行数据提取,获得包含目标数据的语句作为第二处理数据;
需要注意的是,上述目标数据即为最终需要提取获得的QA对。
本方案中,基于预设实体库对第一处理数据进行提取,可以根据实体库中的实体对第一处理数据匹配获得与实体对应的语句作为第二处理数据,作为举例而非限定的,比如获得的而第一处理数据为“理赔条款包括受理报案、现场查勘、受理报案、结案、赔案统计等数据,赔案统计包括乙方需在每一季度7个工作日内支付甲方理赔费用”,则与“赔案统计”匹配的即为“赔案统计包括乙方需在每一季度7个工作日内支付甲方理赔费用”,后续再采用S400中对所述第二处理数据的分析生成理赔时间关联的目标数据,也可以如上述步骤S100采用预训练的注意力模型,对第一处理数据进行处理,具体的处理过程与上述步骤S110-S130相似,将待处理文本更换为第一处理数据,且在注意力模型训练过程中更换训练样本即可。
S400:计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据。
具体的,上述相关性分析是指对两个或多个具备相关性的变量元素进行分析,从而衡量两个变量因素的相关密切程度,在本方案中,相关性分析用于基于实体数据找到第二处理数据中与该实体数据(实体1,问题)对应的关联数据(实体2,回答)以及关系数据(具象的说,即实体1-关系-实体2组成的QA对),以及用于判断上述三者之间的相关程度(即用于判断该QA对是否成立)。
具体的,上述步骤S400计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据,参阅图4,包括以下步骤:
S410:获取第二处理数据以及与所述第二处理数据对应的实体数据;
在上述步骤中,所述第二处理数据对应的实体数据即为从上述步骤S200中从预设数据库中匹配的实体数据。
S420:对所述第二处理数据进行拆分,获得与所述第二处理数据对应的单词集合;
具体的,上述对所述第二处理数据的拆分可以采用根据预设规则,作为举例的按照字、词的方式拆分,或者对所述第二处理数据进行语义解析后根据语义进行拆分,也可以通过深度学习模型自主实现。
S430:采用第二注意力模型计算所述实体数据与所述单词集合中各个单词的相关性,根据相关性获取与所述实体数据对应的关系数据和关联数据;
上述关联数据即为与上述实体数据对应的候选实体,关系数据为该实体数据与候选实体之间的关系。
具体的,计算所述实体数据与各个单词的相关性,获取任一单词,将实体数据、单词以及所述实体数据与该单词之间关系数据组成三元组,作为举例的,如<实体,关系,单词>,或者<实体,属性,属性值>,前者表示两个实体与单词之间的关系,后者表示的描述实体内部的属性关系,采用第二注意模型对前述三元组进行处理,采用第二注意力模型对三元组进行权重分配,获得相关性结果。
更具体的,在采用第二注意力模型计算所述实体数据与所述单词集合中各个单词的相关性,根据相关性获取与所述实体数据对应的关系数据和关联数据前,对第二注意力模型进行训练,参阅图5,包括以下:
S431:获取训练样本,所述训练样本包括带有样本实体数据、样本关系数据和样本关联数据关联标记的样本数据;
在上述步骤中,所述样本实体数据、样本关系数据和样本关联数据三元组可以通过采集现有保险行业中常用实体获得,作为举例的,“级别”-“是”-“客户经理”为一个三元组。
S432:计算样本数据中的实体数据与样本数据中各个词的相关性;
在上述步骤中,所述相关性可以采用加权平均方法,即对相似度得分向量的各个分量进行加权求和,得到最终的实体与各个单词之间的相似度。
S433:根据所述相关性获取与实体数据对应的样本关系结果和样本关联结果;
S434:将所述样本关系结果和样本关联结果分别与样本关系数据和样本关联数据对比,并调整第二注意力模型,直至完成训练过程,获得训练好的第二注意力模型。
采用上述步骤中第二注意力模型计算所述第二处理数据中的各个词语与其对应的实体数据之间的相似度,获得关系数据和关联数据(即候选实体),可以很快精确定位,以便于后续生成目标数据。
S440:计算所述实体数据、关系数据和关联数据之间的相关性,并在相关性超出预设阈值后生成目标数据。
实体数据、关系数据和关联数据形成三元组,该三元组之间相关性超出预设阈值,则为需要生成的QA对(即可生成目标数据),作为举例而非限定的,比如对XX的保险合同进行知识抽取,实体数据对应为“投保”,对应的词为“符合要求”,之间的关系数据为“是”,相关性超出阈值,则该三元组即为当用户提问“是否符合投保要求”对应的答案“XX投保是符合要求”。本方案中,生成的目标数据即为待处理文本中的QA对,用于客服机器人应对用户日常咨询、理赔、续保等服务。
在生成目标数据后,还包括以下:
S500:采用预训练的目标检测模型对对所述目标数据进行检测,并根据检测结果对所述目标数据进行调整。
本方案中在完成目标数据的抽取后,设置目标检测模型用于检测生成的目标数据的合理性,具体检测包括以下(具体如下三种调整策略):主谓宾的判断、识别重复或雷同的知识内容并加以去除、以及对输入数据进行评分等,可以进一步提高生成的目标数据的质量,并且便于后期进行知识的维护与更新,在启用本技术之后,还可人工对生成的QA对进行复核评审。
上述对所述目标检测模型在使用前需要训练,采用现有应用场景中常用的符合逻辑的语句作为训练样本,检测包括主谓宾的判断、识别重复或雷同的知识内容并加以去除、以及对输入数据进行评分。
具体的,上述采用预训练的目标检测模型对对所述目标数据进行检测,并根据检测结果对所述目标数据进行调整,包括但不限于以下调整策略:
对所述目标数据进行解析,获得所述目标数据对应的主谓宾数据,当所述主谓宾数据部分缺失,对所述目标数据进行标记;
和/或,对所述目标数据进行解析,获得所述目标数据对应的主谓宾数据,对所述主谓宾数据进行相关性评分,对评分较低的目标数据进行标记;
和/或,对所述目标数据进行记录,并与历史目标数据进行比对查重,当所述目标数据与历史目标数据重复,删除所述目标数据。
本方案中,生成的目标数据作为用客服机器人的QA对,应当具有完整的主谓宾数据,当所述目标数据中的主谓宾数据不完整,则说明该数据可能出现错误,则可标记后采用人工校验调整或自动删除,重新提取,同时生成的目标数据不应当具有重复数据,因此当检测到重复数据需要自动删除重复的数据,保留一个即可。需要说明的是,上述三种调整策略为基于目前常见问题的具体举例,实际使用场景下可调整上述目标检测模型执行的调整策略。
上述待处理文本和目标数据可对应上传至区块链以便于后续作为参考样本或训练样本,上传至区块链可保证其安全性和对用户的公正透明性,用户设备可以从区块链中下载得该摘要信息,以便查证优先级列表是否被篡改,后续也可以从区块链中下载获得对应金额数据的语音文件用于语音播报,无需生成过程,有效提高语音处理效率。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
本方案应用于对于具有一定结构性的文本数据进行知识抽取,获得用于客服机器人的QA对,通过步骤S100中第一注意力模型对待处理文本进行结构性分析,获得所述待处理文本中可能出现QA对(即目标数据)的段落作为第一处理数据,相当于对目标数据所在的段落数据进行定位,然后再根据预设实体库进行数据提取,获得与预设实体库中实体数据匹配的语句作为第二处理数据,最后采用第二注意力模型计算所述第二处理数据中的各个词语与其对应的实体数据之间的相似度,获得关系数据和关联数据,获得相似度超出阈值的实体数据、关系数据、关联数据三元组并生成目标数据(步骤S400),解决现有技术中结构化文档提取方式需要有大量的人力,工作效率较低且生成的QA对质量较差的问题,能够大幅度减少人工录入QA对的时间成本。本方案中还设置目标检测模型,用于进一步提高生成的目标数据的质量,并且便于后期进行知识的维护与更新。
实施例二:
请参阅图6,本实施例的一种文档知识抽取装置6,包括:获取模块61、匹配模块62、提取模块63、生成模块64以及调整模块65。
获取模块61,用于获取待处理结构化文档,对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据;
匹配模块62,用于根据所述待处理结构化文档的类型从预设实体库中获取与所述待处理结构化文档匹配的实体数据;
提取模块63,用于根据所述实体数据对所述第一处理数据进行数据提取,获得包含目标数据的语句作为第二处理数据;
生成模块64,用于计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据。
作为优选的,所述生成模块64还包括以下:
获取单元641,用于获取第二处理数据以及与所述第二处理数据对应的实体数据;
拆分单元642,用于对所述第二处理数据进行拆分,获得与所述第二处理数据对应的单词集合;
处理单元643,用于采用第二注意力模型计算所述实体数据与所述单词集合中各个单词的相关性,根据相关性获取与所述实体数据对应的关系数据和关联数据;
生成单元644,用于计算所述实体数据、关系数据和关联数据之间的相关性,并在相关性超出预设阈值后生成目标数据。
调整模块65,用于在生成目标数据后,采用预训练的目标检测模型对对所述目标数据进行检测,并根据检测结果对所述目标数据进行调整,包括以下:对所述目标数据进行解析,获得所述目标数据对应的主谓宾数据,当所述主谓宾数据部分缺失,对所述目标数据进行标记;和/或,对所述目标数据进行解析,获得所述目标数据对应的主谓宾数据,对所述主谓宾数据进行相关性评分,对评分较低的目标数据进行标记;和/或,对所述目标数据进行记录,并与历史目标数据进行比对查重,当所述目标数据与历史目标数据重复,删除所述目标数据。
本技术方案基于语音语义中语义解析的自然语言处理,本方案应用于保险等场景下,对于具有一定结构性的文本数据进行知识抽取,获得用于客服机器人的QA对,通过获取模块对待处理结构化文档进行数据提取,获得所述待处理文本中可能出现QA对的段落作为第一处理数据,对目标数据所在的段落数据进行定位,采用匹配模块再根据预设实体库进行数据匹配后再采用提取模块进行第二次数据提取,获得与实体数据匹配的语句(即与实体数据对应的目标数据所在语句)作为第二处理数据,最后采用生成单元计算所述第二处理数据与实体数据之间的相似度,并生成目标数据,解决现有技术中提取方式需要有大量的人力,工作效率较低且生成的QA对质量较差的问题,能够大幅度减少人工录入QA对的时间成本。
本方案中在完成目标数据的抽取后,还通过调整模块执行目标检测模型对目标数据的合理性检验,包括但不限于主谓宾的判断、识别重复或雷同的知识内容并加以去除、以及对输入数据进行评分等,用于进一步提高生成的目标数据的质量,便于后期进行知识的维护与更新。
实施例三:
为实现上述目的,本申请还提供一种计算机设备7,该计算机设备可包括多个计算机设备,实施例二的文档知识抽取装置1的组成部分可分散于不同的计算机设备7中,计算机设备7可以是执行程序的智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。本实施例的计算机设备至少包括但不限于:可通过系统总线相互通信连接的存储器51、处理器72、网络接口73以及文档知识抽取装置6,如图7所示。需要指出的是,图7仅示出了具有组件-的计算机设备,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
本实施例中,存储器71至少包括一种类型的计算机可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器71可以是计算机设备的内部存储单元,例如该计算机设备的硬盘或内存。在另一些实施例中,存储器71也可以是计算机设备的外部存储设备,例如该计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card, SMC),安全数字(Secure Digital, SD)卡,闪存卡(Flash Card)等。当然,存储器71还可以既包括计算机设备的内部存储单元也包括其外部存储设备。本实施例中,存储器71通常用于存储安装于计算机设备的操作系统和各类应用软件,例如实施例一的文档知识抽取装置6的程序代码等。此外,存储器71还可以用于暂时地存储已经输出或者将要输出的各类数据。
处理器72在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器72通常用于控制计算机设备的总体操作。本实施例中,处理器72用于运行存储器71中存储的程序代码或者处理数据,例如运行文档知识抽取装置6,以实现实施例一的文档知识抽取方法。
所述网络接口73可包括无线网络接口或有线网络接口,该网络接口73通常用于在所述计算机设备7与其他计算机设备7之间建立通信连接。例如,所述网络接口73用于通过网络将所述计算机设备7与外部终端相连,在所述计算机设备7与外部终端之间的建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。
需要指出的是,图7仅示出了具有部件71-73的计算机设备7,但是应理解的是,并不要求实施所有示出的部件,可以替代的实施更多或者更少的部件。
在本实施例中,存储于存储器71中的所述文档知识抽取装置6还可以被分割为一个或者多个程序模块,所述一个或者多个程序模块被存储于存储器71中,并由一个或多个处理器(本实施例为处理器72)所执行,以完成本申请。
实施例四:
为实现上述目的,本申请还提供一种计算机可读存储系统,所述计算机可读存储介质可以是非易失性,也可以是易失性,其包括多个存储介质,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机程序,程序被处理器72执行时实现相应功能。本实施例的计算机可读存储介质用于存储文档知识抽取装置,被处理器72执行时实现实施例一的文档知识抽取方法。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种文档知识抽取方法,其中,包括以下:
    获取待处理结构化文档,对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据;
    根据所述待处理结构化文档的类型从预设实体库中获取与所述待处理结构化文档匹配的实体数据;
    根据所述实体数据对所述第一处理数据进行数据提取,获得包含目标数据的语句作为第二处理数据;
    计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据。
  2. 根据权利要求1所述的文档知识抽取方法,其中,所述对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据,包括以下:
    对所述待处理文件进行语义编码,获得与所述待处理文件对应的编码数据;
    采用第一注意力模型对所述待处理文件中每个段落对应的编码数据进行权重分配;
    根据权重对所述待处理文件进行语义解码,获得包含目标数据所在段落数据作为第一处理数据。
  3. 根据权利要求1所述的文档知识抽取方法,其中,所述计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据,包括以下:
    获取第二处理数据以及与所述第二处理数据对应的实体数据;
    对所述第二处理数据进行拆分,获得与所述第二处理数据对应的单词集合;
    采用第二注意力模型计算所述实体数据与所述单词集合中各个单词的相关性,根据相关性获取与所述实体数据对应的关系数据和关联数据;
    计算所述实体数据、关系数据和关联数据之间的相关性,并在相关性超出预设阈值后生成目标数据。
  4. 根据权利要求3所述的文档知识抽取方法,其中,在所述采用第二注意力模型计算所述实体数据与所述单词集合中各个单词的相关性,根据相关性获取与所述实体数据对应的关系数据和关联数据前,对第二注意力模型进行训练,包括以下:
    获取训练样本,所述训练样本包括带有样本实体数据、样本关系数据和样本关联数据关联标记的样本数据;
    计算样本数据中的实体数据与样本数据中各个词的相关性;
    根据所述相关性获取与实体数据对应的样本关系结果和样本关联结果;
    将所述样本关系结果和样本关联结果分别与样本关系数据和样本关联数据对比,并调整第二注意力模型,直至完成训练过程,获得训练好的第二注意力模型。
  5. 根据权利要求2所述的文档知识抽取方法,其中,在所述采用第一注意力模型对所述待处理文件中每个段落对应的编码数据进行权重分配前,还包括对所述第一注意力模型进行训练,包括以下:
    获取训练样本,所述训练样本为带有样本结果标签的样本待处理文本;
    对所述样本待处理数据进行语义编码;
    对所述语义编码后的样本待处理数据中段落进行权重分配,并解码,获得样本处理结果;
    采用样本处理结果与所述样本结果标签进行比对并调整所述第一注意力模型的损失函数,直至完成训练过程,获得训练好的第一注意力模型。
  6. 根据权利要求1所述的文档知识抽取方法,其中,在所述生成目标数据后,还包括以下:
    采用预训练的目标检测模型对所述目标数据进行检测,并根据检 测结果对所述目标数据进行调整。
  7. 根据权利要求6所述的文档知识抽取方法,其中,所述采用预训练的目标检测模型对对所述目标数据进行检测,并根据检测结果对所述目标数据进行调整,包括以下:
    对所述目标数据进行解析,获得所述目标数据对应的主谓宾数据,当所述主谓宾数据部分缺失,对所述目标数据进行标记;
    和/或,对所述目标数据进行解析,获得所述目标数据对应的主谓宾数据,对所述主谓宾数据进行相关性评分,对评分较低的目标数据进行标记;
    和/或,对所述目标数据进行记录,并与历史目标数据进行比对查重,当所述目标数据与历史目标数据重复,删除所述目标数据。
  8. 一种文档知识抽取装置,其中,包括:
    获取模块,用于获取待处理结构化文档,对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据;
    匹配模块,用于根据所述待处理结构化文档的类型从预设实体库中获取与所述待处理结构化文档匹配的实体数据;
    提取模块,用于根据所述实体数据对所述第一处理数据进行数据提取,获得包含目标数据的语句作为第二处理数据;
    生成模块,用于计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据。
  9. 一种计算机设备,其中,所述计算机设备包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现权利所述文档知识抽取方法的以下步骤:
    获取待处理结构化文档,对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据;
    根据所述待处理结构化文档的类型从预设实体库中获取与所述待处理结构化文档匹配的实体数据;
    根据所述实体数据对所述第一处理数据进行数据提取,获得包含目标数据的语句作为第二处理数据;
    计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据。
  10. 根据权利要求9所述的计算机设备,其中,所述对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据,包括以下:
    对所述待处理文件进行语义编码,获得与所述待处理文件对应的编码数据;
    采用第一注意力模型对所述待处理文件中每个段落对应的编码数据进行权重分配;
    根据权重对所述待处理文件进行语义解码,获得包含目标数据所在段落数据作为第一处理数据。
  11. 根据权利要求9所述的计算机设备,其中,所述计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据,包括以下:
    获取第二处理数据以及与所述第二处理数据对应的实体数据;
    对所述第二处理数据进行拆分,获得与所述第二处理数据对应的单词集合;
    采用第二注意力模型计算所述实体数据与所述单词集合中各个单词的相关性,根据相关性获取与所述实体数据对应的关系数据和关联数据;
    计算所述实体数据、关系数据和关联数据之间的相关性,并在相关性超出预设阈值后生成目标数据。
  12. 根据权利要求11所述的计算机设备,其中,在所述采用第二注意力模型计算所述实体数据与所述单词集合中各个单词的相关性,根据相关性获取与所述实体数据对应的关系数据和关联数据前,对第二注意力模型进行训练,包括以下:
    获取训练样本,所述训练样本包括带有样本实体数据、样本关系数据和样本关联数据关联标记的样本数据;
    计算样本数据中的实体数据与样本数据中各个词的相关性;
    根据所述相关性获取与实体数据对应的样本关系结果和样本关联结果;
    将所述样本关系结果和样本关联结果分别与样本关系数据和样本关联数据对比,并调整第二注意力模型,直至完成训练过程,获得训练好的第二注意力模型。
  13. 根据权利要求9所述的计算机设备,其中,所述采用第一注意力模型对所述待处理文件中每个段落对应的编码数据进行权重分配前,还包括对所述第一注意力模型进行训练,包括以下:
    获取训练样本,所述训练样本为带有样本结果标签的样本待处理文本;
    对所述样本待处理数据进行语义编码;
    对所述语义编码后的样本待处理数据中段落进行权重分配,并解码,获得样本处理结果;
    采用样本处理结果与所述样本结果标签进行比对并调整所述第一注意力模型的损失函数,直至完成训练过程,获得训练好的第一注意力模型。
  14. 根据权利要求9所述的计算机设备,其中,所述采用预训练的目标检测模型对对所述目标数据进行检测,并根据检测结果对所述目标数据进行调整,包括以下:
    对所述目标数据进行解析,获得所述目标数据对应的主谓宾数据,当所述主谓宾数据部分缺失,对所述目标数据进行标记;
    和/或,对所述目标数据进行解析,获得所述目标数据对应的主谓宾数据,对所述主谓宾数据进行相关性评分,对评分较低的目标数据进行标记;
    和/或,对所述目标数据进行记录,并与历史目标数据进行比对查重,当所述目标数据与历史目标数据重复,删除所述目标数据。
  15. 一种计算机可读存储介质,其包括多个存储介质,各存储介质上存储有计算机程序,其中,所述多个存储介质存储的所述计算机程序被处理器执行时共同实现所述文档知识抽取方法的以下步骤:
    获取待处理结构化文档,对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据;
    根据所述待处理结构化文档的类型从预设实体库中获取与所述待处理结构化文档匹配的实体数据;
    根据所述实体数据对所述第一处理数据进行数据提取,获得包含目标数据的语句作为第二处理数据;
    计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述对所述待处理结构化文档进行数据提取,获得目标数据所在段落作为第一处理数据,包括以下:
    对所述待处理文件进行语义编码,获得与所述待处理文件对应的编码数据;
    采用第一注意力模型对所述待处理文件中每个段落对应的编码数据进行权重分配;
    根据权重对所述待处理文件进行语义解码,获得包含目标数据所在段落数据作为第一处理数据。
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述计算所述实体数据和所述第二处理数据的相关性,并根据计算结果生成目标数据,包括以下:
    获取第二处理数据以及与所述第二处理数据对应的实体数据;
    对所述第二处理数据进行拆分,获得与所述第二处理数据对应的单词集合;
    采用第二注意力模型计算所述实体数据与所述单词集合中各个单词的相关性,根据相关性获取与所述实体数据对应的关系数据和关联数据;
    计算所述实体数据、关系数据和关联数据之间的相关性,并在相关性超出预设阈值后生成目标数据。
  18. 根据权利要求17所述的计算机可读存储介质,其中,在所述采用第二注意力模型计算所述实体数据与所述单词集合中各个单词的相关性,根据相关性获取与所述实体数据对应的关系数据和关联数据前,对第二注意力模型进行训练,包括以下:
    获取训练样本,所述训练样本包括带有样本实体数据、样本关系数据和样本关联数据关联标记的样本数据;
    计算样本数据中的实体数据与样本数据中各个词的相关性;
    根据所述相关性获取与实体数据对应的样本关系结果和样本关联结果;
    将所述样本关系结果和样本关联结果分别与样本关系数据和样本关联数据对比,并调整第二注意力模型,直至完成训练过程,获得训练好的第二注意力模型。
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述采用第一注意力模型对所述待处理文件中每个段落对应的编码数据进行权重分配前,还包括对所述第一注意力模型进行训练,包括以下:
    获取训练样本,所述训练样本为带有样本结果标签的样本待处理文本;
    对所述样本待处理数据进行语义编码;
    对所述语义编码后的样本待处理数据中段落进行权重分配,并解码,获得样本处理结果;
    采用样本处理结果与所述样本结果标签进行比对并调整所述第一注意力模型的损失函数,直至完成训练过程,获得训练好的第一注意力模型。
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述采用预训练的目标检测模型对对所述目标数据进行检测,并根据检测结果对所述目标数据进行调整,包括以下:
    对所述目标数据进行解析,获得所述目标数据对应的主谓宾数据,当所述主谓宾数据部分缺失,对所述目标数据进行标记;
    和/或,对所述目标数据进行解析,获得所述目标数据对应的主谓宾数据,对所述主谓宾数据进行相关性评分,对评分较低的目标数据进行标记;
    和/或,对所述目标数据进行记录,并与历史目标数据进行比对查重,当所述目标数据与历史目标数据重复,删除所述目标数据。
PCT/CN2021/091435 2020-11-06 2021-04-30 文档知识抽取方法、装置、计算机设备及可读存储介质 WO2022095385A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011228800.0A CN112347226B (zh) 2020-11-06 2020-11-06 文档知识抽取方法、装置、计算机设备及可读存储介质
CN202011228800.0 2020-11-06

Publications (1)

Publication Number Publication Date
WO2022095385A1 true WO2022095385A1 (zh) 2022-05-12

Family

ID=74428363

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091435 WO2022095385A1 (zh) 2020-11-06 2021-04-30 文档知识抽取方法、装置、计算机设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN112347226B (zh)
WO (1) WO2022095385A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114942971A (zh) * 2022-07-22 2022-08-26 北京拓普丰联信息科技股份有限公司 一种结构化数据的抽取方法及装置
CN117076650A (zh) * 2023-10-13 2023-11-17 之江实验室 一种基于大语言模型的智能对话方法、装置、介质及设备
CN117743558A (zh) * 2024-02-20 2024-03-22 青岛海尔科技有限公司 基于大模型的知识加工、知识问答方法、装置及介质
CN118113816A (zh) * 2024-04-26 2024-05-31 杭州数云信息技术有限公司 文档知识抽取方法及装置、存储介质、终端、计算机程序产品

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347226B (zh) * 2020-11-06 2023-05-26 平安科技(深圳)有限公司 文档知识抽取方法、装置、计算机设备及可读存储介质
CN114492409B (zh) * 2022-01-27 2022-12-20 百度在线网络技术(北京)有限公司 文件内容的评价方法、装置、电子设备及程序产品
CN117421416B (zh) * 2023-12-19 2024-03-26 数据空间研究院 交互检索方法、装置和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457686A (zh) * 2019-07-23 2019-11-15 福建奇点时空数字科技有限公司 一种基于深度学习的信息技术数据实体属性抽取方法
CN111126058A (zh) * 2019-12-18 2020-05-08 中汇信息技术(上海)有限公司 文本信息自动抽取方法、装置、可读存储介质和电子设备
US20200234183A1 (en) * 2019-01-22 2020-07-23 Accenture Global Solutions Limited Data transformations for robotic process automation
CN112347226A (zh) * 2020-11-06 2021-02-09 平安科技(深圳)有限公司 文档知识抽取方法、装置、计算机设备及可读存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11226997B2 (en) * 2017-12-05 2022-01-18 International Business Machines Corporation Generating a chatbot from an FAQ
US11494350B2 (en) * 2018-10-05 2022-11-08 Verint Americas Inc. Building of knowledge base and FAQ from voice, chat, email, and social interactions
CN110532369B (zh) * 2019-09-04 2022-02-01 腾讯科技(深圳)有限公司 一种问答对的生成方法、装置及服务器
CN111046152B (zh) * 2019-10-12 2023-09-29 平安科技(深圳)有限公司 Faq问答对自动构建方法、装置、计算机设备及存储介质
CN110727782A (zh) * 2019-10-22 2020-01-24 苏州思必驰信息科技有限公司 问答语料生成方法及系统
CN111143531A (zh) * 2019-12-24 2020-05-12 深圳市优必选科技股份有限公司 一种问答对构建方法、系统、装置及计算机可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200234183A1 (en) * 2019-01-22 2020-07-23 Accenture Global Solutions Limited Data transformations for robotic process automation
CN110457686A (zh) * 2019-07-23 2019-11-15 福建奇点时空数字科技有限公司 一种基于深度学习的信息技术数据实体属性抽取方法
CN111126058A (zh) * 2019-12-18 2020-05-08 中汇信息技术(上海)有限公司 文本信息自动抽取方法、装置、可读存储介质和电子设备
CN112347226A (zh) * 2020-11-06 2021-02-09 平安科技(深圳)有限公司 文档知识抽取方法、装置、计算机设备及可读存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114942971A (zh) * 2022-07-22 2022-08-26 北京拓普丰联信息科技股份有限公司 一种结构化数据的抽取方法及装置
CN117076650A (zh) * 2023-10-13 2023-11-17 之江实验室 一种基于大语言模型的智能对话方法、装置、介质及设备
CN117076650B (zh) * 2023-10-13 2024-02-23 之江实验室 一种基于大语言模型的智能对话方法、装置、介质及设备
CN117743558A (zh) * 2024-02-20 2024-03-22 青岛海尔科技有限公司 基于大模型的知识加工、知识问答方法、装置及介质
CN117743558B (zh) * 2024-02-20 2024-05-24 青岛海尔科技有限公司 基于大模型的知识加工、知识问答方法、装置及介质
CN118113816A (zh) * 2024-04-26 2024-05-31 杭州数云信息技术有限公司 文档知识抽取方法及装置、存储介质、终端、计算机程序产品

Also Published As

Publication number Publication date
CN112347226B (zh) 2023-05-26
CN112347226A (zh) 2021-02-09

Similar Documents

Publication Publication Date Title
WO2022095385A1 (zh) 文档知识抽取方法、装置、计算机设备及可读存储介质
US9300672B2 (en) Managing user access to query results
WO2019062010A1 (zh) 语义识别方法、电子装置及计算机可读存储介质
WO2020006900A1 (zh) 征信报告解析处理方法、装置、计算机设备及存储介质
CN114168716A (zh) 基于深度学习的工程造价自动抽取和分析方法及装置
CN111723870B (zh) 基于人工智能的数据集获取方法、装置、设备和介质
US20230351153A1 (en) Knowledge graph reasoning model, system, and reasoning method based on bayesian few-shot learning
CN109766416A (zh) 一种新能源政策信息抽取方法及系统
CN111144087A (zh) 基于人工智能的企业法律流程辅助决策系统及方法
CN117520503A (zh) 基于llm模型的金融客服对话生成方法、装置、设备及介质
CN117540803A (zh) 基于大模型的决策引擎配置方法、装置、电子设备及介质
CN112860873B (zh) 智能应答方法、装置及存储介质
CN111241153A (zh) 企业自然人实体综合判断对齐方法及系统
US11361032B2 (en) Computer driven question identification and understanding within a commercial tender document for automated bid processing for rapid bid submission and win rate enhancement
CN117932009A (zh) 基于ChatGLM模型的保险客服对话生成方法、装置、设备及介质
CN117556057A (zh) 知识问答方法、向量数据库构建方法及装置
CN117574907A (zh) 任务执行方法及装置
CN112347792A (zh) 一种基于关系抽取的反欺诈验证识别方法与系统
CN111782803A (zh) 一种工单的处理方法、装置、电子设备及存储介质
US20140324524A1 (en) Evolving a capped customer linkage model using genetic models
CN112115212B (zh) 参数识别方法、装置和电子设备
CN112580348B (zh) 政策文本关联性分析方法及系统
CN114880590A (zh) 多语言网站货币自动转换系统及其方法
CN113849618A (zh) 基于知识图谱的策略确定方法、装置、电子设备及介质
CN113254612A (zh) 知识问答处理方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21888089

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21888089

Country of ref document: EP

Kind code of ref document: A1