CN113128231A - Data quality inspection method and device, storage medium and electronic equipment - Google Patents

Data quality inspection method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN113128231A
CN113128231A CN202110448549.7A CN202110448549A CN113128231A CN 113128231 A CN113128231 A CN 113128231A CN 202110448549 A CN202110448549 A CN 202110448549A CN 113128231 A CN113128231 A CN 113128231A
Authority
CN
China
Prior art keywords
text data
quality
entity
attribute information
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110448549.7A
Other languages
Chinese (zh)
Inventor
邹阳
唐万祺
陈健
李思雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huize Times Technology Co ltd
Original Assignee
Shenzhen Huize Times Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huize Times Technology Co ltd filed Critical Shenzhen Huize Times Technology Co ltd
Priority to CN202110448549.7A priority Critical patent/CN113128231A/en
Publication of CN113128231A publication Critical patent/CN113128231A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Abstract

The invention discloses a data quality inspection method, a data quality inspection device, a storage medium and electronic equipment, wherein text data to be inspected is obtained; executing at least one of the following first mode and second mode on the text data to be inspected; the first method is as follows: performing regular matching on the text data to be quality-checked so as to determine whether the text data to be quality-checked comprises illegal words; the second method comprises the following steps: identifying the text data to be quality-tested through a pre-established entity identification model so as to obtain an entity in the text data to be quality-tested and attribute information of the entity; in a pre-established knowledge graph, inquiring attribute information of the entity in the knowledge graph; and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained by the entity identification model, thereby determining whether the text data to be inspected is correct. The invention can accurately carry out quality inspection on the explanation content of the insurance product, and has higher efficiency and higher accuracy.

Description

Data quality inspection method and device, storage medium and electronic equipment
Technical Field
The present invention relates to the field of data quality inspection, and in particular, to a data quality inspection method, apparatus, storage medium, and electronic device.
Background
In the existing insurance intelligent quality inspection system, the common quality inspection contents include: forbidden terms, laws and regulations, illegal operations, etc. The contents can be subjected to quality inspection through rule configuration or a neural network model. However, as for the explanation contents of the insurance products, due to the special timeliness and diversity of the insurance products, other quality inspection methods cannot be reused for quality inspection of the explanation contents. For example, the life insurance is time-efficient, different life insurance products are not put on or off the shelf at the same time period, and the life insurance relates to a large number of specific product attributes, such as the payment years, the insurance premium, the premium amount and the like of the life insurance.
The method for training the neural network model by the accumulated corpora needs to label a large amount of corpora in a specific time period, and the historically accumulated corpora can be aged along with the loading and unloading of insurance products. Therefore, the quality inspection of the explanation content of the insurance product based on the method is not practical, namely, no quality inspection method for the explanation content of the insurance product exists at present.
Disclosure of Invention
In view of the above, the present invention provides a first aspect which overcomes or at least partially solves the above mentioned problems, a data quality inspection method comprising:
acquiring text data to be inspected;
executing at least one of the following first mode and second mode on the text data to be inspected;
the first method is as follows: performing regular matching on the text data to be quality-checked so as to determine whether the text data to be quality-checked comprises illegal words;
the second method comprises the following steps: identifying the text data to be quality-tested through a pre-established entity identification model so as to obtain an entity in the text data to be quality-tested and attribute information of the entity; in a pre-established knowledge graph, inquiring attribute information of the entity in the knowledge graph; and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained by the entity identification model, thereby determining whether the text data to be inspected is correct.
With reference to the first aspect, in certain optional implementations, the method further comprises:
obtaining at least one piece of text data;
performing word segmentation on the at least one piece of text data to obtain a plurality of independent word groups;
filtering the plurality of independent phrases to obtain a plurality of effective independent phrases, wherein the deactivated independent phrases at least comprise preset invalid independent phrases;
combining the effective independent phrases pairwise to obtain a plurality of effective combined phrases;
respectively inputting a plurality of preset violation basic words, a plurality of effective independent phrases and a plurality of effective combined phrases into a pre-trained bert model, so as to respectively obtain a coding matrix of each violation basic word, a coding matrix of each effective independent phrase and a coding matrix of each effective combined phrase;
obtaining a plurality of violation expansion words from the effective independent phrases and the effective combination phrases according to the coding matrix of each violation basic word, the coding matrix of each effective independent phrase and the coding matrix of each effective combination phrase;
the regular matching is performed on the text data to be quality-tested, so as to determine whether the text data to be quality-tested includes an illegal word, and the method includes the following steps:
and performing regular matching on the text data to be quality-checked, and determining whether the text data to be quality-checked comprises an illegal word in an illegal word bank, wherein the illegal word in the illegal word bank at least comprises the illegal basic word and the illegal extended word.
With reference to the previous embodiment, in some optional implementation methods, the obtaining, according to the coding matrix of each illegal basic word, the coding matrix of each valid independent phrase, and the coding matrix of each valid combined phrase, a plurality of illegal extended words from the valid independent phrases and the valid combined phrases includes:
for any one of the coding matrices, performing: calculating to obtain an average value of each vector value of each column of any one coding matrix, wherein each column corresponds to one average value; sorting the average value corresponding to each column according to the position of each column in any one coding matrix, so as to obtain a coding vector matched with any one coding matrix, wherein the coding vector is a row vector of 1 row and N columns, N is an integer greater than 1, and one coding vector corresponds to one coding matrix;
respectively calculating first similarity of each coding vector corresponding to each illegal basic word and each coding vector corresponding to each effective independent phrase, and thus obtaining at least one illegal extended word from the effective independent phrases according to the first similarity;
and respectively calculating second similarity of each coding vector corresponding to each violation basic word and each coding vector corresponding to each effective combination phrase, so as to obtain at least one violation expansion word from the effective combination phrases according to the second similarity.
With reference to the previous embodiment, in some optional implementation methods, the obtaining at least one violation extender from the valid independent phrase according to the first similarity includes:
determining the effective independent phrases with the first similarity larger than a first preset threshold or the first similarity sequence larger than a preset ordinal as the violation expansion words;
the obtaining at least one violation expansion word from the valid combined phrase according to the second similarity includes:
and determining the valid combined phrases with the second similarity larger than a second preset threshold or the second similarity ranking larger than a preset ordinal as the violation expansion words.
With reference to the first aspect, in some optional implementation methods, the comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model to determine whether the text data to be quality-checked is correct includes:
and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model, if the attribute information of the entity in the knowledge graph is consistent with the attribute information obtained through the entity identification model, the text data to be quality-inspected is correct, otherwise, the text data to be quality-inspected is determined to be incorrect.
With reference to the first aspect, in some optional implementation methods, the text data to be quality-checked at least includes: the advisor answers the text;
the step of comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model to determine whether the text data to be inspected is correct includes:
comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model, and obtaining a comparison result; if the attribute information of the entity in the knowledge graph comprises the attribute information obtained through the entity identification model, obtaining a first comparison result, and if not, obtaining a second comparison result;
determining the number of negative words included in the advisor answer text and obtaining a negative word result; if the number of the negative words is an odd number, obtaining an odd result, and otherwise, obtaining an even result;
and determining whether the text data to be subjected to quality inspection is correct or not according to the comparison result and the negative word result.
With reference to the first aspect, in certain optional implementations, the method further comprises:
when only the first mode is executed and the second mode is not executed: if the text data to be quality-checked comprises the illegal word, determining that the text data to be quality-checked is not in compliance; otherwise, determining the compliance of the text data to be quality-checked;
when the method two is only executed and the method one is not executed: if the text data to be quality-checked is incorrect, determining that the text data to be quality-checked is not in compliance; otherwise, determining the compliance of the text data to be quality-checked;
in the case of performing the first and second modes: if the text data to be subjected to quality inspection comprises violation words and/or the text data to be subjected to quality inspection is determined to be incorrect, determining that the text data to be subjected to quality inspection is not in compliance; and if the text data to be subjected to quality inspection does not comprise violation words and the text data to be subjected to quality inspection is determined to be correct, determining that the text data to be subjected to quality inspection is in compliance.
In a second aspect, a data quality inspection apparatus includes: the device comprises a data acquisition unit, a trigger unit, a mode one unit and a mode two unit;
the data obtaining unit is configured to obtain text data to be inspected;
the trigger unit is configured to trigger at least one of the mode one unit and the mode two unit;
the mode one unit is configured to perform regular matching on the text data to be quality-checked, so as to determine whether the text data to be quality-checked includes an illegal word;
the mode two unit is configured to execute recognition on the text data to be quality-tested through a pre-established entity recognition model, so as to obtain an entity in the text data to be quality-tested and attribute information of the entity; in a pre-established knowledge graph, inquiring attribute information of the entity in the knowledge graph; and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained by the entity identification model, thereby determining whether the text data to be inspected is correct.
In a third aspect, a storage medium has a program stored thereon, and the program realizes the data quality inspection method according to any one of the above when executed by a processor.
In a fourth aspect, an electronic device includes at least one processor, and at least one memory, a bus, connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory so as to execute any one of the data quality inspection methods.
By the technical scheme, the data quality inspection method, the data quality inspection device, the storage medium and the electronic equipment provided by the invention have the advantages that text data to be inspected are obtained; executing at least one of the following first mode and second mode on the text data to be inspected; the first method is as follows: performing regular matching on the text data to be quality-checked so as to determine whether the text data to be quality-checked comprises illegal words; the second method comprises the following steps: identifying the text data to be quality-tested through a pre-established entity identification model so as to obtain an entity in the text data to be quality-tested and attribute information of the entity; in a pre-established knowledge graph, inquiring attribute information of the entity in the knowledge graph; and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained by the entity identification model, thereby determining whether the text data to be inspected is correct. Therefore, the quality inspection method can accurately perform quality inspection on the explanation content of the insurance product, and has high quality inspection efficiency and high accuracy.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a data quality inspection method provided by the present invention;
FIG. 2 is a flow chart illustrating another method for data quality inspection according to the present invention;
FIG. 3 is a schematic diagram of an encoding matrix provided by the present invention;
FIG. 4 is a flow chart illustrating another method for data quality inspection according to the present invention;
FIG. 5 is a flow chart illustrating another method for data quality inspection according to the present invention;
FIG. 6 is a schematic structural diagram of a data quality inspection apparatus according to the present invention;
fig. 7 shows a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As shown in fig. 1, the present invention provides a data quality inspection method, including:
s100, obtaining text data to be inspected;
alternatively, the text data to be inspected may be text data which is pre-processed and determined to correspond to the explanation contents of the insurance product between the insurance counselor and the client. For example, because the insurance advisor may have different aspects involved in the conversation with the client, including asking the client for age, occupation, etc. It is therefore necessary to first identify the chat content of the insurance with the customer and then determine whether the current chat content is narrative content about the insurance product. Each sentence of the communication between the insurance advisor and the client may be identified, for example, based on a binary model (e.g., fasttext model) to determine text data corresponding to the content of the commentary between the insurance advisor and the client regarding the insurance product. That is, the text data corresponding to the explanation contents of the insurance product between the insurance advisor and the client can be determined through the binary model, and the text data can be used as the text data to be inspected, which is not limited in the present invention.
For example, the content of the client's communication with the insurer is as follows:
1. customer: "is so".
2. Customer: "how long waiting period for serious disease of Nadarwinian III" is.
3. The insurance consultant: "you want to consult Darwin No. three, this product is a bar".
4. The insurance consultant: "the waiting period of this product is 180 days, and most of the serious disease products are 180 days".
5. Customer: "take good or deep".
For the communication contents between the client and the insurance consultant, it can be determined through the two classification models that the contents corresponding to 1 and 5 are not the contents of the insurance product description, and the contents corresponding to 2, 3 and 4 are the contents of the insurance product description. That is, the contents corresponding to 2, 3 and 4 can be used as the text data to be inspected, which is not limited by the present invention.
Optionally, the text data to be checked may only include the explanation content of the insurance advisor on the insurance product, or may also include the communication content between the insurance advisor and the client on the insurance product, which is not limited by the present invention.
Alternatively, the text data to be inspected may be obtained by converting voice data for communication between the counselor and the client, or may be directly obtained from a chat log between the counselor and the client. Such as WeChat records, QQ records, SMS records, and chat records of other social software, as the present invention is not limited in this respect.
S110, executing at least one of the following first mode and the following second mode on the text data to be quality-checked; the first method is as follows: performing regular matching on the text data to be quality-checked so as to determine whether the text data to be quality-checked comprises illegal words; the second method comprises the following steps: identifying the text data to be quality-tested through a pre-established entity identification model so as to obtain an entity in the text data to be quality-tested and attribute information of the entity; in a pre-established knowledge graph, inquiring attribute information of the entity in the knowledge graph; and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained by the entity identification model, thereby determining whether the text data to be inspected is correct.
Optionally, the first mode and the second mode are respectively to perform quality inspection on the text to be inspected from different angles. The first mode can check whether illegal words which are limited to be used in the industry exist in the text to be quality-tested, and if the illegal words exist in the text to be quality-tested, the situation that the text data to be quality-tested possibly do not conform is shown.
Optionally, in the first mode, only what the insurance consultant in the text to be quality-checked says can be quality-checked to determine whether the situation of non-compliance exists in the conversation of the insurance consultant. Of course, the content spoken by the insurance advisor and spoken by the client can be inspected by any means, and the present invention is not limited thereto.
Optionally, as the first mode can only perform quality inspection on the explanation content of the general insurance product, quality inspection needs to be performed by combining the insurance knowledge map for the explanation content of the specific product attributes mentioned by the insurance advisor. For example, if the insurance counselor refers to specific product attributes of specific insurance products, such as 'severe disease', 'payment age' and 'insurance area', the method cannot directly inspect the technical skill of the insurance counselor, and needs to combine the insurance knowledge map to perform entity extraction and obtain attribute information of the entity so as to perform quality inspection on the technical skill of the insurance counselor, which is specifically the scheme of the method two.
Optionally, through a regular matching mode, matching the illegal words with the text to be quality-checked one by one to determine whether the text data to be quality-checked includes the illegal words. As for the regular matching, it is a relatively conventional technical means, and please refer to the specification of the regular matching in the art, which is not limited by the present invention.
Optionally, the present invention may perform regular matching on the whole text data to be quality-tested, or may also split the text data to be quality-tested into a plurality of phrases, and then perform regular matching on the split phrases, which is not limited by the present invention.
Optionally, the entity recognition model referred to herein is an entity recognition model in the field of knowledge maps, and the entity recognition model can recognize each entity included in the text data to be quality-checked and attribute information of each entity included in the text data to be quality-checked.
In a pre-established knowledge graph, various entity nodes in the insurance field, attribute information of the entity nodes and node relations among the nodes are defined. For example, entities defined in a knowledge graph include: products, waiting period, hesitation period, diseases, payment age limit, insurance year, severe diseases, intermediate diseases, mild diseases, disclaimer terms, characteristic labels, severe interval period, intermediate interval period and the like. And the relationship and attribute information among the entities, for example, the payment time limit is 3 years.
After the knowledge graph is established, each entity in the knowledge graph and attribute information of each entity can be queried based on the cypher query statement, which is not limited in the present invention.
Optionally, the knowledge graph is used as a "standard", and whether the entity and the attribute information in the text data to be inspected both meet the "standard" is determined according to the "standard". If the entity information or the attribute information does not meet the standard, the text data to be quality-tested is incorrect, and the invention does not limit the invention.
Optionally, the first mode already describes how to perform the regular matching, where the first mode relates to a scheme of splitting the text to be quality-checked into a plurality of phrases for performing the regular matching, and the scheme is specifically described below, which is not limited in the present invention.
As shown in fig. 2, in conjunction with the embodiment shown in fig. 1, in some alternative embodiments, the method further comprises:
s200, obtaining at least one piece of text data;
optionally, in order to improve comprehensiveness and accuracy of quality inspection, the number of violating words used in the regular matching process is huge, and a large number of phrases having similar meaning to a specific violating word can be used as the violating words. There is a need to extend the offending word, such as by obtaining at least one piece of text data for communication between the counselor and the client, and extracting a new offending word from the text data. The text data may be historical data for communication between the insurance advisor and the client, and the present invention is not limited thereto.
Optionally, the embodiment of fig. 2 may be applied to any text data in the insurance field, not only limited to the text data for communication between the insurance advisor and the client, but the more text data, the more beneficial the accuracy of the present invention, and the present invention is not limited thereto.
Alternatively, steps S200 to S250 of the embodiment of fig. 2 may be repeatedly executed for a plurality of times before executing the embodiment of fig. 1, so as to construct a violation word library meeting actual needs. Of course, while the embodiment of fig. 1 is executed, steps S200 to S250 of the embodiment of fig. 2 may also be executed in parallel, so as to continuously update the illegal word bank referred to herein, which is not limited by the present invention.
S210, performing word segmentation on the at least one piece of text data to obtain a plurality of independent phrases;
optionally, since it is necessary to determine which phrases in the text data can be used as the illegal words in the illegal word bank subsequently, the text data needs to be segmented in step S210, so as to obtain a plurality of independent phrases.
Alternatively, existing segmentation tools may be used for segmentation. For example, an open source jieba word segmentation tool may be used to perform word segmentation, and certainly, other word segmentation tools or manners may also be used to perform word segmentation, and any manner that can split a piece of text data into a plurality of independent word groups belongs to the protection scope of the present invention, which is not limited in this respect.
S220, filtering the plurality of independent phrases to obtain a plurality of effective independent phrases, wherein the deactivated independent phrases at least comprise preset invalid independent phrases;
optionally, the aforementioned divided independent phrases may include some invalid independent phrases, such as "kay", "yes", and the like. Any independent phrase that is not relevant to the narrative of the insurance product may be understood as an invalid independent phrase. In order to improve the efficiency and accuracy of expanding the illegal word bank, the invalid independent phrases may be filtered in step S220 to save the calculation amount in the subsequent process, which is not limited by the present invention.
Optionally, the filtering manner may be regular matching or other manners, and any manner that can filter the deactivated independent phrases is within the protection scope of the present invention.
Optionally, after filtering out the disabled independent phrases, the remaining independent phrases are all independent phrases related to the explanation content of the insurance product, and are collectively called as valid independent phrases.
S230, combining the effective independent phrases pairwise to obtain a plurality of effective combined phrases;
optionally, the violation word library may include, in addition to some effective independent phrases, a combined phrase obtained by combining some effective independent phrases, which is collectively referred to as an effective combined phrase. Therefore, in step S230, the plurality of effective independent phrases may be pairwise combined to obtain a plurality of effective combined phrases, so as to subsequently determine which effective combined phrases may be used as the offending word in the offending word library, which is not limited by the present invention.
S240, respectively inputting a plurality of preset violation basic words, a plurality of effective independent phrases and a plurality of effective combined phrases into a pre-trained bert model, so as to respectively obtain a coding matrix of each violation basic word, a coding matrix of each effective independent phrase and a coding matrix of each effective combined phrase;
optionally, BERT (English full name: Bidirectional Encoder expressions from Transformer) is a new language representation model. The model is represented by a transform's bi-directional encoder. Unlike other language representation models in the near future, BERT aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained BERT representation can be fine-tuned through an additional output layer, and is suitable for building the most advanced model of a wide range of tasks. The goal of the BERT model is to obtain the Representation of the text containing rich semantic information by using large-scale unmarked corpus training, namely: and performing semantic representation on the text, then finely adjusting the semantic representation of the text in a specific NLP (Natural Language Processing) task, and finally applying the NLP task to the Natural Language Processing (English) task.
In the NLP method based on deep neural network, the words/phrases in the text are usually represented by one-dimensional vectors (generally called "word vectors"); on the basis, the neural network takes the one-dimensional word vector of each character or word in the text as input, and outputs a one-dimensional word vector as semantic representation of the text after a series of complex conversions. In particular, it is generally desirable that the distance between words/phrases with similar semantics in the feature vector space is relatively close, so that the text vector converted from the word/phrase vector can also contain more accurate semantic information. Therefore, the main input of the BERT model is the original Word Vector of each character/Word in the text, and the Vector can be initialized randomly, and can also be pre-trained by using the algorithms such as Word2Vector and the like to be used as an initial value; the output is vector representation after each character/word in the text is fused with full-text semantic information, which is not limited by the invention.
For example, the pre-trained bert model is used to encode the violation base words, and an encoding matrix of the violation base words is obtained:
Figure BDA0003037739460000111
wherein
Figure BDA0003037739460000112
The dimension symbol is shown, n represents the number of characters in the violation basic word, WV is a coding matrix of the violation basic word, and 768 is the hidden layer dimension of the bert model. For the WV of each violation base word, each row represents a coding vector corresponding to each character in the violation base word, and each column represents one dimension of each character of the violation base word. Of course, the above-mentioned bert model is also applicable to valid independent phrases and valid combined phrases, and corresponding coding matrices are obtained. For example, as shown in FIG. 3, for the term "highest cost, inputting it to the bert model may result in the coding matrix shown in FIG. 3.
S250, obtaining a plurality of illegal extended words from the effective independent phrases and the effective combination phrases according to the coding matrix of each illegal basic word, the coding matrix of each effective independent phrase and the coding matrix of each effective combination phrase;
optionally, not all valid independent phrases or all valid combined phrases may be used as violation extender. According to the coding matrix of each violation base word, the coding matrix of each effective independent phrase, and the coding matrix of each effective combined phrase, an effective independent phrase and an effective combined phrase that are similar to each violation base word are determined to be violation extension words, which is not limited in the present invention.
Alternatively, the coding matrix with "highest bit rate" is taken as an example. Each column of the character vector represents information of one dimension of each character in the character sequence with the highest price ratio, and the information vectors of the character sequence with the highest price ratio in different dimensions can be generated by averaging the columns. The information vectors with different dimensions are unitized to generate a coding vector with the highest cost performance, that is, the vector is converted into a unit vector with a modular length of 1 for inner product calculation, so that the similarity between each effective independent phrase and the violation basic word and the similarity between each effective combined phrase and the violation basic word are calculated subsequently according to the coding vector, which is not limited by the invention.
In the embodiment of fig. 2, the first method includes:
and performing regular matching on the text data to be quality-checked, and determining whether the text data to be quality-checked comprises an illegal word in an illegal word bank, wherein the illegal word in the illegal word bank at least comprises the illegal basic word and the illegal extended word.
Optionally, relevant contents of step S100 and step S110 in fig. 2 have already been described in the embodiment of fig. 1, and are not described again here.
Optionally, as shown in fig. 4, in combination with the embodiment shown in fig. 2, in some optional embodiments, the step S250 includes:
s300, for any coding matrix, executing the following steps: calculating to obtain an average value of each vector value of each column of any one coding matrix, wherein each column corresponds to one average value; sorting the average value corresponding to each column according to the position of each column in any one coding matrix, so as to obtain a coding vector matched with any one coding matrix, wherein the coding vector is a row vector of 1 row and N columns, N is an integer greater than 1, and one coding vector corresponds to one coding matrix;
optionally, step S300 of the present invention may be performed on the coding matrix of any illegal basic word, the coding matrix of any effective independent word group, and the coding matrix of any effective combined word group, so as to obtain the coding vector of each illegal basic word, the coding vector of each effective independent word group, and the coding vector of each effective combined word group, which is not limited in this respect.
310. Respectively calculating first similarity of each coding vector corresponding to each illegal basic word and each coding vector corresponding to each effective independent phrase, and thus obtaining at least one illegal extended word from the effective independent phrases according to the first similarity;
optionally, the similarities of the coding vector corresponding to the effective independent phrase and the coding vector corresponding to the illegal basic word are collectively referred to as a first similarity. However, since the coding vectors corresponding to different effective independent phrases may be different, the first similarity between the coding vector corresponding to different effective independent phrases and the coding vector corresponding to the same violation base word may be different. The first similarity between the code vectors corresponding to different valid independent phrases and the code vectors corresponding to different violation base words may also be different, which is not limited in the present invention.
Optionally, the specific first similarity between the code vector of the violation base word and the code vector of each effective independent phrase may be sorted, and the effective independent phrase with the first similarity meeting the actual requirement is selected as the violation extension word, which is not limited in the present invention.
For example, in combination with the embodiment shown in fig. 4, in some optional embodiments, the obtaining at least one violation extender from the valid independent phrase according to the first similarity in step S310 includes:
and determining the valid independent phrases with the first similarity larger than a first preset threshold or the first similarity sequence larger than a preset ordinal as the violation expansion words.
Optionally, the first preset threshold is not limited in the present invention, and any feasible manner falls within the protection scope of the present invention.
Optionally, the present invention does not limit the preset sequence, and any feasible manner is within the protection scope of the present invention. For example, the valid independent phrase with the first similarity being larger and the first three being ranked is determined as the violation extender.
And S320, respectively calculating second similarity of each coding vector corresponding to each violation basic word and each coding vector corresponding to each effective combined phrase, so as to obtain at least one violation expansion word from the effective combined phrases according to the second similarity.
Optionally, the similarities of the code vector corresponding to the valid combined phrase and the code vector corresponding to the illegal basic word are collectively referred to as a second similarity. However, since the coding vectors corresponding to different effective combination phrases may be different, the second similarity between the coding vectors corresponding to different effective combination phrases and the coding vector corresponding to the same violation base word may be different. The second similarity between the code vectors corresponding to different valid combination phrases and the code vectors corresponding to different violation base words may also be different, which is not limited in the present invention.
Optionally, the specific code vector of the violation base word and the second similarity of the code vectors of the effective combined phrases may be sorted, and the effective combined phrase with the second similarity meeting the actual requirement is selected as the violation extension word, which is not limited in the present invention.
Optionally, the present invention does not limit the execution sequence of step S310 and step S320, step S310 may be executed first, or step S320 may be executed first, which is not limited in this respect.
Optionally, if step S310 is executed first and then step S320 is executed, step S110 is executed after step S320 is executed; if step S320 is executed first and then step S310 is executed, step S110 is executed after step S310 is executed, which is not limited in the present invention.
Optionally, steps S200, S210, S220, S230, and step S240 in fig. 4 are already described in the embodiment of fig. 2, and are not described again here.
In some optional embodiments, with reference to the embodiment shown in fig. 4, the obtaining, in step S320, at least one violation extender from the valid combined phrase according to the second similarity includes:
and determining the valid combined phrases with the second similarity larger than a second preset threshold or the second similarity ranking larger than a preset ordinal as the violation expansion words.
Optionally, the second preset threshold is not limited in the present invention, and any feasible manner falls within the protection scope of the present invention.
Optionally, the present invention does not limit the preset sequence, and any feasible manner is within the protection scope of the present invention. For example, the valid combined phrase with the second similarity being larger and the top three being ranked is determined as the violation extender.
Alternatively, the offending word in the offending word library referred to herein is a phrase that has been identified as an offending word. These phrases are derived from manual settings, the valid independent phrases described above, and the valid combination phrases described above. The violation base word mentioned herein may be a violation word included in the current violation word library, or may be a word group manually set as a violation word, which is not limited in the present invention.
Optionally, if at least one phrase matched with the illegal word exists in the text data to be quality-checked, it is indicated that the text data to be quality-checked includes the illegal word; if the word group matched with the illegal word does not exist in the text data to be quality-tested, the text data to be quality-tested does not include the illegal word, and the method is not limited in this respect.
With reference to the embodiment shown in fig. 1, in some optional embodiments, in the second embodiment, comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model to determine whether the text data to be inspected is correct includes:
and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model, if the attribute information of the entity in the knowledge graph is consistent with the attribute information obtained through the entity identification model, the text data to be quality-tested is correct, otherwise, the text data to be quality-tested is determined to be incorrect.
Optionally, each entity established in the knowledge graph and attribute information of each entity are used as "standard". The entity identification model identifies a specific entity and attribute information thereof from the text data to be quality-checked, if the attribute information of the entity in the knowledge graph is inconsistent with or greatly different from the attribute information obtained by the entity identification model, the description of the attribute information of the entity in the text data to be quality-checked is wrong, namely the text data to be quality-checked is incorrect, and the invention does not limit the description.
With reference to the embodiment shown in fig. 1, in some optional embodiments, the text data to be quality-checked at least includes: the advisor answers the text;
in the second mode, comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model to determine whether the text data to be quality-checked is correct includes: step one, step two and step three;
comparing attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model, and obtaining a comparison result; if the attribute information of the entity in the knowledge graph comprises the attribute information obtained through the entity identification model, obtaining a first comparison result, and if not, obtaining a second comparison result;
alternatively, the advisor answer text refers to: the text corresponding to the explanation content of the insurance product stated by the insurance advisor in the text data to be inspected is not limited in the present invention.
Alternatively, multiple entities can be extracted using the entity identification model (e.g., product: guardian No. three, severe disease: first stage thyroid cancer). Then converted into cypher query sentences to query the critical illness list of the guardian's third number' in the insurance knowledge map. If the query result returns to the third critical illness list of the guardian, the third critical illness list is as follows: [ malignant tumor, myocardial infarction, loss of limbs ]. Since the communication between the insured consultant and the client will not usually express the normative disease name (such as the normative disease name in the above-mentioned severe disease list). Therefore, the alignment processing can be carried out on the spoken description of the disease name, and the common spoken description is corresponding to the normative disease name. If the spoken language description in the text to be inspected corresponds to the normative disease name in the knowledge map, a first comparison result is obtained, otherwise, a second comparison result is obtained, and the method is not limited by the invention.
Step two, determining the number of negative words included in the advisor answer text, and obtaining a negative word result; if the number of the negative words is an odd number, obtaining an odd result, and otherwise, obtaining an even result;
optionally, in the first step, it is only determined that the attribute information described in the text data to be quality tested exists in the knowledge spectrogram. However, considering that the text data to be inspected is derived from the communication data between the insurance consultant and the client, the semantics of the text data to be inspected when describing the attribute information also needs to be determined. For example, the description of "myocardial infarction" in the text data to be inspected can be described by the following formula: "the serious disease of guardian three does not include myocardial infarction", can also be described by the positive formula: "Severe diseases of defender III include myocardial infarction". Therefore, in order to accurately determine whether the description of the attribute information by the counselor is correct, the semantics of the counselor in describing the attribute information needs to be determined according to the number of the negative words included in the counselor answer text, which is not limited by the invention.
And step three, determining whether the text data to be inspected is correct or not according to the comparison result and the negative word result.
Optionally, if the comparison result is: and if the attribute information of the entity in the knowledge graph does not comprise the attribute information obtained by the entity identification model. That is, the literal description of the attribute information of a specific entity by the insurance advisor cannot find the corresponding attribute information in the knowledge graph, it indicates that the text data to be inspected is incorrect. That is, the insured advisor is incorrectly described with respect to the literal description of the attribute information for the entity, e.g., "myocardial infarction," the insured advisor is described as "cardiac arrest," which the invention is not limited in this regard.
Optionally, if the comparison result is: and the attribute information of the entity in the knowledge graph comprises the attribute information obtained by the entity identification model. I.e., the live advisor is error-free from a literal description of the attribute information for the entity, but may be semantically incorrect and require further identification. For example, for "myocardial infarction," the insured consultant describes "myocardial infarction," but the insured consultant describes: severe disease of defender III does not include myocardial infarction. In practice, the critical illness of the guardian three includes myocardial infarction, that is, the critical illness of the guardian three recorded in the knowledge spectrogram includes myocardial infarction. It is therefore necessary to determine whether the semantics of the insured advisor in describing the "myocardial infarction" is a positive or negative description, which may be determined in particular by the number of negatives used by the insured advisor in describing the "myocardial infarction". If the number of the negative words is odd, the method indicates that the fixed expression description is available, the insurance consultant has wrong description on the myocardial infarction, and further indicates that the text data to be inspected is incorrect; if the number of the negative words is even, the explanation is positive description, the insurance consultant has no error in the description of the "myocardial infarction", and further explains that the text data to be inspected is correct, which is not limited by the invention.
Optionally, the knowledge graph is referred to in combination with the comparison result and the negative word result, so as to determine whether the description of the insurance advisor on the attribute information of a specific entity is correct, which is not limited in the present invention.
Optionally, the first mode and the second mode of the invention are quality inspection modes at two different angles, and the two modes are in complementary interference. That is, the present invention may be implemented in both the first and second modes, or only one of them, and the present invention is not limited thereto, and will be separately described below.
As shown in fig. 5, in conjunction with the embodiment shown in fig. 1, in some alternative embodiments, the method further comprises:
s400, when only the first mode is executed and the second mode is not executed: if the text data to be quality-checked comprises the illegal word, determining that the text data to be quality-checked is not in compliance; otherwise, determining the compliance of the text data to be quality-checked;
optionally, if only the first mode is executed, the quality inspection result of the text data to be inspected is only affected by the result of the first mode, that is, the quality inspection result of the text data to be inspected is directly matched with the result of the first mode.
For example, if the result of mode one is: and if the text data to be quality-checked comprises the illegal words, the explanation content of the insurance product in the text data to be quality-checked is wrong, namely the text data to be quality-checked is not in compliance. Conversely, if the result of mode one is: if the text data to be quality-tested does not include the illegal word, the text data to be quality-tested is in compliance, and the method is not limited in this respect.
S410, when only the second mode is executed and the first mode is not executed: if the text data to be quality-checked is incorrect, determining that the text data to be quality-checked is not in compliance; otherwise, determining the compliance of the text data to be quality-checked;
optionally, if only the second mode is executed, the quality inspection result of the text data to be inspected is only affected by the result of the second mode, that is, the quality inspection result of the text data to be inspected is directly matched with the result of the second mode.
For example, if the result of mode two is: and if the text data to be subjected to quality inspection is incorrect, the explanation content of the insurance product in the text data to be subjected to quality inspection is wrong, namely the text data to be subjected to quality inspection is not compliant. Conversely, if the result of mode two is: and if the text data to be quality-tested is correct, the text data to be quality-tested is in compliance, and the method is not limited in this respect.
S420, under the condition that the first mode and the second mode are executed: if the text data to be subjected to quality inspection comprises violation words and/or the text data to be subjected to quality inspection is determined to be incorrect, determining that the text data to be subjected to quality inspection is not in compliance; and if the text data to be subjected to quality inspection does not comprise violation words and the text data to be subjected to quality inspection is determined to be correct, determining that the text data to be subjected to quality inspection is in compliance.
Optionally, the words used when describing text data to be quality checked in the present invention are: "compliant", "non-compliant", "correct", "incorrect", can be understood as: "compliance" and "non-compliance" describe the final result of the text data to be inspected, and "correct" and "incorrect" describe the result of the second way. The influence factor of the final result is determined according to the execution conditions of the first mode and the second mode, and if only the first mode is executed, the final result is only influenced by the result of the first mode, which is detailed in step S400; if only the second method is performed, the final result is only affected by the result of the second method, which is detailed in step S410; if the first and second modes are executed, the final result is affected by both the result of the first mode and the result of the second mode, which is detailed in step S420.
Optionally, in the case of the first and second execution modes, the final result of the text data to be quality-checked has two types, one type is compliant and the other type is non-compliant. Wherein the compliance condition is: the result of the first mode is that the text data to be quality-tested does not contain the illegal word and the result of the second mode is that the text data to be quality-tested is correct; there are three cases of non-compliance: (1) the result of the first mode is that the text data to be quality-checked comprises the illegal words, and the result of the second mode is that the text data to be quality-checked is correct; (2) the result of the first mode is that the text data to be quality-tested does not contain the illegal word and the result of the second mode is that the text data to be quality-tested is incorrect; (3) the result of the first mode is that the text data to be quality-checked includes the illegal word, and the result of the second mode is that the text data to be quality-checked is incorrect, which is not limited by the invention.
As shown in fig. 6, the present invention provides a data quality inspection apparatus, including: a data obtaining unit 100, a trigger unit 200, a mode one unit 300 and a mode two unit 400;
the data obtaining unit 100 is configured to perform obtaining text data to be inspected;
the trigger unit 200 configured to trigger at least one of the mode one unit 300 and the mode two unit 400;
the mode one unit 300 is configured to perform regular matching on the text data to be quality-checked, so as to determine whether the text data to be quality-checked includes an illegal word;
the second mode unit 400 is configured to perform recognition on the text data to be quality-tested through a pre-established entity recognition model, so as to obtain an entity in the text data to be quality-tested and attribute information of the entity; in a pre-established knowledge graph, inquiring attribute information of the entity in the knowledge graph; and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained by the entity identification model, thereby determining whether the text data to be inspected is correct.
In some optional embodiments, the apparatus further comprises, in combination with the embodiment described in fig. 6: the device comprises a text obtaining unit, a word segmentation unit, a filtering unit, a combination unit, a coding matrix obtaining unit and an illegal expansion word obtaining unit;
the text obtaining unit is configured to obtain at least one piece of text data;
the word segmentation unit is configured to perform word segmentation on the at least one piece of text data to obtain a plurality of independent phrases;
the filtering unit is configured to perform filtering of deactivated independent phrases on the plurality of independent phrases so as to obtain a plurality of valid independent phrases, wherein the deactivated independent phrases at least include preset invalid independent phrases;
the combination is configured to perform pairwise combination on the plurality of effective independent phrases to obtain a plurality of effective combined phrases;
the coding matrix obtaining unit is configured to perform input of a plurality of preset violation basic words, a plurality of effective independent phrases and a plurality of effective combined phrases into a pre-trained bert model respectively, so as to obtain a coding matrix of each violation basic word, a coding matrix of each effective independent phrase and a coding matrix of each effective combined phrase respectively;
the illegal extended word obtaining unit is configured to execute the steps of obtaining a plurality of illegal extended words from the effective independent phrases and the effective combined phrases according to the coding matrix of each illegal basic word, the coding matrix of each effective independent phrase and the coding matrix of each effective combined phrase;
the mode one unit 300 includes: a first sub-unit;
the mode one subunit is configured to perform regular matching on the text data to be quality-checked, and determine whether the text data to be quality-checked includes an illegal word in an illegal word bank, where the illegal word in the illegal word bank includes at least the illegal base word and the illegal extension word.
With reference to the previous embodiment, in some optional embodiments, the violation extender obtaining unit includes: the encoding vector obtaining subunit, the first similarity subunit and the second similarity subunit;
the coding vector obtaining subunit is configured to perform, for any one of the coding matrices: calculating to obtain an average value of each vector value of each column of any one coding matrix, wherein each column corresponds to one average value; sorting the average value corresponding to each column according to the position of each column in any one coding matrix, so as to obtain a coding vector matched with any one coding matrix, wherein the coding vector is a row vector of 1 row and N columns, N is an integer greater than 1, and one coding vector corresponds to one coding matrix;
the first similarity subunit is configured to perform respective calculation of first similarities of each coding vector corresponding to each illegal basic word and each coding vector corresponding to each effective independent phrase, so as to obtain at least one illegal extended word from the effective independent phrases according to the first similarities;
the second similarity subunit is configured to perform respective calculation of second similarities of the respective coding vectors corresponding to the respective violation base words and the respective coding vectors corresponding to the respective valid combination phrases, so as to obtain at least one violation extension word from the valid combination phrases according to the second similarities.
In combination with the previous embodiment, in some optional embodiments, the first similarity subunit includes: a first determining subunit;
the first determining subunit is configured to perform determination of the valid independent phrases with the first similarity greater than a first preset threshold or the first similarity ranking greater than a preset ordinal number as the violation expansion words;
the second similarity subunit includes: a second determining subunit;
the second determining subunit is configured to perform determination of the valid combined phrase with the second similarity greater than a second preset threshold or the second similarity ranking greater than a preset rank as the violation expansion word.
With reference to the embodiment shown in fig. 6, in some optional embodiments, the second mode unit 400 is specifically configured to perform, when determining whether the text data to be quality-checked is correct by comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model, the second mode unit is configured to perform:
and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model, if the attribute information of the entity in the knowledge graph is consistent with the attribute information obtained through the entity identification model, the text data to be quality-inspected is correct, otherwise, the text data to be quality-inspected is determined to be incorrect.
With reference to the embodiment shown in fig. 6, in some optional embodiments, the text data to be quality-checked at least includes: the advisor answers the text;
the second mode unit 400, when performing comparison between the attribute information of the entity in the knowledge graph and the attribute information obtained through the entity identification model to determine whether the text data to be quality-checked is correct, includes: the comparison subunit, the first result subunit, the second result subunit, the negative word result obtaining subunit, the odd result subunit, the even result subunit and the document data determination subunit;
the comparison subunit is configured to perform comparison between the attribute information of the entity in the knowledge graph and the attribute information obtained through the entity identification model, and obtain a comparison result; if the attribute information of the entity in the knowledge graph comprises the attribute information obtained by the entity identification model, triggering the first result subunit, and otherwise, triggering the second result subunit;
the first result subunit is configured to perform obtaining a first comparison result;
the second result subunit is configured to perform obtaining a second comparison result;
the negative word result obtaining subunit is configured to perform determining the number of negative words included in the advisor answer text and obtain a negative word result; if the number of the negative words is an odd number, triggering the odd result subunit, and otherwise triggering the even result subunit;
the odd result subunit configured to perform obtaining an odd result;
the even result subunit configured to perform obtaining an even result;
and the document data determining subunit is configured to determine whether the text data to be subjected to quality inspection is correct according to the comparison result and the negative word result.
In some alternative embodiments, in combination with the embodiment shown in fig. 6, the apparatus further comprises: a first case subunit, a second case subunit, and a third case subunit;
the first case subunit is configured to perform, in the case where only the first mode is performed, but not the second mode: if the text data to be quality-checked comprises the illegal word, determining that the text data to be quality-checked is not in compliance; otherwise, determining the compliance of the text data to be quality-checked;
the second case subunit is configured to perform, when only the second mode is performed and the first mode is not performed: if the text data to be quality-checked is incorrect, determining that the text data to be quality-checked is not in compliance; otherwise, determining the compliance of the text data to be quality-checked;
the third case subunit configured to perform, in the case of performing mode one and mode two: if the text data to be subjected to quality inspection comprises violation words and/or the text data to be subjected to quality inspection is determined to be incorrect, determining that the text data to be subjected to quality inspection is not in compliance; and if the text data to be subjected to quality inspection does not comprise violation words and the text data to be subjected to quality inspection is determined to be correct, determining that the text data to be subjected to quality inspection is in compliance.
The present invention provides a storage medium having a program stored thereon, the program implementing any one of the above-described data quality inspection methods when executed by a processor.
As shown in fig. 7, the present invention provides an electronic device 70, wherein the electronic device 70 includes at least one processor 701, at least one memory 702 connected to the processor 701, and a bus 703; the processor 701 and the memory 702 complete communication with each other through the bus 703; the processor 701 is configured to call the program instructions in the memory 702 to execute any one of the above-mentioned data quality inspection methods.
In this application, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method for data quality inspection, comprising:
acquiring text data to be inspected;
executing at least one of the following first mode and second mode on the text data to be inspected;
the first method is as follows: performing regular matching on the text data to be quality-checked so as to determine whether the text data to be quality-checked comprises illegal words;
the second method comprises the following steps: identifying the text data to be quality-tested through a pre-established entity identification model so as to obtain an entity in the text data to be quality-tested and attribute information of the entity; in a pre-established knowledge graph, inquiring attribute information of the entity in the knowledge graph; and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained by the entity identification model, thereby determining whether the text data to be inspected is correct.
2. The method of claim 1, further comprising:
obtaining at least one piece of text data;
performing word segmentation on the at least one piece of text data to obtain a plurality of independent word groups;
filtering the plurality of independent phrases to obtain a plurality of effective independent phrases, wherein the deactivated independent phrases at least comprise preset invalid independent phrases;
combining the effective independent phrases pairwise to obtain a plurality of effective combined phrases;
respectively inputting a plurality of preset violation basic words, a plurality of effective independent phrases and a plurality of effective combined phrases into a pre-trained bert model, so as to respectively obtain a coding matrix of each violation basic word, a coding matrix of each effective independent phrase and a coding matrix of each effective combined phrase;
obtaining a plurality of violation expansion words from the effective independent phrases and the effective combination phrases according to the coding matrix of each violation basic word, the coding matrix of each effective independent phrase and the coding matrix of each effective combination phrase;
the regular matching is performed on the text data to be quality-tested, so as to determine whether the text data to be quality-tested includes an illegal word, and the method includes the following steps:
and performing regular matching on the text data to be quality-checked, and determining whether the text data to be quality-checked comprises an illegal word in an illegal word bank, wherein the illegal word in the illegal word bank at least comprises the illegal basic word and the illegal extended word.
3. The method according to claim 2, wherein obtaining a plurality of violation expansion words from the valid independent phrases and the valid combined phrases according to the coding matrix of each of the violation base words, the coding matrix of each of the valid independent phrases, and the coding matrix of each of the valid combined phrases comprises:
for any one of the coding matrices, performing: calculating to obtain an average value of each vector value of each column of any one coding matrix, wherein each column corresponds to one average value; sorting the average value corresponding to each column according to the position of each column in any one coding matrix, so as to obtain a coding vector matched with any one coding matrix, wherein the coding vector is a row vector of 1 row and N columns, N is an integer greater than 1, and one coding vector corresponds to one coding matrix;
respectively calculating first similarity of each coding vector corresponding to each illegal basic word and each coding vector corresponding to each effective independent phrase, and thus obtaining at least one illegal extended word from the effective independent phrases according to the first similarity;
and respectively calculating second similarity of each coding vector corresponding to each violation basic word and each coding vector corresponding to each effective combination phrase, so as to obtain at least one violation expansion word from the effective combination phrases according to the second similarity.
4. The method according to claim 3, wherein the deriving at least one of the violation extender words from the valid independent phrase according to the first similarity comprises:
determining the effective independent phrases with the first similarity larger than a first preset threshold or the first similarity sequence larger than a preset ordinal as the violation expansion words;
the obtaining at least one violation expansion word from the valid combined phrase according to the second similarity includes:
and determining the valid combined phrases with the second similarity larger than a second preset threshold or the second similarity ranking larger than a preset ordinal as the violation expansion words.
5. The method of claim 1, wherein the comparing the attribute information of the entity in the knowledge-graph with the attribute information obtained through the entity identification model to determine whether the text data to be inspected is correct comprises:
and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model, if the attribute information of the entity in the knowledge graph is consistent with the attribute information obtained through the entity identification model, the text data to be quality-inspected is correct, otherwise, the text data to be quality-inspected is determined to be incorrect.
6. The method according to claim 1, wherein the text data to be quality-checked includes at least: the advisor answers the text;
the step of comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model to determine whether the text data to be inspected is correct includes:
comparing the attribute information of the entity in the knowledge graph with the attribute information obtained through the entity identification model, and obtaining a comparison result; if the attribute information of the entity in the knowledge graph comprises the attribute information obtained through the entity identification model, obtaining a first comparison result, and if not, obtaining a second comparison result;
determining the number of negative words included in the advisor answer text and obtaining a negative word result; if the number of the negative words is an odd number, obtaining an odd result, and otherwise, obtaining an even result;
and determining whether the text data to be subjected to quality inspection is correct or not according to the comparison result and the negative word result.
7. The method of claim 1, further comprising:
when only the first mode is executed and the second mode is not executed: if the text data to be quality-checked comprises the illegal word, determining that the text data to be quality-checked is not in compliance; otherwise, determining the compliance of the text data to be quality-checked;
when the method two is only executed and the method one is not executed: if the text data to be quality-checked is incorrect, determining that the text data to be quality-checked is not in compliance; otherwise, determining the compliance of the text data to be quality-checked;
in the case of performing the first and second modes: if the text data to be subjected to quality inspection comprises violation words and/or the text data to be subjected to quality inspection is determined to be incorrect, determining that the text data to be subjected to quality inspection is not in compliance; and if the text data to be subjected to quality inspection does not comprise violation words and the text data to be subjected to quality inspection is determined to be correct, determining that the text data to be subjected to quality inspection is in compliance.
8. A data quality inspection apparatus, comprising: the device comprises a data acquisition unit, a trigger unit, a mode one unit and a mode two unit;
the data obtaining unit is configured to obtain text data to be inspected;
the trigger unit is configured to trigger at least one of the mode one unit and the mode two unit;
the mode one unit is configured to perform regular matching on the text data to be quality-checked, so as to determine whether the text data to be quality-checked includes an illegal word;
the mode two unit is configured to execute recognition on the text data to be quality-tested through a pre-established entity recognition model, so as to obtain an entity in the text data to be quality-tested and attribute information of the entity; in a pre-established knowledge graph, inquiring attribute information of the entity in the knowledge graph; and comparing the attribute information of the entity in the knowledge graph with the attribute information obtained by the entity identification model, thereby determining whether the text data to be inspected is correct.
9. A storage medium on which a program is stored, the program implementing the data quality inspection method according to any one of claims 1 to 7 when executed by a processor.
10. An electronic device comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform the data quality inspection method of any one of claims 1 to 7.
CN202110448549.7A 2021-04-25 2021-04-25 Data quality inspection method and device, storage medium and electronic equipment Pending CN113128231A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110448549.7A CN113128231A (en) 2021-04-25 2021-04-25 Data quality inspection method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110448549.7A CN113128231A (en) 2021-04-25 2021-04-25 Data quality inspection method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN113128231A true CN113128231A (en) 2021-07-16

Family

ID=76779893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110448549.7A Pending CN113128231A (en) 2021-04-25 2021-04-25 Data quality inspection method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113128231A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023088249A1 (en) * 2021-11-18 2023-05-25 华为技术有限公司 Method and apparatus for detecting compliance of data processing, and related device

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140280089A1 (en) * 2013-03-15 2014-09-18 Google Inc. Providing search results using augmented search queries
WO2014182852A1 (en) * 2013-05-07 2014-11-13 Magnet Systems, Inc. System for managing graph queries on relationships among entities using graph index
CN105574098A (en) * 2015-12-11 2016-05-11 百度在线网络技术(北京)有限公司 Knowledge graph generation method and device and entity comparing method and device
GB201615373D0 (en) * 2015-11-11 2016-10-26 Adobe Systems Inc Structured knowledge modeling, extraction and localization from images
US9710544B1 (en) * 2016-05-19 2017-07-18 Quid, Inc. Pivoting from a graph of semantic similarity of documents to a derivative graph of relationships between entities mentioned in the documents
CN111368096A (en) * 2020-03-09 2020-07-03 中国平安人寿保险股份有限公司 Knowledge graph-based information analysis method, device, equipment and storage medium
CN111428054A (en) * 2020-04-14 2020-07-17 中国电子科技网络信息安全有限公司 Construction and storage method of knowledge graph in network space security field
CN111767410A (en) * 2020-06-30 2020-10-13 平安国际智慧城市科技股份有限公司 Construction method, device, equipment and storage medium of clinical medical knowledge map
CN111897970A (en) * 2020-07-27 2020-11-06 平安科技(深圳)有限公司 Text comparison method, device and equipment based on knowledge graph and storage medium
CN111916110A (en) * 2020-08-06 2020-11-10 龙马智芯(珠海横琴)科技有限公司 Voice quality inspection method and device
CN111984786A (en) * 2020-08-17 2020-11-24 深圳新闻网传媒股份有限公司 Intelligent whistle blowing early warning method based on news information and server
WO2021000676A1 (en) * 2019-07-03 2021-01-07 平安科技(深圳)有限公司 Q&a method, q&a device, computer equipment and storage medium
CN112464661A (en) * 2020-11-25 2021-03-09 马上消费金融股份有限公司 Model training method, voice conversation detection method and related equipment
CN113761213A (en) * 2020-06-01 2021-12-07 Tcl科技集团股份有限公司 Data query system and method based on knowledge graph and terminal equipment
CN114564571A (en) * 2022-04-21 2022-05-31 支付宝(杭州)信息技术有限公司 Graph data query method and system

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140280089A1 (en) * 2013-03-15 2014-09-18 Google Inc. Providing search results using augmented search queries
WO2014182852A1 (en) * 2013-05-07 2014-11-13 Magnet Systems, Inc. System for managing graph queries on relationships among entities using graph index
GB201615373D0 (en) * 2015-11-11 2016-10-26 Adobe Systems Inc Structured knowledge modeling, extraction and localization from images
CN105574098A (en) * 2015-12-11 2016-05-11 百度在线网络技术(北京)有限公司 Knowledge graph generation method and device and entity comparing method and device
US9710544B1 (en) * 2016-05-19 2017-07-18 Quid, Inc. Pivoting from a graph of semantic similarity of documents to a derivative graph of relationships between entities mentioned in the documents
WO2021000676A1 (en) * 2019-07-03 2021-01-07 平安科技(深圳)有限公司 Q&a method, q&a device, computer equipment and storage medium
CN111368096A (en) * 2020-03-09 2020-07-03 中国平安人寿保险股份有限公司 Knowledge graph-based information analysis method, device, equipment and storage medium
CN111428054A (en) * 2020-04-14 2020-07-17 中国电子科技网络信息安全有限公司 Construction and storage method of knowledge graph in network space security field
CN113761213A (en) * 2020-06-01 2021-12-07 Tcl科技集团股份有限公司 Data query system and method based on knowledge graph and terminal equipment
CN111767410A (en) * 2020-06-30 2020-10-13 平安国际智慧城市科技股份有限公司 Construction method, device, equipment and storage medium of clinical medical knowledge map
CN111897970A (en) * 2020-07-27 2020-11-06 平安科技(深圳)有限公司 Text comparison method, device and equipment based on knowledge graph and storage medium
CN111916110A (en) * 2020-08-06 2020-11-10 龙马智芯(珠海横琴)科技有限公司 Voice quality inspection method and device
CN111984786A (en) * 2020-08-17 2020-11-24 深圳新闻网传媒股份有限公司 Intelligent whistle blowing early warning method based on news information and server
CN112464661A (en) * 2020-11-25 2021-03-09 马上消费金融股份有限公司 Model training method, voice conversation detection method and related equipment
CN114564571A (en) * 2022-04-21 2022-05-31 支付宝(杭州)信息技术有限公司 Graph data query method and system

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
CHENG ZHOU: "Intelligent bug fixing with software bug knowledge graph", ESEC/FSE 2018INTELLIGENT BUG FIXING WITH SOFTWARE BUG KNOWLEDGE GRAPH, 26 October 2018 (2018-10-26), pages 944, XP058699230, DOI: 10.1145/3236024.3275428 *
GAURAV MAHESHWAR 等: "Learning to Rank Query Graphs for Complex Question Answering over Knowledge Graphs", INTERNATIONAL SEMANTIC WEB CONFERENCE, 17 October 2019 (2019-10-17), pages 487, XP047524278, DOI: 10.1007/978-3-030-30793-6_28 *
JINGLEI ZHANG等: "Exploiting Code Knowledge Graph for Bug Localization via Bi-directional Attention", ICSEPROCEEDINGSICPC \'20, 12 September 2020 (2020-09-12), pages 219 *
PREETI KATHIRIA 等: "Document Clustering based on Phrase and Single", INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND EXPLORING ENGINEERING, vol. 9, no. 3, 29 February 2020 (2020-02-29), pages 3188 - 3192 *
R WITA等: "Content-Based Filtering Recommendation in Abstract Search Using Neo4j", 2017 21ST INTERNATIONAL COMPUTER SCIENCE AND ENGINEERING CONFERENCE (ICSEC), 18 November 2017 (2017-11-18), pages 136 - 139 *
THASHEN PADAYACHY: "An Information Extraction Model Using a Graph Database to Recommend the Most Applied Case", 2018 INTERNATIONAL CONFERENCE ON COMPUTING, 7 March 2019 (2019-03-07), pages 89 - 94 *
奥渊博: "基于用户信息图谱的互联网金融虚假用户信息检测系统的设计与实现", 基于用户信息图谱的互联网金融虚假用户信息检测系统的设计与实现, no. 5, 15 May 2018 (2018-05-15), pages 1 - 70 *
李容: "基于知识图谱的手机质量检测方法研究与实现", 中国优秀硕士论文电子期刊, no. 2, 15 February 2021 (2021-02-15), pages 1 - 82 *
潘镇: "社交网络事件检测分析及应用的研究", 中国博士学位论文电子期刊, no. 1, 15 January 2021 (2021-01-15), pages 1 - 92 *
邵领: "基于知识图谱的搜索引擎技术研究与应用", 中国优秀硕士论文电子期刊, 15 February 2017 (2017-02-15), pages 1 - 76 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023088249A1 (en) * 2021-11-18 2023-05-25 华为技术有限公司 Method and apparatus for detecting compliance of data processing, and related device

Similar Documents

Publication Publication Date Title
CN109885660B (en) Knowledge graph energizing question-answering system and method based on information retrieval
CN111222305B (en) Information structuring method and device
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN111639171A (en) Knowledge graph question-answering method and device
CN107451153A (en) The method and apparatus of export structure query statement
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN110866089B (en) Robot knowledge base construction system and method based on synonymous multi-context analysis
CN109598517B (en) Commodity clearance processing, object processing and category prediction method and device thereof
US20100198756A1 (en) Methods and systems for matching records and normalizing names
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN111078837A (en) Intelligent question and answer information processing method, electronic equipment and computer readable storage medium
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN113961685A (en) Information extraction method and device
CN113886604A (en) Job knowledge map generation method and system
CN112270188A (en) Questioning type analysis path recommendation method, system and storage medium
CN113947084A (en) Question-answer knowledge retrieval method, device and equipment based on graph embedding
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN113128231A (en) Data quality inspection method and device, storage medium and electronic equipment
CN110969005A (en) Method and device for determining similarity between entity corpora
US11361565B2 (en) Natural language processing (NLP) pipeline for automated attribute extraction
CN117149955A (en) Method, medium and system for automatically answering insurance clause consultation
CN109933787B (en) Text key information extraction method, device and medium
CN113312903B (en) Method and system for constructing word stock of 5G mobile service product
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN112541357B (en) Entity identification method and device and intelligent equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination