CN113901233A - Query data repairing method, system, computer equipment and storage medium - Google Patents

Query data repairing method, system, computer equipment and storage medium Download PDF

Info

Publication number
CN113901233A
CN113901233A CN202111189624.9A CN202111189624A CN113901233A CN 113901233 A CN113901233 A CN 113901233A CN 202111189624 A CN202111189624 A CN 202111189624A CN 113901233 A CN113901233 A CN 113901233A
Authority
CN
China
Prior art keywords
data
missing
repaired
query
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111189624.9A
Other languages
Chinese (zh)
Other versions
CN113901233B (en
Inventor
沈玉军
李民权
徐小磊
刘建华
邢继风
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhilian Wuxi Information Technology Co ltd
Original Assignee
Zhilian Wuxi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhilian Wuxi Information Technology Co ltd filed Critical Zhilian Wuxi Information Technology Co ltd
Priority to CN202111189624.9A priority Critical patent/CN113901233B/en
Publication of CN113901233A publication Critical patent/CN113901233A/en
Application granted granted Critical
Publication of CN113901233B publication Critical patent/CN113901233B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a query data repairing method, a query data repairing system, computer equipment and a storage medium. The method comprises the following steps: acquiring data content to be judged, and judging whether the content data is missing data; repairing the data content judged as missing data to obtain repaired data; constructing knowledge map data in real time on the repaired data, and eliminating repeated data stored in a database; and verifying whether the repair result of the repaired data meets the service application standard on line, and optimizing the recall and matching degree of the retrieval query by using the repaired data passing the verification. According to the method, through the defect judgment, repair, construction and verification of the main data, a complete closed loop is formed, the data is repaired and supplemented in real time, the correlation alignment of the user and the matching content is realized, the most relevant data is recalled, and the two-way promotion of the recall amount and the matching degree of the query is fundamentally realized.

Description

Query data repairing method, system, computer equipment and storage medium
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a query data repairing method, a query data repairing system, computer equipment and a storage medium.
Background
With the rapid development of internet science and technology, the talent recruitment mode is gradually realized on line. When the recruitment is carried out on line, the database needs to be docked to accurately understand the user requirements, describe the user portrait and automatically repair the missing data.
For example, in a recruitment database. In the search/recommendation scene in the online recruitment service, the search/recommendation recall amount and the matching degree are limited due to the loss of resume or position data, so that the search/recommendation scene is difficult to greatly improve. The reason for this is because users are generally accustomed to using abbreviated content rather than standard content. Such as: when the user searches the positions of the intelligent company, the user rarely inputs the full name 'Beijing network engagement and consultation limited company' of the company for searching, and more is 'intelligent connection' or 'intelligent connection recruitment'. In the resume filling link, the user fills in the educational experience, and usually fills in the contents of 'northern Dada', 'finishing work', and the like. The data missing problem is ubiquitous in scenes such as resume, job position and basic information filling of companies. It is this input habit of the user that presents a significant challenge to search/recommend recall and matching for online recruitment services.
In view of the above problem, it is desirable to provide a method, a system, a computer device, and a storage medium for repairing and complementing data in real time, so as to achieve correlation alignment between a user and a matching content, recall the most relevant data, and fundamentally achieve bidirectional promotion of the recall amount and matching degree of a query.
Disclosure of Invention
In order to solve the problems that in the prior art, the search/recommendation recall amount and the matching degree are limited due to data loss of resumes or positions, and great improvement is difficult to realize, the invention provides a query data restoration method, a system, a device and a storage medium.
The invention is realized by adopting the following technical scheme:
a query data repair method, comprising:
acquiring data content to be judged, and judging whether the content data is missing data;
repairing the data content judged as missing data to obtain repaired data;
constructing knowledge map data in real time on the repaired data, and eliminating repeated data stored in a database;
and verifying whether the repair result of the repaired data meets the service application standard on line, and optimizing the recall and matching degree of the retrieval query by using the repaired data passing the verification.
Optionally, the method for determining whether the content data is missing data includes:
reading data contents input by a user side, wherein the data contents comprise query data and resume data filled into a search box by the user side;
traversing the acquired data content based on the user behavior data and the domain knowledge graph data, and judging whether the data content is missing content;
judging whether the missing content is completed or not based on the constructed domain knowledge graph data aiming at the missing content;
and marking the missing and incomplete data content to obtain the missing data.
Optionally, the method for repairing the data content determined as missing data includes:
acquiring missing data to be repaired;
repairing the missing data in real time according to user behavior data and domain knowledge map data;
completing the repair of the missing data by searching the user behavior data related to the preference of the user behavior;
and acquiring Internet open domain knowledge in real time through a crawler, acquiring tag data corresponding to the missing data, generating characteristic tag data after marking confirmation, establishing domain knowledge map data of user input data and repaired data, and completing the repair of the missing data.
Optionally, the method for constructing the knowledge graph data from the repaired data in real time and removing the repeated data stored in the database includes:
acquiring repaired data, positioning the data as wide table data, converting the wide table data into triple data in a metadata table in a triple (SPO) data form, and establishing a basic SPO layer of a hierarchical architecture for storing knowledge graph data;
generating a link according to the attribute of the triple data in the metadata table, performing duplicate-removing and merging processing on the entity data of the basic layer for constructing the triple, removing invalid data, and establishing an entity data merging layer of a hierarchical structure for storing the knowledge map data;
and converting the triple data of the entity data in one layer into wide table data, mapping the attribute name and the data type of the triple data to the wide table data, and setting a wide table service application layer for storing the knowledge map data.
Further, the metadata table comprises a generated entity type table, an entity attribute table, a constructed automatic warehousing task metadata table, a record traceability table and an auxiliary table; the entity type table comprises an entity type number, a type name, a level and a parent type number, the entity attribute table is used for constraining the attribute of entity data, the attribute comprises the basic attribute and the relation attribute of the entity data, and the entity attribute table comprises an attribute name, the type to which the attribute belongs and whether multiple values exist; the automatic warehousing task metadata table is used for describing attributes corresponding to entity data and automatically constructing the attributes, and the automatic warehousing task metadata table comprises a task number, an attribute name, a data source, field mapping, relationship attribute constraint and whether a reverse relationship is constructed or not; the record tracing table is used for recording process information and detailed configuration information in the data construction process, so that the data tracing is facilitated, and the record tracing table comprises a tracing id, an entity type, construction time, a type, a data source and a version number; the auxiliary table comprises an attribute constraint table, a data source table, a customized wide table conversion configuration table and the like.
Optionally, the method for verifying whether the repair result of the repaired data meets the service application standard on line includes: and pushing the repaired data to the online, and verifying whether the data repairing result meets the service application standard or not through small-flow experimental analysis.
The invention also comprises a query data restoration system, wherein the query data restoration system adopts the query data restoration method to realize missing data restoration; the query data restoration system comprises a data deficiency judgment module, a data restoration module, a data construction module and a data verification module.
The data missing judgment module is used for acquiring data content to be judged, and judging whether the content data is missing data or not and whether the missing data is completed or not according to user behavior data and domain knowledge map data; the data restoration module is used for restoring the data content judged as missing data according to user behavior data and/or internet open domain knowledge collected by a crawler in real time to obtain restored data; the data construction module is used for constructing the data storage of the knowledge map in real time in a progressive layering mode according to the repaired data in a triple structure mode, and automatically constructing a link according to the knowledge map data so as to remove the repeated data stored in the database; the data verification module is used for verifying whether the repair result of the repaired data meets the service application standard on line, and the recall amount and the matching degree of the query are bidirectionally promoted by using the repaired data passing the verification.
The invention also includes a computer device, which includes a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the query data repairing method when executing the computer program.
The invention also comprises a storage medium storing a computer program which, when executed by a processor, carries out the steps of the query data repairing method.
The technical scheme provided by the invention has the following beneficial effects:
according to the method, the data is repaired and supplemented in real time through the data loss problem related to the search/recommendation scene in the online recruitment service, the relevance alignment of the user and the matched content is realized, the most relevant data is recalled, and the recall quantity and the matching degree of the query are improved in a two-way mode fundamentally. Through the defect judgment, repair, construction and verification of the main data, a complete closed loop is formed, all entity data of an entity data oil pipe can be checked during query, and the relationship network data of the entity data and the attributes and relationships of all the entity data can be explored.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a flowchart of a query data repairing method according to embodiment 1 of the present invention.
Fig. 2 is a schematic diagram of resume data before and after recovery in the query data recovery method according to embodiment 1 of the present invention.
Fig. 3 is a flowchart of data deficiency judgment in the query data repairing method according to embodiment 1 of the present invention.
Fig. 4 is a flowchart of missing data repair in the query data repair method according to embodiment 1 of the present invention.
Fig. 5 is a flowchart of constructing knowledge-graph data in the query data repairing method according to embodiment 1 of the present invention.
Fig. 6 is a schematic diagram of three branches of construction tasks in the query data repairing method according to embodiment 1 of the present invention.
Fig. 7 is a schematic diagram of performing an SPO to width conversion table on triple structure data in the query data repairing method according to embodiment 1 of the present invention.
Fig. 8 is a system block diagram of a query data recovery system according to embodiment 2 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
A knowledge graph, which is essentially a semantic network, is a graph-based data structure that may be composed of nodes (points) and edges (edges). In the knowledge-graph, each node represents an "entity" existing in the real world, and each edge is a "relationship" between entities. Knowledge-graphs are an efficient way of representing relationships. It is colloquially understood that a knowledge graph can be a relational network that links together all of the different kinds of information. Knowledge-graphs provide the ability to analyze problems from a "relational" perspective. Suppose a knowledge graph is used to describe a fact: zhang III is the father of Li IV. Here the entities are Zhang three and Li four and the relationship is father. Of course, Zhang three and Li four may have some kind of relationship with others, and are not considered here. When a telephone number is added to the knowledge-graph as a node, the telephone number is also an entity, and a relationship, a has _ phone, can also be defined between a person and a telephone, i.e. a certain telephone number belongs to a certain person. Time may be added to the has _ phone relationship as an attribute to indicate the time at which the phone number was turned on, such an attribute may be added not only to the relationship but also to the entity, and so on.
The resume knowledge graph is a knowledge graph constructed by using information related to resumes. The resume knowledge graph can be a whole set of framework for realizing knowledge representation and reasoning, and comprises knowledge graph entities, relations, word forests (synonyms, upper and lower words), vertical knowledge graphs (domain professional graphs), a knowledge maintenance module, a machine learning reasoning engine (upper and lower level reasoning, inconsistent reasoning, knowledge discovery reasoning and ontology concept reasoning) and the like. On one hand, the reasoning mechanism of the knowledge graph plays a role in assisting in identification during resume analysis; on the other hand, in information evaluation, the functions of entity positioning, matching degree identification and the like are realized, and support is provided for final resume evaluation. In one embodiment, the profile knowledge graph may be generated using the evaluated historical profiles. The evaluated historical resumes may include resumes of job seekers who have successfully applied the job, and may also include resumes of job seekers who have not successfully applied the job. The evaluated historical resume may be a resume obtained by wholly scoring the historical resume, or may be a resume obtained by scoring one or more resume information in the resume. The resume knowledge graph at least comprises the relevance information of the resume information of the historical resume relative to the post requirement of the post. The post requirements can be determined by the recruitment requirements and the field locations. For example, skill requirements, academic requirements, work age requirements, industry feature requirements, and the like may be included. The resume information may be information recorded in the resume, including, for example, a personal description, a learning experience description, a work experience description, and the like. Nodes in the resume knowledge graph and the relationship among the nodes can be configured according to requirements. For example, the nodes in the resume knowledge-graph may include post nodes, resume nodes, and the like. The post nodes may be used to represent post requirements and the resume nodes may be used to represent information related to resumes. The nodes and edges in the resume knowledge graph are used for representing that the connected nodes have an incidence relation. The relevance information may be information for evaluating relevance, such as a degree of relevance, a score, or a degree of matching. For example, the attribute of the edge connecting the resume node and the post node may include a value attribute of the resume node relative to the post node. The value attribute may be embodied by a score value/degree of association, or the like. In some examples, some resume nodes also have value attributes. For example, a node that indicates that a Nobel prize has been awarded has a value attribute that describes the value of the node. When determining the correlation between nodes, the correlation may be determined by the value attribute of the edge between nodes, or may be determined by the value attribute of the node.
There are many ways to construct the resume knowledge graph, which are not described herein. The method aims to solve the problem of how to repair data in real time when the query data is judged to be missing data and needs to be repaired.
According to the query data repairing method, the query data repairing system, the computer equipment and the storage medium, when the problem of data loss related to a search/recommendation scene in the on-line recruitment service is solved, the main data in the recruitment service is subjected to deletion judgment, repair, construction and verification to form a complete closed loop, and data is repaired and supplemented in real time. The relevance alignment of the user and the matched content is realized, the most relevant data is recalled, and the two-way promotion of the recall amount and the matching degree of the query is fundamentally realized. The following description will be given with reference to specific examples.
Example 1
As shown in fig. 1, an embodiment of the present invention provides a query data repairing method, where the method is used to perform deletion judgment and repair on data content input by a user at a user end, and the method includes the following steps:
and S1, acquiring the data content to be judged, and judging whether the content data is missing data.
In this embodiment, the missing data is mainly divided into two types, one is actually filled in the content for the user, which can be recognized by people, but is difficult to be recognized by machines. Referring to fig. 2, schematic diagrams before and after resume data repair are shown. Such as "big north" filled in by the school name of the original resume. Because of the background knowledge, people can know that the Beijing university actually means the Beijing university, but the machine cannot establish the relationship between the Beijing university and the Beijing university; in another example, if the user does not fill in the content, but knowledge is hidden in the filled content, and background knowledge reading is needed, or in the case of "north big", the user knows that the user is graduate in schools 985 and 211, and although the user does not fill in the content specifically, the machine cannot know the content. Obviously, the two types of missing data supplementary repair are crucial to the influence of the post matching on the recall and the matching degree.
In order to solve the above two types of missing data problems. Referring to fig. 3, the method for determining whether the content data is missing data includes:
s101, reading data contents input by a user side, wherein the data contents comprise query data and resume data filled into a search box by the user side.
S102, traversing the acquired data content based on the user behavior data and the domain knowledge graph data, and judging whether the data content is missing content.
S103, judging whether the missing content is completed or not based on the constructed domain knowledge graph data aiming at the missing content.
And S104, marking the data content which is not completed by missing to obtain missing data.
In this embodiment, whether the user input content is missing data and whether the data content is completed are calculated in real time based on the user behavior data and the domain knowledge graph data. The specific behavior data includes: the user side mainly uses the query content, job clicking, job viewing and job delivery data input by the user in the search box; query content of HR input responsible for recruitment, resume work experience, resume education experience and other data. For example: by inputting the query content 'intelligent connection' by the user at the user end and the following preference data of 'Beijing network engagement consultation limited company' for the company of post position clicking, position checking and position delivery, the 'intelligent connection' input by the user can be judged to be the missing content. Based on the constructed domain knowledge graph data, whether the missing data is completed or not can be judged.
And S2, repairing the data content judged as missing data to obtain repaired data.
In this embodiment, referring to fig. 4, a method for repairing data content determined as missing data includes:
s201, acquiring missing data to be repaired;
s202, repairing the missing data in real time according to user behavior data and domain knowledge map data;
s203, completing the repair of the missing data by searching the user behavior data related to the user behavior preference;
s204, acquiring Internet open domain knowledge in real time through a crawler, acquiring tag data corresponding to the missing data, generating characteristic tag data after marking confirmation, establishing domain knowledge map data of user input data and repaired data, and completing repairing of the missing data.
In an embodiment, data restoration is performed in real time for data contents that are determined to be missing data and need to be restored. Data repair provides mainly two approaches: one is to perform the repair based on user behavior data. The behavior data includes: the system comprises search query, job clicking, job viewing, job delivery data, resume work experience, resume education experience and other data. And (4) performing correlation calculation between the search and the behavior preference so as to complete the repair of the missing data. For example: the user who searches the 'intelligent joining recruitment' can directly check the position of the 'Beijing network engaging and consulting limited company'. Thereby establishing the relationship between the entities of 'intelligent joining recruitment' and 'Beijing network engagement consultation limited company'.
In addition, a crawler is adopted to collect internet open domain knowledge in real time, and relationship data of Beijing university and Beijing university are obtained, and label data of Beijing university and → 985 and 211 are obtained. And then after manual marking confirmation, carrying out characteristic label production processing to generate label data, and establishing a relation between user input data and repaired data.
And S3, constructing knowledge graph data from the repaired data in real time, and removing the repeated data stored in the database.
In this embodiment, the repaired data is constructed and stored in the database in real time through the data construction module. In order to improve the missing data repair and service capacity, the invention constructs and stores the repaired data in a warehouse in a triple (SPO) structure form. The structural data not only keeps rich semantic meanings among the data, but also is more friendly to a graph database, is more flexible particularly on data repairing and processing businesses, and is more suitable for multi-source and various missing data repairing and processing and service application scenes. In order to ensure the quality of data storage and the consistency of upper-layer service call data, the invention uniformly manages the attribute storage process through metadata.
The adopted scheme comprises the following steps: 1) configuring metadata; 2) a knowledge graph metadata management platform; 3) carrying out progressive layering to construct data storage of the knowledge graph; 4) automatically constructing a link of knowledge graph data; standardized deduplication, etc.
Referring to fig. 5, the method for constructing the knowledge-graph data from the repaired data in real time includes:
s301, acquiring the repaired data, positioning the data as wide table data, converting the wide table data into triple data in a metadata table in a triple (SPO) data form, and establishing a basic SPO layer of a hierarchical architecture for storing knowledge graph data;
s302, generating a link according to the attribute of the triple data in the metadata table, performing duplicate-elimination and unification processing on the entity data of the basic layer for constructing the triple, eliminating invalid data, and establishing an entity data layer of a hierarchical structure for storing the knowledge graph data;
and S303, converting the triple data of the entity data in one layer into wide-table data, mapping the attribute name and the data type of the triple data to the wide-table data, and setting a wide-table service application layer for storing the knowledge map data.
In this embodiment, the metadata table includes a generated entity category table, an entity attribute table, a constructed automatic warehousing task metadata table, a record traceability table, and an auxiliary table. The entity category table comprises an entity category number, a category name, a level and a parent category number, the entity attribute table is used for constraining the attribute of entity data, the attribute comprises the basic attribute and the relation attribute of the entity data, and the entity attribute table comprises the attribute name, the category to which the attribute belongs and whether multiple values exist; the automatic warehousing task metadata table is used for describing attributes corresponding to entity data and automatically constructing the attributes, and the automatic warehousing task metadata table comprises a task number, an attribute name, a data source, field mapping, relationship attribute constraint and whether a reverse relationship is constructed or not; the record tracing table is used for recording process information and detailed configuration information in the data construction process, so that the data tracing is facilitated, and the record tracing table comprises a tracing id, an entity type, construction time, a type, a data source and a version number; the auxiliary table comprises an attribute constraint table, a data source table, a customized wide table conversion configuration table and the like.
The invention builds a unified knowledge map metadata management platform, a data modeler presets various entities and attribute metadata in advance, and controllable contents comprise: attribute name, Chinese meaning, attribute description, edge type, single/multiple value, belonging class, data type, source identification, rule constraint, etc. Before data is put in storage, metadata needs to be uniformly configured on a metadata management platform, and in the data storage, a program can read the metadata for verification so as to ensure that the data in storage meets the data standard. For example: the 985 and 211 label fields of Beijing university are multi-value types, and after the multi-value types are configured, the system can accurately identify and automatically construct a multi-value array format, so that the system can be applied to recall and matching.
In the construction of knowledge graph data, because most of the data such as entities, relationships, attributes and the like come from multi-source channels, the problem of repeated data storage cannot be avoided, for example: the "big Beijing university" entity's "big" attribute goes many times. Like this kind of repeated data, not only occupy storage space, also influence the practical application effect of business. In order to avoid the problem, the invention not only ensures the data storage efficiency, but also ensures the high availability of the data, and fully utilizes the layering characteristic of the storage link to judge the reliability of repeated data and deduplicate the repeated data at the upper layer. And eliminating repeated data based on the repeated value deduplication capability provided by the database.
In one embodiment of the invention, entity data is abstracted into a metadata table, and unified standard and constraint management is performed on the data; the method specifically comprises the following steps:
1.1) generating an entity type table, wherein the entity type table mainly comprises an entity type number, a type name, a level and a parent type number.
1.2) generating an entity attribute table, and constraining which attributes (basic attributes and relationship attributes) of the entity, mainly including attribute names, the category to which the attributes belong, whether the attributes are multi-valued or not and the like.
1.3) an automatic database entry task metadata table is constructed to describe which attributes of which entity are automatically constructed, wherein the table mainly comprises a task number, an attribute name, a data source, field mapping, relationship attribute constraint and whether a reverse relationship is constructed.
1.4) a record tracing table is constructed by the data, process information and detailed configuration information in the data construction process are recorded, the tracing of the data is facilitated, and the record tracing table mainly comprises a trackID (tracing id), an entity type, construction time, a type, a data source, a version number and the like.
1.5) other auxiliary tables for improving data quality, such as: an attribute constraint table, a kg _ source data source table, a customized wide table conversion configuration table and the like.
In this embodiment, generating a link according to the metadata attribute specifically includes: entity type management, entity attribute management, and data source management. In the real-time construction of the restored data into the knowledge graph data, as shown in fig. 6, the whole construction task is divided into three branches: triple (SPO) structure storage, data normalization and deduplication, and SPO spin-width tables. Wherein:
(1) the purpose of the triple structure storage is to support dynamic and diverse attribute type change and graph calculation query support. Dynamic addition of arbitrary attributes is supported. Such as: the entity of the Beijing network recruitment and consultation company Limited adds the attribute alias 'intelligent recruitment', adds the attribute name 'Beijing intelligent triple collocade talent service company Limited', and the like, and the dynamic addition can be realized without modifying the table structure.
(2) The data normalization and deduplication task aims to ensure consistency and reliability of data and avoid ambiguity when data service is provided for front-end service. Such as: the Beijing network engagement consultation company and the Beijing intelligent three-colloke talent service company are actually different new heads of the same company, and when providing services to the outside, the service is carried out based on unified and standardized data, and the Beijing network engagement consultation company is uniformly adopted to serve the outside. The data normalization and the de-duplication are realized by judging the similarity of two entities through a data normalization model, and if the similarity reaches a certain threshold value and the data is proved to be a uniform entity, the entity normalization is carried out. In the corporate entity similarity model, selected features of the invention include: company name, company registration address, company corporate, company equity relationship, etc.
(3) The task of the SPO to the width table is to solve the analysis and mining scenes of the triple structure data, and the provided data format is automatically converted. Referring to fig. 7, the triple structure has the advantages of convenient and flexible construction, and the disadvantage of being not suitable for data analysis and mining in hive, mysql and other non-graph databases, because of the operation through a large number of join tables. Not only development cost is high, and execution filtering is low. Therefore, the invention realizes the service of converting the SPO into the wide table, automatically converts the data table of the SPO structure into the wide table structure based on the metadata, and provides service for data analysis and mining.
In one embodiment of the invention, the hierarchical architecture establishment of the knowledge-graph data storage comprises three layers, wherein the first layer is a basic SPO layer, the second layer is a layer of entity data, and the third layer is a broad-table service application layer.
The base SPO layer: basic data of KG data comprise data sources, data which is not normalized to an entity yet is mainly used for converting wide table data into ternary data, the entity attribute relationship is configured, data traceable trackID is generated, the reciprocal relationship is automatically established, reliable data source attributes are controlled, and the data are stored in a dm _ garph layer.
The entity data is classified into one layer: the data of the basic layer entity is subjected to duplication elimination and normalization, data normalization and data sorting are mainly realized, single-value/multi-value duplication elimination is carried out, invalid data is cleaned, a data source is controlled, and the data are stored in the dmr _ garph layer.
The wide table service application layer: one-stop service is provided for a data user, the user roommates who do not know SPO are facilitated, the mapping from attribute names and data types to a wide table is mainly realized, the configuration construction is realized, and the data is stored in the dma _ garph layer.
And S4, verifying whether the repair result of the repaired data meets the service application standard on line, and optimizing the recall and matching degree of the retrieval query by using the repaired data passing the verification.
In this embodiment, the method for verifying whether the repair result of the repaired data meets the service application standard on line includes: and pushing the repaired data to the online, and verifying whether the data repairing result meets the service application standard or not through small-flow experimental analysis.
And finally, during retrieval, the retrieval is carried out on the platform constructed by the method, so that the retrieval can be optimized, and the retrieval is inquired to realize bidirectional promotion of the recall amount and the matching degree. When the platform constructed by the method is used for retrieval, the detail information of a company can be clicked and checked; clicking and checking an entity related to the company, entering a detail page of the entity, and subsequently further checking all entities related to the entity so as to explore relationship network data of a certain entity; clicking into the entity detail page can see the attributes and relationships of the entity.
Example 2
As shown in fig. 8, an embodiment of the present invention provides a query data repairing system, which includes a data deficiency judging module 11, a data repairing module 12, a data constructing module 13, and a data verifying module 14.
The data missing judgment module 11 is configured to acquire data content to be judged, and judge whether the content data is missing data and whether the missing data is completed according to user behavior data and domain knowledge map data. In this embodiment, the data deficiency determining module 11 determines two types of data deficiency problems, where the two types of data deficiency are: the user actually fills in the content, and people can know the content, but the content is difficult to identify by a machine; the user does not fill in, but knowledge is hidden in the filled-in content, and background knowledge reading is assisted.
The data missing judgment module 11 may read data content including the two types of missing data input by the user side, traverse the acquired data content based on the user behavior data and the domain knowledge graph data, and judge whether the data content is the missing content. And judging whether the missing content is completed or not based on the constructed domain knowledge graph data aiming at the missing content, and marking the missing and uncompensated data content to obtain the missing data. The data deficiency judging module 11 may be capable of judging whether the deficiency data is completed based on the constructed domain knowledge graph data.
The data restoration module 12 is configured to restore data content determined as missing data according to user behavior data and/or internet open domain knowledge collected by a crawler in real time, so as to obtain restored data. In this embodiment, when the data repairing module 12 repairs, the missing data to be repaired is obtained, and the missing data is repaired in real time according to the user behavior data and the domain knowledge map data. The repairing method is divided into two methods, one method is based on user behavior data, and the repairing of the missing data is completed by searching the user behavior data related to the user behavior preference. And the other method is based on domain knowledge map data, the Internet open domain knowledge is collected in real time through a crawler, tag data corresponding to the missing data are obtained, after labeling confirmation is carried out, characteristic tag data are generated, domain knowledge map data of user input data and the repaired data are established, and the repairing of the missing data is completed.
For example, behavioral data includes: the system comprises search query, job clicking, job viewing, job delivery data, resume work experience, resume education experience and other data. And (4) performing correlation calculation between the search and the behavior preference so as to complete the repair of the missing data. For example: the user who searches the 'intelligent joining recruitment' can directly check the position of the 'Beijing network engaging and consulting limited company'. Thereby establishing the relationship between the entities of 'intelligent joining recruitment' and 'Beijing network engagement consultation limited company'.
When the domain-based knowledge graph data is restored, for example, relationship data of "beida" and "beijing university" and label data of "beijing university" and → "985, 211" are acquired. And then after manual marking confirmation, carrying out characteristic label production processing to generate label data, and establishing a relation between user input data and repaired data.
The data construction module 13 is configured to construct knowledge graph data storage in a progressive hierarchical manner for the repaired data in a triple structure, and automatically construct a link according to the knowledge graph data to remove the repeated data stored in the database. In the knowledge graph data construction, before the repeated data is eliminated, the data construction module 13 uses the layering characteristics of the warehousing links to judge the reliability of the repeated data and remove the repeated data according to the repeated warehousing problem of the data which is caused by the fact that most of the data such as entities, relations and attributes come from multi-source channels. And eliminating repeated data based on the repeated value deduplication capability provided by the database.
The data construction module 13 is further configured to abstract the entity data into a metadata table, and perform uniform specification and constraint management on the data. The metadata table comprises a generated entity category table, an entity attribute table, a constructed automatic warehousing task metadata table, a record traceability table and an auxiliary table. And generating a link according to the metadata attributes to realize entity type management, entity attribute management and data source management. And constructing the whole construction task through triple (SPO) structure storage, data normalization and deduplication and an SPO bandwidth conversion table.
The data verification module 14 is configured to verify whether a repair result of the repaired data meets a service application standard on line, and bidirectionally increase a recall amount and a matching degree of the query by using the repaired data that passes the verification. In this embodiment, the data verification module 14 pushes the repaired data content to the online, and the networking performs a small-flow a/B experiment significance analysis to verify the data repair result. And finally, retrieval is carried out through a platform constructed by the system, so that the retrieval can be optimized, and the retrieval and query can realize bidirectional promotion of the recall amount and the matching degree.
Example 3
In an embodiment of the present invention, a computer device is provided, which may be used to implement the query data repairing method provided in the above embodiments, and the computer device may be a smart phone, a computer, a tablet computer, or the like.
The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps in the above method embodiment when executing the computer program:
acquiring data content to be judged, and judging whether the content data is missing data;
repairing the data content judged as missing data to obtain repaired data;
constructing knowledge map data in real time on the repaired data, and eliminating repeated data stored in a database;
and verifying whether the repair result of the repaired data meets the service application standard on line, and optimizing the recall and matching degree of the retrieval query by using the repaired data passing the verification.
Example 4
An embodiment of the present invention further provides a storage medium having a computer program stored thereon, which when executed by a processor implements the steps in the above-mentioned method embodiments:
acquiring data content to be judged, and judging whether the content data is missing data;
repairing the data content judged as missing data to obtain repaired data;
constructing knowledge map data in real time on the repaired data, and eliminating repeated data stored in a database;
and verifying whether the repair result of the repaired data meets the service application standard on line, and optimizing the recall and matching degree of the retrieval query by using the repaired data passing the verification.
The method for restoring query data based on knowledge graph provided by this embodiment may be implemented by software, or by a combination of software and hardware, or by hardware, where the related hardware may be composed of two or more physical entities, or may be composed of one physical entity. The method of the embodiment can be applied to electronic equipment with processing capability. The electronic device may be a PC, a tablet computer, a notebook computer, a desktop computer, or the like.
It should be noted that, for the query data repairing method described in the present application, it can be understood by a person skilled in the art that all or part of the processes for implementing the query data repairing method described in the embodiments of the present application may be implemented by controlling related hardware through a computer program, where the computer program may be stored in a computer readable storage medium, such as a memory of a computer device, and executed by at least one processor in the computer device, and during the execution process, the processes of the embodiments of the query data repairing method may be included.
Accordingly, the present specification further provides a computer storage medium, in which program instructions are stored, and when the program instructions are executed by a processor, the method for restoring query data based on a knowledge graph is implemented.
Embodiments of the present description may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.
For the query data recovery apparatus in the embodiment of the present application, each functional module may be integrated in one processing chip, or each module may exist alone physically, or two or more modules are integrated in one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, or the like.
In conclusion, the data are repaired and supplemented in real time by solving the problem of data loss related to the search/recommendation scene in the online recruitment service, so that the relevance alignment of the user and the matched content is realized, the most relevant data is recalled, and the query recall amount and the matching degree are improved in a two-way mode fundamentally. Through the defect judgment, repair, construction and verification of the main data, a complete closed loop is formed, all entity data of an entity data oil pipe can be checked during query, and the relationship network data of the entity data and the attributes and relationships of all the entity data can be explored.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A query data recovery method is characterized by comprising the following steps:
s1, acquiring the data content to be judged, and judging whether the content data is missing data;
s2, repairing the data content judged as missing data to obtain repaired data;
s3, constructing knowledge graph data in real time from the repaired data, and eliminating the repeated data stored in the database;
and S4, verifying whether the repair result of the repaired data meets the service application standard on line, and optimizing the recall and matching degree of the retrieval query by using the repaired data passing the verification.
2. The query data repairing method according to claim 1, wherein the step S1 of determining whether the content data is missing data includes:
s101: reading data contents input by a user side, wherein the data contents comprise query data and resume data filled into a search box by the user side;
s102: traversing the acquired data content based on the user behavior data and the domain knowledge graph data, and judging whether the data content is missing content;
s103: judging whether the missing content is completed or not based on the constructed domain knowledge graph data aiming at the missing content;
s104: and marking the missing and incomplete data content to obtain the missing data.
3. The query data repairing method according to claim 2, wherein in step S2, said repairing the data content determined as missing data includes:
s201: acquiring missing data to be repaired;
s202: repairing the missing data in real time according to user behavior data and domain knowledge map data;
s203: completing the repair of the missing data by searching the user behavior data related to the preference of the user behavior;
s204: and acquiring Internet open domain knowledge in real time through a crawler, acquiring tag data corresponding to the missing data, generating characteristic tag data after marking confirmation, establishing domain knowledge map data of user input data and repaired data, and completing the repair of the missing data.
4. The query data repairing method according to claim 3, wherein the method for constructing the knowledge graph data from the repaired data in real time and eliminating the repeated data stored in the database comprises the following steps:
s301: acquiring repaired data, positioning the data as wide table data, converting the wide table data into triple data in a metadata table in a triple (SPO) data form, and establishing a basic SPO layer of a hierarchical architecture for storing knowledge graph data;
s302: generating a link according to the attribute of the triple data in the metadata table, performing duplicate-removing and merging processing on the entity data of the basic layer for constructing the triple, removing invalid data, and establishing an entity data merging layer of a hierarchical structure for storing the knowledge map data;
s303: and converting the triple data of the entity data in one layer into wide table data, mapping the attribute name and the data type of the triple data to the wide table data, and setting a wide table service application layer for storing the knowledge map data.
5. The query data recovery method as claimed in claim 4, wherein the metadata tables include a generated entity category table, an entity attribute table, a constructed automatic warehousing task metadata table, a record traceability table and an auxiliary table.
6. The query data recovery method of claim 5, wherein the entity category table includes an entity category number, a category name, a level, a parent class number;
the entity attribute table is used for constraining the attributes of the entity data, the attributes comprise basic attributes and relationship attributes of the entity data, and the entity attribute table comprises attribute names, attribute categories and whether multiple values exist;
the automatic warehousing task metadata table is used for describing attributes corresponding to entity data and automatically constructing the attributes, and the automatic warehousing task metadata table comprises a task number, an attribute name, a data source, field mapping, relationship attribute constraint and whether a reverse relationship is constructed or not;
the record tracing table is used for recording process information and detailed configuration information in the data construction process, so that the data tracing is facilitated, and the record tracing table comprises a tracing id, an entity type, construction time, a type, a data source and a version number;
the auxiliary table comprises an attribute constraint table, a data source table and a customized wide table conversion configuration table.
7. The query data repairing method according to claim 1, wherein the method for verifying whether the repairing result of the repaired data meets the service application standard on line comprises: and pushing the repaired data to the online, and verifying whether the data repairing result meets the service application standard or not through small-flow experimental analysis.
8. A query data recovery system, wherein the query data recovery system adopts the query data recovery method of any one of claims 1 to 7 to realize missing data recovery; the query data repair system includes:
the data missing judgment module is used for acquiring data content to be judged, and judging whether the content data is missing data or not and whether the missing data is completed or not according to the user behavior data and the domain knowledge map data;
the data restoration module is used for restoring the data content judged as missing data according to the user behavior data and/or the knowledge of the internet open domain collected by the crawler in real time to obtain restored data;
the data construction module is used for constructing the knowledge map data storage in a progressive layering mode on the repaired data in a triple structure mode in real time, and automatically constructing a link according to the knowledge map data so as to remove the repeated data stored in the database; and
and the data verification module is used for verifying whether the repair result of the repaired data meets the service application standard on line and bidirectionally promoting the recall amount and the matching degree of the query by utilizing the repaired data passing the verification.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A storage medium storing a computer program, characterized in that the computer program, when being executed by a processor, realizes the steps of the method of any one of claims 1 to 7.
CN202111189624.9A 2021-10-13 2021-10-13 Query data restoration method, system, computer equipment and storage medium Active CN113901233B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111189624.9A CN113901233B (en) 2021-10-13 2021-10-13 Query data restoration method, system, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111189624.9A CN113901233B (en) 2021-10-13 2021-10-13 Query data restoration method, system, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113901233A true CN113901233A (en) 2022-01-07
CN113901233B CN113901233B (en) 2023-11-17

Family

ID=79191884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111189624.9A Active CN113901233B (en) 2021-10-13 2021-10-13 Query data restoration method, system, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113901233B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290561A (en) * 2023-11-27 2023-12-26 北京衡石科技有限公司 Service state information feedback method, device, equipment and computer readable medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271530A (en) * 2018-10-17 2019-01-25 长沙瀚云信息科技有限公司 A kind of disease knowledge map construction method and plateform system, equipment, storage medium
CN109657238A (en) * 2018-12-10 2019-04-19 宁波深擎信息科技有限公司 Context identification complementing method, system, terminal and the medium of knowledge based map
CN109766445A (en) * 2018-12-13 2019-05-17 平安科技(深圳)有限公司 A kind of knowledge mapping construction method and data processing equipment
CN110019150A (en) * 2019-04-11 2019-07-16 软通动力信息技术有限公司 A kind of data administering method, system and electronic equipment
US20190392074A1 (en) * 2018-06-21 2019-12-26 LeapAnalysis Inc. Scalable capturing, modeling and reasoning over complex types of data for high level analysis applications
CN111723215A (en) * 2020-06-19 2020-09-29 国家计算机网络与信息安全管理中心 Device and method for establishing biotechnological information knowledge graph based on text mining

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190392074A1 (en) * 2018-06-21 2019-12-26 LeapAnalysis Inc. Scalable capturing, modeling and reasoning over complex types of data for high level analysis applications
CN109271530A (en) * 2018-10-17 2019-01-25 长沙瀚云信息科技有限公司 A kind of disease knowledge map construction method and plateform system, equipment, storage medium
CN109657238A (en) * 2018-12-10 2019-04-19 宁波深擎信息科技有限公司 Context identification complementing method, system, terminal and the medium of knowledge based map
CN109766445A (en) * 2018-12-13 2019-05-17 平安科技(深圳)有限公司 A kind of knowledge mapping construction method and data processing equipment
CN110019150A (en) * 2019-04-11 2019-07-16 软通动力信息技术有限公司 A kind of data administering method, system and electronic equipment
CN111723215A (en) * 2020-06-19 2020-09-29 国家计算机网络与信息安全管理中心 Device and method for establishing biotechnological information knowledge graph based on text mining

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290561A (en) * 2023-11-27 2023-12-26 北京衡石科技有限公司 Service state information feedback method, device, equipment and computer readable medium
CN117290561B (en) * 2023-11-27 2024-03-29 北京衡石科技有限公司 Service state information feedback method, device, equipment and computer readable medium

Also Published As

Publication number Publication date
CN113901233B (en) 2023-11-17

Similar Documents

Publication Publication Date Title
Evans et al. A holistic view of the knowledge life cycle: the knowledge management cycle (KMC) model
CN109446344B (en) Intelligent analysis report automatic generation system based on big data
Regli et al. Managing digital libraries for computer-aided design
WO2020160264A1 (en) Systems and methods for organizing and finding data
Grolinger et al. Knowledge as a service framework for disaster data management
US20150095303A1 (en) Knowledge Graph Generator Enabled by Diagonal Search
US10089390B2 (en) System and method to extract models from semi-structured documents
CN112463980A (en) Intelligent plan recommendation method based on knowledge graph
CN102930024A (en) A data quality solution architecture based on knowledge
CN102982097A (en) Domains for knowledge-based data quality solution
CA2659743A1 (en) Primenet data management system
Varga et al. Dimensional enrichment of statistical linked open data
Athanasiou et al. Big POI data integration with Linked Data technologies.
Ferri et al. KRC: KnowInG crowdsourcing platform supporting creativity and innovation
Moraitou et al. Semantic models and services for conservation and restoration of cultural heritage: A comprehensive survey
CN113901233A (en) Query data repairing method, system, computer equipment and storage medium
Baldo et al. A framework for selecting performance indicators for virtual organisation partners’ search and selection
Schwade et al. A semantic data lake for harmonizing data from cross-platform digital workspaces using ontology-based data access
CN112199488B (en) Incremental knowledge graph entity extraction method and system for power customer service question and answer
Siguenza Guzman et al. Design of an integrated decision support system for library holistic evaluation
Moalla et al. Integration of a multidimensional schema from different social media to analyze customers' opinions
Gujral et al. Knowledge Graphs: Connecting Information over the Semantic Web
Lee The combination of knowledge management and data mining with knowledge warehouse
Howard et al. Understanding and Characterizing Engineering Research Data for its Better Management
Ali Knowledge Graph-based Conceptual Models Search.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 214000 room 706, 7 / F, building 8 (Wuxi talent financial port), east of Hongxing Duhui, economic development zone, Wuxi City, Jiangsu Province

Applicant after: Zhilian Wangpin Information Technology Co.,Ltd.

Address before: 214000 room 706, 7 / F, building 8 (Wuxi talent financial port), east of Hongxing Duhui, Wuxi Economic Development Zone, Wuxi City, Jiangsu Province

Applicant before: Zhilian (Wuxi) Information Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant