CN115391480A

CN115391480A - Data processing method and device, computer equipment and storage medium

Info

Publication number: CN115391480A
Application number: CN202110563612.1A
Authority: CN
Inventors: 铁瑞雪
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2022-11-25

Abstract

The embodiment of the invention discloses a data processing method, a data processing device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a target data set, wherein the target data set comprises target full-name data and M short-name data corresponding to the target full-name data, and M is a positive integer; identifying data types of the M short messages, wherein the data types comprise suspicious types, and the short messages of the suspicious types refer to short messages which cannot represent the semantics of the target full-name data; if the short data with the data type being the suspicious type exists in the M short data, performing data analysis on the target full name data to generate new short data; in the target data set, the abbreviation data with the data type being the suspicious type in the M abbreviation data is replaced by the new abbreviation data, so that the effectiveness of data cleaning can be improved.

Description

Data processing method, data processing device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.

Background

With the continuous and deep development of computer technology, the amount of data stored in computer equipment is also increasing, for example, when the computer equipment stores full-name data and corresponding short-name data, a large amount of full-name short-name data is stored, and for convenience of performing subsequent business processing by using the full-name short-name data, data cleaning is often required to be performed on the full-name short-name data stored in the computer equipment. The currently adopted data cleaning method is a denoising method, a deduplication method and the like, but the subsequent service cannot be effectively improved based on the currently adopted data cleaning method, so that how to improve the effectiveness of data cleaning on the full-short data becomes a current research hotspot.

Disclosure of Invention

The embodiment of the invention provides a data processing method and device, computer equipment and a storage medium, which can improve the effectiveness of data cleaning.

In one aspect, an embodiment of the present invention provides a data processing method, including:

acquiring a target data set, wherein the target data set comprises target full-name data and M short-name data corresponding to the target full-name data, and M is a positive integer;

identifying data types of the M short messages, wherein the data types comprise suspicious types, and the short messages of the suspicious types refer to short messages which cannot represent the semantics of the target full-name data;

if short data with the data type being a suspicious type exists in the M short data, performing data analysis on the target full name data to generate new short data;

and in the target data set, replacing the abbreviation data with the suspicious data type in the M abbreviation data with the new abbreviation data.

In another aspect, an embodiment of the present invention provides a data processing apparatus, including:

the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a target data set, the target data set comprises target full-name data and M short-name data corresponding to the target full-name data, and M is a positive integer;

the identification unit is used for identifying the data types of the M short name data, wherein the data types comprise suspicious types, and the short name data of the suspicious types refer to short name data which cannot represent the semantics of the target full name data;

the analysis unit is used for carrying out data analysis on the target full name data to generate new abbreviation data if abbreviation data with a data type of a suspicious type exists in the M abbreviation data;

and the replacing unit is used for replacing the abbreviation data with the suspicious data type in the M abbreviation data with the new abbreviation data in the target data set.

In still another aspect, an embodiment of the present invention provides a computer device, including a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, where the memory is used to store a computer program that supports the computer device to execute the above method, the computer program includes program instructions, and the processor is configured to call the program instructions to perform the following steps:

if the short data with the data type being the suspicious type exists in the M short data, performing data analysis on the target full name data to generate new short data;

In yet another aspect, an embodiment of the present invention provides a computer-readable storage medium, in which program instructions are stored, and when the program instructions are executed by a processor, the program instructions are used for executing the data processing method according to the first aspect.

In the embodiment of the present invention, after the computer device determines the target whole-course data from the target data set and the M short names corresponding to the target full name data, the computer device may identify the data type of each short name data in the M short names, so as to screen out the short names of suspicious types that cannot express the semantics of the target full name data, and based on the screening of the short names of the suspicious types in the short names, the subsequent data processing pressure of the computer device may be reduced, and the data processing efficiency of the computer device may be improved. After the computer device screens out the suspicious type short data, the computer device can analyze the target data to generate new short data, and replace the original suspicious type short data in the target data set by adopting the newly generated short data, so that the computer device can effectively improve the accuracy of semantic expression of each short data in the target data set on the corresponding full-name data, and is favorable for the accuracy when the target data set is adopted to execute downstream tasks.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a diagram of a data search system according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram of a data processing method provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of data parsing on full symmetric data according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart diagram of a data processing method provided by an embodiment of the invention;

FIG. 5 is a schematic diagram of a crawler verification method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of generating new abbreviation data according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating a data processing method according to an embodiment of the present invention;

FIG. 8 is a schematic block diagram of a data processing apparatus provided by an embodiment of the present invention;

FIG. 9 is a schematic block diagram of a computer device provided by an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a data processing method, which can enable computer equipment to generate new abbreviation data based on target full-name data when determining that suspicious type abbreviation data exists in M abbreviation data of the target full-name data when data cleaning is carried out on a target data set, so that the suspicious type abbreviation data corresponding to the target full-name data is replaced by the newly generated abbreviation data, and the abbreviation data of the target full-name data obtained by replacement can be the abbreviation data capable of expressing the semantics of the target full-name data, so that the computer equipment can delete the abbreviation data which cannot be semantically expressed in the target data set when the data cleaning is carried out on the target data set, and can supplement the abbreviation data in the target data set, thereby improving the effectiveness of data cleaning on the target data set and remarkably improving the number of the abbreviation data in the target data set. In an embodiment, the target data set includes at least one abbreviation data and at least one full-name data, data types of the abbreviation data and the full-name data included in the target data set may be any types, and any two full-name data (or the abbreviation data) in the target data set may be data of the same type or data of different types. It is understood that the full name data is a name used for describing a formal name of a organization and the like, and the abbreviation data is a partial entity word extracted from the full name data and is used for summarizing the full name data, wherein the full name data may be "XXX finity company" for example, and the corresponding abbreviation data may be "XXX" or "XXX company" for example.

In one embodiment, after the computer device obtains the target data set, the computer device may perform data cleaning on the target data set, where the data cleaning performed on the short-form data and the full-form data in the target data set includes denoising, deduplication, and the like, specifically, the computer device may remove illegal characters in the target data set, characters that do not satisfy length constraints, and the like, and the computer device may also delete noise characters in the target data set, and after the computer device performs data cleaning on the target data set, the full-form data and the short-form data in the target data set may be further evaluated in a trusted manner by combining with expert experience, so that a full-form data set with low reliability is screened out from the target data set through the tags obtained through evaluation, and then the computer device may perform crawler check on the full-form data set with low reliability, thereby outputting the trusted short-form and the non-form. After the computer device determines the credible abbreviation and the non-credible abbreviation from the target data set, text analysis can be performed on the full-name data in the target data set, new abbreviation data is generated according to the text analysis result, the non-credible abbreviation data can be removed, and corresponding credible new abbreviation data is adopted for replacement, supplement and the like, so that a new target data set can be obtained, the quantity of the full-name abbreviation data in the target data set can be effectively increased, and the accuracy of the short-name data in the target data set is improved. In one embodiment, the computer device may employ a Named Entity Recognition (NER) algorithm, or other Entity word Recognition algorithms, for text parsing of the full-scale data.

After the computer device updates the full-name data and the corresponding abbreviation data in the target data set to obtain a new target data set, the computer device may recommend search information based on the obtained new target data set, specifically, as shown in fig. 1, after the computer device 10 obtains the new target data set, the computer device may obtain search information from the terminal device 11, and after the computer device 10 obtains the search information from the terminal device 11, the computer device may search the full-name information and/or the abbreviation information matching the search information from the new target data set, wherein if the computer device 10 determines that the abbreviation information matching the search information is found from the new target data set, the computer device may obtain the content data related to the abbreviation information and the content data related to the full-name data corresponding to the abbreviation data, and feed the content data fed back to the terminal device 11, so that the content data fed back may be realistic in the terminal device 11; if the computer device 10 finds the full-name data matched with the search information from the new target data set, the corresponding content data is obtained based on the full-name data and the corresponding short-name data as feedback data and is fed back to the terminal device 11 for display, and the content data is fed back based on the new target data set, so that the accuracy of the fed-back content data can be effectively improved, and the satisfaction of a user in the data search process is improved. In one embodiment, the computer device may search the semantics of the search information and then may search for the full-name data (or simply data) matching the semantics, or the computer device may extract the keyword from the search information to search for the full-name data (or simply data) matching the keyword.

For a detailed description of a process of generating new abbreviated data corresponding to a certain full-name data in the target data set, please refer to fig. 2, which is a schematic flowchart of a data processing method according to an embodiment of the present invention, where in this embodiment, a detailed description is given of a process of generating new abbreviated data of the target full-name data in the target data set, where the target full-name data is any full-name data in the target data set, and as shown in fig. 2, the method may include:

s201, a target data set is obtained, wherein the target data set comprises target full-name data and M short-name data corresponding to the target full-name data, and M is a positive integer.

S202, identifying M data types of the data for short, wherein the data types comprise suspicious types, and the data for short of the suspicious types refers to the data for short which cannot represent the semantics of the target full-name data.

The target data set obtained by the computer device includes one or more full name data, each full name data corresponding to one or more abbreviated data, and similarly, each abbreviated data in the target data set also corresponds to one or more full name data, such as the full name data being "XXX finite liability company", whose corresponding abbreviated data includes "XXX company" or "XXX", and the full name data corresponding to the abbreviated data "XXX" may also include not only "XXX finite liability company", but also "XXX finite liability company — Shenzhen division", it can be understood that although one or more full name data and corresponding abbreviated relationships are recorded in the target data set. Then, when the computer device performs the service processing by using the target data set, a service error may occur due to a many-to-many relationship between the abbreviated data and the full-name data in the target data set, so that the computer device needs to adjust the abbreviated data in the target data set to obtain a new target data set, so as to ensure an accuracy problem when performing the service processing based on the new target data set.

In one embodiment, when adjusting abbreviated data in a target data set, a computer device may perform a data cleansing operation on full-name data and abbreviated data in the target data set, and specifically, when performing data cleansing on full-name data and abbreviated data in the target data set, the computer device may perform data deduplication processing, data cleansing processing, denoising processing, and the like on the full-name data (or abbreviated data), where the data cleansing processing includes removing extra spliced data or extra english characters, and the denoising processing may include removing illegal characters included in the full-name data (or abbreviated data), data with length not conforming to a constraint, and the like. Since data with length not conforming to the constraint, which may be, for example, 2 characters long or 3 characters long, has no specific business semantics in most cases, the data with length not conforming to the constraint may be deleted in the data cleansing phase. After the computer device performs data cleaning on the target data set, any full-name data can be selected from the target data set as target full-name data, and M short-name data corresponding to the target full-name data are determined from the target data set.

After the computer device obtains the target full-name data and the corresponding M short-name data, the data types of the M short-name data can be identified, so that the suspicious data of the suspicious type which cannot express the semantics of the target full-name data can be determined from the M short-name data. In a specific implementation, when the computer device identifies data types of M abbreviated data, the computer device may perform semantic analysis on each abbreviated data in the M abbreviated data to determine a semantic of each abbreviated data, and further perform semantic analysis on the target full-name data to determine a semantic of the target full-name data, so that the computer device may further perform semantic matching on the semantic of each abbreviated data and the semantic of the target full-name data, and find out, from the M abbreviated data, abbreviated data that cannot express the semantic of the target full-name data (or cannot completely express the semantic of the target full-name data, or has semantic ambiguity) according to a result of the semantic matching, where the found abbreviated data is suspicious type abbreviated data.

In an embodiment, after the computer device obtains the target full-name data and the corresponding M abbreviated data, the computer device may further perform a trusted evaluation on the M abbreviated data to delete the untrusted abbreviated data from the M abbreviated data, where when performing the trusted evaluation on the M abbreviated data, the computer device may perform a trusted evaluation on each abbreviated data based on a data attribute of each abbreviated data in the M abbreviated data, for example, may perform the trusted evaluation on each abbreviated data by determining a data length of each abbreviated data, and the like, and further perform a correlation detection on each abbreviated data that passes through the trusted evaluation and the target full-name data after performing the trusted evaluation on each abbreviated data, thereby determining whether each detected data can express semantics of the target full-name data based on a detection result of the correlation, and determining a data type of each abbreviated data. After the computer device determines the data type of each abbreviation data corresponding to the target full-name data, the abbreviation data of which the corresponding data type is a suspicious type in the target data set can be adjusted, so that the abbreviation data corresponding to the target full-name data in the target data set can express the semantics of the target full-name data.

And S203, if the abbreviated data with the data type being the suspicious type exists in the M abbreviated data, performing data analysis on the target full-name data to generate new abbreviated data.

In an embodiment, after determining a data class of each abbreviation data of the M abbreviation data corresponding to the target full-name data, the computer device may further perform data parsing on the target full-name data, so that the computer device may generate a new abbreviation data of the target full-name data based on a parsing result of the target full-name data. In an embodiment, when the computer device performs data analysis processing on the target full-name data and generates corresponding new abbreviation data, the computer device may perform semantic analysis on the target full-name data to determine that the semantic analysis result of the target full-name data generates the new abbreviation data, for example, if the target full-name data is "shanghai XX e-commerce limited", and if the computer device determines that there is abbreviation data whose data type is suspicious in M abbreviation data corresponding to the "shanghai XX e-commerce limited", the computer device may perform semantic analysis on the "shanghai XX e-commerce limited", and if the generated semantic analysis result is "shanghai XX", the semantic analysis result "shanghai XX company" may be used as the new abbreviation data corresponding to the target full-name data.

When the computer device performs data analysis on the target full-name data and generates corresponding new abbreviation data, in another implementation manner, the computer device may further invoke a sequence tagging model to analyze the target full-name data so as to determine semantics corresponding to each entity word in the target full-name data and a role expressed in the target full-name data, and then further, the computer device may extract one or more entity words from the target full-name data for multiple combinations based on the analysis of the sequence tagging model on the target full-name data, so as to obtain the new abbreviation data corresponding to the target full-name data. In one embodiment, the model structure of the sequence labeling model may be a bi-lstm + crf two-layer model structure, or a bert + lstm + crf three-layer model structure, for example, and the computer device may perform extraction and combination of entity words based on prior knowledge or through model training when parsing the target full-scale data and extracting one or more entity words from the target full-scale data. Similarly, if the target full-name data determined by the computer device is "shanghai XX company", "XX company limited", and a role corresponding to each entity word may be determined based on the invocation of the sequence tagging model by the computer device, as shown in fig. 3, the role corresponding to the entity word shanghai is a place name, the role corresponding to the entity word XX is a keyword, the role corresponding to the entity word electronic commerce is an industry, and the role corresponding to the entity word limited is general information, then, based on the analysis of the target full-name data, the computer device may extract one or more entity words from the target full-name data to form new abbreviated data corresponding to the target full-name data, where the obtained new abbreviated data may be "shanghai XX company", "XX company limited", and the like.

After generating new abbreviation data based on the parsing of the target full-name data by the computer device, the computer device may update the target data set with the new abbreviation data, that is, the computer device may proceed to execute step S204.

S204, in the target data set, replacing the abbreviation data with the suspicious data type in the M abbreviation data with new abbreviation data.

When the computer device updates the target data set with the new abbreviation data, the computer device may replace the corresponding abbreviation data of the suspicious type with the new abbreviation data, and in one embodiment, replacing the original abbreviation data of the suspicious type with the new abbreviation data refers to: in the target data set, the data association between the suspicious type abbreviation data and the target full-name data is cancelled, the generated new abbreviation data is added to the target data set, and the added new abbreviation data is associated with the target full-name data, it should be noted that the number of new abbreviation data generated by the computer device based on the data analysis of the target full-name data is one or more, and the number of the generated new abbreviation data and the number of the suspicious type abbreviation data may also be the same or different, that is, the number of new abbreviation data generated by the computer device based on the data analysis of the target full-name data may be greater than the number of the suspicious type abbreviation data, may also be equal to the number of the suspicious type abbreviation data, or may also be smaller than the number of the suspicious type abbreviation data, which is not limited in the embodiment of the present invention.

Referring to fig. 4, which is a schematic flow chart of a data processing method according to an embodiment of the present invention, as shown in fig. 4, the method may include:

s401, a target data set is obtained, wherein the target data set comprises target full-name data and M short-name data corresponding to the target full-name data, and M is a positive integer.

S402, identifying M data types of the data for short, wherein the data types comprise suspicious types, and the data for short of the suspicious types refers to the data for short which cannot represent the semantics of the target full-name data.

In one embodiment, the target data set acquired by the computer device includes one or more full-name data and abbreviation data associated with each full-name data, and in the target data set, the number of abbreviation data associated with each full-name data may be one or more, and the number of full-name data associated with each abbreviation data may also be one or more. If the computer device acquires the target full-name data and the M short-name data associated with the target full-name data, after the computer device acquires the target full-name data and the corresponding M short-name data, the computer device can identify the data type of each short-name data in the M short-name data to determine the data type of each short-name data, and when the computer device identifies the data, the computer device can firstly acquire the data attribute of each short-name data in the M short-name data, the data attribute of the target full-name data and the data association between the target full-name data and each short-name data respectively; and the computer equipment can perform credible evaluation on each abbreviation data according to the data attribute of each abbreviation data, the data attribute of the target full-name data and the data relevance, and determine the data types of the M abbreviation data according to the credible evaluation result.

Data attributes of the abbreviated data (or the target full-name data) are data for describing the abbreviated data (or the target full-name data) qualitatively and/or quantitatively, for example, the data attributes may include a data length, a data content included in the data, and the like, so that when each abbreviated data is evaluated truthfully based on the data attribute of the abbreviated data, the data attribute of the target full-name data, and the data association, the computer device may evaluate a tag for the corresponding abbreviated data based on one or more evaluation modes, wherein the tag is used for indicating whether the corresponding abbreviated data is credible based on the credible evaluation. In a specific implementation, the computer device may add a suspicious tag to any abbreviated data when determining that the data length (e.g., 2 or 3) of any abbreviated data is smaller than a length threshold according to a data attribute corresponding to any abbreviated data, where the suspicious tag is an evaluation tag added to any abbreviated data by the computer device based on a trusted evaluation performed on any abbreviated data, and the suspicious tag is used to indicate that any abbreviated data is considered to be temporarily untrusted abbreviated data after the computer device performs the trusted evaluation on any abbreviated data. In addition, when the computer device performs trusted evaluation on the abbreviated data, it may further determine whether any abbreviated data and the target full-name data have an intersection according to a data attribute corresponding to any abbreviated data, a data attribute of the target full-name data, and a data association, and when it is determined that there is an intersection, add a trusted tag to any abbreviated data, so that it should be understood that the trusted tag is used to indicate that any hierarchical data is regarded as temporary trusted abbreviated data after performing trusted evaluation.

In one embodiment, when the computer device performs trusted evaluation on the abbreviation data and adds a corresponding evaluation tag to the abbreviation data based on a result of the trusted evaluation, the computer device may further determine whether any of the abbreviation data and the target full-name data are in one-to-one correspondence, and when it is determined that they are in one-to-one correspondence, add a trusted tag to the abbreviation data, and when they are not in one-to-one correspondence, add a suspicious tag to the abbreviation data. And the like, in the embodiment of the present invention, the manner and the number of the evaluation means adopted by the computer device for performing the trusted evaluation on the abbreviated data are not limited, that is, the computer device may adopt one or more of the above three manners for performing the trusted evaluation on the abbreviated data to perform the trusted evaluation on the abbreviated data. When multiple modes for performing trusted evaluation on the abbreviated data exist in the computer device, the multiple evaluation modes can be sequenced, and then the corresponding evaluation modes are sequentially adopted to perform the trusted evaluation on the abbreviated data, so that an evaluation result of the abbreviated data is obtained. For example, if the abbreviated data "abc" is determined by the computer device and the target full name data is "shanghai XX e-commerce limited", then, since there is no intersection between the abbreviated data "abc" and the target full name data "shanghai XX e-commerce limited", the abbreviated data "abc" is determined to be untrusted abbreviated data, and a suspicious tag is added to the abbreviated data.

After the computer equipment determines M data types of the data for short for reference according to the credibility evaluation result, the data for short for reference added with the credibility label can be screened out from the M data for short for reference according to the credibility evaluation result, the data type of the data for short for reference added with the credibility label is a credibility type, and the credibility type data for short for reference refers to data for short for reference capable of representing the semantics of the target full-name data; in addition, the computer device can also determine the abbreviation data added with the suspicious tag from the M abbreviation data, then, after determining the abbreviation data added with the suspicious tag, the computer device can perform semantic analysis on the abbreviation data added with the suspicious tag and the target full-name data to obtain a semantic analysis result, then, the computer device can determine the semantics of the abbreviation data added with the suspicious tag and the semantic relevance between the semantics of the target full-name data according to the semantic analysis result, and thus, the data type of the abbreviation data added with the suspicious tag can be determined according to the semantic relevance. That is to say, after the computer device performs the trusted evaluation on the abbreviation data, the computer device may further analyze the abbreviation data evaluated as suspicious based on the result of the trusted evaluation, so as to determine the data type of the abbreviation data based on the secondary analysis result, and further, the reliability of each data type of the abbreviation data determined by the computer device may be effectively improved.

When semantic analysis is performed on the short data added with the suspicious tags and the target full-name data by the computer equipment to obtain a semantic analysis result, the short data added with the suspicious tags can be used as a first crawler keyword, and the target full-name data can be used as a second crawler keyword, so that the computer equipment can perform crawler search according to the first crawler keyword and the second crawler keyword to obtain a crawler search result, and the crawler search result is used as the short data added with the suspicious tags and the result of performing semantic analysis on the target full-name data. Then, when the computer device determines the semantics of the abbreviation data to which the suspicious tag is added and the semantic association between the semantics of the target full-name data according to the semantic parsing result, the computer device may determine the semantics of the abbreviation data to which the suspicious tag is added and the semantics of the target full-name data are associated when the crawler search result indicates that the first crawler keyword and the second crawler keyword commonly appear in the search result, and may determine the semantics of the abbreviation data to which the suspicious tag is added and the semantics of the target full-name data are irrelevant if the computer device determines that the crawler search result indicates that the first crawler keyword and the second crawler keyword do not commonly appear in the search result. For example, if the target global name data determined by the computer device is "shanghai XX e-commerce limited company", and the abbreviation data to which the tag is added includes "abc" based on the above-mentioned trust check, the computer device may perform a crawler search using "shanghai XX e-commerce limited company" as the second crawler keyword and "abc" as the first crawler keyword, and if a search result obtained by performing the crawler search based on the two crawler keywords is as shown in fig. 5, the computer device may determine, based on the search result, that a data type corresponding to the abbreviation data "abc" marked as the untrusted state is suspicious because "abc" as the first crawler keyword and "shanghai XX e-commerce limited company" as the second crawler keyword do not occur together.

Based on the above multiple analysis of the data types of the abbreviated data, the manner of determining the data type corresponding to each abbreviated data can effectively improve the accuracy when determining the data type of each abbreviated data, and then, after determining the data type of each abbreviated data, the computer device can perform data analysis on the target full-name data after the target full-name data has the data of the suspicious type, so as to realize the replacement of the data of the suspicious type, that is, execute step S403 instead.

And S403, if the abbreviated data with the data type being the suspicious type exists in the M abbreviated data, performing data analysis on the target full-name data to generate new abbreviated data.

S404, in the target data set, replacing the abbreviation data with the data type of the suspicious type in the M abbreviation data with new abbreviation data.

In an embodiment, the target full-name data includes one or more entity words, and then, when the computer device performs data parsing on the target full-name data to generate new abbreviation data, the computer device may perform data parsing on the target full-name data to determine an entity role of each entity word in the target full-name data, and further may combine any one or more entity words according to the entity role of each entity word, and use the entity word obtained by the combination as the new abbreviation data. The computer device may perform data analysis on the target full-scale data by using a sequence tagging model, where the sequence tagging model is a trained model for performing role recognition of entity words, and an analysis process of the target full-scale data by using a sequence identification model by the computer device may be as shown in fig. 3. After determining the entity role of each entity word in the target full-name data, the computer device may combine any one or more entity words to obtain new short-name data, as shown in fig. 6, for example, the computer device may combine entity words according to the role of each entity word and the dependency relationship between the entity word roles to obtain new short-name data.

In one embodiment, if the target full name data is "shanghai XX e-commerce limited", based on the dependency relationship between the corresponding roles of the entity words shanghai, XX, e-commerce and limited, new abbreviation data can be obtained by recombination, and the obtained new abbreviation data can be "shanghai XX", "XX e-commerce", or "XX e-commerce", and so on. After the computer device generates new abbreviation data, the abbreviation data corresponding to the suspicious type in the target data set can be replaced by the new abbreviation data, so that the reliability of each abbreviation data in the target data set is improved.

S405, according to the data type of each abbreviation data, the abbreviation data of the credible type is screened out from the M abbreviation data.

S406, if the credible short name data corresponds to at least two full name data, screening out common full name data and non-common full name data from the at least two full name data.

S407, setting a first weight value for the common full-scale data, and setting a second weight value for the non-common full-scale data, wherein the priority of the first weight value is higher than that of the second weight value.

In steps S405 to S407, based on the data type of each abbreviation data determined by the computer device, the computer device may further screen out the abbreviation data of a trusted type from M abbreviation data of the target full-name data, and after screening out the abbreviation data of a trusted type, if one abbreviation data of a trusted type corresponds to at least two full-name data, then screen out common full-name data and non-common full-name data from the corresponding at least two full-name data, where the common full-name data corresponding to the abbreviation data of a trusted type refers to: based on the daily expression requirement of the user, the full-name data referred to by the reliable short-name data is usually referred to as a computer rather than a mobile phone based on the daily expression requirement of the user, so if the full-name data corresponding to the short-name data in the target data set comprises the computer and the mobile phone, the computer device determines that the short-name data "PC" is reliable, the common full-name data determined based on the short-name data is referred to as the computer, and the full-name data is referred to as the mobile phone. Then, after the computer device obtains the common full-name data and the non-common full-name data through screening based on the credible abbreviated data, a first weight value can be set for the common full-name data, and the non-common full-name data is a second weight value, and the priority of the first weight value is higher than that of the second weight value.

In one embodiment, the dictionary of the common data maintained in the target data set by the computer device may be maintained based on whether the full-name data corresponds to an enterprise with a head hint in a related field, that is, if the full-name data corresponding to the abbreviated data of the trusted type determined by the computer device includes a company name of company a and a company name of company B, and in the internet industry field, if company a is a head enterprise in the industry field and company B is not a head enterprise in the industry field, the computer device may filter the common full-name data corresponding to the abbreviated data of the trusted type respectively, and when the common full-name data is not, may use the company name corresponding to company a as the common full-name data, and use the company name corresponding to company B as the non-common full-name data.

The computer equipment maintains a common data dictionary in a target data set based on screening of common full-name data and non-common full-name data, and then can be applied to downstream tasks based on the common data dictionary, and when the downstream tasks are search tasks, the computer equipment can obtain target search information which comprises credible type reference short-name data; and then, data related to the reference short data and the common full-name data corresponding to the reference short data can be obtained and used as search result data, so that the search result data can be displayed in the terminal equipment, and search feedback is performed based on the common full-name data, so that the search feedback result can better meet the requirements of users, the data processing pressure of the computer equipment can be effectively reduced, and the data processing efficiency of the computer equipment is improved. In one embodiment, after determining the data type of the abbreviation data, the computer device may further screen out the abbreviation data of a trusted type from the M abbreviation data according to the data type of each abbreviation data of the M abbreviation data, and determine one or more other full name data corresponding to the abbreviation data of the trusted type; and then, data analysis can be carried out on the target full-name data and one or more other full-name data to generate recommendation abbreviation data, the recommendation abbreviation data is added to the target full-name data and one or more other full-name data, and the data quantity corresponding to the full-name data and the abbreviation data in the target data set can be remarkably improved based on the addition of the recommendation abbreviation data to the full-name data.

Based on the above process of processing any target full-name data in the target data set, the following describes, with reference to fig. 7, a case when the computer device processes each full-name data in the target data set, where a specific process of processing any full-name data in the target data set by the computer device may refer to the description of the above embodiment. When the computer device processes data of a target data set, it mainly needs a filtering module, a checking module and a full name data analysis model in the computer device, wherein the computer device can firstly use the filtering module to perform data cleaning, denoising, and full name duplication removal, etc. on the target data set, and then can perform credible evaluation on the processed data to obtain a corresponding evaluation label, and based on the credible evaluation, the filtering module can send the data added with a suspicious label to the checking module. After the verification check module obtains the suspicious data for short, the web crawler processing can be performed based on the suspicious data for short to obtain a crawler check result, and then the data type of the data for short is determined based on the crawler check result.

After the verification module determines credible abbreviated data and possible abbreviated data based on the crawler verification result and the input of credible evaluation, aiming at the suspicious abbreviated data, the verification module can input the suspicious abbreviated data corresponding to the full-name data into a full-name data analysis module so that the full-name data analysis module generates new abbreviated data of the suspicious abbreviated data corresponding to the full-name data, and the new abbreviated data is adopted to replace the suspicious abbreviated data. For the credible abbreviation data, the computer equipment can perform other recommendation abbreviation data generation on the corresponding full-name data and maintain a common data dictionary, and then effectively adjust the target data set.

In the embodiment of the present invention, after determining target full-name data from a target data set, a computer device may identify a data type of each abbreviation data corresponding to the target full-name data to determine a data type corresponding to each abbreviation data, and further, when the abbreviation data corresponding to the target full-name data includes suspicious type abbreviation data, perform data analysis on the target full-name data, and generate new abbreviation data based on an analysis result, so as to replace the suspicious abbreviation data in the target data set with the new abbreviation data. In addition, the computer equipment can also screen out credible type short name data, screen out common full name data and non-common full name data according to full name data corresponding to the credible type short name data, set different priorities for the common full name data and the non-common full name data respectively in a mode of setting weight values, and effectively improve the reliability of subsequent tasks carried out by the adjusted target data set by the computer equipment through distinguishing and maintaining the common full name data and the non-common full name data.

Based on the description of the foregoing data processing method embodiment, an embodiment of the present invention further provides a data processing apparatus, where the data processing apparatus may be a computer program (including a program code) running in the foregoing computer device, where the computer device may be a terminal device or may also be a server device. The data processing apparatus may be configured to execute the data processing method as shown in fig. 2 and fig. 4, referring to fig. 8, the data processing apparatus includes: an acquisition unit 801, a recognition unit 802, a parsing unit 803, and a replacement unit 804.

An obtaining unit 801, configured to obtain a target data set, where the target data set includes target full-name data and M short-name data corresponding to the target full-name data, where M is a positive integer;

an identifying unit 802, configured to identify data types of the M abbreviated data, where the data types include a suspicious type, and the suspicious type abbreviated data is abbreviated data that cannot represent semantics of the target full-name data;

an analyzing unit 803, configured to perform data analysis on the target full name data to generate new abbreviation data if there is abbreviation data whose data type is a suspicious type in the M abbreviation data;

a replacing unit 804, configured to replace, in the target data set, the abbreviation data whose data type is a suspicious type in the M abbreviation data with the new abbreviation data.

In an embodiment, the identifying unit 802 is specifically configured to:

acquiring data attributes of each abbreviation data in the M abbreviation data, data attributes of the target full-name data and data relevance between the target full-name data and each abbreviation data respectively;

and performing credible evaluation on each abbreviation data according to the data attribute of each abbreviation data, the data attribute of the target full-name data and the data relevance, and determining the data types of the M abbreviation data according to the credible evaluation result.

In an embodiment, the identifying unit 802 is specifically configured to:

if the data length of any short data is determined to be smaller than the length threshold value according to the data attribute corresponding to the short data, adding a suspicious label to the short data; alternatively, the first and second electrodes may be,

and if it is determined that intersection exists between any abbreviation data and the target full-name data according to the data attribute corresponding to any abbreviation data, the data attribute of the target full-name data and the data correlation, adding a trusted tag to any abbreviation data.

In one embodiment, the data types further include a trusted type, and the abbreviation data of the trusted type refers to abbreviation data capable of representing the semantics of the target full name data; the identifying unit 802 is specifically configured to:

screening short data added with a trusted label from the M short data according to the trusted evaluation result, wherein the data type of the short data added with the trusted label is a trusted type;

determining the abbreviation data added with the suspicious tag from the M abbreviation data, and performing semantic analysis on the abbreviation data added with the suspicious tag and the target full name data to obtain a semantic analysis result;

and determining the semantics of the data for short for which the suspicious label is added and the semantic relevance between the semantics of the target full-name data according to the semantic analysis result, and determining the data type of the data for short for which the suspicious label is added according to the semantic relevance.

In an embodiment, the identifying unit 802 is specifically configured to:

taking the short data added with the suspicious tag as a first crawler keyword, and taking the target full name data as a second crawler keyword;

and performing crawler search according to the first crawler keyword and the second crawler keyword to obtain a crawler search result, and taking the crawler search result as a result of performing semantic analysis on the data added with the suspicious tag for short and the target full name data.

In an embodiment, the identifying unit 802 is specifically configured to:

if the crawler search result indicates that the first crawler keyword and the second crawler keyword commonly appear in the search result, determining that the semantics of the short data added with the suspicious tag are associated with the semantics of the target full-name data;

and if the crawler search result indicates that the first crawler keyword and the second crawler keyword do not appear together in the search result, determining that the semantics of the short data added with the suspicious tag are irrelevant to the semantics of the target full-name data.

In one embodiment, the target full call data comprises one or more entity words; the analysis unit 803 is specifically configured to:

carrying out named entity recognition processing on the target full-scale data, and determining an entity role of each entity word in the target full-scale data;

and combining any one or more entity words according to the entity roles of each entity word, and taking the entity words obtained by combination as new abbreviation data.

In one embodiment, the data types further include a trusted type, and the abbreviation data of the trusted type refers to abbreviation data capable of representing corresponding full name data semantics; the device further comprises: a screening unit 805 and a setting unit 806.

A screening unit 805, configured to screen, according to the data type of each abbreviation data, abbreviation data of a trusted type from the M abbreviation data;

the screening unit 805 is further configured to screen out common full-name data and non-common full-name data from the at least two full-name data if the short-name data of the trusted type corresponds to the at least two full-name data;

the setting unit 806 is configured to set a first weight value for the common full-scale data, and set a second weight value for the non-common full-scale data, where a priority of the first weight value is higher than a priority of the second weight value.

In one embodiment, the apparatus further comprises: a display unit 807.

The obtaining unit 801 is further configured to obtain target search information from a terminal device, where the target search information includes reference abbreviation data of a trusted type;

the obtaining unit 801 is further configured to obtain data related to the reference abbreviation data and common full-name data corresponding to the reference abbreviation data as search result data;

a display unit 807 for displaying the search result data in the terminal device.

In one embodiment, the data types further include a trusted type, and the abbreviation data of the trusted type refers to abbreviation data capable of representing corresponding full name data semantics; the device further comprises: an adding unit 808.

The screening unit 805 is further configured to screen, according to a data type of each abbreviation data in the M abbreviation data, a trusted type abbreviation data from the M abbreviation data, and determine one or more other full name data corresponding to the trusted type abbreviation data;

the analyzing unit 803 is further configured to perform data analysis on the target full-name data and the one or more other full-name data;

an adding unit 808, configured to generate recommendation abbreviation data, and add the recommendation abbreviation data to the target full name data and the one or more other full name data.

In this embodiment of the present invention, after the obtaining unit 801 determines the target whole-course data from the target data set and the M short names corresponding to the target full name data, the identifying unit 802 may identify the data type of each short name data in the M short names, so as to screen out the short names of suspicious types that cannot express the semantics of the target full name data, and based on the screening of the short names of the suspicious types in the short names, the subsequent data processing pressure of the computer device may be reduced, and the data processing efficiency may be improved. After the suspected type abbreviation data is screened out, the parsing unit 803 may parse the target data to generate new abbreviation data, so that the replacing unit 804 may replace the original suspected type abbreviation data in the target data set with the newly generated abbreviation data, which may effectively improve the accuracy of semantic expression of the corresponding full-name data by each abbreviation data in the target data set, and is favorable for the accuracy when executing a downstream task by using the target data set.

Fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device in the present embodiment as shown in fig. 9 may include: one or more processors 901; one or more input devices 902, one or more output devices 903, and memory 904. The processor 901, the input device 902, the output device 903, and the memory 904 are connected by a bus 905. The memory 904 is used to store a computer program comprising program instructions, and the processor 901 is used to execute the program instructions stored by the memory 904.

The memory 904 may include volatile memory (volatile memory), such as random-access memory (RAM); the memory 904 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the memory 904 may also comprise a combination of the above-described types of memory.

The processor 901 may be a Central Processing Unit (CPU). The processor 901 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or the like. The PLD may be a field-programmable gate array (FPGA), a General Array Logic (GAL), or the like. The processor 901 may also be a combination of the above structures.

In the embodiment of the present invention, the memory 904 is used for storing a computer program, the computer program includes program instructions, and the processor 901 is configured to execute the program instructions stored in the memory 904, so as to implement the steps of the corresponding methods in fig. 2 and fig. 4.

In one embodiment, the processor 901 is configured to call the program instructions to perform:

identifying data types of the M data for short, wherein the data types comprise suspicious types, and the data for short of the suspicious types refer to data for short which cannot represent semantics of the target full name data;

In one embodiment, the data types further include a trusted type, and the abbreviation data of the trusted type refers to abbreviation data capable of representing semantics of the target full name data; the processor 901 is configured to call the program instructions for performing:

and determining the semantics of the data for short with the added suspicious label and the semantic relevance between the semantics of the target full-name data according to the semantic analysis result, and determining the data type of the data for short with the added suspicious label according to the semantic relevance.

and performing crawler search according to the first crawler keyword and the second crawler keyword to obtain a crawler search result, and taking the crawler search result as a result of performing semantic analysis on the data added with the suspicious tag and the target full name data.

In one embodiment, the target full call data comprises one or more entity words; the processor 901 is configured to call the program instructions for performing:

In one embodiment, the data types further include a trusted type, and the abbreviation data of the trusted type refers to abbreviation data capable of representing corresponding full name data semantics; the processor 901 is configured to call the program instructions for performing:

screening reliable short-name data from the M short-name data according to the data type of each short-name data;

if the credible type short name data corresponds to at least two full name data, screening out common full name data and non-common full name data from the at least two full name data;

and setting a first weight value for the common full-scale data, and setting a second weight value for the non-common full-scale data, wherein the priority of the first weight value is higher than that of the second weight value.

acquiring target search information from terminal equipment, wherein the target search information comprises reference short-form data of a trusted type;

acquiring data related to the reference abbreviation data and common full-name data corresponding to the reference abbreviation data as search result data;

and displaying the search result data in the terminal equipment.

In one embodiment, the data types further include a trusted type, and the abbreviation data of the trusted type refers to abbreviation data capable of representing corresponding full name data semantics; the processor 901 is configured to call the program instructions for executing:

screening short data of a credible type from the M short data according to the data type of each short data in the M short data, and determining one or more other full name data corresponding to the short data of the credible type;

and analyzing the target full-name data and the one or more other full-name data to generate recommended abbreviation data, and adding the recommended abbreviation data to the target full-name data and the one or more other full-name data.

Embodiments of the present invention provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method embodiments as shown in fig. 2 or fig. 4. The computer-readable storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the invention has been described with reference to a specific embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A data processing method, comprising:

2. The method of claim 1, wherein said identifying the data type of the M abbreviation data comprises:

3. The method of claim 2, wherein the performing a trusted evaluation of each abbreviation data comprises:

4. The method of claim 2, wherein the data types further include a trusted type, and the abbreviation data of the trusted type refers to abbreviation data capable of representing semantics of the target full name data; the determining the data types of the M short form data according to the trusted evaluation result includes:

determining the abbreviation data added with the suspicious labels from the M abbreviation data, and performing semantic analysis on the abbreviation data added with the suspicious labels and the target full name data to obtain a semantic analysis result;

5. The method according to claim 4, wherein performing semantic parsing on the abbreviation data to which the suspicious label is added and the target abbreviation data to obtain a semantic parsing result comprises:

6. The method of claim 5, wherein the determining semantic relevance between the semantics of the abbreviation data added with the suspicious tag and the semantics of the target full-name data according to the semantic parsing result comprises:

if the crawler search result indicates that the first crawler keyword and the second crawler keyword commonly appear in the search result, determining that the semantics of the data added with the suspicious tag for short are associated with the semantics of the target full-name data;

7. The method of claim 1, wherein the target full call data comprises one or more entity words; performing data analysis on the target full name data to generate new short name data, comprising:

8. The method of claim 1, wherein the data types further include a trusted type, and the abbreviation data of the trusted type refers to abbreviation data capable of representing semantics of corresponding full name data; the method further comprises the following steps:

screening out reliable abbreviated data from the M abbreviated data according to the data type of each abbreviated data;

if the credible type short data corresponds to at least two full-name data, screening out common full-name data and non-common full-name data from the at least two full-name data;

setting a first weight value for the common full-scale data, and setting a second weight value for the non-common full-scale data, wherein the priority of the first weight value is higher than that of the second weight value.

9. The method of claim 8, wherein the method further comprises:

acquiring target search information from terminal equipment, wherein the target search information comprises credible type reference short-form data;

and displaying the search result data in the terminal equipment.

10. The method of claim 1, wherein the data types further include a trusted type, and the abbreviation data of the trusted type refers to abbreviation data capable of representing semantics of corresponding full name data; the method further comprises the following steps:

11. A data processing apparatus, characterized by comprising: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a target data set, the target data set comprises target full-name data and M short-name data corresponding to the target full-name data, and M is a positive integer;

12. A computer device comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1 to 10.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1 to 10.