CN111191689B

CN111191689B - Sample data processing method and device

Info

Publication number: CN111191689B
Application number: CN201911293517.3A
Authority: CN
Inventors: 张东; 刘成鹏; 卢亿雷
Original assignee: Enyike Beijing Data Technology Co ltd
Current assignee: Enyike Beijing Data Technology Co ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2023-09-12
Anticipated expiration: 2039-12-16
Also published as: CN111191689A

Abstract

The embodiment of the application discloses a method and a device for processing sample data. The method comprises the following steps: acquiring characteristic information of a predicted word in sample data; determining similarity information of each training word in the pre-acquired training data and the predicted word according to the characteristic information of the predicted word; according to the similarity information of each training word, K words are determined to be candidate words of the predicted word, wherein K is an integer greater than or equal to 2; obtaining labeling results of the K candidate words, wherein the labeling results comprise category information corresponding to the candidate words; and determining the category information of the predicted word according to the labeling results of the K candidate words.

Description

Sample data processing method and device

Technical Field

The embodiment of the application relates to the field of information processing, in particular to a method and a device for processing sample data.

Background

The knowledge graph is an indispensable basic resource for artificial intelligence application, and plays an important role in Internet applications such as semantic search, question-answering systems, personalized recommendation and the like. The construction process of the knowledge graph is divided into three parts: information extraction, knowledge fusion and knowledge processing, wherein key technologies related to information extraction comprise: entity extraction, attribute extraction, and relationship extraction. Under the condition of sufficient data volume, the current popular information extraction technology is to use a deep learning neural network to extract entities, attributes and relations in the corpus to construct triples. However, in the case that the data in the domain knowledge graph is relatively lacking and the deep learning model is not applicable, the construction of the knowledge graph becomes a difficulty in the industry domain.

In the related art, a hidden markov model (Hidden Markov Model, HMM) is used for information extraction, a training corpus is constructed manually or semi-automatically, model training is performed, and then the model is used for identifying entities, attributes and relations in the corpus. The generation process is modeled by defining joint probabilities for the observation sequence and the tag sequence. Since HMM has a strong independence assumption, the elements of the observation sequence are regarded as individuals isolated from each other, and the observation result at any moment only depends on the state at that moment, therefore, the hidden markov model can only use limited context features, otherwise, the problem of data sparseness is caused, and the recognition accuracy is reduced.

To address the independence assumption limitations of HMMs on observed sequence output, information extraction is performed using a maximum entropy markov model (Maximum Entropy Markov Model, MEMM). The MEMM model can better contain various constraint information and support any complex cross characteristics of observed sequence elements in classification processing. In practical applications, the recognition accuracy of the MEMM needs to be improved, and the complexity needs to be reduced.

Disclosure of Invention

In order to solve any technical problem, the embodiment of the application provides a method and a device for processing sample data.

In order to achieve the object of the embodiment of the present application, an embodiment of the present application provides a method for processing sample data, including:

acquiring characteristic information of a predicted word in sample data;

determining similarity information of each training word in the pre-acquired training data and the predicted word according to the characteristic information of the predicted word;

according to the similarity information of each training word, K words are determined to be candidate words of the predicted word, wherein K is an integer greater than or equal to 2;

obtaining labeling results of the K candidate words, wherein the labeling results comprise category information corresponding to the candidate words;

and determining the category information of the predicted word according to the labeling results of the K candidate words.

In an exemplary embodiment, the determining similarity information between each word in the pre-acquired training data and the predicted word according to the feature information of the predicted word includes:

calculating similarity information of each feature information of the predicted word and each feature information of the same training word;

and calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word.

In an exemplary embodiment, the calculating the similarity information of each feature information of the predicted word and each feature information of the same training word includes:

judging whether the difference value of the lengths of the contents in two groups of characteristic information in the same characteristic information is larger than a preset length threshold value or not, and obtaining a judging result;

if the judgment result is larger than the length threshold, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the cosine similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information;

if the judging result is smaller than or equal to the length threshold value, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the Jaccard similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information.

In an exemplary embodiment, the calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word includes:

when the similarity of the same feature information is determined by utilizing a cosine similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, wherein the similarity information comprises at least one of the mean, the variance and the standard deviation of the similarity of each feature information; carrying out Gaussian normalization processing on similarity information of each feature information of the same training word to obtain similarity information of the training word and the predicted word;

and when the similarity of the same feature information is determined by utilizing a Jaccard similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, and carrying out maximum value and minimum value normalization processing on the similarity information of each feature information of the same training word to obtain the similarity information of the training word and the predicted word.

In an exemplary embodiment, the determining the category information of the predicted word according to the category information of the K candidate words includes:

classifying the K candidate words according to the category information, and determining the total number of candidate words of the same category information in the K candidate words;

and selecting the category information meeting a preset high-use-rate judgment strategy from the category information of the K candidate words according to the total number of the candidate words of the same category information, and taking the category information as the category information of the predicted word.

A processing apparatus for sample data, comprising a processor and a memory, wherein the memory stores a computer program, the processor invoking the computer program in the memory to perform operations comprising:

acquiring characteristic information of a predicted word in sample data;

In an exemplary embodiment, the processor invokes a computer program in the memory to implement the operation of determining similarity information of each word in the pre-acquired training data and the predicted word according to the feature information of the predicted word, including:

In an exemplary embodiment, the processor invokes a computer program in the memory to perform the operation of calculating similarity information for each feature information of the predicted word and each feature information of the same training word, including:

In an exemplary embodiment, the processor invokes a computer program in the memory to perform the operation of calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word, including:

In an exemplary embodiment, the processor invokes a computer program in the memory to perform the operation of determining the category information of the predicted word from the category information of the K candidate words, including:

According to the scheme provided by the embodiment of the application, the characteristic information of the predicted word in the sample data is obtained, the similarity information of each training word and the predicted word in the pre-obtained training data is determined according to the characteristic information of the predicted word, K words are determined as candidate words of the predicted word according to the similarity information of each training word, the labeling results of the K candidate words are obtained, wherein the labeling results comprise the category information corresponding to the candidate words, the category information of the predicted word is determined according to the labeling results of the K candidate words, and the sequence labeling problem is converted into a multi-classification problem, so that the classification speed and precision are effectively improved, the error is reduced, and the accuracy of information extraction is improved; in addition, the method has the advantages of simple implementation complexity, easy understanding, easy implementation, no need of parameter estimation and suitability for multi-classification application scenes.

Additional features and advantages of embodiments of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of embodiments of the application. The objectives and other advantages of embodiments of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the technical solution of the embodiments of the present application, and are incorporated in and constitute a part of this specification, illustrate and explain the technical solution of the embodiments of the present application, and not to limit the technical solution of the embodiments of the present application.

FIG. 1 is a flowchart of a method for processing sample data according to an embodiment of the present application;

fig. 2 is a flowchart of a method for processing sample data based on a K-nearest neighbor algorithm according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in detail hereinafter with reference to the accompanying drawings. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be arbitrarily combined with each other.

The steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer-executable instructions. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.

The inventors have found that the reason why the recognition accuracy of the information extraction operation using the MEMM is to be improved is that the MEMM performs individual marking for each observed value, and the relationship between the marks cannot be considered from the global point of view, and thus the obtained marking result is usually a local optimum, and at the same time, this approach may cause a problem of "mark bias", that is, the current marking state has no relationship with the observed value, resulting in a decrease in the recognition accuracy.

In order to solve the problem that in the field knowledge graph, the entity, the attribute and the relation cannot be extracted accurately by using a depth model due to lack of data, the embodiment of the application provides a method for converting a sequence labeling problem into a multi-classification problem by machine learning, so that the classification speed and precision are effectively improved, the error is reduced, and the information extraction accuracy is improved.

For the construction of a domain knowledge graph, due to the lack of domain data, entities, attributes and relations in corpus cannot be extracted accurately through a depth model, and information extraction is a part of the most critical process in the construction of the knowledge graph.

If a sample belongs to a class for the majority of the k most similar (i.e., nearest neighbor) samples in the feature space, then the sample also belongs to that class. In the KNN algorithm, the selected neighbors are all objects that have been correctly classified. The method has the advantages that: (1) the method is simple, easy to understand and easy to realize, and parameters do not need to be estimated; (2) is particularly suited for multi-class problems (objects have multiple class labels).

The following describes the scheme provided by the embodiment of the application:

fig. 1 is a flowchart of a method for processing sample data according to an embodiment of the present application. The method shown in fig. 1 comprises the following steps:

step 101, obtaining characteristic information of a predicted word in sample data;

in an exemplary embodiment, the interpretation information and/or the description information of the predicted word are queried through a preset corpus.

Taking the predictive term "company" as an example for illustration,

1. open classification: organizing;

2. the basic information may include:

attributes: corporate legal personnel for profit;

the following names: public class government office in feudal China;

type (2): limited liability company and share limited company;

3. the description information may include: companies are corporate legal persons, including finite liability companies and stock finite companies, established in China for the purpose of profit according to public jurisdictions. The method is an enterprise organization form which is formed by adapting to the requirements of market economic and social mass production.

A company is an entity, and the open classification, description information, and key value pairs in basic information can all be used as features of the word.

102, determining similarity information of each training word and the predicted word in the pre-acquired training data according to the characteristic information of the predicted word;

in an exemplary embodiment, since the feature information is a specific description and explanation of the training words, the similarity between the training words and the predicted words is calculated in units of the feature information, and the relevance between the words is more accurately determined.

When calculating the similarity of the feature information, calculating the similarity of the same feature information so as to more determine the similarity of the predicted word and the training word; and after the similarity of each feature information of the same training word is obtained, determining the similarity information of the training word and the predicted word through weighted calculation.

In one exemplary embodiment, the similarity information of each feature information is calculated by: :

acquiring a feature name and a feature value of each feature information; taking the feature names and the feature values as a group of feature information;

calculating the similarity of feature names of two sets of feature information of the same feature; calculating the similarity of the feature values of two sets of feature information of the same feature;

and determining similarity information of the feature information according to the similarity of the feature names and the similarity of the feature values.

The similarity of the feature information is determined by utilizing the similarity of the feature names and the feature values, so that the calculation accuracy of the similarity of the feature information can be improved.

Step 103, determining K words as candidate words of the predicted words according to the similarity information of each training word, wherein K is an integer greater than or equal to 2;

in an exemplary embodiment, K training words with the largest numerical values may be selected as candidate words in order from the largest according to the magnitude of the numerical values of the similarity.

104, obtaining labeling results of the K candidate words, wherein the labeling results comprise category information corresponding to the candidate words;

in an exemplary embodiment, the labeling operation is completed for all K candidate words in the training data, the determination of the category information is completed, and the labeling results of the K candidate words are read.

And 105, determining the category information of the predicted words according to the labeling results of the K candidate words.

In an exemplary embodiment, since the K candidate words and the predicted word are similar words, the labeling results of the K candidate words are also applicable to the predicted word, and the labeling operation of the predicted word is completed by means of the labeling results of the K candidate words.

Determining which category of K candidate words has the largest word number, using the category information corresponding to the predicted word, converting the sequence labeling problem into a multi-classification problem, and finishing labeling operation by using the characteristics of K Nearest neighbors (K-Nearest classification algorithm), namely, if most of K most similar samples in a feature space belong to a certain category, then the samples also belong to the characteristics of the category.

According to the method provided by the embodiment of the application, the characteristic information of the predicted word in the sample data is obtained, the similarity information of each training word and the predicted word in the pre-obtained training data is determined according to the characteristic information of the predicted word, K words are determined as candidate words of the predicted word according to the similarity information of each training word, the labeling results of the K candidate words are obtained, wherein the labeling results comprise the category information corresponding to the candidate words, the category information of the predicted word is determined according to the labeling results of the K candidate words, and the sequence labeling problem is converted into a multi-classification problem, so that the classification speed and precision are effectively improved, the error is reduced, and the accuracy of information extraction is improved; in addition, the method has the advantages of simple implementation complexity, easy understanding, easy implementation, no need of parameter estimation and suitability for multi-classification application scenes.

The accuracy of calculation can be effectively improved by selecting a corresponding calculation mode through the content length of the characteristic information.

Fig. 2 is a flowchart of a method for processing sample data based on a K-nearest neighbor algorithm according to an embodiment of the present application. The method shown in fig. 2 comprises the following steps:

step 201, constructing a training corpus corresponding to the predicted word, wherein the training corpus is marked with entities, attributes and relations.

Labeling defined entity category words and attribute words (attribute is also a noun relation) in the training corpus to serve as data of a training algorithm;

for example, beijing is the capital of China, where Beijing and the entity class of China are place names and capital are attribute words.

Step 202, obtaining characteristic information of the predicted word.

In an exemplary embodiment, the feature information of the predicted word is obtained by obtaining the description information of the entity word from a pre-stored corpus (e.g. interactive encyclopedia) by using the entity word obtained from the corpus

Taking entity words as 'company' as an example, the company is an entity, and key value pairs in open classification, description information and basic information can be used as characteristics of the words;

step 203, judging whether the length phase difference of feature words in the same feature is larger than a preset first number threshold value;

in one exemplary embodiment, the number threshold may be set to 2;

if yes, go to step 204; otherwise, go to step 206;

in an exemplary embodiment, the feature information of the entity word can be represented by a k-v key value pair, so that k and v in the basic information of two words can be used as features, k of the two words is compared, and v is compared;

step 204, fine tuning the word vector of the pre-training FastText by using the existing corpus to obtain a final word vector, calculating the reverse file frequency value (Inverse Document Frequency, IDF) of each feature word, calculating cosine similarity by using the FastText word vector, weighting and averaging by using the IDF of the corresponding word, and then executing step 205.

Step 205, calculating the mean, variance and standard deviation of the similarity of each feature of each word in the predicted word and training data for the feature with the word length difference larger than 2, namely the feature needing to calculate the similarity by using Fasttext vector; wherein the mean, variance and standard deviation are used to gaussian normalize the similarity of the features of the words in the training data, and step 209 is performed.

Step 206, calculate the similarity using jaccard, average, and execute step 207 again.

Jaccard is mainly to obtain the same parts of two words with similar lengths, and the more the same parts are, the more similar the words are; the similarity is calculated by vectors if the lengths of the words are relatively large, for example: both the people's republic of China and China cannot use jaccard to calculate similarity.

Step 207, performing maximum value and minimum value normalization on the similarity of the features with the word length difference being less than or equal to 2, namely, the features need to be calculated by using jaccard similarity, and then executing step 208.

Step 208, calculating a weighted sum of the similarity of each feature of each word and sequencing, and selecting the first K words as candidate words.

Wherein the weights used for the weighted sum may be obtained by grid search and cross-validation.

Step 209, comparing at least two categories in the maximum number of words in the same category in the candidate words, and taking the at least two categories as category information of the predicted word.

According to the method provided by the embodiment of the application, under the condition that the depth model is not applicable to the lack of the field data, the sequence labeling problem is converted into the multi-classification problem, and other classification algorithms are used for information extraction and construction of the field knowledge graph, and the classification accuracy of the algorithm and the accuracy of information extraction are effectively improved by improving the algorithm in the process of reproducing the KNN algorithm.

The embodiment of the application provides a processing device of sample data, which comprises a processor and a memory, wherein the memory stores a computer program, and the processor calls the computer program in the memory to realize the following operations, and the processing device comprises:

acquiring characteristic information of a predicted word in sample data;

According to the embodiment of the device provided by the embodiment of the application, the characteristic information of the predicted word in the sample data is obtained, the similarity information of each training word and the predicted word in the pre-obtained training data is determined according to the characteristic information of the predicted word, K words are determined to serve as candidate words of the predicted word according to the similarity information of each training word, the labeling results of the K candidate words are obtained, wherein the labeling results comprise the category information corresponding to the candidate words, the category information of the predicted word is determined according to the labeling results of the K candidate words, and the sequence labeling problem is converted into a multi-classification problem, so that the classification speed and precision are effectively improved, the error is reduced, and the accuracy of information extraction is improved; in addition, the method has the advantages of simple implementation complexity, easy understanding, easy implementation, no need of parameter estimation and suitability for multi-classification application scenes.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Claims

1. A method of processing sample data, comprising:

acquiring characteristic information of a predicted word in sample data;

determining category information of the predicted words according to labeling results of the K candidate words;

the determining similarity information between each word in the training data obtained in advance and the predicted word according to the feature information of the predicted word comprises the following steps:

calculating similarity information of each feature information of the predicted word and each feature information of the same training word, wherein the similarity information comprises the following steps:

2. The method according to claim 1, wherein the calculating the similarity of each feature information of the same training word to determine the similarity information of each training word and the predicted word includes:

3. The method of claim 1, wherein determining the category information of the predicted word from the category information of the K candidate words comprises:

4. A sample data processing device comprising a processor and a memory, wherein the memory stores a computer program, the processor invoking the computer program in the memory to perform operations comprising:

acquiring characteristic information of a predicted word in sample data;

the processor invokes a computer program in the memory to realize the operation of determining similarity information of each word in the pre-acquired training data and the predicted word according to the characteristic information of the predicted word, and the operation comprises the following steps:

calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word;

the processor invokes a computer program in the memory to perform the operation of calculating similarity information for each feature information of the predicted word and each feature information of the same training word, comprising:

5. The apparatus of claim 4, wherein the processor invokes the computer program in the memory to perform the operation of calculating the similarity of each feature information of the same training word to determine similarity information of each training word to the predicted word, comprising:

6. The apparatus of claim 4, wherein the processor invokes a computer program in the memory to perform the operation of determining the category information of the predicted word from the category information of the K candidate words, comprising: