CN111191689B - Sample data processing method and device - Google Patents

Sample data processing method and device Download PDF

Info

Publication number
CN111191689B
CN111191689B CN201911293517.3A CN201911293517A CN111191689B CN 111191689 B CN111191689 B CN 111191689B CN 201911293517 A CN201911293517 A CN 201911293517A CN 111191689 B CN111191689 B CN 111191689B
Authority
CN
China
Prior art keywords
information
word
similarity
training
same
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911293517.3A
Other languages
Chinese (zh)
Other versions
CN111191689A (en
Inventor
张东
刘成鹏
卢亿雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Enyike Beijing Data Technology Co ltd
Original Assignee
Enyike Beijing Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Enyike Beijing Data Technology Co ltd filed Critical Enyike Beijing Data Technology Co ltd
Priority to CN201911293517.3A priority Critical patent/CN111191689B/en
Publication of CN111191689A publication Critical patent/CN111191689A/en
Application granted granted Critical
Publication of CN111191689B publication Critical patent/CN111191689B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application discloses a method and a device for processing sample data. The method comprises the following steps: acquiring characteristic information of a predicted word in sample data; determining similarity information of each training word in the pre-acquired training data and the predicted word according to the characteristic information of the predicted word; according to the similarity information of each training word, K words are determined to be candidate words of the predicted word, wherein K is an integer greater than or equal to 2; obtaining labeling results of the K candidate words, wherein the labeling results comprise category information corresponding to the candidate words; and determining the category information of the predicted word according to the labeling results of the K candidate words.

Description

Sample data processing method and device
Technical Field
The embodiment of the application relates to the field of information processing, in particular to a method and a device for processing sample data.
Background
The knowledge graph is an indispensable basic resource for artificial intelligence application, and plays an important role in Internet applications such as semantic search, question-answering systems, personalized recommendation and the like. The construction process of the knowledge graph is divided into three parts: information extraction, knowledge fusion and knowledge processing, wherein key technologies related to information extraction comprise: entity extraction, attribute extraction, and relationship extraction. Under the condition of sufficient data volume, the current popular information extraction technology is to use a deep learning neural network to extract entities, attributes and relations in the corpus to construct triples. However, in the case that the data in the domain knowledge graph is relatively lacking and the deep learning model is not applicable, the construction of the knowledge graph becomes a difficulty in the industry domain.
In the related art, a hidden markov model (Hidden Markov Model, HMM) is used for information extraction, a training corpus is constructed manually or semi-automatically, model training is performed, and then the model is used for identifying entities, attributes and relations in the corpus. The generation process is modeled by defining joint probabilities for the observation sequence and the tag sequence. Since HMM has a strong independence assumption, the elements of the observation sequence are regarded as individuals isolated from each other, and the observation result at any moment only depends on the state at that moment, therefore, the hidden markov model can only use limited context features, otherwise, the problem of data sparseness is caused, and the recognition accuracy is reduced.
To address the independence assumption limitations of HMMs on observed sequence output, information extraction is performed using a maximum entropy markov model (Maximum Entropy Markov Model, MEMM). The MEMM model can better contain various constraint information and support any complex cross characteristics of observed sequence elements in classification processing. In practical applications, the recognition accuracy of the MEMM needs to be improved, and the complexity needs to be reduced.
Disclosure of Invention
In order to solve any technical problem, the embodiment of the application provides a method and a device for processing sample data.
In order to achieve the object of the embodiment of the present application, an embodiment of the present application provides a method for processing sample data, including:
acquiring characteristic information of a predicted word in sample data;
determining similarity information of each training word in the pre-acquired training data and the predicted word according to the characteristic information of the predicted word;
according to the similarity information of each training word, K words are determined to be candidate words of the predicted word, wherein K is an integer greater than or equal to 2;
obtaining labeling results of the K candidate words, wherein the labeling results comprise category information corresponding to the candidate words;
and determining the category information of the predicted word according to the labeling results of the K candidate words.
In an exemplary embodiment, the determining similarity information between each word in the pre-acquired training data and the predicted word according to the feature information of the predicted word includes:
calculating similarity information of each feature information of the predicted word and each feature information of the same training word;
and calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word.
In an exemplary embodiment, the calculating the similarity information of each feature information of the predicted word and each feature information of the same training word includes:
judging whether the difference value of the lengths of the contents in two groups of characteristic information in the same characteristic information is larger than a preset length threshold value or not, and obtaining a judging result;
if the judgment result is larger than the length threshold, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the cosine similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information;
if the judging result is smaller than or equal to the length threshold value, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the Jaccard similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information.
In an exemplary embodiment, the calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word includes:
when the similarity of the same feature information is determined by utilizing a cosine similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, wherein the similarity information comprises at least one of the mean, the variance and the standard deviation of the similarity of each feature information; carrying out Gaussian normalization processing on similarity information of each feature information of the same training word to obtain similarity information of the training word and the predicted word;
and when the similarity of the same feature information is determined by utilizing a Jaccard similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, and carrying out maximum value and minimum value normalization processing on the similarity information of each feature information of the same training word to obtain the similarity information of the training word and the predicted word.
In an exemplary embodiment, the determining the category information of the predicted word according to the category information of the K candidate words includes:
classifying the K candidate words according to the category information, and determining the total number of candidate words of the same category information in the K candidate words;
and selecting the category information meeting a preset high-use-rate judgment strategy from the category information of the K candidate words according to the total number of the candidate words of the same category information, and taking the category information as the category information of the predicted word.
A processing apparatus for sample data, comprising a processor and a memory, wherein the memory stores a computer program, the processor invoking the computer program in the memory to perform operations comprising:
acquiring characteristic information of a predicted word in sample data;
determining similarity information of each training word in the pre-acquired training data and the predicted word according to the characteristic information of the predicted word;
according to the similarity information of each training word, K words are determined to be candidate words of the predicted word, wherein K is an integer greater than or equal to 2;
obtaining labeling results of the K candidate words, wherein the labeling results comprise category information corresponding to the candidate words;
and determining the category information of the predicted word according to the labeling results of the K candidate words.
In an exemplary embodiment, the processor invokes a computer program in the memory to implement the operation of determining similarity information of each word in the pre-acquired training data and the predicted word according to the feature information of the predicted word, including:
calculating similarity information of each feature information of the predicted word and each feature information of the same training word;
and calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word.
In an exemplary embodiment, the processor invokes a computer program in the memory to perform the operation of calculating similarity information for each feature information of the predicted word and each feature information of the same training word, including:
judging whether the difference value of the lengths of the contents in two groups of characteristic information in the same characteristic information is larger than a preset length threshold value or not, and obtaining a judging result;
if the judgment result is larger than the length threshold, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the cosine similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information;
if the judging result is smaller than or equal to the length threshold value, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the Jaccard similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information.
In an exemplary embodiment, the processor invokes a computer program in the memory to perform the operation of calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word, including:
when the similarity of the same feature information is determined by utilizing a cosine similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, wherein the similarity information comprises at least one of the mean, the variance and the standard deviation of the similarity of each feature information; carrying out Gaussian normalization processing on similarity information of each feature information of the same training word to obtain similarity information of the training word and the predicted word;
and when the similarity of the same feature information is determined by utilizing a Jaccard similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, and carrying out maximum value and minimum value normalization processing on the similarity information of each feature information of the same training word to obtain the similarity information of the training word and the predicted word.
In an exemplary embodiment, the processor invokes a computer program in the memory to perform the operation of determining the category information of the predicted word from the category information of the K candidate words, including:
classifying the K candidate words according to the category information, and determining the total number of candidate words of the same category information in the K candidate words;
and selecting the category information meeting a preset high-use-rate judgment strategy from the category information of the K candidate words according to the total number of the candidate words of the same category information, and taking the category information as the category information of the predicted word.
According to the scheme provided by the embodiment of the application, the characteristic information of the predicted word in the sample data is obtained, the similarity information of each training word and the predicted word in the pre-obtained training data is determined according to the characteristic information of the predicted word, K words are determined as candidate words of the predicted word according to the similarity information of each training word, the labeling results of the K candidate words are obtained, wherein the labeling results comprise the category information corresponding to the candidate words, the category information of the predicted word is determined according to the labeling results of the K candidate words, and the sequence labeling problem is converted into a multi-classification problem, so that the classification speed and precision are effectively improved, the error is reduced, and the accuracy of information extraction is improved; in addition, the method has the advantages of simple implementation complexity, easy understanding, easy implementation, no need of parameter estimation and suitability for multi-classification application scenes.
Additional features and advantages of embodiments of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of embodiments of the application. The objectives and other advantages of embodiments of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the technical solution of the embodiments of the present application, and are incorporated in and constitute a part of this specification, illustrate and explain the technical solution of the embodiments of the present application, and not to limit the technical solution of the embodiments of the present application.
FIG. 1 is a flowchart of a method for processing sample data according to an embodiment of the present application;
fig. 2 is a flowchart of a method for processing sample data based on a K-nearest neighbor algorithm according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in detail hereinafter with reference to the accompanying drawings. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be arbitrarily combined with each other.
The steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer-executable instructions. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.
The inventors have found that the reason why the recognition accuracy of the information extraction operation using the MEMM is to be improved is that the MEMM performs individual marking for each observed value, and the relationship between the marks cannot be considered from the global point of view, and thus the obtained marking result is usually a local optimum, and at the same time, this approach may cause a problem of "mark bias", that is, the current marking state has no relationship with the observed value, resulting in a decrease in the recognition accuracy.
In order to solve the problem that in the field knowledge graph, the entity, the attribute and the relation cannot be extracted accurately by using a depth model due to lack of data, the embodiment of the application provides a method for converting a sequence labeling problem into a multi-classification problem by machine learning, so that the classification speed and precision are effectively improved, the error is reduced, and the information extraction accuracy is improved.
For the construction of a domain knowledge graph, due to the lack of domain data, entities, attributes and relations in corpus cannot be extracted accurately through a depth model, and information extraction is a part of the most critical process in the construction of the knowledge graph.
If a sample belongs to a class for the majority of the k most similar (i.e., nearest neighbor) samples in the feature space, then the sample also belongs to that class. In the KNN algorithm, the selected neighbors are all objects that have been correctly classified. The method has the advantages that: (1) the method is simple, easy to understand and easy to realize, and parameters do not need to be estimated; (2) is particularly suited for multi-class problems (objects have multiple class labels).
The following describes the scheme provided by the embodiment of the application:
fig. 1 is a flowchart of a method for processing sample data according to an embodiment of the present application. The method shown in fig. 1 comprises the following steps:
step 101, obtaining characteristic information of a predicted word in sample data;
in an exemplary embodiment, the interpretation information and/or the description information of the predicted word are queried through a preset corpus.
Taking the predictive term "company" as an example for illustration,
1. open classification: organizing;
2. the basic information may include:
attributes: corporate legal personnel for profit;
the following names: public class government office in feudal China;
type (2): limited liability company and share limited company;
3. the description information may include: companies are corporate legal persons, including finite liability companies and stock finite companies, established in China for the purpose of profit according to public jurisdictions. The method is an enterprise organization form which is formed by adapting to the requirements of market economic and social mass production.
A company is an entity, and the open classification, description information, and key value pairs in basic information can all be used as features of the word.
102, determining similarity information of each training word and the predicted word in the pre-acquired training data according to the characteristic information of the predicted word;
in an exemplary embodiment, since the feature information is a specific description and explanation of the training words, the similarity between the training words and the predicted words is calculated in units of the feature information, and the relevance between the words is more accurately determined.
In an exemplary embodiment, the determining similarity information between each word in the pre-acquired training data and the predicted word according to the feature information of the predicted word includes:
calculating similarity information of each feature information of the predicted word and each feature information of the same training word;
and calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word.
When calculating the similarity of the feature information, calculating the similarity of the same feature information so as to more determine the similarity of the predicted word and the training word; and after the similarity of each feature information of the same training word is obtained, determining the similarity information of the training word and the predicted word through weighted calculation.
In one exemplary embodiment, the similarity information of each feature information is calculated by: :
acquiring a feature name and a feature value of each feature information; taking the feature names and the feature values as a group of feature information;
calculating the similarity of feature names of two sets of feature information of the same feature; calculating the similarity of the feature values of two sets of feature information of the same feature;
and determining similarity information of the feature information according to the similarity of the feature names and the similarity of the feature values.
The similarity of the feature information is determined by utilizing the similarity of the feature names and the feature values, so that the calculation accuracy of the similarity of the feature information can be improved.
Step 103, determining K words as candidate words of the predicted words according to the similarity information of each training word, wherein K is an integer greater than or equal to 2;
in an exemplary embodiment, K training words with the largest numerical values may be selected as candidate words in order from the largest according to the magnitude of the numerical values of the similarity.
104, obtaining labeling results of the K candidate words, wherein the labeling results comprise category information corresponding to the candidate words;
in an exemplary embodiment, the labeling operation is completed for all K candidate words in the training data, the determination of the category information is completed, and the labeling results of the K candidate words are read.
And 105, determining the category information of the predicted words according to the labeling results of the K candidate words.
In an exemplary embodiment, since the K candidate words and the predicted word are similar words, the labeling results of the K candidate words are also applicable to the predicted word, and the labeling operation of the predicted word is completed by means of the labeling results of the K candidate words.
In an exemplary embodiment, the determining the category information of the predicted word according to the category information of the K candidate words includes:
classifying the K candidate words according to the category information, and determining the total number of candidate words of the same category information in the K candidate words;
and selecting the category information meeting a preset high-use-rate judgment strategy from the category information of the K candidate words according to the total number of the candidate words of the same category information, and taking the category information as the category information of the predicted word.
Determining which category of K candidate words has the largest word number, using the category information corresponding to the predicted word, converting the sequence labeling problem into a multi-classification problem, and finishing labeling operation by using the characteristics of K Nearest neighbors (K-Nearest classification algorithm), namely, if most of K most similar samples in a feature space belong to a certain category, then the samples also belong to the characteristics of the category.
According to the method provided by the embodiment of the application, the characteristic information of the predicted word in the sample data is obtained, the similarity information of each training word and the predicted word in the pre-obtained training data is determined according to the characteristic information of the predicted word, K words are determined as candidate words of the predicted word according to the similarity information of each training word, the labeling results of the K candidate words are obtained, wherein the labeling results comprise the category information corresponding to the candidate words, the category information of the predicted word is determined according to the labeling results of the K candidate words, and the sequence labeling problem is converted into a multi-classification problem, so that the classification speed and precision are effectively improved, the error is reduced, and the accuracy of information extraction is improved; in addition, the method has the advantages of simple implementation complexity, easy understanding, easy implementation, no need of parameter estimation and suitability for multi-classification application scenes.
In an exemplary embodiment, the calculating the similarity information of each feature information of the predicted word and each feature information of the same training word includes:
judging whether the difference value of the lengths of the contents in two groups of characteristic information in the same characteristic information is larger than a preset length threshold value or not, and obtaining a judging result;
if the judgment result is larger than the length threshold, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the cosine similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information;
if the judging result is smaller than or equal to the length threshold value, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the Jaccard similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information.
In an exemplary embodiment, the calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word includes:
when the similarity of the same feature information is determined by utilizing a cosine similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, wherein the similarity information comprises at least one of the mean, the variance and the standard deviation of the similarity of each feature information; carrying out Gaussian normalization processing on similarity information of each feature information of the same training word to obtain similarity information of the training word and the predicted word;
and when the similarity of the same feature information is determined by utilizing a Jaccard similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, and carrying out maximum value and minimum value normalization processing on the similarity information of each feature information of the same training word to obtain the similarity information of the training word and the predicted word.
The accuracy of calculation can be effectively improved by selecting a corresponding calculation mode through the content length of the characteristic information.
Fig. 2 is a flowchart of a method for processing sample data based on a K-nearest neighbor algorithm according to an embodiment of the present application. The method shown in fig. 2 comprises the following steps:
step 201, constructing a training corpus corresponding to the predicted word, wherein the training corpus is marked with entities, attributes and relations.
Labeling defined entity category words and attribute words (attribute is also a noun relation) in the training corpus to serve as data of a training algorithm;
for example, beijing is the capital of China, where Beijing and the entity class of China are place names and capital are attribute words.
Step 202, obtaining characteristic information of the predicted word.
In an exemplary embodiment, the feature information of the predicted word is obtained by obtaining the description information of the entity word from a pre-stored corpus (e.g. interactive encyclopedia) by using the entity word obtained from the corpus
Taking entity words as 'company' as an example, the company is an entity, and key value pairs in open classification, description information and basic information can be used as characteristics of the words;
step 203, judging whether the length phase difference of feature words in the same feature is larger than a preset first number threshold value;
in one exemplary embodiment, the number threshold may be set to 2;
if yes, go to step 204; otherwise, go to step 206;
in an exemplary embodiment, the feature information of the entity word can be represented by a k-v key value pair, so that k and v in the basic information of two words can be used as features, k of the two words is compared, and v is compared;
step 204, fine tuning the word vector of the pre-training FastText by using the existing corpus to obtain a final word vector, calculating the reverse file frequency value (Inverse Document Frequency, IDF) of each feature word, calculating cosine similarity by using the FastText word vector, weighting and averaging by using the IDF of the corresponding word, and then executing step 205.
Step 205, calculating the mean, variance and standard deviation of the similarity of each feature of each word in the predicted word and training data for the feature with the word length difference larger than 2, namely the feature needing to calculate the similarity by using Fasttext vector; wherein the mean, variance and standard deviation are used to gaussian normalize the similarity of the features of the words in the training data, and step 209 is performed.
Step 206, calculate the similarity using jaccard, average, and execute step 207 again.
Jaccard is mainly to obtain the same parts of two words with similar lengths, and the more the same parts are, the more similar the words are; the similarity is calculated by vectors if the lengths of the words are relatively large, for example: both the people's republic of China and China cannot use jaccard to calculate similarity.
Step 207, performing maximum value and minimum value normalization on the similarity of the features with the word length difference being less than or equal to 2, namely, the features need to be calculated by using jaccard similarity, and then executing step 208.
Step 208, calculating a weighted sum of the similarity of each feature of each word and sequencing, and selecting the first K words as candidate words.
Wherein the weights used for the weighted sum may be obtained by grid search and cross-validation.
Step 209, comparing at least two categories in the maximum number of words in the same category in the candidate words, and taking the at least two categories as category information of the predicted word.
According to the method provided by the embodiment of the application, under the condition that the depth model is not applicable to the lack of the field data, the sequence labeling problem is converted into the multi-classification problem, and other classification algorithms are used for information extraction and construction of the field knowledge graph, and the classification accuracy of the algorithm and the accuracy of information extraction are effectively improved by improving the algorithm in the process of reproducing the KNN algorithm.
The embodiment of the application provides a processing device of sample data, which comprises a processor and a memory, wherein the memory stores a computer program, and the processor calls the computer program in the memory to realize the following operations, and the processing device comprises:
acquiring characteristic information of a predicted word in sample data;
determining similarity information of each training word in the pre-acquired training data and the predicted word according to the characteristic information of the predicted word;
according to the similarity information of each training word, K words are determined to be candidate words of the predicted word, wherein K is an integer greater than or equal to 2;
obtaining labeling results of the K candidate words, wherein the labeling results comprise category information corresponding to the candidate words;
and determining the category information of the predicted word according to the labeling results of the K candidate words.
In an exemplary embodiment, the processor invokes a computer program in the memory to implement the operation of determining similarity information of each word in the pre-acquired training data and the predicted word according to the feature information of the predicted word, including:
calculating similarity information of each feature information of the predicted word and each feature information of the same training word;
and calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word.
In an exemplary embodiment, the processor invokes a computer program in the memory to perform the operation of calculating similarity information for each feature information of the predicted word and each feature information of the same training word, including:
judging whether the difference value of the lengths of the contents in two groups of characteristic information in the same characteristic information is larger than a preset length threshold value or not, and obtaining a judging result;
if the judgment result is larger than the length threshold, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the cosine similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information;
if the judging result is smaller than or equal to the length threshold value, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the Jaccard similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information.
In an exemplary embodiment, the processor invokes a computer program in the memory to perform the operation of calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word, including:
when the similarity of the same feature information is determined by utilizing a cosine similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, wherein the similarity information comprises at least one of the mean, the variance and the standard deviation of the similarity of each feature information; carrying out Gaussian normalization processing on similarity information of each feature information of the same training word to obtain similarity information of the training word and the predicted word;
and when the similarity of the same feature information is determined by utilizing a Jaccard similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, and carrying out maximum value and minimum value normalization processing on the similarity information of each feature information of the same training word to obtain the similarity information of the training word and the predicted word.
In an exemplary embodiment, the processor invokes a computer program in the memory to perform the operation of determining the category information of the predicted word from the category information of the K candidate words, including:
classifying the K candidate words according to the category information, and determining the total number of candidate words of the same category information in the K candidate words;
and selecting the category information meeting a preset high-use-rate judgment strategy from the category information of the K candidate words according to the total number of the candidate words of the same category information, and taking the category information as the category information of the predicted word.
According to the embodiment of the device provided by the embodiment of the application, the characteristic information of the predicted word in the sample data is obtained, the similarity information of each training word and the predicted word in the pre-obtained training data is determined according to the characteristic information of the predicted word, K words are determined to serve as candidate words of the predicted word according to the similarity information of each training word, the labeling results of the K candidate words are obtained, wherein the labeling results comprise the category information corresponding to the candidate words, the category information of the predicted word is determined according to the labeling results of the K candidate words, and the sequence labeling problem is converted into a multi-classification problem, so that the classification speed and precision are effectively improved, the error is reduced, and the accuracy of information extraction is improved; in addition, the method has the advantages of simple implementation complexity, easy understanding, easy implementation, no need of parameter estimation and suitability for multi-classification application scenes.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Claims (6)

1. A method of processing sample data, comprising:
acquiring characteristic information of a predicted word in sample data;
determining similarity information of each training word in the pre-acquired training data and the predicted word according to the characteristic information of the predicted word;
according to the similarity information of each training word, K words are determined to be candidate words of the predicted word, wherein K is an integer greater than or equal to 2;
obtaining labeling results of the K candidate words, wherein the labeling results comprise category information corresponding to the candidate words;
determining category information of the predicted words according to labeling results of the K candidate words;
the determining similarity information between each word in the training data obtained in advance and the predicted word according to the feature information of the predicted word comprises the following steps:
calculating similarity information of each feature information of the predicted word and each feature information of the same training word, wherein the similarity information comprises the following steps:
judging whether the difference value of the lengths of the contents in two groups of characteristic information in the same characteristic information is larger than a preset length threshold value or not, and obtaining a judging result;
if the judgment result is larger than the length threshold, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the cosine similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information;
if the judging result is smaller than or equal to the length threshold value, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the Jaccard similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information.
2. The method according to claim 1, wherein the calculating the similarity of each feature information of the same training word to determine the similarity information of each training word and the predicted word includes:
when the similarity of the same feature information is determined by utilizing a cosine similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, wherein the similarity information comprises at least one of the mean, the variance and the standard deviation of the similarity of each feature information; carrying out Gaussian normalization processing on similarity information of each feature information of the same training word to obtain similarity information of the training word and the predicted word;
and when the similarity of the same feature information is determined by utilizing a Jaccard similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, and carrying out maximum value and minimum value normalization processing on the similarity information of each feature information of the same training word to obtain the similarity information of the training word and the predicted word.
3. The method of claim 1, wherein determining the category information of the predicted word from the category information of the K candidate words comprises:
classifying the K candidate words according to the category information, and determining the total number of candidate words of the same category information in the K candidate words;
and selecting the category information meeting a preset high-use-rate judgment strategy from the category information of the K candidate words according to the total number of the candidate words of the same category information, and taking the category information as the category information of the predicted word.
4. A sample data processing device comprising a processor and a memory, wherein the memory stores a computer program, the processor invoking the computer program in the memory to perform operations comprising:
acquiring characteristic information of a predicted word in sample data;
determining similarity information of each training word in the pre-acquired training data and the predicted word according to the characteristic information of the predicted word;
according to the similarity information of each training word, K words are determined to be candidate words of the predicted word, wherein K is an integer greater than or equal to 2;
obtaining labeling results of the K candidate words, wherein the labeling results comprise category information corresponding to the candidate words;
determining category information of the predicted words according to labeling results of the K candidate words;
the processor invokes a computer program in the memory to realize the operation of determining similarity information of each word in the pre-acquired training data and the predicted word according to the characteristic information of the predicted word, and the operation comprises the following steps:
calculating similarity information of each feature information of the predicted word and each feature information of the same training word;
calculating the similarity of each feature information of the same training word, and determining the similarity information of each training word and the predicted word;
the processor invokes a computer program in the memory to perform the operation of calculating similarity information for each feature information of the predicted word and each feature information of the same training word, comprising:
judging whether the difference value of the lengths of the contents in two groups of characteristic information in the same characteristic information is larger than a preset length threshold value or not, and obtaining a judging result;
if the judgment result is larger than the length threshold, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the cosine similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information;
if the judging result is smaller than or equal to the length threshold value, calculating the content corresponding to the predicted word and the content corresponding to the training word on the same characteristic information by utilizing a calculation strategy of the Jaccard similarity obtained in advance, and determining the similarity information of the predicted word and the training word in the same characteristic information.
5. The apparatus of claim 4, wherein the processor invokes the computer program in the memory to perform the operation of calculating the similarity of each feature information of the same training word to determine similarity information of each training word to the predicted word, comprising:
when the similarity of the same feature information is determined by utilizing a cosine similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, wherein the similarity information comprises at least one of the mean, the variance and the standard deviation of the similarity of each feature information; carrying out Gaussian normalization processing on similarity information of each feature information of the same training word to obtain similarity information of the training word and the predicted word;
and when the similarity of the same feature information is determined by utilizing a Jaccard similarity calculation strategy, obtaining the similarity information of each feature information of the same training word, and carrying out maximum value and minimum value normalization processing on the similarity information of each feature information of the same training word to obtain the similarity information of the training word and the predicted word.
6. The apparatus of claim 4, wherein the processor invokes a computer program in the memory to perform the operation of determining the category information of the predicted word from the category information of the K candidate words, comprising:
classifying the K candidate words according to the category information, and determining the total number of candidate words of the same category information in the K candidate words;
and selecting the category information meeting a preset high-use-rate judgment strategy from the category information of the K candidate words according to the total number of the candidate words of the same category information, and taking the category information as the category information of the predicted word.
CN201911293517.3A 2019-12-16 2019-12-16 Sample data processing method and device Active CN111191689B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911293517.3A CN111191689B (en) 2019-12-16 2019-12-16 Sample data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911293517.3A CN111191689B (en) 2019-12-16 2019-12-16 Sample data processing method and device

Publications (2)

Publication Number Publication Date
CN111191689A CN111191689A (en) 2020-05-22
CN111191689B true CN111191689B (en) 2023-09-12

Family

ID=70711040

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911293517.3A Active CN111191689B (en) 2019-12-16 2019-12-16 Sample data processing method and device

Country Status (1)

Country Link
CN (1) CN111191689B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635273A (en) * 2018-10-25 2019-04-16 平安科技(深圳)有限公司 Text key word extracting method, device, equipment and storage medium
CN110378381A (en) * 2019-06-17 2019-10-25 华为技术有限公司 Object detecting method, device and computer storage medium
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium
CN110555083A (en) * 2019-08-26 2019-12-10 北京工业大学 non-supervision entity relationship extraction method based on zero-shot

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150095017A1 (en) * 2013-09-27 2015-04-02 Google Inc. System and method for learning word embeddings using neural language models
US10360507B2 (en) * 2016-09-22 2019-07-23 nference, inc. Systems, methods, and computer readable media for visualization of semantic information and inference of temporal signals indicating salient associations between life science entities

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635273A (en) * 2018-10-25 2019-04-16 平安科技(深圳)有限公司 Text key word extracting method, device, equipment and storage medium
CN110378381A (en) * 2019-06-17 2019-10-25 华为技术有限公司 Object detecting method, device and computer storage medium
CN110555083A (en) * 2019-08-26 2019-12-10 北京工业大学 non-supervision entity relationship extraction method based on zero-shot
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"近十年国内移动学习研究现状与趋势——基于共词分析的知识图谱研究";付晓丽;《吕梁教育学院学报》;第35卷(第1期);30-33 *

Also Published As

Publication number Publication date
CN111191689A (en) 2020-05-22

Similar Documents

Publication Publication Date Title
CN110188168B (en) Semantic relation recognition method and device
US20210382937A1 (en) Image processing method and apparatus, and storage medium
US20230039496A1 (en) Question-and-answer processing method, electronic device and computer readable medium
US20170262478A1 (en) Method and apparatus for image retrieval with feature learning
CN111538846A (en) Third-party library recommendation method based on mixed collaborative filtering
CN111177403B (en) Sample data processing method and device
CN114329244A (en) Map interest point query method, map interest point query device, map interest point query equipment, storage medium and program product
CN115544303A (en) Method, apparatus, device and medium for determining label of video
CN113076758B (en) Task-oriented dialog-oriented multi-domain request type intention identification method
CN113342935A (en) Semantic recognition method and device, electronic equipment and readable storage medium
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
CN111191689B (en) Sample data processing method and device
CN111597336A (en) Processing method and device of training text, electronic equipment and readable storage medium
CN116681128A (en) Neural network model training method and device with noisy multi-label data
CN113723111B (en) Small sample intention recognition method, device, equipment and storage medium
CN114372148A (en) Data processing method based on knowledge graph technology and terminal equipment
CN114529191A (en) Method and apparatus for risk identification
CN113139382A (en) Named entity identification method and device
CN117235629B (en) Intention recognition method, system and computer equipment based on knowledge domain detection
US11934794B1 (en) Systems and methods for algorithmically orchestrating conversational dialogue transitions within an automated conversational system
CN113590747B (en) Method for intent recognition and corresponding system, computer device and medium
CN116578804A (en) Website security detection method, device and storage medium
CN116151246A (en) Method and device for generating document reading information
CN114357175A (en) Data mining system based on semantic network
CN115357723A (en) Entity relationship extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant