CN113392294A - Sample labeling method and device - Google Patents

Sample labeling method and device Download PDF

Info

Publication number
CN113392294A
CN113392294A CN202011105032.XA CN202011105032A CN113392294A CN 113392294 A CN113392294 A CN 113392294A CN 202011105032 A CN202011105032 A CN 202011105032A CN 113392294 A CN113392294 A CN 113392294A
Authority
CN
China
Prior art keywords
vector
dimension
attribute
space
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011105032.XA
Other languages
Chinese (zh)
Other versions
CN113392294B (en
Inventor
马莘权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011105032.XA priority Critical patent/CN113392294B/en
Publication of CN113392294A publication Critical patent/CN113392294A/en
Application granted granted Critical
Publication of CN113392294B publication Critical patent/CN113392294B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/04Payment circuits
    • G06Q20/06Private payment circuits, e.g. involving electronic currency used among participants of a common payment scheme
    • G06Q20/065Private payment circuits, e.g. involving electronic currency used among participants of a common payment scheme using e-cash
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/382Payment protocols; Details thereof insuring higher security of transaction
    • G06Q20/3829Payment protocols; Details thereof insuring higher security of transaction involving key management

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a sample labeling method and device. The evaluation method of the search phrase comprises the following steps: acquiring a service sample set, wherein the service sample set comprises at least one sample; performing attribute analysis on the service sample set to obtain multi-dimensional attribute information of the service sample set; performing category analysis on each dimension attribute information in the multi-dimension attribute information to obtain multi-dimension category information of the service sample set; and labeling each sample in the service sample set according to at least one dimension type information in the multi-dimension type information. The embodiment of the invention can effectively improve the efficiency and the accuracy of sample marking and reduce the marking cost.

Description

Sample labeling method and device
Technical Field
The invention relates to the technical field of computers, in particular to a sample labeling method and device.
Background
There are three main ways of sample labeling in the prior art: manual marking; performing machine evaluation from a literal angle, and then combining manual marking; and constructing training samples and models for labeling through transfer learning.
However, the manual labeling and the way of machine-to-machine labeling before manual labeling are time-consuming and labor-consuming, and are easily affected by personal factors of the labeling personnel, so that the labeling results are inconsistent. And the labeling mode of training samples and models is constructed through transfer learning, different models need to be constructed according to different services or different service requirements, even if similar services exist, the models cannot be directly used, and transfer learning needs to be carried out again, so that the sample labeling efficiency is low.
Disclosure of Invention
The invention provides a sample labeling method and device, which can effectively improve the efficiency and accuracy of sample labeling and reduce the labor cost.
In a first aspect, the present invention provides a sample labeling method, including:
acquiring a service sample set, wherein the service sample set comprises at least one sample;
performing attribute analysis on the service sample set to obtain multi-dimensional attribute information of the service sample set;
performing category analysis on each dimension attribute information in the multi-dimension attribute information to obtain multi-dimension category information of the service sample set;
and labeling each sample in the service sample set according to at least one dimension type information in the multi-dimension type information.
Optionally, the performing attribute analysis on the service sample set to obtain multidimensional attribute information of the service sample set includes:
taking each dimension vector space in a pre-constructed multi-dimension vector space as a target dimension vector space, and determining an attribute vector in the target dimension vector space to which each sample in the service sample set is mapped, wherein the attribute vectors of all samples in the service sample set in the target dimension vector space form one-dimension attribute information in the multi-dimension attribute information; each dimension vector space comprises a plurality of attribute vectors with position relations, and at least one attribute vector forms a benchmark vector.
Optionally, the performing category analysis on each piece of dimensional attribute information in the multi-dimensional attribute information to obtain the multi-dimensional category information of the service sample set includes:
respectively taking each sample in the service sample set as a target sample, and acquiring a benchmark vector closest to an attribute vector corresponding to the target sample in the target dimension vector space;
and performing category analysis on the obtained benchmark vectors to obtain category information of the target samples in the target dimension vector space, wherein the category information of all the samples in the service sample set in the target dimension vector space forms one-dimensional category information in the multi-dimensional category information.
Optionally, the performing category analysis on each piece of dimensional attribute information in the multi-dimensional attribute information to obtain the multi-dimensional category information of the service sample set includes:
clustering attribute vectors of all samples in the service sample set in the target dimension vector space to divide the service sample set into at least one subset;
respectively taking each subset in the business sample set as a target subset, and determining a benchmark vector of the target subset in the target dimension vector space;
and performing category analysis on the determined benchmark vectors to obtain category information of the target subsets in the target dimension vector space, wherein the category information of all subsets in the service sample set in the target dimension vector space constitutes one-dimensional category information in the multi-dimensional category information.
Optionally, the determining the target vector of the target subset in the target dimension vector space includes:
detecting whether all samples in the target subset have a benchmarking vector in an attribute vector in the target dimension vector space;
and if so, taking the detected benchmark vector as the benchmark vector of the target subset in the target dimension vector space.
Optionally, the labeling, according to at least one dimension category information in the multi-dimension category information, each sample in the service sample set includes:
selecting at least one dimension category information from the multi-dimension category information according to a preset service strategy;
and labeling the category of each sample of the service sample set according to the at least one dimension category information.
Optionally, the method further includes:
constructing the multi-dimensional vector space;
setting a plurality of attribute vectors with a position relation in each dimension vector space, and selecting at least one attribute vector from the attribute vectors as a benchmark vector;
and correlating the benchmark vectors in different dimension vector spaces.
Optionally, the method further includes:
adding a new marker post vector in the target dimension vector space, and setting the position relation of the new marker post vector in the target dimension vector space;
determining a benchmarking vector associated with the newly added benchmarking vector in the other dimension vector space;
detecting whether the position relation of the newly added marker post vector in the target dimension vector space meets the requirement of a preset index or not according to the position relation of the associated marker post vector in the other dimension vector spaces;
and if so, updating the multi-dimensional vector space.
Optionally, the multidimensional vector space includes an entity space, a concept space, a tag space, a phrase space, a semantic space, a topic space, and a focus variation trend space.
In a second aspect, the present invention provides a sample labeling apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a service sample set, and the service sample set comprises at least one sample;
the attribute analysis module is used for carrying out attribute analysis on the service sample set to obtain multi-dimensional attribute information of the service sample set;
the category analysis module is used for carrying out category analysis on each dimension attribute information in the multi-dimension attribute information to obtain the multi-dimension category information of the service sample set; and the number of the first and second groups,
and the marking module is used for marking each sample in the service sample set according to at least one dimension type information in the multi-dimension type information.
In a third aspect, the present invention provides a server comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to perform the steps of:
acquiring a service sample set, wherein the service sample set comprises at least one sample;
performing attribute analysis on the service sample set to obtain multi-dimensional attribute information of the service sample set;
performing category analysis on each dimension attribute information in the multi-dimension attribute information to obtain multi-dimension category information of the service sample set;
and labeling each sample in the service sample set according to at least one dimension type information in the multi-dimension type information.
In a fourth aspect, the present invention provides a storage medium storing a plurality of instructions, the instructions being suitable for being loaded by a processor to perform the steps of the sample labeling method according to any one of the first aspect.
The embodiment of the invention obtains the multi-dimensional attribute information of the service sample set by obtaining the service sample set and performing attribute analysis on the service sample set, obtains the multi-dimensional category information of the service sample set by performing category analysis on each dimension attribute information in the multi-dimensional attribute information, and labels each sample in the service sample set according to at least one dimension category information in the multi-dimensional category information. The embodiment can simultaneously carry out multi-dimensional type detection on different samples in a service sample set, effectively improves the efficiency and accuracy of sample labeling, and reduces the labeling cost.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of a scenario of a sample annotation system according to an embodiment of the present invention;
FIG. 2 is an alternative structure diagram of the distributed system applied to the blockchain system according to the embodiment of the present invention;
FIG. 3 is an alternative block structure according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of a sample labeling method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of relationships among entities, concepts and phrases in the sample labeling method according to the embodiment of the present invention;
fig. 6 is a schematic diagram illustrating an effect of a multi-dimensional vector space in the sample labeling method according to the embodiment of the present invention;
fig. 7 is a schematic diagram of a landmark vector relationship in a multi-dimensional vector space in the sample labeling method according to the embodiment of the present invention;
FIG. 8 is a schematic vector diagram of a topic space in the sample labeling method according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a sample labeling method provided by an embodiment of the present invention;
FIG. 10 is a schematic flow chart of a sample annotation method according to an embodiment of the present invention;
FIG. 11 is a schematic structural diagram of a sample labeling apparatus provided in an embodiment of the present invention;
fig. 12 is a schematic structural diagram of a server provided in the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description that follows, specific embodiments of the present invention are described with reference to steps and symbols executed by one or more computers, unless otherwise indicated. Accordingly, these steps and operations will be referred to, several times, as being performed by a computer, the computer performing operations involving a processing unit of the computer in electronic signals representing data in a structured form. This operation transforms the data or maintains it at locations in the computer's memory system, which may be reconfigured or otherwise altered in a manner well known to those skilled in the art. The data maintains a data structure that is a physical location of the memory that has particular characteristics defined by the data format. However, while the principles of the invention have been described in language specific to above, it is not intended to be limited to the specific form set forth herein, but on the contrary, it is to be understood that various steps and operations described hereinafter may be implemented in hardware.
The term "module" or "unit" as used herein may be considered a software object executing on the computing system. The various components, modules, engines, and services described herein may be viewed as objects implemented on the computing system. The apparatus and method described herein are preferably implemented in software, but may also be implemented in hardware, and are within the scope of the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
The embodiment of the invention provides a sample labeling method, a sample labeling device, a server and a storage medium.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
The scheme provided by the embodiment of the invention can be a sample labeling method related to artificial intelligence, namely, the embodiment of the invention provides a sample labeling method based on artificial intelligence, which comprises the following steps: acquiring a service sample set, wherein the service sample set comprises at least one sample; performing attribute analysis on the service sample set by using a machine learning algorithm to obtain multi-dimensional attribute information of the service sample set; performing category analysis on each dimension attribute information in the multi-dimension attribute information to obtain multi-dimension category information of the service sample set; and labeling each sample in the service sample set according to at least one dimension type information in the multi-dimension type information.
Referring to fig. 1, fig. 1 is a schematic view of a scenario of a sample annotation system according to an embodiment of the present invention, where the sample annotation system may include a server 10, and a sample annotation device is integrated in the server 10. In the embodiment of the present invention, the server 10 is mainly configured to obtain a service sample set, where the service sample set includes at least one sample; performing attribute analysis on the service sample set to obtain multi-dimensional attribute information of the service sample set; performing category analysis on each dimension attribute information in the multi-dimension attribute information to obtain multi-dimension category information of the service sample set; and labeling each sample in the service sample set according to at least one dimension type information in the multi-dimension type information.
In this embodiment of the present invention, the server 10 may be an independent server, or may be a server network or a server cluster composed of servers, for example, the server 10 described in this embodiment of the present invention includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud server composed of a plurality of servers. Among them, the Cloud server is constituted by a large number of computers or web servers based on Cloud Computing (Cloud Computing).
Those skilled in the art will appreciate that the application environment shown in fig. 1 is only one application scenario related to the present application, and does not constitute a limitation to the application scenario of the present application, and that other application environments may further include more or less servers than those shown in fig. 1, or a network connection relationship of servers, for example, only 1 server is shown in fig. 1, and it is understood that the sample annotation system may further include one or more other servers, or/and one or more clients connected to a network of servers, and is not limited herein.
In addition, as shown in fig. 1, the sample labeling system may further include a memory 20 for storing data, such as a sample database, in which various service samples, such as information, articles, phrases, and the like, are stored, the memory 20 may further include an attribute information database, in which all attribute vectors in the multidimensional vector space are stored, the memory 20 may further include a category information database, in which category information and the like corresponding to the bid stalk vector in the multidimensional vector space are stored.
It should be noted that the scenario diagram of the sample annotation system shown in fig. 1 is only an example, and the sample annotation system and the scenario described in the embodiment of the present invention are for more clearly illustrating the technical solution of the embodiment of the present invention, and do not form a limitation on the technical solution provided in the embodiment of the present invention.
The sample annotation system related to the embodiment of the present invention may be a distributed system formed by connecting a plurality of nodes (any form of computing devices in an access network, such as the server 10, etc.) through a network communication form.
Taking a distributed system as an example of a blockchain system, referring To fig. 2, fig. 2 is an optional structural schematic diagram of the distributed system 100 applied To the blockchain system, which is formed by a plurality of nodes 200 (computing devices in any form in an access network, such as servers) and clients 300, and a Peer-To-Peer (P2P, Peer To Peer) network is formed between the nodes, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server or a terminal, can join to become a node, and the node comprises a hardware layer, a middle layer, an operating system layer and an application layer. In the embodiment of the present invention, the servers 10 are each a node in the blockchain system.
Referring to the functions of each node in the blockchain system shown in fig. 2, the functions involved include:
1) routing, a basic function that a node has, is used to support communication between nodes.
Besides the routing function, the node may also have the following functions:
2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recording data, carrying a digital signature in the recording data to represent a source of task data, and sending the recording data to other nodes in the block chain system, so that the other nodes add the recording data to a temporary block when the source and integrity of the recording data are verified successfully.
For example, the services implemented by the application include:
2.1) wallet, for providing the function of transaction of electronic money, including initiating transaction (i.e. sending the transaction record of current transaction to other nodes in the blockchain system, after the other nodes are successfully verified, storing the record data of transaction in the temporary blocks of the blockchain as the response of confirming the transaction is valid; of course, the wallet also supports the querying of the remaining electronic money in the electronic money address;
and 2.2) sharing the account book, wherein the shared account book is used for providing functions of operations such as storage, query and modification of account data, record data of the operations on the account data are sent to other nodes in the block chain system, and after the other nodes verify the validity, the record data are stored in a temporary block as a response for acknowledging that the account data are valid, and confirmation can be sent to the node initiating the operations.
2.3) Intelligent contracts, computerized agreements, which can enforce the terms of a contract, implemented by codes deployed on a shared ledger for execution when certain conditions are met, for completing automated transactions according to actual business requirement codes, such as querying the logistics status of goods purchased by a buyer, transferring the buyer's electronic money to the merchant's address after the buyer signs for the goods; of course, smart contracts are not limited to executing contracts for trading, but may also execute contracts that process received information.
3) And the Block chain comprises a series of blocks (blocks) which are mutually connected according to the generated chronological order, new blocks cannot be removed once being added into the Block chain, and recorded data submitted by nodes in the Block chain system are recorded in the blocks.
Referring to fig. 3, fig. 3 is an optional schematic diagram of a Block Structure (Block Structure) according to an embodiment of the present invention, where each Block includes a hash value of a transaction record stored in the Block (hash value of the Block) and a hash value of a previous Block, and the blocks are connected by the hash values to form a Block chain. The block may include information such as a time stamp at the time of block generation. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using cryptography, and each data block contains related information for verifying the validity (anti-counterfeiting) of the information and generating a next block.
When the sample labeling system is a blockchain system in the embodiment of the present invention, and the server is a node in the blockchain system in the embodiment of the present invention, the multidimensional vector space can be stored in the blockchain. Specifically, in the embodiment of the present invention, the method further includes: and constructing a multi-dimensional vector space, and storing the multi-dimensional vector space in a block chain in a block form. For a specific way of adding blocks, reference may be made to the description of the above-mentioned blockchain system, which is not described herein again.
The following is a detailed description of specific embodiments.
In the present embodiment, description will be made from the viewpoint of a sample labeling apparatus, which may be specifically integrated in the server 10.
The invention provides a sample labeling method, which comprises the following steps: acquiring a service sample set, wherein the service sample set comprises at least one sample; performing attribute analysis on the service sample set to obtain multi-dimensional attribute information of the service sample set; performing category analysis on each dimension attribute information in the multi-dimension attribute information to obtain multi-dimension category information of the service sample set; and labeling each sample in the service sample set according to at least one dimension type information in the multi-dimension type information.
Referring to fig. 4, a schematic flow chart of a sample labeling method according to an embodiment of the present invention is shown, where the sample labeling method includes:
101. a set of service samples is obtained, the set of service samples comprising at least one sample.
The sample in the embodiment of the invention refers to text information, such as information and comments on a website, and the sample can be words, phrases, sentences, articles and the like. The set of traffic samples may comprise samples of different traffic, i.e. the samples in the set of traffic samples may be of any type, in any aspect.
102. And performing attribute analysis on the service sample set to obtain multi-dimensional attribute information of the service sample set.
In the embodiment of the invention, multi-dimensional attribute analysis is simultaneously carried out on different samples in the service sample set, and the attribute information of the service sample set can be obtained through the analysis of each dimension to form the multi-dimensional attribute information of the service sample set. The dimension division can be performed on entities, concepts, phrases and the like which are common in natural language, for example, the dimensions of multiple dimensions include entities, concepts, tags (tag), phrases (phrase), semantics, topics (topic), attention point change trends and the like. Entities, concepts and phrases are the most fundamental three dimensions, based on which other dimensions can be extended. As shown in fig. 5, an entity refers to a person, thing, etc. with a unique reference, and its scope is minimal. A concept refers to a group having a unique designation whose extent is greater than that of an entity. Phrases include words, phrases, and the like, modified and described only around a single core word or phrase, and are intended to be broader than the concept.
The multidimensional attribute information of the service sample set can be determined through the constructed multidimensional vector space, and the multidimensional attribute information corresponds to the multidimensional vector space one by one, namely, each piece of dimensional attribute information is determined through the corresponding multidimensional vector space. The multi-dimensional vector space may include an entity space, a concept space, a tag space, a phrase space, a semantic space, a topic space, a point of interest trend space, and the like.
Specifically, the method further comprises: constructing a multi-dimensional vector space; setting a plurality of attribute vectors with a position relation in each dimension vector space, and selecting at least one attribute vector from the attribute vectors as a benchmark vector; and correlating the benchmark vectors in different dimension vector spaces.
It should be noted that samples of different services can be used to construct a multi-dimensional vector space, the entity space, the concept space, and the phrase space are three most basic vector spaces, and other vector spaces can be extended based on the most basic vector spaces. As shown in fig. 6, based on the phrase space, a tag space, a semantic space, a topic space, a point of interest variation trend space, and the like may be extended.
Each dimension vector space is a relatively independent space and has a set of corresponding methods to construct. For example, for an entity space, a training sample constructed based on a random walk sampling method (random walk sampling) may be used to train to obtain an entity vector model based on a knowledge graph, and then, based on attribute vectors output by the entity vector model, a position relationship between the attribute vectors is set to form the entity space. Other vector spaces can also be obtained according to the training of the respective suitable vector models, and are not described in detail herein.
Each dimension vector space is composed of a plurality of attribute vectors, and the types of the attribute vectors in different dimension vector spaces are different, for example, the attribute vectors in the entity space are entity words such as "singer novels", "teaching novels", and the like, the attribute vectors in the concept space are group words such as "singer", "teacher", and the like, and the attribute vectors in the phrase space are phrases such as "singer novels spring concert". The multiple attribute vectors in each dimension vector space have a positional relationship, that is, a distance exists between any two attribute vectors, and the distance is related to the attribute similarity between the two attribute vectors, and the greater the attribute similarity between the two attribute vectors is, the closer the distance is, the smaller the attribute similarity is, and the farther the distance is. For example, in the concept space, as shown in fig. 7, for three attribute vectors of "singer", "professor" teacher ", the similarity between" professor "and" teacher "is higher, the" professor "is closer to" teacher ", and the similarity between" singer "and" professor "teacher" is lower, the "singer" is farther from "professor" and "teacher", that is, the distance between "professor" and "teacher" should be closer than the distance between "professor" and "singer". Since each dimension vector space is a three-dimensional space, the distance between "singer", "professor" teacher "illustrated in fig. 7 does not represent the actual distance of" singer "," professor "teacher" in the concept space.
At least one attribute vector in each dimension vector space can be used as a benchmark vector, and the selection standard of the benchmark vector is a representative attribute vector, generally representing similar vectors in a certain area and serving as a positioning function in the vector space. Similarly, the closer a certain attribute vector is to a benchmark vector, the higher the attribute similarity of the attribute vector and the benchmark vector is; the farther an attribute vector is from a flagpole vector, the smaller the attribute similarity between the attribute vector and the flagpole vector.
Different dimensional vector spaces may be correlated by a benchmarking vector. For example, as shown in fig. 7, the target vector "singer bloom" in the entity space, the target vector "singer" in the concept space, and the target vector "spring concert" in the phrase space are associated with the target vector "singer" in the concept space, and the target vector "singer" in the concept space is associated with the target vector "spring concert" in the phrase space.
Samples of different services can be used for updating and optimizing the multi-dimensional vector space, so that the multi-dimensional vector space becomes a knowledge carrier accumulated by historical services. Because the benchmark vectors in the multi-dimensional vector space are correlated, after a certain dimension vector space is updated, the updated dimension vector space can be detected according to other dimension vector spaces, and indexes in multiple angles, such as identification accuracy, sequencing accuracy and the like, can be detected, so that whether the multi-dimensional vector space is updated or not can be determined according to the detection indexes.
Specifically, the method further comprises: adding a new marker post vector in the target dimension vector space, and setting the position relation of the new marker post vector in the target dimension vector space; determining a benchmarking vector associated with the newly added benchmarking vector in the other dimension vector space; detecting whether the position relation of the newly added marker post vector in the target dimension vector space meets the requirement of a preset index or not according to the position relation of the associated marker post vector in the other dimension vector spaces; and if so, updating the multi-dimensional vector space.
When the target dimension vector space needs to add new marker post vectors, the position relation of the new marker post vectors in the target dimension vector space is set, namely the distances between the new marker post vectors and other marker post vectors in the target dimension vector space are set, and the distances are associated with the attribute similarity, so that the other marker post vectors in the target dimension vector space can be sorted according to the attribute similarity with the new marker post vectors, namely the marker post vector arranged at the first position has the maximum attribute similarity with the new marker post vector, and the marker post vector arranged at the last position has the minimum attribute similarity with the new marker post vector. Meanwhile, determining that the pole vectors related to the newly added pole vector in other dimension vector spaces are first pole vectors, and determining that other pole vectors except the first pole vector in other dimension vector spaces are second pole vectors, and sorting the second pole vectors according to the attribute similarity of the second pole vectors and the first pole vectors according to the distance between the first pole vectors and the second pole vectors, namely that the attribute similarity of the second pole vectors arranged at the first position and the first pole vectors is maximum, and the attribute similarity of the second pole vectors arranged at the last position and the first pole vectors is minimum. Comparing the rank in the target dimension vector space with the ranks in other dimension vector spaces, and if the ranks in the target dimension vector space and the ranks in other dimension vector spaces are consistent, for example, the target dimension vector space and the other dimension vector spaces are associated with the pole vectors arranged at the same number of bits, that is, the first pole vector arranged in the target dimension vector space is associated with the first pole vector arranged in the other dimension vector spaces, the second pole vector arranged in the target dimension vector space is associated with the second pole vector arranged in the other dimension vector spaces, and the like, determining that the position relationship of the newly added pole vector in the target dimension vector space meets the preset index requirement (the index is equal or more optimal); if the two sequences are not consistent, manually detecting whether the sequences in the target dimension vector space are more accurate, and if so, determining that the position relation of the newly added benchmark vector in the target dimension vector space meets the preset index requirement (the index is equal or more optimal). And storing the position relation of the newly added pole vector in the target dimension vector space and the incidence relation of the newly added pole vector and the pole vectors of other dimension vector spaces to realize the updating of the multi-dimension vector space, otherwise (namely the position relation of the newly added pole vector in the target dimension vector space does not meet the requirement of a preset index), not adding the newly added pole vector, namely not processing the multi-dimension vector space.
For example, a benchmarking vector "singer Xiaoming" is newly added in the physical space, other benchmarking vectors "teaching Xiaoming" are also arranged in the physical space, and the distance between the benchmarking vectors "singer Xiaoming" and "teaching Xiaoming" is set. "singer xiaoming" is associated with the flagpole vector "singer" in the concept space, while "professor xiaoming" is associated with the flagpole vector "teacher" in the concept space. According to the distance between the singer and the teacher in the concept space, determining the first attribute similarity of the singer and the teacher, and according to the distance between the singer and the teacher in the entity space, determining the second attribute similarity of the singer and the teacher, if the second attribute similarity is consistent with the first attribute similarity, for example, the second attribute similarity is within the error range of the first attribute similarity, determining that the position of the newly added target bar vector singer and the teacher is reasonable in the entity space, meeting the requirement of a preset index, updating the multi-dimensional vector space, and otherwise, removing the newly added target bar vector singer and the teacher, and not updating the multi-dimensional vector space.
In addition, when the attribute vector is newly added in the target dimension vector space, whether the updating meets the requirement of a preset index can be detected according to the position relation in the target dimension vector space. Specifically, after the distances between the newly added attribute vector and other attribute vectors are set in a target dimension vector space, a flagpole vector closest to the newly added attribute vector is detected, the category of the newly added attribute vector is identified according to the category of the flagpole vector, and accuracy detection is performed on the identified category of the newly added attribute vector.
The multidimensional vector space can realize the accumulation, integration and upgrading of knowledge through samples of different services, and no matter which multidimensional vector space is updated, other multidimensional vector spaces can be used as forward bases to promote the updating perfection of the whole multidimensional vector space. And after the multi-dimensional vector space is improved, basic capabilities such as more precise and accurate classification and labeling can be provided for the service samples.
After the multi-dimensional vector space is constructed, the multi-dimensional vector space can be adopted to perform attribute analysis on the service sample set. Specifically, the performing attribute analysis on the service sample set to obtain multidimensional attribute information of the service sample set includes: and taking each dimension vector space in a pre-constructed multi-dimension vector space as a target dimension vector space, determining an attribute vector in the target dimension vector space to which each sample in the service sample set is mapped, wherein the attribute vectors of all samples in the service sample set in the target dimension vector space form one-dimension attribute information in the multi-dimension attribute information.
After the service sample set is obtained, mapping each sample in the service sample set to a multi-dimensional vector space one by one for attribute analysis. Since each dimension vector space is constructed by a vector model, that is, the attribute vector in each dimension vector space is determined by the vector model, the sample can be input into the corresponding vector model of each dimension vector space to determine the attribute vector of the sample in the dimension vector space. For example, when determining an attribute vector of a sample in an entity space, the sample is input into an entity vector model, and an attribute vector output by the entity vector model is an attribute vector mapped to the entity space by the sample.
The attribute vectors that the same sample maps into different dimensional vector spaces are different, and different samples may also map into the same attribute vector in the same dimensional vector space. For example, sample a is "huazi singing a concert in spring," which maps to an attribute vector "singer's blossoms" in physical space, which maps to an attribute vector "singer" in concept space. Sample B is "singer mingming occurs at airport", which maps to attribute vector "singer mingming" in entity space, which maps to attribute vector "singer" in concept space.
103. And performing category analysis on each dimension attribute information in the multi-dimension attribute information to obtain the multi-dimension category information of the service sample set.
Due to the fact that vector models for constructing different dimensionality vector spaces are different, attributes of attribute vectors of samples mapped to the different dimensionality vector spaces are different, attribute categories of the samples in the different dimensionality vector spaces are further different, and therefore the categories of the samples in each dimensionality vector space are analyzed, the categories of the samples in the multi-dimensionality vector spaces are integrated, and the final categories of the samples are determined.
The present embodiment can analyze the category information of each sample in each dimension vector space one by one. Specifically, the performing category analysis on each piece of dimensional attribute information in the multidimensional attribute information to obtain the multidimensional category information of the service sample set includes: respectively taking each sample in the service sample set as a target sample, and acquiring a benchmark vector closest to an attribute vector corresponding to the target sample in the target dimension vector space; and performing category analysis on the obtained benchmark vectors to obtain category information of the target samples in the target dimension vector space, wherein the category information of all the samples in the service sample set in the target dimension vector space forms one-dimensional category information in the multi-dimensional category information.
After the attribute vectors of the target samples in the target dimension vector space are determined, because distances for representing attribute similarity are set between different vectors in the target dimension vector space, the benchmark vector closest to the attribute vectors is obtained, namely the benchmark vector with the highest similarity to the attribute vectors is determined, the categories of the attribute vectors can be determined according to the categories of the benchmark vectors, and then the categories of the target samples are determined. Wherein the category of the benchmarking vector may be preset.
For example, as shown in fig. 8, in the theme space, the attribute vector of a certain area includes "new crown virus", "new crown pneumonia", "new crown recent situation direct seeding", "symptoms of new crown pneumonia", "new crown pneumonia recent message", "new crown pneumonia asymptomatic infection detection", "new crown pneumonia asymptomatic infection", and the like, and the most representative attribute vector "new crown pneumonia" is determined as a target vector. And when the attribute vector mapped to the subject space by the target sample is any one of the attribute vectors in the region, determining that the benchmark vector closest to the attribute vector corresponding to the target sample is 'new crown pneumonia'. And the category of the target bar vector "new coronary pneumonia" is set in advance as the "disease" category, so the category information of the target sample in the subject space can be determined as "disease".
The embodiment can also cluster the samples first and analyze the category information of each type of sample in each dimension vector space. Specifically, the performing category analysis on each piece of dimensional attribute information in the multidimensional attribute information to obtain the multidimensional category information of the service sample set includes: clustering attribute vectors of all samples in the service sample set in the target dimension vector space to divide the service sample set into at least one subset; respectively taking each subset in the business sample set as a target subset, and determining a benchmark vector of the target subset in the target dimension vector space; and performing category analysis on the determined benchmark vectors to obtain category information of the target subsets in the target dimension vector space, wherein the category information of all subsets in the service sample set in the target dimension vector space constitutes one-dimensional category information in the multi-dimensional category information.
After determining the attribute vector of each sample in the service sample set in the target dimension vector space, clustering the samples by comparing the similarity of every two attribute vectors. For example, the traffic sample set includes sample C, sample D, and sample E, and sample C, sample D, and sample E correspond to attribute vector C, attribute vector D, and attribute vector E, respectively, in the target dimension vector space. According to the position relationship among the attribute vector C, the attribute vector D and the attribute vector E in the target dimension vector space, the similarity of the attribute vector C and the attribute vector D, the similarity of the attribute vector C and the attribute vector E and the similarity of the attribute vector E and the attribute vector D can be respectively calculated. Attribute vectors with the similarity greater than a preset similarity threshold are classified into one class, for example, the similarity between the attribute vector C and the attribute vector D is 0.9, the similarity between the attribute vector C and the attribute vector E is 0.3, the similarity between the attribute vector E and the attribute vector D is 0.4, the attribute vector C and the attribute vector D are classified into one class, and the attribute vector E is classified into one class. According to the clustering of the attribute vectors, the service sample set is divided into at least one subset, for example, a sample C corresponding to the attribute vector C and a sample D corresponding to the attribute vector D are a subset, and a sample E corresponding to the attribute vector E is a subset.
And further, determining a benchmark vector corresponding to each subset according to the attribute vector of each subset in the target dimension vector space. Specifically, the determining the target subset for the target vector in the target dimension vector space includes: detecting whether all samples in the target subset have a benchmarking vector in an attribute vector in the target dimension vector space; and if so, taking the detected benchmark vector as the benchmark vector of the target subset in the target dimension vector space.
If the attribute vector of one sample in the target subset in the target dimension vector space is detected to be a benchmark vector, taking the benchmark vector as the benchmark vector of the target subset in the target dimension vector space; if the attribute vectors of all the samples in the target subset in the target dimension vector space are not detected to be the benchmark vectors, the benchmark vectors of the target subset can be manually set. After the target sub-set benchmark vector is determined, the category of the target sub-set can be determined according to the category of the benchmark vector, and then the category of each sample in the target sub-set is determined. Wherein the category of the benchmarking vector can be preset.
Due to the fact that the types of the different dimensionality vector spaces are different, attributes of attribute vectors mapped by the samples in the different dimensionality vector spaces have differences, clustering results of the attribute vectors in the different dimensionality vector spaces have differences, subsets divided by the service sample set have differences, namely the subsets of the service sample set in the different dimensionality vector spaces may be different, and category information of the samples in the different dimensionality vector spaces may be different.
For example, a business sample set includes "small rock airport", "small cross airport", "small tomb airport". In the theme space, the similarity of the attribute vector corresponding to the small occurrence airport and the small league airport is 0.99, the similarity of the attribute vector corresponding to the small occurrence airport and the small kuangjie airport is 0.43, so that the small occurrence airport and the small league airport are divided into a subset, the category information of the subset is determined to be 'hot star', the small kuangjie airport is divided into a subset, and the category information of the subset is determined to be 'cold star'; in the phrase space, the similarity of the attribute vector corresponding to "small rock airport" and "small tom airport" is 0.62, and the similarity of the attribute vector corresponding to "small rock airport" and "small tom airport" is 0.64, so that the "small rock airport", "small tom airport" and "small tom airport" are divided into a subset, and the category information of the subset is determined to be "star".
In addition, according to the symmetry of the vector space, for example, phrases including "exist" and "none", "strong" and "weak" in the phrase space are all symmetric vectors, when the class information of the sample is determined, the benchmarking vector corresponding to the sample can be inverted (for example, "exist" in the benchmarking vector is changed to "none") by using the symmetry principle of the benchmarking vector, and the missing in the sample class analysis can be detected and found according to the type of the inverted benchmarking vector.
104. And labeling each sample in the service sample set according to at least one dimension type information in the multi-dimension type information.
Since different dimension category information may be different, the category of each sample in the service sample set can be determined comprehensively according to the multi-dimension category information. For example, if the category information of the target sample in the four-dimensional vector space is category a and the category information in the two-dimensional vector space is category B, the category of the target sample is labeled as category a.
After the multi-dimensional category information is obtained, at least one-dimensional category information can be selected according to actual requirements to determine the category of each sample in the service sample set. Specifically, the labeling each sample in the service sample set according to at least one dimension category information in the multi-dimension category information includes: selecting at least one dimension category information from the multi-dimension category information according to a preset service strategy; and labeling the category of each sample of the service sample set according to the at least one dimension category information.
The business strategy can be set according to the actual business requirements so as to ensure that different business samples can be labeled simultaneously. For example, if the business policy is business relevance, the category information of the business sample set in the semantic space and the topic space is selected from the multi-dimensional category information and used as the basis for sample labeling. And if the business strategy is topic transfer and recommendation, selecting the category information of the business sample set in a topic space and an attention point change trend space from the multi-dimensional category information as a sample marking basis, and in addition, the sample marking of business correlation can also be used as a sample marking basis for topic transfer and recommendation. And if the business strategy is the professional vocabulary marking, selecting the category information of the business sample set in an entity space, a concept space, a label space and a subject space from the multi-dimensional category information as the basis of the sample marking. And if the service strategy is multi-modal marking, selecting the category information of the service sample set in the theme space and the semantic space from the multi-dimensional category information as the basis of the sample marking. And if the business strategy is quality grading, selecting the class information of the business sample set in the theme space from the multi-dimensional class information as the basis of sample marking. Furthermore, the sample label of the professional vocabulary can also be used as the sample label basis of the user field specialty, and the sample label of the topic transfer and recommendation and the sample label of the user field specialty can also be used as the sample label basis of the user portrait, as shown in fig. 6.
If the category information of the target sample in the selected multi-dimensional vector space is the same, the category of the target sample is marked as a corresponding category, and if the category information of the target sample in the multi-dimensional vector space is different, the category information can be marked after artificial comprehensive evaluation.
For example, the business strategy is star popularity grading, that is, the business hopes to focus on hot stars and improve the response speed to hot star events, in the prior art, hot stars are classified by grading fan counts under account numbers such as star microblogs and the like, but the method is time-consuming and labor-consuming, difficult to identify the fan counts and counterfeit, and difficult to update the classification information change in time. The embodiment of the invention collects the star texts on the network as samples, maps the samples into a multi-dimensional vector space to obtain each dimension type information, for example, the type information of the samples 'small rock airport', 'small overtaking airport' and 'small overtaking airport' in the subject space and the phrase space are different, the type information of the business sample set in the subject space is selected as the basis of sample marking, the samples 'small rock airport', 'small overtaking airport' are marked as 'hot star' type, and the samples 'small overtaking airport' are marked as 'cold star' type, so that the accurate and quick marking of the samples is realized.
When the one-dimensional class information in step 103 includes class information of each sample in the service sample set in the corresponding dimensional vector space, each sample in the service sample set is labeled one by one according to the class information of each sample. When the one-dimensional class information in step 103 includes class information of each subset in the service sample set in the corresponding dimensional vector space, the samples of each subset in the service sample set are integrally labeled according to the class information of each subset, so as to realize rapid labeling of the samples.
The embodiment carries out sample labeling based on the multi-dimensional vector space, can quickly obtain different dimensional category information, and has the advantages of more flexibility, lower expansion cost, stronger cross-service and cross-scene universal capability and higher reusability compared with the labeling method in the prior art.
After class labeling is carried out on each sample in the service sample set, the labeled sample can be directly used for service processing, such as detecting service correlation and being used for topic transfer and recommendation; detecting professional vocabularies for detecting the field specialty of the user and constructing a user portrait; the multi-modal classification results are detected for labeling traffic, etc., as shown in fig. 9. After class marking is carried out on each sample in the service sample set, the marked samples can be screened, the screened samples are adopted to train a service customization model, and then the service customization model is adopted to carry out service processing. The business customization model is an algorithm model which is customized and developed according to business needs, such as processing time consumption, memory size and the like, the business customization model is trained by using the sample labeled by the embodiment of the invention as a training sample, and compared with the prior art, the training of the business customization model is more unmanned, rapid, accurate and low in cost.
The embodiment of the invention obtains the multi-dimensional attribute information of the service sample set by obtaining the service sample set and performing attribute analysis on the service sample set, obtains the multi-dimensional category information of the service sample set by performing category analysis on each dimension attribute information in the multi-dimensional attribute information, and labels each sample in the service sample set according to at least one dimension category information in the multi-dimensional category information. The embodiment can simultaneously carry out multi-dimensional type detection on different samples in a service sample set, effectively improves the efficiency and accuracy of sample labeling, and reduces the labeling cost.
The sample labeling method in the embodiment of the present invention is described below with reference to a specific application scenario.
Referring to fig. 10, a schematic flow chart of another embodiment of a sample annotation method according to an embodiment of the present invention is shown, where the sample annotation method is applied to a server, and the sample annotation method includes:
201. and constructing an n-dimensional vector space, wherein n is more than or equal to 2.
For example, the n-dimensional vector space includes an entity space, a concept space, a label space, a phrase space, a semantic space, a topic space, and a point of interest variation trend space, each of which constitutes a one-dimensional vector space.
202. A plurality of attribute vectors with position relations are arranged in each dimension vector space, and at least one attribute vector is selected from the attribute vectors to serve as a benchmark vector.
The position relation among the attribute vectors represents the similarity of the attribute vectors, the closer the attribute vectors are, the higher the similarity is, and the farther the attribute vectors are, the lower the similarity is.
203. Acquiring a service sample set, wherein the service sample set comprises a sample X and a sample Y.
The samples in the service sample set may belong to different services, that is, the sample X and the sample Y may belong to different services, where the sample X and the sample Y are used as an example, and the service sample set may further include one sample or more than two samples.
204. Attribute vectors Xi, Yi, i ═ 1, 2, …, n in the ith-dimension vector space to which sample X, Y maps are determined.
Sample X maps to attribute vector X1 in the first dimension vector space, sample X maps to attribute vector X2 in the second dimension vector space, and so on, sample X maps to attribute vector Xn in the nth dimension vector space. Likewise, sample Y maps to attribute vector Y1 in the first dimension vector space, sample Y maps to attribute vector Y2 in the second dimension vector space, and so on, sample Y maps to attribute vector Yn in the nth dimension vector space.
205. And determining the closest benchmarking vectors Xi1 and Yi1 to the attribute vectors Xi and Yi in the ith dimension vector space.
A benchmarking vector X11 closest to attribute vector X1 in the first dimension vector space, a benchmarking vector X21 closest to attribute vector X2 in the second dimension vector space, and so on, a benchmarking vector Xn1 closest to attribute vector Xn in the nth dimension vector space. Similarly, the flagpole vector Y11 closest to the attribute vector Y1 in the first dimension vector space, the flagpole vector Y21 closest to the attribute vector Y2 in the second dimension vector space, and so on, and the flagpole vector Yn1 closest to the attribute vector Yn in the nth dimension vector space.
206. According to the category information of the benchmarking vectors Xi1 and Yi1, the category information of the sample X, Y in the i-th dimension vector space is determined.
Determining the category of the sample X in the first dimension vector space according to the category of the benchmark vector X11, determining the category of the sample X in the second dimension vector space according to the category of the benchmark vector X21, and determining the category of the sample X in the nth dimension vector space according to the category of the benchmark vector Xn1 by analogy. Similarly, the category of the sample Y in the first-dimension vector space is determined according to the category of the benchmarking vector Y11, the category of the sample Y in the second-dimension vector space is determined according to the category of the benchmarking vector Y21, and so on, and the category of the sample Y in the nth-dimension vector space is determined according to the category of the benchmarking vector Yn 1.
207. And selecting the category information of the sample X, Y in the second-dimension vector space and the sixth-dimension vector space according to a preset business strategy.
It should be noted that the second-dimension vector space and the sixth-dimension vector space are selected as an example, and in practical applications, the category information of the sample X, Y in other dimension vector spaces may also be selected according to other business strategies.
208. The class of the sample X, Y is labeled according to the class information of the sample X, Y in the second-dimension vector space and the sixth-dimension vector space.
If the category information of the sample X in the second-dimension vector space and the sixth-dimension vector space is both category A, the sample X is marked as category A, and the category information of the sample Y in the second-dimension vector space and the sixth-dimension vector space is both category B, the sample Y is marked as category B.
In summary, the embodiment of the invention changes the problem that each scene of each service models in the field of traditional machine learning, accumulates and integrates the knowledge of each service into a multidimensional vector space, and labels different service samples based on the multidimensional vector space, thereby greatly reducing the cost and improving the labeling efficiency and accuracy.
In order to better implement the sample labeling method provided by the embodiment of the invention, the embodiment of the invention also provides a device based on the sample labeling method. The meanings of the terms are the same as those in the sample labeling method, and the specific implementation details can refer to the description in the method embodiment.
Referring to fig. 11, fig. 11 is a schematic structural diagram of a sample annotation device according to an embodiment of the present invention, wherein the sample annotation device includes:
an obtaining module 31, configured to obtain a service sample set, where the service sample set includes at least one sample;
the attribute analysis module 32 is configured to perform attribute analysis on the service sample set to obtain multidimensional attribute information of the service sample set;
the category analysis module 33 is configured to perform category analysis on each piece of dimensional attribute information in the multidimensional attribute information to obtain multidimensional category information of the service sample set;
and the labeling module 34 is configured to label each sample in the service sample set according to at least one dimension type information in the multi-dimension type information.
Optionally, the attribute analysis module 32 is further configured to:
taking each dimension vector space in a pre-constructed multi-dimension vector space as a target dimension vector space, and determining an attribute vector in the target dimension vector space to which each sample in the service sample set is mapped, wherein the attribute vectors of all samples in the service sample set in the target dimension vector space form one-dimension attribute information in the multi-dimension attribute information; each dimension vector space comprises a plurality of attribute vectors with position relations, and at least one attribute vector forms a benchmark vector.
Optionally, the category analysis module 33 is further configured to:
respectively taking each sample in the service sample set as a target sample, and acquiring a benchmark vector closest to an attribute vector corresponding to the target sample in the target dimension vector space;
and performing category analysis on the obtained benchmark vectors to obtain category information of the target samples in the target dimension vector space, wherein the category information of all the samples in the service sample set in the target dimension vector space forms one-dimensional category information in the multi-dimensional category information.
Optionally, the category analysis module 33 is further configured to:
clustering attribute vectors of all samples in the service sample set in the target dimension vector space to divide the service sample set into at least one subset;
respectively taking each subset in the business sample set as a target subset, and determining a benchmark vector of the target subset in the target dimension vector space;
and performing category analysis on the determined benchmark vectors to obtain category information of the target subsets in the target dimension vector space, wherein the category information of all subsets in the service sample set in the target dimension vector space constitutes one-dimensional category information in the multi-dimensional category information.
Optionally, the category analysis module 33 is further configured to:
detecting whether all samples in the target subset have a benchmarking vector in an attribute vector in the target dimension vector space;
and if so, taking the detected benchmark vector as the benchmark vector of the target subset in the target dimension vector space.
Optionally, the labeling module 34 is further configured to:
selecting at least one dimension category information from the multi-dimension category information according to a preset service strategy;
and labeling the category of each sample of the service sample set according to the at least one dimension category information.
Optionally, the apparatus further comprises a building module, wherein the building module is configured to:
constructing the multi-dimensional vector space;
setting a plurality of attribute vectors with a position relation in each dimension vector space, and selecting at least one attribute vector from the attribute vectors as a benchmark vector;
and correlating the benchmark vectors in different dimension vector spaces.
Optionally, the apparatus further includes an update module, and the update module is further configured to:
adding a new marker post vector in the target dimension vector space, and setting the position relation of the new marker post vector in the target dimension vector space;
determining a benchmarking vector associated with the newly added benchmarking vector in the other dimension vector space;
detecting whether the position relation of the newly added marker post vector in the target dimension vector space meets the requirement of a preset index or not according to the position relation of the associated marker post vector in the other dimension vector spaces;
and if so, updating the multi-dimensional vector space.
Optionally, the multidimensional vector space includes an entity space, a concept space, a tag space, a phrase space, a semantic space, a topic space, and a focus variation trend space.
The embodiment of the invention obtains the multi-dimensional attribute information of the service sample set by obtaining the service sample set and performing attribute analysis on the service sample set, obtains the multi-dimensional category information of the service sample set by performing category analysis on each dimension attribute information in the multi-dimensional attribute information, and labels each sample in the service sample set according to at least one dimension category information in the multi-dimensional category information. According to the embodiment, multi-dimensional type detection can be simultaneously performed on each sample in the service sample set, the sample labeling efficiency and accuracy are effectively improved, and the labeling cost is reduced.
An embodiment of the present invention further provides a server, as shown in fig. 12, which shows a schematic structural diagram of the server according to the embodiment of the present invention, specifically:
the server may include components such as a processor 801 of one or more processing cores, memory 802 of one or more computer-readable storage media, a power supply 803, and an input unit 804. Those skilled in the art will appreciate that the server architecture shown in FIG. 12 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 801 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 802 and calling data stored in the memory 802, thereby performing overall monitoring of the server. Alternatively, processor 801 may include one or more processing cores; preferably, the processor 801 may integrate an application processor, which mainly handles operations of storage media, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 801.
The memory 802 may be used to store software programs and modules, and the processor 801 executes various functional applications and data processing by operating the software programs and modules stored in the memory 802. The memory 802 may mainly include a storage program area and a storage data area, wherein the storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for operating a storage medium, at least one function, and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 802 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 802 may also include a memory controller to provide the processor 801 access to the memory 802.
The server further comprises a power supply 803 for supplying power to each component, and preferably, the power supply 803 can be logically connected with the processor 801 through a power management storage medium, so that functions of charging, discharging, power consumption management and the like can be managed through the power management storage medium. The power supply 803 may also include any component of one or more dc or ac power sources, rechargeable storage media, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The server may further include an input unit 804, and the input unit 804 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the server may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 801 in the server loads the executable file corresponding to the process of one or more application programs into the memory 802 according to the following instructions, and the processor 801 runs the application programs stored in the memory 802, thereby implementing various functions as follows:
obtaining a service sample set, performing attribute analysis on the service sample set to obtain multi-dimensional attribute information of the service sample set, performing category analysis on each dimension attribute information in the multi-dimensional attribute information to obtain multi-dimensional category information of the service sample set, and labeling each sample in the service sample set according to at least one dimension category information in the multi-dimensional category information
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present invention provide a storage medium having stored therein a plurality of instructions that can be loaded by a processor to perform the steps of any of the methods for evaluating a search phrase provided by embodiments of the present invention. For example, the instructions may perform the steps of:
obtaining a service sample set, performing attribute analysis on the service sample set to obtain multi-dimensional attribute information of the service sample set, performing category analysis on each dimension attribute information in the multi-dimensional attribute information to obtain multi-dimensional category information of the service sample set, and labeling each sample in the service sample set according to at least one dimension category information in the multi-dimensional category information
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the storage medium can execute the steps in any sample labeling method provided in the embodiments of the present invention, the beneficial effects that can be achieved by any sample labeling method provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
The sample labeling method, device, server and storage medium provided by the embodiments of the present invention are described in detail above, and the principle and implementation of the present invention are explained in this document by applying specific examples, and the description of the above embodiments is only used to help understanding the method and core ideas of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method for labeling a sample, comprising:
acquiring a service sample set, wherein the service sample set comprises at least one sample;
performing attribute analysis on the service sample set to obtain multi-dimensional attribute information of the service sample set;
performing category analysis on each dimension attribute information in the multi-dimension attribute information to obtain multi-dimension category information of the service sample set;
and labeling each sample in the service sample set according to at least one dimension type information in the multi-dimension type information.
2. The method for labeling samples according to claim 1, wherein the analyzing the attributes of the service sample set to obtain the multidimensional attribute information of the service sample set comprises:
taking each dimension vector space in a pre-constructed multi-dimension vector space as a target dimension vector space, and determining an attribute vector in the target dimension vector space to which each sample in the service sample set is mapped, wherein the attribute vectors of all samples in the service sample set in the target dimension vector space form one-dimension attribute information in the multi-dimension attribute information; each dimension vector space comprises a plurality of attribute vectors with position relations, and at least one attribute vector forms a benchmark vector.
3. The method for labeling samples according to claim 2, wherein the performing category analysis on each dimension attribute information in the multi-dimension attribute information to obtain the multi-dimension category information of the service sample set comprises:
respectively taking each sample in the service sample set as a target sample, and acquiring a benchmark vector closest to an attribute vector corresponding to the target sample in the target dimension vector space;
and performing category analysis on the obtained benchmark vectors to obtain category information of the target samples in the target dimension vector space, wherein the category information of all the samples in the service sample set in the target dimension vector space forms one-dimensional category information in the multi-dimensional category information.
4. The method for labeling samples according to claim 2, wherein the performing category analysis on each dimension attribute information in the multi-dimension attribute information to obtain the multi-dimension category information of the service sample set comprises:
clustering attribute vectors of all samples in the service sample set in the target dimension vector space to divide the service sample set into at least one subset;
respectively taking each subset in the business sample set as a target subset, and determining a benchmark vector of the target subset in the target dimension vector space;
and performing category analysis on the determined benchmark vectors to obtain category information of the target subsets in the target dimension vector space, wherein the category information of all subsets in the service sample set in the target dimension vector space constitutes one-dimensional category information in the multi-dimensional category information.
5. The method of claim 4, wherein the determining the target subset's benchmarking vector in the target dimension vector space comprises:
detecting whether all samples in the target subset have a benchmarking vector in an attribute vector in the target dimension vector space;
and if so, taking the detected benchmark vector as the benchmark vector of the target subset in the target dimension vector space.
6. The method for labeling samples according to claim 1, wherein the labeling each sample in the set of business samples according to at least one dimension category information in the multi-dimension category information comprises:
selecting at least one dimension category information from the multi-dimension category information according to a preset service strategy;
and labeling the category of each sample of the service sample set according to the at least one dimension category information.
7. The method of claim 2, further comprising:
constructing the multi-dimensional vector space;
setting a plurality of attribute vectors with a position relation in each dimension vector space, and selecting at least one attribute vector from the attribute vectors as a benchmark vector;
and correlating the benchmark vectors in different dimension vector spaces.
8. The method of claim 7, further comprising:
adding a new marker post vector in the target dimension vector space, and setting the position relation of the new marker post vector in the target dimension vector space;
determining a benchmarking vector associated with the newly added benchmarking vector in the other dimension vector space;
detecting whether the position relation of the newly added marker post vector in the target dimension vector space meets the requirement of a preset index or not according to the position relation of the associated marker post vector in the other dimension vector spaces;
and if so, updating the multi-dimensional vector space.
9. The sample labeling method of claim 2, wherein the multi-dimensional vector space comprises an entity space, a concept space, a label space, a phrase space, a semantic space, a topic space, and a point of interest trend space.
10. A sample annotation device, said device comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a service sample set, and the service sample set comprises at least one sample;
the attribute analysis module is used for carrying out attribute analysis on the service sample set to obtain multi-dimensional attribute information of the service sample set;
the category analysis module is used for carrying out category analysis on each dimension attribute information in the multi-dimension attribute information to obtain the multi-dimension category information of the service sample set; and the number of the first and second groups,
and the marking module is used for marking each sample in the service sample set according to at least one dimension type information in the multi-dimension type information.
CN202011105032.XA 2020-10-15 2020-10-15 Sample labeling method and device Active CN113392294B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011105032.XA CN113392294B (en) 2020-10-15 2020-10-15 Sample labeling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011105032.XA CN113392294B (en) 2020-10-15 2020-10-15 Sample labeling method and device

Publications (2)

Publication Number Publication Date
CN113392294A true CN113392294A (en) 2021-09-14
CN113392294B CN113392294B (en) 2023-11-10

Family

ID=77616524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011105032.XA Active CN113392294B (en) 2020-10-15 2020-10-15 Sample labeling method and device

Country Status (1)

Country Link
CN (1) CN113392294B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115578394A (en) * 2022-12-09 2023-01-06 湖南省中医药研究院 Pneumonia image processing method based on asymmetric network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6990238B1 (en) * 1999-09-30 2006-01-24 Battelle Memorial Institute Data processing, analysis, and visualization system for use with disparate data types
CN109359301A (en) * 2018-10-19 2019-02-19 国家计算机网络与信息安全管理中心 A kind of the various dimensions mask method and device of web page contents
CN111046275A (en) * 2019-11-19 2020-04-21 腾讯科技(深圳)有限公司 User label determining method and device based on artificial intelligence and storage medium
CN111461180A (en) * 2020-03-12 2020-07-28 平安科技(深圳)有限公司 Sample classification method and device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6990238B1 (en) * 1999-09-30 2006-01-24 Battelle Memorial Institute Data processing, analysis, and visualization system for use with disparate data types
CN109359301A (en) * 2018-10-19 2019-02-19 国家计算机网络与信息安全管理中心 A kind of the various dimensions mask method and device of web page contents
CN111046275A (en) * 2019-11-19 2020-04-21 腾讯科技(深圳)有限公司 User label determining method and device based on artificial intelligence and storage medium
CN111461180A (en) * 2020-03-12 2020-07-28 平安科技(深圳)有限公司 Sample classification method and device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115578394A (en) * 2022-12-09 2023-01-06 湖南省中医药研究院 Pneumonia image processing method based on asymmetric network

Also Published As

Publication number Publication date
CN113392294B (en) 2023-11-10

Similar Documents

Publication Publication Date Title
CN112241481B (en) Cross-modal news event classification method and system based on graph neural network
CN110059198A (en) A kind of discrete Hash search method across modal data kept based on similitude
CN109885692A (en) Knowledge data storage method, device, computer equipment and storage medium
CN106407208B (en) A kind of construction method and system of city management ontology knowledge base
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN109886294A (en) Knowledge fusion method, apparatus, computer equipment and storage medium
CN105518658A (en) Apparatus, systems, and methods for grouping data records
CN110765301B (en) Picture processing method, device, equipment and storage medium
CN113298197B (en) Data clustering method, device, equipment and readable storage medium
Ke et al. TabNN: A universal neural network solution for tabular data
Chen et al. Enhanced discrete multi-modal hashing: More constraints yet less time to learn
CN113723853A (en) Method and device for processing post competence demand data
CN112330510A (en) Volunteer recommendation method and device, server and computer-readable storage medium
Omurca et al. A document image classification system fusing deep and machine learning models
Shen et al. Clustering-driven deep adversarial hashing for scalable unsupervised cross-modal retrieval
CN114491071A (en) Food safety knowledge graph construction method and system based on cross-media data
CN114330476A (en) Model training method for media content recognition and media content recognition method
CN112711645B (en) Method and device for expanding position point information, storage medium and electronic equipment
CN114281984A (en) Risk detection method, device and equipment and computer readable storage medium
CN113392294B (en) Sample labeling method and device
CN113761291A (en) Processing method and device for label classification
CN108959664A (en) Distributed file system based on picture processor
CN116244497A (en) Cross-domain paper recommendation method based on heterogeneous data embedding
Du et al. A general fine-grained truth discovery approach for crowdsourced data aggregation
CN112749246A (en) Search phrase evaluation method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40051772

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant