CN110472055B

CN110472055B - Method and device for marking data

Info

Publication number: CN110472055B
Application number: CN201910775144.7A
Authority: CN
Inventors: 李晓东; 罗雪峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-08-21
Filing date: 2019-08-21
Publication date: 2021-09-14
Anticipated expiration: 2039-08-21
Also published as: CN110472055A

Abstract

The embodiment of the disclosure discloses a method and a device for labeling data. One embodiment of the method comprises: responding to the received data to be labeled, and inquiring the data with different clusters and the highest first similarity in a preset number; calculating a second similarity between the preset number of data and the data to be labeled; putting data, of which the second similarity with the data to be labeled exceeds a preset clustering threshold value, in a preset number of data into a data set; and if the data set is not empty and data with the second similarity to the data to be labeled larger than a preset data merging threshold does not exist in the data set, using the cluster corresponding to the data with the highest second similarity to the data to be labeled in the data set as the cluster of the data to be labeled, and inserting the data to be labeled into a preset database, wherein the data merging threshold is larger than the cluster threshold. According to the embodiment, the cloud computing speed can be increased, and the efficiency and the effect of labeling work are improved.

Description

Method and device for marking data

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a method and a device for labeling data.

Background

With the rapid development of artificial intelligence technology, intelligent customer service systems are applied in large scale, gradually replacing traditional artificial customer service systems. In order to improve the accuracy and recall rate of the intelligent customer service system, the conversation data generated by the intelligent customer service system needs to be labeled in time, when more requests are processed by the intelligent customer service system, the more the generated conversation data are, and a large number of same or similar problems need to be repeatedly labeled by a labeling person, so that the working efficiency of the labeling person is reduced, and the timeliness of labeling operation is also reduced.

The existing labeling system labels detailed data of a conversation, and the way does not cluster the same or similar data, so that the problem of repeated labeling of the data is caused; or off-line clustering calculation is carried out on the labeled data, and the data cannot be labeled in time due to long calculation time.

Disclosure of Invention

The embodiment of the disclosure provides a method and a device for labeling data.

In a first aspect, an embodiment of the present disclosure provides a method for annotating data, including: responding to the received data to be labeled, and inquiring the data with different clusters and the highest first similarity in a preset number; calculating a second similarity between the preset number of data and the data to be labeled; putting data, of which the second similarity with the data to be labeled exceeds a preset clustering threshold value, in a preset number of data into a data set; and if the data set is not empty and data with the second similarity to the data to be labeled larger than a preset data merging threshold does not exist in the data set, using the cluster corresponding to the data with the highest second similarity to the data to be labeled in the data set as the cluster of the data to be labeled, and inserting the data to be labeled into a preset database, wherein the data merging threshold is larger than the cluster threshold.

In some embodiments, the method further comprises: and if the data set is empty, generating a new cluster for the data to be labeled, and inserting the data to be labeled into a preset database.

In some embodiments, the method further comprises: and if the data with the second similarity to the data to be labeled is greater than the preset data merging threshold exists in the data set, counting the data which is most similar to the data to be labeled in the data set and adding 1.

In some embodiments, after placing into the data set data of the predetermined number of data having a second similarity to the data to be annotated that exceeds a predetermined clustering threshold, the method further comprises: calculating a second similarity of any two data in the data set; and merging the data in the data set based on the second similarity.

In some embodiments, merging data in the data set based on the second similarity includes: and merging the data with the second similarity larger than a preset data merging threshold value.

In some embodiments, merging data in the data set based on the second similarity includes: and merging the clusters of the data with the second similarity between the preset clustering threshold and the preset data merging threshold.

In some embodiments, the method further comprises: and displaying the data in a descending order according to the total data amount corresponding to the cluster identification.

In a second aspect, an embodiment of the present disclosure provides an apparatus for annotating data, including: the query unit is configured to respond to the received data to be labeled and query a preset number of data with different clusters and the highest first similarity; a calculation unit configured to calculate a second similarity between a predetermined number of data and data to be labeled; an aggregation unit configured to put data, of the predetermined number of data, having a second similarity to data to be labeled exceeding a predetermined clustering threshold into a data set; and the inserting unit is configured to use a cluster corresponding to the data with the highest second similarity to the data to be labeled in the data set as a cluster of the data to be labeled and insert the data to be labeled into a preset database if the data set is not empty and the data with the second similarity to the data to be labeled larger than a preset data merging threshold does not exist in the data set, wherein the data merging threshold is larger than the cluster threshold.

In some embodiments, the insertion unit is further configured to: and if the data set is empty, generating a new cluster for the data to be labeled, and inserting the data to be labeled into a preset database.

In some embodiments, the insertion unit is further configured to: and if the data with the second similarity to the data to be labeled is greater than the preset data merging threshold exists in the data set, counting the data which is most similar to the data to be labeled in the data set and adding 1.

In some embodiments, the aggregation unit is further configured to: after data with the second similarity exceeding a preset clustering threshold value with the data to be labeled in the preset number of data is put into a data set, calculating the second similarity of any two data in the data set; and merging the data in the data set based on the second similarity.

In some embodiments, the aggregation unit is further configured to: and merging the data with the second similarity larger than a preset data merging threshold value.

In some embodiments, the aggregation unit is further configured to: and merging the clusters of the data with the second similarity between the preset clustering threshold and the preset data merging threshold.

In some embodiments, the apparatus further comprises a presentation unit configured to: and displaying the data in a descending order according to the total data amount corresponding to the cluster identification.

In a third aspect, an embodiment of the present disclosure provides an electronic device for annotating data, including: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement an apparatus as in any one of the first aspects.

In a fourth aspect, embodiments of the disclosure provide a computer readable medium having a computer program stored thereon, where the program, when executed by a processor, implements an apparatus as in any of the first aspects.

The method and the device for labeling data provided by the embodiment of the disclosure realize a method for performing real-time data clustering calculation in a session data collection stage, solve the problem that the same or similar data needs to be repeatedly labeled, reduce the number of data to be labeled, help the labeling personnel to preferentially label the high-frequency problem, and improve the efficiency and the effect of labeling work.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of an apparatus for annotating data in accordance with the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of an apparatus for annotating data in accordance with the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of an apparatus for annotating data according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for annotating data according to the present disclosure;

FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the disclosed method for annotating data or apparatus for annotating data can be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting a session, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), a laptop portable computer, a desktop computer, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background dialog server providing support for man-machine dialogs on the

terminal devices

101, 102, 103. The background dialog server may analyze the received question, and feed back the processing result (e.g., answer) to the terminal device. The server can also collect conversations for labeling, and high-frequency problems are screened out, so that manual labeling is facilitated.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for annotating data provided by the embodiments of the present disclosure is generally performed by the server 105, and accordingly, the apparatus for annotating data is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for annotating data in accordance with the present disclosure is illustrated. The method for labeling data comprises the following steps:

step 201, in response to receiving the data to be labeled, querying a predetermined number of data with different clusters and the highest first similarity.

In this embodiment, an execution subject of the method for annotating data (e.g., a server shown in fig. 1) may receive data to be annotated (e.g., a question posed by a user) from a terminal with which the user performs human-computer interaction in a wired connection manner or a wireless connection manner. When the man-machine interaction system finishes the problem reply, the dialogue data is asynchronously sent to the data channel, and the real-time performance of data sending is guaranteed while the performance of the dialogue system is not influenced.

The server receives the conversation data in the data channel in a streaming mode and gives the received data to the thread pool for processing. Using the question in the received dialogue data Q, N data whose clusters are different and whose first similarity is the highest are queried in a DB (storage database of dialogue mark data) supporting full-text retrieval. The first similarity may represent the text similarity by TF-IDF (term frequency-inverse document frequency) or the like. The first similarity calculation of the two sentences comprises the following steps:

1. by Chinese word segmentation, the complete sentence is divided into independent word sets according to the word segmentation algorithm

2. Finding a union of two sets of words (word bag)

3. Calculating the word frequency of each word set and vectorizing the word frequency

4. And substituting the vector calculation model to obtain the text similarity.

And the first similarity adopts a faster similarity calculation method, roughly screens out similar data from the DB, and then accurately calculates the similarity between the screened data and Q by using other similarity algorithms. Therefore, the number of data for calculating the similarity can be reduced, and the calculation speed is improved. For example, the similarity between Q and each data in DB is calculated by TF-IDF method, and the data with the highest similarity is selected from the data of each cluster. Then, a predetermined number of data having the highest degree of similarity are selected from these selected data. That is, at most one data in the same cluster is selected for the next calculation.

In this embodiment, the data to be annotated may be just a question, or may be a combination of a question + an answer. Calculating either the first similarity or the second similarity may be calculating only for "questions" without considering "answers".

Step 202, calculating a second similarity between the predetermined number of data and the data to be labeled.

In this embodiment, the second similarity may be a semantic similarity. Under many circumstances, the similarity between words is calculated directly, and generally, the distance between words is calculated first and then converted into the similarity. The distance between semantics is usually calculated in two ways, one is statistical through a large corpus, and the other is based on some kind of ontology or classification relationship.

Step 203, putting data, of which the second similarity with the data to be labeled exceeds a preset clustering threshold value, in the preset number of data into a data set.

In this embodiment, after sequentially calculating the second similarity between the predetermined number of data and the data to be labeled, the data with the second similarity exceeding the predetermined clustering threshold S1 is placed in the data set U.

And 204, if the data set is not empty and data with the second similarity to the data to be labeled being greater than the preset data merging threshold does not exist in the data set, using the cluster corresponding to the data with the highest second similarity to the data to be labeled in the data set as the cluster of the data to be labeled, and inserting the data to be labeled into a preset database.

In this embodiment, if the data set is not empty, it indicates that the same or similar data as the data to be labeled exists in the predetermined database. It is further necessary to distinguish whether the data to be annotated is identical or similar. And when the second similarity is larger than the data merging threshold value, the data are considered to be the same data. If the second similarity is only greater than the clustering threshold, the data are considered to be similar and belong to the same cluster. The data merge threshold is greater than the cluster threshold. And if the same data does not exist, but similar data exists and the similar data is multiple, taking the cluster corresponding to the data with the highest similarity in the similar data as the cluster of the data to be labeled.

Step 205, if the data set is empty, a new cluster is generated for the data to be labeled, and the data to be labeled is inserted into a predetermined database.

In this embodiment, if the data set is empty, it indicates that there is no data in the predetermined database that is the same as or similar to the data to be labeled, a new cluster is created for the data to be labeled, and then the data to be labeled is inserted into the predetermined database according to the new cluster.

In step 206, if there is data in the data set whose second similarity to the data to be labeled is greater than the predetermined data merging threshold, the data in the data set that is most similar to the data to be labeled is counted and 1 is added.

In this embodiment, if there is data in the data set whose second similarity to the data to be annotated is greater than a predetermined data merging threshold, the data to be annotated is merged into the data set. And if a plurality of data with the second similarity degree larger than the preset data merging threshold value with the data to be annotated exist, counting the data which are most similar to the data to be annotated in the data set and adding 1.

In some optional implementations of this embodiment, the method further includes: and displaying the data in a descending order according to the total data amount corresponding to the cluster identification. Each cluster has a corresponding cluster identity. And performing aggregation query in the storage DB by using the cluster identifier calculated in the data processing process, displaying the data in a list according to the descending order of the total data amount corresponding to the cluster identifier, and displaying the high-frequency cluster data in a high-quality mode. The labeling personnel can label data according to the sequence from high frequency to low frequency aiming at the inquired clusters in the DB, so that the high frequency problem is solved with the minimum time cost and high efficiency, and the effect of the intelligent customer service system is optimized to the maximum extent.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for labeling data according to the present embodiment. In the application scenario of fig. 3, the user sends a question Q to the server through the terminal. The server uses the question in the received dialogue data Q to search for N data with different clusters and the highest similarity from a DB supporting full-text retrieval (storage DB for dialogue labeling data), calculates the semantic similarity between the N data and Q, and puts data with a similarity exceeding S1 (clustering threshold) in the set U. And if the generated set U is empty, generating a new cluster for the session data Q, inserting the data into the DB for storing the session marking data, and finishing the data processing. If the set U is not empty, data with the similarity exceeding S2 with the session data Q exists in the data set U, no data is inserted, the number of the data most similar to the data in U is counted and 1 is added, and the data processing is finished. If data with the similarity exceeding S2 with the session data Q does not exist in the data set U, a cluster corresponding to the data with the highest similarity with Q in the data set U is used as a new data cluster, a new piece of labeled data is inserted, and data processing is finished.

In the method provided by the above embodiment of the present disclosure, the texts with semantic similarity greater than the cluster merging threshold are aggregated in one cluster, and the texts with semantic similarity greater than the data merging threshold are merged. The clustering is formed by gathering a series of data with similarity greater than S1 (the similarity is greater than S2) after being merged, a user can label the data from high to low according to the data volume, the least time is spent on interfering with the highest-frequency data on the line, the effect of achieving twice the result with half the effort is achieved, and meanwhile, the data merging based on S2 can greatly reduce the labeling workload and improve the labeling efficiency.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for annotating data is illustrated. The flow 400 of the method for annotating data comprises the steps of:

step 401, in response to receiving the data to be labeled, querying a predetermined number of data with different clusters and the highest first similarity.

Step 402, calculating a second similarity between the predetermined number of data and the data to be labeled.

And step 403, putting data, of which the second similarity with the data to be labeled exceeds a preset clustering threshold value, in the preset number of data into a data set.

Steps

401 and 403 are substantially the same as

step

201 and 203, and therefore will not be described again.

Step 404, if the data set is empty, generating a new cluster for the data to be labeled, and inserting the data to be labeled into a predetermined database.

Step 404 is substantially the same as step 205, and therefore is not described in detail.

Step 405, calculating a second similarity of any two data in the data set, and merging the data in the data set based on the second similarity.

In the present embodiment, data in the set U whose similarity exceeds S2 (data merging threshold, S2> S1) are merged by semantic calculation; merging clusters of the data with the similarity in the range of (S1, S2) in the set U; (in order to improve the performance of clustering calculation and ensure the real-time performance of the algorithm, the method adopts a multithreading mode to perform clustering calculation, which greatly improves the calculation performance and may bring about a reduction in clustering effect, if there are multiple request data which do not exist in the historical annotation data at the same time, because similar data cannot be found by full-text retrieval in the DB, the clustering calculation generates different clusters for the data in different threads, so that the same or similar data are calculated into different clusters, and therefore, the semantic similarity calculation needs to be performed once for the data in the set U, the data with the semantic similarity higher than S2 are merged, the data with the semantic similarity between (S1, S2) are merged, and the real-time clustering labeling effect is realized by adopting a multithreading clustering and subsequent compensation mode.) the DB is updated according to the merged data set.

Step 406, if there is data in the data set whose second similarity to the data to be labeled is greater than the predetermined data merging threshold, counting the data in the data set that is most similar to the data to be labeled by adding 1.

Step 406 is substantially the same as step 206 except that step 406 employs a merged data set.

Step 407, if the data set is not empty and there is no data in the data set whose second similarity to the data to be labeled is greater than the predetermined data merging threshold, using the cluster corresponding to the data in the data set whose second similarity to the data to be labeled is the highest as the cluster of the data to be labeled, and inserting the data to be labeled into the predetermined database.

Step 407 is substantially the same as step 204, and therefore is not described in detail.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for labeling data in the present embodiment represents a step of data merging. Therefore, the scheme described in this embodiment can realize the effect of real-time clustering labeling by adopting a multi-thread clustering and subsequent compensation mode.

With further reference to fig. 5, as an implementation of the method shown in the above-mentioned figures, the present disclosure provides an embodiment of an apparatus for labeling data, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 5, the apparatus 500 for labeling data of the present embodiment includes: a query unit 501, a calculation unit 502, an aggregation unit 503 and an insertion unit 504. The query unit 501 is configured to query, in response to receiving data to be labeled, a predetermined number of data with different clusters and the highest first similarity; a calculation unit 502 configured to calculate a second similarity between a predetermined number of data and data to be labeled; an aggregation unit 503 configured to put data, of the predetermined number of data, having a second similarity to the data to be labeled exceeding a predetermined clustering threshold into a data set; the inserting unit 504 is configured to, if the data set is not empty and there is no data in the data set with a second similarity to the data to be labeled that is greater than a predetermined data merging threshold, use a cluster corresponding to the data in the data set with the highest second similarity to the data to be labeled as a cluster of the data to be labeled, and insert the data to be labeled into a predetermined database, where the data merging threshold is greater than the cluster threshold.

In this embodiment, the specific processing of the querying unit 501, the calculating unit 502, the aggregating unit 503 and the inserting unit 504 of the apparatus 500 for labeling data may refer to step 201, step 202, step 203 and step 204 in the corresponding embodiment of fig. 2.

In some optional implementations of the present embodiment, the inserting unit 504 is further configured to: and if the data set is empty, generating a new cluster for the data to be labeled, and inserting the data to be labeled into a preset database.

In some optional implementations of the present embodiment, the inserting unit 504 is further configured to: and if the data with the second similarity to the data to be labeled is greater than the preset data merging threshold exists in the data set, counting the data which is most similar to the data to be labeled in the data set and adding 1.

In some optional implementations of this embodiment, the aggregation unit 503 is further configured to: after data with the second similarity exceeding a preset clustering threshold value with the data to be labeled in the preset number of data is put into a data set, calculating the second similarity of any two data in the data set; and merging the data in the data set based on the second similarity.

In some optional implementations of this embodiment, the aggregation unit 503 is further configured to: and merging the data with the second similarity larger than a preset data merging threshold value.

In some optional implementations of this embodiment, the aggregation unit 503 is further configured to: and merging the clusters of the data with the second similarity between the preset clustering threshold and the preset data merging threshold.

In some optional implementations of this embodiment, the apparatus 500 further comprises a presentation unit (not shown in the drawings) configured to: and displaying the data in a descending order according to the total data amount corresponding to the cluster identification.

Referring now to FIG. 6, a schematic diagram of an electronic device (e.g., the server of FIG. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for executing an apparatus shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the device of the embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: responding to the received data to be labeled, and inquiring the data with different clusters and the highest first similarity in a preset number; calculating a second similarity between the preset number of data and the data to be labeled; putting data, of which the second similarity with the data to be labeled exceeds a preset clustering threshold value, in a preset number of data into a data set; and if the data set is not empty and data with the second similarity to the data to be labeled larger than a preset data merging threshold does not exist in the data set, using the cluster corresponding to the data with the highest second similarity to the data to be labeled in the data set as the cluster of the data to be labeled, and inserting the data to be labeled into a preset database, wherein the data merging threshold is larger than the cluster threshold.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, apparatuses, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a query unit, a calculation unit, an aggregation unit, and an insertion unit. Where the names of the units do not in some cases constitute a limitation on the units themselves, for example, a query unit may also be described as "a unit that queries a predetermined number of data with different clusters and the highest first similarity in response to receiving data to be labeled".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for annotating data, comprising:

responding to the received data to be labeled, and inquiring the data with different clusters and the highest first similarity in a preset number;

calculating a second similarity between the data with the preset number and the data to be labeled;

putting data, of the predetermined number of data, of which the second similarity with the data to be labeled exceeds a predetermined clustering threshold into a data set;

if the data set is not empty and data with the second similarity degree with the data to be labeled being larger than a preset data merging threshold value does not exist in the data set, using a cluster corresponding to the data with the highest second similarity degree with the data to be labeled in the data set as a cluster of the data to be labeled, and inserting the data to be labeled into a preset database, wherein the data merging threshold value is larger than the cluster threshold value.

2. The method of claim 1, wherein the method further comprises:

and if the data set is empty, generating a new cluster for the data to be labeled, and inserting the data to be labeled into a preset database.

3. The method of claim 1, wherein the method further comprises:

and if the data with the second similarity degree larger than the preset data merging threshold exists in the data set, counting and adding 1 to the data which is most similar to the data to be labeled in the data set.

4. The method of claim 1, wherein after placing into a data set data of the predetermined number of data having a second similarity to the data to be annotated that exceeds a predetermined clustering threshold, the method further comprises:

calculating a second similarity of any two data in the data set;

merging data in the data set based on the second similarity.

5. The method of claim 4, wherein the merging the data in the data set based on the second similarity comprises:

and merging the data with the second similarity larger than a preset data merging threshold value.

6. The method of claim 4, wherein the merging the data in the data set based on the second similarity comprises:

and merging the clusters of the data with the second similarity between the preset clustering threshold and the preset data merging threshold.

7. The method according to one of claims 1-6, wherein the method further comprises:

and displaying the data in a descending order according to the total data amount corresponding to the cluster identification.

8. An apparatus for annotating data, comprising:

the query unit is configured to respond to the received data to be labeled and query a preset number of data with different clusters and the highest first similarity;

a calculation unit configured to calculate a second similarity between the predetermined number of data and the data to be labeled;

the aggregation unit is configured to put data, of the predetermined number of data, with a second similarity to the data to be labeled exceeding a predetermined clustering threshold into a data set;

an inserting unit, configured to, if the data set is not empty and there is no data in the data set with a second similarity to the data to be labeled that is greater than a predetermined data merging threshold, use a cluster corresponding to data in the data set with the highest second similarity to the data to be labeled as a cluster of the data to be labeled, and insert the data to be labeled into a predetermined database, where the data merging threshold is greater than the cluster threshold.

9. The apparatus of claim 8, wherein the insertion unit is further configured to:

10. The apparatus of claim 8, wherein the insertion unit is further configured to:

11. The apparatus of claim 8, wherein the aggregation unit is further configured to:

after data of the preset number of data, the second similarity of which to the data to be labeled exceeds a preset clustering threshold value, is put into a data set, calculating the second similarity of any two data in the data set;

merging data in the data set based on the second similarity.

12. The apparatus of claim 11, wherein the aggregation unit is further configured to:

13. The apparatus of claim 11, wherein the aggregation unit is further configured to:

14. The apparatus according to one of claims 8-13, wherein the apparatus further comprises a presentation unit configured to:

15. An electronic device for annotating data, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.