CN116383724B - Single-domain label vector extraction method and device, electronic equipment and medium - Google Patents

Single-domain label vector extraction method and device, electronic equipment and medium Download PDF

Info

Publication number
CN116383724B
CN116383724B CN202310180959.7A CN202310180959A CN116383724B CN 116383724 B CN116383724 B CN 116383724B CN 202310180959 A CN202310180959 A CN 202310180959A CN 116383724 B CN116383724 B CN 116383724B
Authority
CN
China
Prior art keywords
vector matrix
text
tag
loss
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310180959.7A
Other languages
Chinese (zh)
Other versions
CN116383724A (en
Inventor
苑帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shumei Tianxia Beijing Technology Co ltd
Beijing Nextdata Times Technology Co ltd
Original Assignee
Shumei Tianxia Beijing Technology Co ltd
Beijing Nextdata Times Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shumei Tianxia Beijing Technology Co ltd, Beijing Nextdata Times Technology Co ltd filed Critical Shumei Tianxia Beijing Technology Co ltd
Priority to CN202310180959.7A priority Critical patent/CN116383724B/en
Publication of CN116383724A publication Critical patent/CN116383724A/en
Application granted granted Critical
Publication of CN116383724B publication Critical patent/CN116383724B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to a single-domain label vector extraction method, a single-domain label vector extraction device, electronic equipment and a medium, wherein the single-domain label vector extraction method comprises the following steps: acquiring a plurality of texts corresponding to a single field and labels of each text; determining a text vector matrix corresponding to a plurality of texts; determining a label vector matrix corresponding to the labels according to the number of the labels and a preset vector dimension; determining a total loss value according to the label vector matrix and the text vector matrix; and updating the label vector matrix according to the total loss value to obtain an updated label vector matrix. By the method, the dimension of the tag vector matrix can be reduced, so that the memory or disk space occupied by data can be reduced; meanwhile, the dimension is reduced, so that the number of matching calculation each time is reduced, and meanwhile, the correlation relation among the labels is reflected from multiple aspects.

Description

Single-domain label vector extraction method and device, electronic equipment and medium
Technical Field
The invention relates to the technical fields of computer application and big data, in particular to a single-domain label vector extraction method, a single-domain label vector extraction device, electronic equipment and a medium.
Background
Most of the existing deep learning applications only consider how to obtain an excellent data vector representation, and use one-hot vectors to characterize tag information, neglecting further optimization of tag vectors. In general, if the label data is n, under one-hot characterization, the dimension of one label is 1*n, if the number of labels is large, n is large, which causes the dimension of the label vector to increase with the increase of the number of labels, so that a certain pressure is brought to storage and calculation, and in addition, the relationship between the labels is single through one-hot characterization.
Disclosure of Invention
The invention provides a single-domain label vector extraction method, a single-domain label vector extraction device, electronic equipment and a medium, and aims to solve at least one technical problem.
In a first aspect, the present invention solves the above technical problems by providing the following technical solutions: a single domain label vector extraction method, the method comprising:
acquiring a plurality of texts corresponding to a single field and labels of each text, wherein for each text, the labels represent the text types of the texts, the labels are numerical values, and different numerical values represent different text types;
Determining a text vector matrix corresponding to a plurality of texts, wherein each element in the text vector matrix characterizes a text vector corresponding to each text;
determining a label vector matrix corresponding to a plurality of labels according to the number of the labels and a preset vector dimension, wherein each element in the label vector matrix represents a value corresponding to each label, and the vector dimension is smaller than the number of the labels;
determining a total loss value according to the tag vector matrix and the text vector matrix, wherein the total loss value characterizes differences between each element in the tag vector matrix and each element in the corresponding text vector matrix;
and updating the label vector matrix according to the total loss value to obtain an updated label vector matrix.
The beneficial effects of the invention are as follows: because the tag vector matrix is determined based on a preset vector dimension, and the vector dimension is smaller than the number of a plurality of tags, the dimension of the tag vector matrix can be reduced, so that the memory or disk space occupied by data can be reduced; at the same time, the dimension is reduced, and the number of matching calculation per time can be reduced. In addition, the vector optimized by the technology belongs to a low-dimensional and dense vector and does not belong to a one-hot vector, so that the label vectors are not completely orthogonal to each other, and the correlation relationship among the labels is reflected from multiple aspects.
On the basis of the technical scheme, the invention can be improved as follows.
Further, the determining a text vector matrix corresponding to the plurality of texts includes:
and performing forward computation on a plurality of texts through a natural language model to obtain text vector matrixes corresponding to the texts.
The adoption of the further scheme has the beneficial effects that a plurality of texts can be converted into a text vector matrix by using a natural language model, so that the realization is convenient.
Further, determining a tag vector matrix corresponding to the plurality of tags according to the number of the plurality of tags and a preset vector dimension includes:
constructing a tag vector matrix according to the number of the tags and a preset vector dimension, wherein each element in the tag vector matrix accords with the data distribution of multi-dimensional Gaussian noise, the number of the tags is the number of rows of the tag vector matrix, and the preset vector dimension is the number of columns of the tag vector matrix.
The adoption of the further scheme has the beneficial effects that each element in the tag vector matrix accords with the data distribution of the multidimensional Gaussian noise, so that the data is more stable in the subsequent vector calculation process.
Further, the determining the total loss value according to the tag vector matrix and the text vector matrix includes:
determining a similarity loss between the tag vector matrix and the text vector matrix according to the tag vector matrix and the text vector matrix, wherein the similarity loss characterizes the similarity between the text and the tag;
determining classification loss corresponding to each text according to the tag vector matrix and the text vector matrix, wherein the classification loss represents whether the text type of each text is the text type corresponding to each tag;
determining the total loss value based on the similarity loss and the classification loss.
The method has the beneficial effects that the determination of the total loss value takes the similarity loss and the classification loss into consideration, namely, the difference between the text and the label in different dimensions is considered, so that the total loss value can be determined more accurately.
Further, determining the classification loss corresponding to each piece of text according to the tag vector matrix and the text vector matrix includes:
according to the tag vector matrix and the text vector matrix, respectively determining accumulated positive loss and accumulated negative loss between elements at corresponding positions in the tag vector matrix and the text vector matrix, wherein the accumulated positive loss represents that the text type of the text corresponding to the elements at the corresponding positions is the text type corresponding to the corresponding tags, and the accumulated negative loss represents that the text type of the text corresponding to the elements at the corresponding positions is not the text type corresponding to the corresponding tags;
And determining the classification loss according to each accumulated positive loss and each accumulated negative loss.
The method has the beneficial effects that in the process of determining the classification loss, the classification loss can be more accurately reflected by determining from two aspects of accumulated positive loss and accumulated negative loss.
Further, determining the classification loss corresponding to each piece of text according to the tag vector matrix and the text vector matrix includes:
performing matrix calculation on the label vector matrix and the transpose matrix of the text vector matrix to obtain a calculation result;
calculating the cross entropy between the calculation result and each label;
and determining the classification loss according to each cross entropy.
The further scheme has the beneficial effect that the classification loss obtained by calculation can be calculated more simply and conveniently through calculation among the matrixes.
Further, updating the tag vector matrix according to the total loss value to obtain an updated tag vector matrix, including:
according to the total loss value, deriving the label vector matrix to obtain a gradient matrix corresponding to the label vector matrix;
And updating the label vector matrix according to the gradient matrix and a preset learning rate to obtain an updated label vector matrix.
Compared with the prior art, the method has the advantages that the method is different from the prior art in updating mode, namely the label vector matrix is updated according to the total loss value and the learning rate, and more practical requirements are met.
In a second aspect, the present invention further provides a single domain label vector extraction device for solving the above technical problem, where the device includes:
the data acquisition module is used for acquiring a plurality of texts corresponding to a single field and labels of each text, wherein for each text, the labels represent the text types of the texts, the labels are numerical values, and different numerical values represent different text types;
a text vector matrix determining module, configured to determine a text vector matrix corresponding to a plurality of texts, where each element in the text vector matrix represents a text vector corresponding to each text;
the label vector matrix determining module is used for determining label vector matrixes corresponding to the labels according to the number of the labels and preset vector dimensions, each element in the label vector matrixes represents a value corresponding to each label, and the vector dimensions are smaller than the number of the labels;
A total loss value determining module, configured to determine a total loss value according to the tag vector matrix and the text vector matrix, where the total loss value characterizes differences between each element in the tag vector matrix and each element in the corresponding text vector matrix;
and the updating module is used for updating the label vector matrix according to the total loss value to obtain an updated label vector matrix.
In a third aspect, the present application further provides an electronic device, where the electronic device includes a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements a single domain label vector extraction method according to the present application when executing the computer program.
In a fourth aspect, the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement a single domain label vector extraction method according to the present application.
Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments of the present invention will be briefly described below.
Fig. 1 is a flow chart of a single domain label vector extraction method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a single domain label vector extraction device according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The principles and features of the present invention are described below with examples given for the purpose of illustration only and are not intended to limit the scope of the invention.
The following describes the technical scheme of the present invention and how the technical scheme of the present invention solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
The scheme provided by the embodiment of the invention can be applied to any application scene needing to extract the label vector of a single field. The scheme provided by the embodiment of the invention can be executed by any electronic equipment, for example, the scheme can be terminal equipment of a user and comprises at least one of the following steps: smart phone, tablet computer, notebook computer, desktop computer, intelligent audio amplifier, intelligent wrist-watch, smart television, intelligent vehicle equipment.
The embodiment of the invention provides a possible implementation manner, as shown in fig. 1, a flowchart of a single domain label vector extraction method is provided, and the method can be executed by any electronic device, for example, can be a terminal device, or can be executed by the terminal device and a server together. For convenience of description, a method provided by an embodiment of the present invention will be described below by taking a terminal device as an execution body, and the method may include the following steps as shown in a flowchart in fig. 1:
step S110, obtaining a plurality of texts corresponding to a single field and labels of each text, wherein for each text, the labels represent the text types of the texts, the labels are numerical values, and different numerical values represent different text types;
step S120, determining a text vector matrix corresponding to a plurality of texts, wherein each element in the text vector matrix represents a text vector corresponding to each text;
step S130, determining a label vector matrix corresponding to a plurality of labels according to the number of the labels and a preset vector dimension, wherein each element in the label vector matrix represents a value corresponding to each label, and the vector dimension is smaller than the number of the labels;
Step S140, determining a total loss value according to the tag vector matrix and the text vector matrix, where the total loss value characterizes a difference between each element in the tag vector matrix and each element in the corresponding text vector matrix;
and step S150, updating the label vector matrix according to the total loss value to obtain an updated label vector matrix.
According to the method, the label vector matrix is determined based on the preset vector dimension, and the vector dimension is smaller than the number of the labels, so that the dimension of the label vector matrix can be reduced, and the memory or disk space occupied by data can be reduced; at the same time, the dimension is reduced, and the number of matching calculation per time can be reduced. In addition, the vector optimized by the technology belongs to a low-dimensional and dense vector and does not belong to a one-hot vector, so that the label vectors are not completely orthogonal to each other, and the correlation relationship among the labels is reflected from multiple aspects.
The following describes the scheme of the present invention with reference to the following specific embodiments, in which a single domain label vector extraction method may include the following steps:
Step S110, obtaining a plurality of texts corresponding to a single field and labels of each text, wherein for each text, the labels represent the text types of the texts, the labels are numerical values, and different numerical values represent different text types;
wherein, a single domain refers to the same domain, such as a medical domain, a literature domain, and the like. For each text, the text type of the text is represented by a label corresponding to the text, for example, taking emotion forward classification as an example, the text of which the "fast meal delivery and good attitude" is positive text content, the corresponding label is a forward emotion label, and the label can be assigned to be 1 when data labeling is carried out; correspondingly, "the dishes are difficult to eat, the food delivery speed is too slow, the dishes are cool," the text content is negative text content, the corresponding label is negative emotion label, and the label can be assigned to 0 when the data is marked. A label may represent different text types by different numerical values, and the representation modes of 0 and 1 are merely examples, and do not limit the representation modes of the label in the scheme of the present application.
Step S120, determining a text vector matrix corresponding to a plurality of texts, wherein each element in the text vector matrix represents a text vector corresponding to each text;
Optionally, the determining a text vector matrix corresponding to the plurality of texts includes:
and performing forward computation on a plurality of texts through a natural language model to obtain text vector matrixes corresponding to the texts.
The natural language model may be an existing natural language model in the prior art, or a model obtained by performing fine tuning on an existing natural language model by using text data in a single field. From this model, a text vector for each text can be determined, i.e. the text is characterized by the text vector.
Step S130, determining a label vector matrix corresponding to a plurality of labels according to the number of the labels and a preset vector dimension, wherein each element in the label vector matrix represents a value corresponding to each label, and the vector dimension is smaller than the number of the labels;
the number of the labels is the same as the number of the texts, namely, if n texts exist, n labels are corresponding. The text vector of each text corresponds to the position of the text vector matrix one by one with the position of the corresponding label in the label vector matrix. The vector dimension is preset and does not change with the increase of the number of labels. As an example, for example, the number of the plurality of labels is n, and the preset vector dimension is m, the size of the label vector matrix is n×m.
Optionally, determining the tag vector matrix corresponding to the plurality of tags according to the number of the plurality of tags and a preset vector dimension includes:
constructing a tag vector matrix according to the number of the tags and a preset vector dimension, wherein each element in the tag vector matrix accords with the data distribution of multi-dimensional Gaussian noise, the number of the tags is the number of rows of the tag vector matrix, and the preset vector dimension is the number of columns of the tag vector matrix.
Optionally, the constructing the tag vector matrix according to the number of the plurality of tags and a preset vector dimension includes:
constructing an initialization matrix according to the number of the labels and a preset vector dimension, wherein the number of the labels is the number of rows of the initialization matrix, and the preset vector dimension is the number of columns of the initialization matrix;
and initializing the initialization matrix by adopting multidimensional Gaussian noise to obtain a label vector matrix.
In any application scenario, assuming that the number of labels is n and the hyper-parameter of the dimension of the initialization vector is m, the dimension of the label vector matrix is n×m, each element in the matrix is a random value within a range, and the initialization is a basic operation flow.
Step S140, determining a total loss value according to the tag vector matrix and the text vector matrix, where the total loss value characterizes a difference between each element in the tag vector matrix and each element in the corresponding text vector matrix;
optionally, determining the total loss value according to the tag vector matrix and the text vector matrix includes:
determining a similarity loss between the tag vector matrix and the text vector matrix according to the tag vector matrix and the text vector matrix, wherein the similarity loss characterizes the similarity between the text and the tag;
determining classification loss corresponding to each text according to the tag vector matrix and the text vector matrix, wherein the classification loss represents whether the text type of each text is the text type corresponding to each tag;
determining the total loss value based on the similarity loss and the classification loss.
The text vector matrix and the label vector matrix have the same size, for example, the text vector matrix has a size of n×m, and the label vector matrix has a size of n×m.
Optionally, one implementation manner of determining the similarity loss between the tag vector matrix and the text vector matrix according to the tag vector matrix and the text vector matrix is:
And calculating the similarity (such as Euclidean distance) between elements at corresponding positions in the tag vector matrix and the text vector matrix according to the tag vector matrix and the text vector matrix, and accumulating the calculated similarity values to obtain the similarity loss.
As an example, for example, when similarity is measured using euclidean distance, assuming that the text vector matrix and the tag vector matrix are both 1*2 in size, the text vector matrix is [ -1,2], the tag vector matrix is [2,6], the similarity value between the two matrices, i.e., euclidean distance is 5. The calculation formula of the Euclidean distance is as follows:
i.e. Euclidean distance according to the above formula
Alternatively, the above-described similarity loss may be calculated based on the L2 loss or JS divergence, that is, the sum of the similarities between the respective elements is calculated based on the L2 loss or JS divergence as the similarity loss.
Optionally, determining the classification loss corresponding to each piece of text according to the tag vector matrix and the text vector matrix includes:
according to the tag vector matrix and the text vector matrix, respectively determining accumulated positive loss and accumulated negative loss between elements at corresponding positions in the tag vector matrix and the text vector matrix, wherein the accumulated positive loss represents that the text type of the text corresponding to the elements at the corresponding positions is the text type corresponding to the corresponding tags, and the accumulated negative loss represents that the text type of the text corresponding to the elements at the corresponding positions is not the text type corresponding to the corresponding tags;
And determining the classification loss according to each accumulated positive loss and each accumulated negative loss.
The element corresponding to the position may be understood as an element corresponding to the same position in the tag vector matrix and the text vector matrix, and for the element, the cumulative forward loss corresponding to the element indicates that the text type of the text corresponding to the element is the text type corresponding to the tag corresponding to the element, for example, the text corresponding to the element is text a, the tag corresponding to the element is z, and if the text type corresponding to the text a is the text type corresponding to the tag z, the loss corresponding to the element is the cumulative forward loss. Otherwise, if the text type corresponding to the text a is not the text type corresponding to the label z, the loss corresponding to the element is a cumulative negative loss, and finally the classification loss is determined based on each cumulative positive loss and each cumulative negative loss, namely, the sum of the cumulative positive losses of each element in the matrix and the sum of the cumulative negative losses of each element in the matrix are obtained, namely, the classification loss is obtained.
Optionally, another implementation manner of determining the classification loss corresponding to each piece of text according to the tag vector matrix and the text vector matrix is:
Performing matrix calculation on the label vector matrix and the transpose matrix of the text vector matrix to obtain a calculation result;
calculating the cross entropy between the calculation result and each label;
and determining the classification loss according to each cross entropy.
Optionally, one implementation manner of determining the classification loss according to each cross entropy is: and obtaining classification loss for each cross entropy through soft max processing.
And step S150, updating the label vector matrix according to the total loss value to obtain an updated label vector matrix.
Wherein the tag vector matrix can be optimized (updated) by a random gradient optimization method.
Optionally, updating the tag vector matrix according to the total loss value to obtain an updated tag vector matrix, including:
according to the total loss value, deriving the label vector matrix to obtain a gradient matrix corresponding to the label vector matrix;
and updating the label vector matrix according to the gradient matrix and a preset learning rate to obtain an updated label vector matrix.
Wherein the gradient matrix is the same size as the tag vector matrix. The updating process of the tag vector matrix involves each element in the tag vector matrix.
As an example, assume that the original tag vector matrix is [1,2], that is, the tag vector matrix before update is [1,2], the corresponding gradient matrix is [0.1,0.4], the learning rate is 0.01, and the update method is random gradient descent, and the updated tag vector matrix is [0.999,1.996]. The calculation process of each element of the updated label vector matrix is 0.999=1-0.1×0.01, and 1.996=2-0.4×0.01.
By the scheme of the application, the following two beneficial effects are respectively:
1, the dimension of tag characterization is reduced, the memory or disk space occupied by the tag vector is reduced, the storage pressure of data can be further reduced, and meanwhile, the calculation amount is reduced by the lower dimension, so that the matching calculation speed of the tag and the data is improved;
2, tag vectors are not exactly orthogonal to each other, complex correlation relationships between tags can be characterized using cosine distance or euclidean distance equidistance metrics.
The two beneficial effects are mainly the result of reducing the dimension of the label, because the dimension of the label is reduced, the memory or disk space occupied by data can be reduced; at the same time the dimensions are reduced and the number of matching computations per time is reduced.
In the prior art, if the label data is n, under one-hot characterization, the dimension of one label is 1*n, and if the number of labels is large, n is large; by the scheme of the application, the dimension of a label can be initialized to be 1 x m, m is a super parameter which can be dynamically adjusted, and the size of m is set to be smaller than the size of n. The tag vectors described above are not perfectly orthogonal to each other as compared to the general one-hot vector characterization, 1) one-hot vectors are perfectly orthogonal to each other in pairs; 2) The cosine distance calculated by the one-hot vector is 0, and the euclidean distance is the same, so that the correlation relationship between the labels cannot be represented. The vector optimized by the scheme of the application belongs to a low-dimensional and dense vector and does not belong to a one-hot vector, so that the correlation between labels can be shown through Euclidean distance (similarity).
Based on the same principle as the method shown in fig. 1, the embodiment of the present invention further provides a single domain label vector extraction apparatus 20, as shown in fig. 2, the single domain label vector extraction apparatus 20 may include a data acquisition module 210, a text vector matrix determination module 220, a label vector matrix determination module 230, a total loss value determination module 240, and an update module 250, wherein:
the data obtaining module 210 is configured to obtain a plurality of texts corresponding to a single field and a label of each text, where for each text, the label represents a text type of the text, the label is a numerical value, and different numerical values represent different text types;
a text vector matrix determining module 220, configured to determine a text vector matrix corresponding to a plurality of texts, where each element in the text vector matrix characterizes a text vector corresponding to each text;
a tag vector matrix determining module 230, configured to determine a tag vector matrix corresponding to a plurality of tags according to the number of the plurality of tags and a preset vector dimension, where each element in the tag vector matrix characterizes a value corresponding to each tag, and the vector dimension is smaller than the number of the plurality of tags;
A total loss value determining module 240, configured to determine a total loss value according to the tag vector matrix and the text vector matrix, where the total loss value characterizes a difference between each element in the tag vector matrix and each element in the corresponding text vector matrix;
and an updating module 250, configured to update the tag vector matrix according to the total loss value, to obtain an updated tag vector matrix.
Optionally, the text vector matrix determining module 220 is specifically configured to, when determining a plurality of text vector matrices corresponding to the texts:
and performing forward computation on a plurality of texts through a natural language model to obtain text vector matrixes corresponding to the texts.
Optionally, the tag vector matrix determining module 230 is specifically configured to, when determining the tag vector matrices corresponding to the plurality of tags according to the number of the plurality of tags and a preset vector dimension:
constructing a tag vector matrix according to the number of the tags and a preset vector dimension, wherein each element in the tag vector matrix accords with the data distribution of multi-dimensional Gaussian noise, the number of the tags is the number of rows of the tag vector matrix, and the preset vector dimension is the number of columns of the tag vector matrix.
Optionally, the total loss value determining module 240 is specifically configured to, when determining the total loss value according to the tag vector matrix and the text vector matrix:
determining a similarity loss between the tag vector matrix and the text vector matrix according to the tag vector matrix and the text vector matrix, wherein the similarity loss characterizes the similarity between the text and the tag;
determining classification loss corresponding to each text according to the tag vector matrix and the text vector matrix, wherein the classification loss represents whether the text type of each text is the text type corresponding to each tag;
determining the total loss value based on the similarity loss and the classification loss.
Optionally, the total loss value determining module 240 is specifically configured to, when determining, according to the tag vector matrix and the text vector matrix, a classification loss corresponding to each piece of text:
according to the tag vector matrix and the text vector matrix, respectively determining accumulated positive loss and accumulated negative loss between elements at corresponding positions in the tag vector matrix and the text vector matrix, wherein the accumulated positive loss represents that the text type of the text corresponding to the elements at the corresponding positions is the text type corresponding to the corresponding tags, and the accumulated negative loss represents that the text type of the text corresponding to the elements at the corresponding positions is not the text type corresponding to the corresponding tags;
And determining the classification loss according to each accumulated positive loss and each accumulated negative loss.
Optionally, the total loss value determining module 240 is specifically configured to, when determining, according to the tag vector matrix and the text vector matrix, a classification loss corresponding to each piece of text:
performing matrix calculation on the label vector matrix and the transpose matrix of the text vector matrix to obtain a calculation result;
calculating the cross entropy between the calculation result and each label;
and determining the classification loss according to each cross entropy.
Optionally, the updating module 250 is specifically configured to, when updating the tag vector matrix according to the total loss value to obtain an updated tag vector matrix:
according to the total loss value, deriving the label vector matrix to obtain a gradient matrix corresponding to the label vector matrix;
and updating the label vector matrix according to the gradient matrix and a preset learning rate to obtain an updated label vector matrix.
The single domain label vector extraction apparatus according to the embodiments of the present invention may perform a single domain label vector extraction method according to the embodiments of the present invention, and the implementation principle is similar, and actions performed by each module and unit in the single domain label vector extraction apparatus according to each embodiment of the present invention correspond to steps in the single domain label vector extraction method according to each embodiment of the present invention, and detailed functional descriptions of each module in the single domain label vector extraction apparatus may be referred to the descriptions in the corresponding single domain label vector extraction method shown in the foregoing, which are not repeated herein.
Wherein the single domain tag vector extraction apparatus may be a computer program (including program code) running in a computer device, for example, the single domain tag vector extraction apparatus is an application software; the device can be used for executing corresponding steps in the method provided by the embodiment of the invention.
In some embodiments, a single domain tag vector extraction apparatus provided by the present invention may be implemented by combining software and hardware, and by way of example, a single domain tag vector extraction apparatus provided by the present invention may be a processor in the form of a hardware decoding processor that is programmed to perform a single domain tag vector extraction method provided by the present invention, for example, a processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSP, programmable logic device (PLD, programmable Logic Device), complex programmable logic device (CPLD, complex Programmable Logic Device), field programmable gate array (FPGA, field-Programmable Gate Array), or other electronic component.
In other embodiments, a single domain tag vector extraction apparatus provided in the embodiments of the present invention may be implemented in software, and fig. 2 shows a single domain tag vector extraction apparatus stored in a memory, which may be software in the form of a program, a plug-in, or the like, and includes a series of modules including a data acquisition module 210, a text vector matrix determination module 220, a tag vector matrix determination module 230, a total loss value determination module 240, and an update module 250, for implementing a single domain tag vector extraction method provided in the embodiments of the present invention.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The name of a module does not in some cases define the module itself.
Based on the same principles as the methods shown in the embodiments of the present invention, there is also provided in the embodiments of the present invention an electronic device, which may include, but is not limited to: a processor and a memory; a memory for storing a computer program; a processor for executing the method according to any of the embodiments of the invention by invoking a computer program.
In an alternative embodiment, an electronic device is provided, as shown in fig. 3, the electronic device 4000 shown in fig. 3 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present invention.
The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 3, but not only one bus or one type of bus.
Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 4003 is used for storing application program codes (computer programs) for executing the present invention and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute application program codes stored in the memory 4003 to realize what is shown in the foregoing method embodiment.
The electronic device shown in fig. 3 is only an example, and should not impose any limitation on the functions and application scope of the embodiment of the present invention.
Embodiments of the present invention provide a computer-readable storage medium having a computer program stored thereon, which when run on a computer, causes the computer to perform the corresponding method embodiments described above.
According to another aspect of the present invention, there is also provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the implementation of the various embodiments described above.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
It should be appreciated that the flow charts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The computer readable storage medium according to embodiments of the present invention may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer-readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above-described embodiments.
The above description is only illustrative of the preferred embodiments of the present invention and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present invention is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present invention (but not limited to) having similar functions are replaced with each other.

Claims (8)

1. The single-domain label vector extraction method is characterized by comprising the following steps of:
acquiring a plurality of texts corresponding to a single field and labels of each text, wherein for each text, the labels represent the text types of the texts, the labels are numerical values, and different numerical values represent different text types;
determining a text vector matrix corresponding to a plurality of texts, wherein each element in the text vector matrix characterizes a text vector corresponding to each text;
determining a label vector matrix corresponding to a plurality of labels according to the number of the labels and a preset vector dimension, wherein each element in the label vector matrix represents a value corresponding to each label, and the vector dimension is smaller than the number of the labels;
determining a total loss value according to the tag vector matrix and the text vector matrix, wherein the total loss value characterizes differences between each element in the tag vector matrix and each element in the corresponding text vector matrix;
updating the label vector matrix according to the total loss value to obtain an updated label vector matrix;
Determining a total loss value according to the tag vector matrix and the text vector matrix, including:
determining a similarity loss between the tag vector matrix and the text vector matrix according to the tag vector matrix and the text vector matrix, wherein the similarity loss characterizes the similarity between the text and the tag;
determining classification loss corresponding to each text according to the tag vector matrix and the text vector matrix, wherein the classification loss represents whether the text type of each text is the text type corresponding to each tag;
determining the total loss value based on the similarity loss and the classification loss;
according to the label vector matrix and the text vector matrix, calculating the similarity between elements at corresponding positions in the label vector matrix and the text vector matrix, accumulating the calculated similarity values, and obtaining the similarity loss by the following steps:
similarity is measured by using Euclidean distance, and the calculation formula of the Euclidean distance is as follows:
=/>
or, the similarity loss is calculated based on the L2 loss or the JS divergence, namely, the sum of the similarity among the elements is calculated based on the L2 loss or the JS divergence and is used as the similarity loss;
The determining a tag vector matrix corresponding to the plurality of tags according to the number of the plurality of tags and a preset vector dimension comprises the following steps:
constructing a tag vector matrix according to the number of the tags and a preset vector dimension, wherein each element in the tag vector matrix accords with the data distribution of multi-dimensional Gaussian noise, the number of the tags is the number of rows of the tag vector matrix, the preset vector dimension is the number of columns of the tag vector matrix, and the number of the tags is the same as the number of the texts.
2. The method of claim 1, wherein the determining a text vector matrix for a plurality of the texts comprises:
and performing forward computation on a plurality of texts through a natural language model to obtain text vector matrixes corresponding to the texts.
3. The method of claim 1, wherein said determining a classification penalty for each of said texts based on said tag vector matrix and said text vector matrix comprises:
according to the tag vector matrix and the text vector matrix, respectively determining accumulated positive loss and accumulated negative loss between elements at corresponding positions in the tag vector matrix and the text vector matrix, wherein the accumulated positive loss represents that the text type of the text corresponding to the elements at the corresponding positions is the text type corresponding to the corresponding tags, and the accumulated negative loss represents that the text type of the text corresponding to the elements at the corresponding positions is not the text type corresponding to the corresponding tags;
And determining the classification loss according to each accumulated positive loss and each accumulated negative loss.
4. The method of claim 1, wherein said determining a classification penalty for each of said texts based on said tag vector matrix and said text vector matrix, further comprises:
performing matrix calculation on the label vector matrix and the transpose matrix of the text vector matrix to obtain a calculation result;
calculating the cross entropy between the calculation result and each label;
and determining the classification loss according to each cross entropy.
5. The method according to claim 1 or 2, wherein updating the tag vector matrix according to the total loss value, to obtain an updated tag vector matrix, comprises:
according to the total loss value, deriving the label vector matrix to obtain a gradient matrix corresponding to the label vector matrix;
and updating the label vector matrix according to the gradient matrix and a preset learning rate to obtain an updated label vector matrix.
6. A single domain label vector extraction apparatus comprising:
the data acquisition module is used for acquiring a plurality of texts corresponding to a single field and labels of each text, wherein for each text, the labels represent the text types of the texts, the labels are numerical values, and different numerical values represent different text types;
A text vector matrix determining module, configured to determine a text vector matrix corresponding to a plurality of texts, where each element in the text vector matrix represents a text vector corresponding to each text;
the label vector matrix determining module is used for determining label vector matrixes corresponding to the labels according to the number of the labels and preset vector dimensions, each element in the label vector matrixes represents a value corresponding to each label, and the vector dimensions are smaller than the number of the labels;
a total loss value determining module, configured to determine a total loss value according to the tag vector matrix and the text vector matrix, where the total loss value characterizes differences between each element in the tag vector matrix and each element in the corresponding text vector matrix;
the updating module is used for updating the label vector matrix according to the total loss value to obtain an updated label vector matrix;
determining a total loss value according to the tag vector matrix and the text vector matrix, including:
determining a similarity loss between the tag vector matrix and the text vector matrix according to the tag vector matrix and the text vector matrix, wherein the similarity loss characterizes the similarity between the text and the tag;
Determining classification loss corresponding to each text according to the tag vector matrix and the text vector matrix, wherein the classification loss represents whether the text type of each text is the text type corresponding to each tag;
determining the total loss value based on the similarity loss and the classification loss;
according to the label vector matrix and the text vector matrix, calculating the similarity between elements at corresponding positions in the label vector matrix and the text vector matrix, accumulating the calculated similarity values, and obtaining the similarity loss by the following steps:
similarity is measured by using Euclidean distance, and the calculation formula of the Euclidean distance is as follows:
=/>
or, the similarity loss is calculated based on the L2 loss or the JS divergence, namely, the sum of the similarity among the elements is calculated based on the L2 loss or the JS divergence and is used as the similarity loss;
the determining a tag vector matrix corresponding to the plurality of tags according to the number of the plurality of tags and a preset vector dimension comprises the following steps:
constructing a tag vector matrix according to the number of the tags and a preset vector dimension, wherein each element in the tag vector matrix accords with the data distribution of multi-dimensional Gaussian noise, the number of the tags is the number of rows of the tag vector matrix, the preset vector dimension is the number of columns of the tag vector matrix, and the number of the tags is the same as the number of the texts.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1-5 when the computer program is executed.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1-5.
CN202310180959.7A 2023-02-16 2023-02-16 Single-domain label vector extraction method and device, electronic equipment and medium Active CN116383724B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310180959.7A CN116383724B (en) 2023-02-16 2023-02-16 Single-domain label vector extraction method and device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310180959.7A CN116383724B (en) 2023-02-16 2023-02-16 Single-domain label vector extraction method and device, electronic equipment and medium

Publications (2)

Publication Number Publication Date
CN116383724A CN116383724A (en) 2023-07-04
CN116383724B true CN116383724B (en) 2023-12-05

Family

ID=86972129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310180959.7A Active CN116383724B (en) 2023-02-16 2023-02-16 Single-domain label vector extraction method and device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN116383724B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910259B (en) * 2023-09-12 2024-04-16 深圳须弥云图空间科技有限公司 Knowledge diagnosis method and device for knowledge base

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2813990A1 (en) * 2013-06-12 2014-12-17 Netflix, Inc. Targeted promotion of original titles
US9749277B1 (en) * 2014-12-17 2017-08-29 Google Inc. Systems and methods for estimating sender similarity based on user labels
CN111125358A (en) * 2019-12-17 2020-05-08 北京工商大学 Text classification method based on hypergraph
CN112633419A (en) * 2021-03-09 2021-04-09 浙江宇视科技有限公司 Small sample learning method and device, electronic equipment and storage medium
CN112883190A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium
CN113987174A (en) * 2021-10-22 2022-01-28 上海携旅信息技术有限公司 Core statement extraction method, system, equipment and storage medium for classification label
CN114048290A (en) * 2021-11-22 2022-02-15 鼎富智能科技有限公司 Text classification method and device
CN114661933A (en) * 2022-03-08 2022-06-24 重庆邮电大学 Cross-modal retrieval method based on fetal congenital heart disease ultrasonic image-diagnosis report

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2813990A1 (en) * 2013-06-12 2014-12-17 Netflix, Inc. Targeted promotion of original titles
US9749277B1 (en) * 2014-12-17 2017-08-29 Google Inc. Systems and methods for estimating sender similarity based on user labels
CN111125358A (en) * 2019-12-17 2020-05-08 北京工商大学 Text classification method based on hypergraph
CN112883190A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium
CN112633419A (en) * 2021-03-09 2021-04-09 浙江宇视科技有限公司 Small sample learning method and device, electronic equipment and storage medium
CN113987174A (en) * 2021-10-22 2022-01-28 上海携旅信息技术有限公司 Core statement extraction method, system, equipment and storage medium for classification label
CN114048290A (en) * 2021-11-22 2022-02-15 鼎富智能科技有限公司 Text classification method and device
CN114661933A (en) * 2022-03-08 2022-06-24 重庆邮电大学 Cross-modal retrieval method based on fetal congenital heart disease ultrasonic image-diagnosis report

Also Published As

Publication number Publication date
CN116383724A (en) 2023-07-04

Similar Documents

Publication Publication Date Title
Bressert SciPy and NumPy: an overview for developers
CN111860398B (en) Remote sensing image target detection method and system and terminal equipment
CN109918513B (en) Image processing method, device, server and storage medium
CN107301248B (en) Word vector construction method and device of text, computer equipment and storage medium
CN109684476B (en) Text classification method, text classification device and terminal equipment
CN116383724B (en) Single-domain label vector extraction method and device, electronic equipment and medium
CN111832440B (en) Face feature extraction model construction method, computer storage medium and equipment
US11250299B2 (en) Learning representations of generalized cross-modal entailment tasks
CN110852261B (en) Target detection method and device, electronic equipment and readable storage medium
Qiao et al. Davarocr: A toolbox for ocr and multi-modal document understanding
CN111767710B (en) Indonesia emotion classification method, device, equipment and medium
CN115730225A (en) Clustering method and device for discrete sequences
CN108921226B (en) Zero sample image classification method based on low-rank representation and manifold regularization
CN116484856B (en) Keyword extraction method and device of text, electronic equipment and storage medium
CN113934842A (en) Text clustering method and device and readable storage medium
Liu et al. Locality constrained dictionary learning for human behaviour recognition
CN117540306B (en) Label classification method, device, equipment and medium for multimedia data
CN113869529B (en) Method for generating challenge samples, model evaluation method, device and computer device
CN116912518B (en) Image multi-scale feature processing method and device
US20230026113A1 (en) System, circuit, device and/or processes for accumulating neural network signals
CN117876751A (en) Image processing method, image processing system, and computer readable medium
CN115512194A (en) Image target detection method and device and terminal equipment
CN117892123A (en) Multi-mode target detection method and device
CN117648456A (en) Image searching method, device, electronic equipment and storage medium
CN114140807A (en) Copy image identification method and device, terminal device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant