WO2021072864A1 - 文本相似度获取方法、装置、电子设备及计算机可读存储介质 - Google Patents

文本相似度获取方法、装置、电子设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021072864A1
WO2021072864A1 PCT/CN2019/117670 CN2019117670W WO2021072864A1 WO 2021072864 A1 WO2021072864 A1 WO 2021072864A1 CN 2019117670 W CN2019117670 W CN 2019117670W WO 2021072864 A1 WO2021072864 A1 WO 2021072864A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
text
vector
spliced
similarity
Prior art date
Application number
PCT/CN2019/117670
Other languages
English (en)
French (fr)
Inventor
陈瑞清
许开河
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021072864A1 publication Critical patent/WO2021072864A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs

Definitions

  • This application relates to the field of machine learning, and in particular to a method, device, electronic device, and computer-readable storage medium for acquiring text similarity.
  • the inventor of the present application realizes that the existing text similarity processing technology, due to the lack of sentence representation ability and the simple processing method adopted, the processing result of text similarity is usually inaccurate, resulting in improper subsequent processing of the text.
  • one purpose of the present application is to provide a method, device, electronic device, and computer-readable storage medium for acquiring text similarity.
  • a text similarity acquisition method includes: splicing two texts to be compared for similarity to form a spliced text, the two texts respectively forming a first text segment and a second text segment in the spliced text ; Perform character segmentation and vectorization processing on the spliced text to obtain the word vector of each word in the spliced text; for each word in the spliced text, use the word vector of each word to calculate and obtain each The feature vector of the word, the feature vector of each word represents the similar feature of each word and the spliced text; the feature vector of each word in the first text segment and the feature vector of each word in the second text segment are used Calculate the similarity between the first text segment and the second text segment, and obtain a similarity value representing the similarity between the first text segment and the second text segment.
  • a text similarity acquisition device includes: a preprocessing module configured to splice two texts to be compared for similarity to form a spliced text, and the two texts respectively form the first part of the spliced text.
  • a text segment and a second text segment A text segment and a second text segment; a vectorization processing module configured to perform character segmentation and vectorization processing on the spliced text to obtain the word vector of each word in the spliced text; the first calculation module is configured For each word in the spliced text formed by splicing, the word vector of each word is used to calculate the feature vector of each word, and the feature vector of each word represents the similar characteristics of each word and the spliced text The second calculation module is configured to use the feature vector of each word in the first text segment and the feature vector of each word in the second text segment to calculate the similarity between the first text segment and the second text segment, Obtain a similarity value representing the similarity between the first text segment and the second text segment.
  • an electronic device includes: a processing unit; and a storage unit.
  • the storage unit stores computer-readable instructions, and when the computer-readable instructions are executed by the processing unit, the above method is implemented.
  • a computer-readable storage medium stores computer program instructions, and when the computer program instructions are executed by a computer, the computer executes the above method.
  • the feature vector obtained by calculation can characterize the similar features of each word in the spliced text and the spliced text, which makes it possible to use the feature vector to calculate and obtain the expression between the first text segment and the second text segment in the spliced text.
  • the similarity value of the similarity between the two has high accuracy. Therefore, the accuracy of the calculated similarity value representing the similarity between the first spliced text and the second spliced text is improved.
  • Fig. 1 is a schematic diagram showing a system architecture of a method for acquiring text similarity according to an exemplary embodiment
  • Fig. 2 is a flowchart showing a method for acquiring text similarity according to an exemplary embodiment
  • FIG. 3 is a detailed flowchart of step 220 of an embodiment shown according to the embodiment corresponding to FIG. 2;
  • FIG. 4 is a detailed flowchart of step 230 of an embodiment shown according to the embodiment corresponding to FIG. 2;
  • FIG. 5 is a detailed flowchart of step 231 of an embodiment shown according to the embodiment corresponding to FIG. 4;
  • FIG. 6 is a detailed flowchart of step 240 according to an embodiment shown in the embodiment corresponding to FIG. 2;
  • FIG. 7 is a flowchart of steps after step 240 of an embodiment shown according to the embodiment corresponding to FIG. 2;
  • Fig. 8 is a block diagram showing a device for acquiring text similarity according to an exemplary embodiment
  • Fig. 9 is a block diagram showing a device for acquiring text similarity according to an exemplary embodiment
  • Fig. 10 is a block diagram showing an example of an electronic device implementing the above method for acquiring text similarity according to an exemplary embodiment.
  • the text refers to a text segment composed of Chinese or foreign characters that can express meaning. Due to the various ways of text representation, when the text is composed of different text content, the text in the text may be different, and the meaning or content of the text may be similar or consistent. With the rapid development of Internet technology, the text content is extracted by the computer to obtain the text, and the characters or words with the same or similar characteristics in the text are set with the same or similar data, and the characters or words in the text are digitally extracted according to The word feature data or word feature data of the text is calculated to obtain the data representing the text feature, and then the data representing the two text features are calculated, and the similarity value that measures the similarity between the two texts can be obtained.
  • the features here can be words, such as words or words, meaning to be expressed.
  • the implementation terminal of this application can be any device with computing and processing functions.
  • the device can also be connected to an external device for data transmission.
  • It can be a portable mobile device, such as a smart phone, a tablet computer, a notebook computer, or a PDA (Personal Digital Assistant), etc., can also be fixed devices, such as computer equipment, field terminals, desktop computers, servers, workstations, etc., or a collection of multiple devices, such as the physical infrastructure of cloud computing.
  • Fig. 1 is a schematic diagram showing a system architecture of a method for acquiring text similarity according to an exemplary embodiment.
  • the server 120 is the implementation terminal of the application.
  • the server 120 and the database 110 are connected through a communication link, so that the server 120 can perform access operations on the data stored in the database 110.
  • the database 110 stores the pre-placed text and the trained word segmentation model.
  • the user terminal 130 may send a network request to the server 120, and the server 120 will return a corresponding response to the user terminal 130 according to the received network request.
  • the server 120 processes the network request and obtains what is required by the network request.
  • the text and word segmentation model are then obtained from the database 110 and returned to the user terminal 130.
  • the user terminal 130 stores the program code.
  • the user terminal 130 includes a processing unit and a storage unit.
  • the storage unit stores a computer readable
  • the steps can be realized: splicing two texts to be compared for similarity to form a spliced text, the two texts respectively forming the first text segment in the spliced text And a second text segment; perform character segmentation and vectorization processing on the spliced text formed by splicing to obtain the word vector of each word in the spliced text; for each word in the spliced text, use each word The feature vector of each word is calculated, and the feature vector of each word represents the similar feature of each word and the spliced text; the feature vector of each word in the first text segment and the second text are used The feature vector of each word in the segment is calculated, the similarity between the first text segment and the second text segment is calculated, and the similarity value representing the similarity between the first text segment and the second text segment is obtained.
  • Fig. 1 is only an embodiment of the present application, although in the embodiment shown in Fig. 1, the text and word segmentation model are stored in the database connected to the implementation terminal of the present application.
  • the implementation terminal is a server, and the user terminal is a desktop computer.
  • the text and word segmentation models are stored in various locations, such as local storage space.
  • the implementation terminals of this application can be the various types mentioned above.
  • the user terminal can also be a variety of terminal devices, for example, the user terminal can also be a smart phone. Therefore, this application does not make any limitation on this, and the protection scope of this application should not be restricted in any way.
  • Fig. 2 is a flowchart showing a method for acquiring text similarity according to an exemplary embodiment. As shown in Figure 2, it includes the following steps:
  • step 210 two texts to be compared for similarity are spliced to form spliced text, and the two texts respectively form a first text segment and a second text segment in the spliced text.
  • Text segment, text, and spliced text all refer to text fields that can express meaning.
  • the text usually consists of multiple words that can express meaning.
  • the similarity analysis of text is usually carried out between two texts, and the model that vectorizes the text, such as the Bert model, can usually only perform single text input.
  • the two texts need to be spliced to form spliced text.
  • the method of splicing two texts can splice the first text in front of the second text, so that the first word in the second text is connected after the first text. It may also be that the first text is spliced after the second text. After the two texts are spliced, the two texts respectively form the first text segment and the second text segment of the spliced text formed after splicing.
  • Step 220 Perform character segmentation and vectorization processing on the spliced text, and obtain the word vector of each word in the spliced text.
  • Concatenated text is composed of words that can express meaning.
  • it is necessary to perform vectorization processing on each word in the spliced text to form a word vector that characterizes the character feature of each word in the spliced text.
  • the meaning that the spliced text can express is related to the meaning of each word in the spliced text and the word order position of each word in the spliced text.
  • the meaning and meaning of each word in the spliced text must be compared first.
  • the word order position of each word in the spliced text is vectorized to form a character vector representing the character feature of each word in the spliced text. Therefore, the character characteristics of each word in the spliced text are related to the meaning of each word and the word order position of each word in the spliced text.
  • the word vector of each word in the spliced text is related to the character feature of each word, that is, the meaning of each word and the text in the splicing text
  • the word order position in is related.
  • vectorization of each word in the spliced text includes:
  • Step 221 For each word in the spliced text, use the Bert model to perform vectorization processing on each word to obtain the word meaning vector of each word.
  • the Bert model can only input single text. When using the Bert model to perform vectorization processing, it is necessary to input each word obtained through the character segmentation processing into the Bert model to obtain the word meaning vector of each word.
  • Step 222 Input each word meaning vector into the LSTM network model, and obtain a word vector containing both the word meaning of each word and the word order position of each word in the spliced text where it is located.
  • the LSTM network (Long Storage Term Memory) is an improved model of a cyclic neural network. It uses a forget gate to determine which information needs to be filtered out.
  • the input gate determines the current input information and current state, and the output gate determines the output. .
  • the context information of the spliced text is learned through the gate method, so as to add timing information to the obtained spliced text information.
  • the LSTM network model can re-encode the input word meaning vector, thereby adding timing information to the input word meaning vector, and obtaining simultaneous expression of each word The meaning of and the word vector of each word's word order in the spliced text.
  • the word vector of each character can represent the character feature of each character.
  • the character characteristics of each word are related to the meaning of each word and the word order of each word in the spliced text.
  • Step 230 For each word in the spliced text, use the word vector of each word to calculate and obtain the feature vector of each word.
  • the feature vector of each word represents the relationship between each word and each word in the spliced text. Similar features.
  • the word vector of each word in the spliced text can represent the characteristics of each word.
  • the feature vector of each word can be obtained respectively.
  • the feature vector of each word represents the similar characteristics of each word and each word in the spliced text.
  • the similar characteristics of each word and each word in the spliced text can represent the similar characteristics of each word and the spliced text, so the feature vector of each word in the spliced text can represent the similar characteristics of each word and the spliced text.
  • the word vector of each word is used to calculate the feature vector of each word.
  • the feature vector of each word represents the relationship between each word and the spliced text. Similar characteristics of each character, including:
  • Step 231 For each word in the spliced text, use the word vector of each word to calculate and obtain a number of regular weights respectively representing the similar characteristics of each word and each word in the spliced text, and each regular weight Represents the similar characteristics of each character and a character in the spliced text.
  • the calculation steps of the regular weight include:
  • Step 2311 For each word in the spliced text, cross-multiply the word vector of each word and the transposed vector of the word vector of each word in the spliced text to obtain a number of regular values of each word, Each of the regular values of each character represents the similar characteristics of each character and a character in the spliced text.
  • the word vector of each word For each word in the text, cross-multiply the word vector of each word with the transpose vector of a word vector in the spliced text, and obtain a representation of the two words corresponding to the two word vectors for calculation.
  • the regular value of similar features between the two For each word in the spliced text, the word vector of each word is cross-multiplied with the transposition vector of the word vector of each word in the spliced text, and several regular values of each word can be obtained.
  • the regular values all represent the similar characteristics of each word and a word in the spliced text.
  • the word vectors of each word in the spliced text represent the characteristics of each word. For each word in the spliced text, after cross-multiplying the word vector of each word with the transposed vector of a word vector in the spliced text, the similar features of each word and the words in the spliced text can be calculated It is enhanced, and dissimilar features are weakened. Therefore, the regular value of each word obtained by calculation can represent the similar characteristics of the two words under calculation.
  • Step 2312 Divide all the regular values of each word by a set value to obtain a number of regular weights, each of the regular weights represents the similar characteristics of each word and a word in the spliced text where each word is not located , The sum of the regular weights is 1.
  • the regular weight of each word calculated according to the regular value of each word can also represent each word and a word in the spliced text Similar characteristics.
  • Step 232 Cross-multiply the word vector of each word in the spliced text with the corresponding regular weight of each word to obtain a number of vectors representing similar features of each word and each word in the spliced text Each of the vectors represents a similar feature between each character and a character in the spliced text.
  • each regular weight of each word is combined with the splicing
  • the word in the spliced text corresponding to the regular weight of each word is: the word in the spliced text for which the regular weight is calculated.
  • each word in the spliced text also corresponds to a regular weight of each word.
  • the word vector of each word in the spliced text is cross-multiplied by the regular weight of each word corresponding to it, and several vectors representing the similar characteristics of each word and each word in the spliced text are obtained. , Each vector represents the similar feature of each word and a word in the spliced text.
  • the regular weight of each word can express the similar characteristics of each word and each word in the spliced text
  • the regular weight of each word and the word vector of a word in the text are calculated as a vector, It can also express the similar characteristics of each character to a character.
  • Step 233 Add the several vectors that have been obtained to obtain a feature vector of each character, and the feature vector of each character represents a similar feature between each character and the spliced text.
  • the feature vector of each word can represent The similar characteristics between each word and each word in the spliced text.
  • each word in the spliced text since the feature vector of each word can express the similar characteristics of each word and each word in the spliced text, and the combination of the characteristics of each word in the spliced text is the feature of the spliced text, so each The feature vector of a word can represent the similar characteristics of each word and the spliced text.
  • Step 240 Use the feature vector of each word in the first text segment and the feature vector of each word in the second text segment to calculate the similarity between the first text segment and the second text segment to obtain the first text The similarity value of the similarity between the segment and the second text segment.
  • the feature vector of each word represents the similar feature of each word and the spliced text.
  • the similarity of each word in the spliced text is different from the spliced text, and the feature vector of each word is different.
  • the feature vector and the feature vector of each word in the second text segment of the spliced text can calculate the similarity between the first text segment and the second text segment.
  • the feature vector of each word in the first text segment of the spliced text and the feature vector of each word in the second text segment of the spliced text are used to calculate the similarity between the first text segment and the second text segment ,include:
  • Step 241 Respectively take specific values in the feature vectors of the characters in the first text segment to form a first similarity vector and take specific values in the feature vectors of the characters in the second text segment respectively to form a second similarity vector .
  • the specific value includes the maximum value in the feature vector of each word. For each word in the first text segment and each word in the second text segment, since the purpose of calculating the enhancement vector of each word is to enhance the similar features of each word and the spliced text, the dissimilar features are Attenuation, therefore, the maximum value in the enhancement vector of each character can best represent the similar characteristics of each character and the spliced text.
  • the average value of all data in the enhancement vector of each word is used as the specific value of the enhancement vector of each word.
  • the maximum value in the enhancement vector of each word and the average value of all data are used as the specific value of the enhancement vector of each word.
  • first similarity vector Taking a specific value in the feature vector of each character in the first text segment to form a first similarity vector; taking a specific value in the feature vector of each character in the second text segment to form a second similarity vector.
  • first similarity vector and the second similarity vector the similar features of the first text segment and the second text segment are respectively strengthened, and the dissimilar features are weakened, so the first similarity vector and the second similarity vector can both be expressed The similarity between the first text and the second text.
  • Step 242 Divide the Euclidean distance of the first similarity vector and the second similarity vector by the sum of the modulus of the first similarity vector and the modulus of the second similarity vector to obtain the first text The similarity value of the similarity between the segment and the second text segment.
  • the Euclidean distance of the first similarity vector and the second similarity vector is divided by the sum of the modulus of the first similarity vector and the modulus of the second similarity vector, and the calculation formula is:
  • A is the first similarity vector
  • B is the second similarity vector
  • is the Euclidean distance between the first similarity vector A and the second similarity vector B
  • is the first similarity vector
  • is the modulus of the second similarity vector B
  • DW is the similarity value indicating the similarity between the first text segment and the second text segment.
  • the acquired similarity value representing the similarity between the first text segment and the second text segment has a high degree of accuracy.
  • the method further includes:
  • Step 250 Use the error model to input the similarity value into the error model to obtain the difference between the similarity value and the true value representing the true similarity between the first text segment and the second text segment, so as to evaluate the similarity value. accuracy.
  • the error model is:
  • Y is a set value, when the first text segment is similar to the second text segment, the Y value is 1; when the first text segment is not similar to the second text segment, the Y value is 0.
  • m is 1
  • DW is a similarity value representing the similarity between the first text segment and the second text segment, Is the square of D W. Represents the difference between the similarity value and the true value representing the true similarity between the first text segment and the second text segment.
  • the similarity value DW is less than 1.
  • the difference between the similarity value and the true value representing the true similarity between the first text segment and the second text segment is large.
  • the difference between the calculated similarity value and the true value representing the true similarity between the first text segment and the second text segment is relatively large, which means that the value obtained by the above calculation step represents the similarity between the first text segment and the second text segment.
  • the similarity value is inaccurate and cannot truly reflect the similarity between the first text segment and the second text segment.
  • Use gradient descent method to retrain the word segmentation model and vectorization processing model, and use the trained model to segment words and obtain the word vector of each word, and use the word vector of each word to re-calculate from step 210 to step 240 Obtain a similarity value representing the similarity between the first text segment and the second text segment. And use the error model to evaluate the similarity value. In this way, several rounds of model training and similar value calculations can be performed until the error model is used to obtain Less than the set value, the set value is a value close to zero.
  • An embodiment of the present application also provides a text similarity acquisition device, as shown in FIG. 8, including:
  • the preprocessing module 310 is configured to splice two texts to be compared for similarity to form a spliced text, and the two texts respectively form a first text segment and a second text segment in the spliced text.
  • the vectorization processing module 320 is configured to perform character segmentation and vectorization processing on the spliced text, and obtain the word vector of each word in the spliced text.
  • the first calculation module 330 is configured to use the word vector of each word for each word in the spliced text to calculate and obtain the feature vector of each word.
  • the feature vector of each word represents the relationship between each word and the Similar features of spliced text.
  • the second calculation module 340 is configured to use the feature vector of each character in the first text segment and the feature vector of each character in the second text segment to calculate the similarity between the first text segment and the second text segment, Obtain a similarity value representing the similarity between the first text segment and the second text segment.
  • the vectorization processing module 320 includes:
  • the segmentation sub-module 321 is configured to perform character segmentation on the spliced text to obtain each word contained in the spliced text; the processing sub-module 322 is configured to perform character segmentation on each word in the spliced text.
  • Vectorization processing to obtain a character vector that characterizes the characteristics of each character.
  • processing sub-module 322 includes:
  • the word meaning vector unit 3221 is configured to perform vectorization processing on each word using the Bert model for each word in the spliced text to obtain the word meaning vector of each word; the position unit 3222 is configured to The word meaning vectors are input into the LSTM network model, and the word vectors that simultaneously express the word meaning of each word and the word order position of each word in the spliced text are obtained.
  • the first calculation module 330 includes:
  • the weight calculation sub-module 331 is configured to use the word vector of each word for each word in the spliced text to calculate and obtain a number of each word representing the similar characteristics of each word and each word in the spliced text.
  • the regular weight of each word represents the similar characteristics of each word and a word in the spliced text
  • the cross multiplication sub-module 332 is configured to separate the word vectors of each word in the spliced text Cross multiplied by the corresponding regular weight of each word to obtain a number of vectors representing the similar features of each word and each word in the spliced text, and each of the vectors represents each word and the spliced text.
  • Similar features of a word the addition sub-module 333 is configured to add the several vectors that have been obtained to obtain the feature vector of each word, and the feature vector of each word represents the similarity between each word and the spliced text feature.
  • the weight calculation sub-module 331 includes:
  • the first cross multiplication unit 3311 is configured to cross multiply the word vector of each word and the transposition vector of the word vector of each word in the spliced text for each word in the spliced text to obtain several The regular value of each word, and each of the regular values of each word represent the similar characteristics of each word and a word in the spliced text; the dividing unit 3312 is configured to divide all the regular values of each word The value is divided by a set value to obtain a number of regular weights of each word. Each of the regular weights of each word represents the similar characteristics of each word and a word in the spliced text where each word is not located. The sum of regular weights is 1.
  • the second calculation module 340 includes:
  • the similarity vector calculation sub-module 341 is configured to respectively take specific values in the feature vectors of the characters in the first text segment to form a first similarity vector and take specific values in the feature vectors of the characters in the second text segment , Forming a second similarity vector;
  • the division sub-module 342 is configured to compare the Euclidean distance of the first similarity vector and the second similarity vector to the magnitude of the first similarity vector and the second similarity vector Divide the sum of the modulus to obtain a similarity value representing the similarity between the first text segment and the second text segment.
  • the apparatus for acquiring text similarity further includes:
  • the evaluation module 350 is configured to, after the calculation obtains the similarity value representing the similarity between the first text segment and the second text segment, use an error model to input the similarity value into the error model to obtain the similarity The difference between the value and the true value representing the true similarity between the first text segment and the second text segment to evaluate the accuracy of the similarity value.
  • the error model is:
  • Y is a set value, when the first spliced text is similar to the second spliced text, the Y value is 1; when the first spliced text is not similar to the second spliced text, the Y value is 0.
  • m is 1
  • DW is a similarity value representing the similarity between the first spliced text and the second spliced text, Is the square of D W.
  • the electronic device 700 according to this embodiment of the present invention will be described below with reference to FIG. 10.
  • the electronic device 700 shown in FIG. 10 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.
  • the electronic device 700 is represented in the form of a general-purpose computing device.
  • the components of the electronic device 700 may include, but are not limited to: the aforementioned at least one processing unit 710, the aforementioned at least one storage unit 720, and a bus 730 connecting different system components (including the storage unit 720 and the processing unit 710).
  • the storage unit stores program code, and the program code can be executed by the processing unit 710, so that the processing unit 710 executes the various exemplary methods described in the "Methods of Embodiments" section of this specification. Steps of implementation.
  • the storage unit 720 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 721 and/or a cache storage unit 722, and may further include a read-only storage unit (ROM) 723.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 720 may also include a program/utility tool 724 having a set of (at least one) program module 725.
  • program module 725 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 730 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the electronic device 700 may also communicate with one or more external devices 900 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 700, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 700 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 750.
  • I/O input/output
  • the electronic device 700 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 760.
  • networks for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet
  • the network adapter 760 communicates with other modules of the electronic device 700 through the bus 730.
  • other hardware and/or software modules can be used in conjunction with the electronic device 700, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a computing device which can be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium on which is stored a program product capable of implementing the above-mentioned method in this specification.
  • the computer-readable storage medium may be a computer non-volatile readable storage medium .
  • various aspects of the present invention may also be implemented in the form of a program product, which includes program code.
  • the program product runs on a terminal device, the program code is used to make the The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned "Exemplary Method" section of this specification.
  • An embodiment of the present application provides a program product for implementing the above method, which can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can run on a terminal device, such as a personal computer.
  • the program product of the present invention is not limited to this.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.
  • the program product can use any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein.
  • This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.
  • the program code used to perform the operations of the present invention can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural styles. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers for example, using Internet service providers

Abstract

一种文本相似度获取方法、装置、电子设备及计算机可读存储介质,涉及机器学习领域。该方法包括:将要进行相似度比较的两文本进行拼接,形成拼接文本,所述两文本分别形成了所述拼接文本中的第一文本段和第二文本段(210);对所述拼接文本进行字符分割和向量化处理,获取所述拼接文本中每个字的字向量(220);针对所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征(230);利用所述第一文本段中每个字的特征向量和第二文本段中每个字的特征向量,计算第一文本段和第二文本段的相似度,获取表示所述第一文本段与所述第二文本段相似度的相似值(240)。采用该方法,能够提高文本相似度获取的准确性。

Description

文本相似度获取方法、装置、电子设备及计算机可读存储介质
本申请要求2019年10月15日递交、发明名称为“文本相似度获取方法、装置、介质及电子设备”的中国专利申请201910980271.0的优先权,在此通过引用将其全部内容合并于此。
技术领域
本申请涉及机器学习领域,尤其涉及一种文本相似度获取方法、装置、电子设备及计算机可读存储介质。
背景技术
在大数据处理过程中,为了分析不同文字内容之间的相似度,需要采集不同的文本,并对采集到的文本进行相似度处理,将内容相似的文本归为一类,从而对内容相似的文本所呈现的状况进行统一处理,提升突发状况的处理效率。
本申请的发明人意识到,现有的文本相似度处理技术,由于对句子的表征能力欠缺以及采用的处理方法简单,对文本相似的处理结果通常不准确,造成对文本的后续处理失当。
发明内容
为了解决上述技术问题,本申请的一个目的在于提供一种文本相似度获取方法、装置、电子设备及计算机可读存储介质。
其中,本申请所采用的技术方案为:
一方面,一种文本相似度获取方法,包括:将要进行相似度比较的两文本进行拼接,形成拼接文本,所述两文本分别形成了所述拼接文本中的第一文本段和第二文本段;对所述拼接文本进行字符分割和向量化处理,获取所述拼接文本中每个字的字向量;针对所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征;利用所述第一文本段中每个字的特征向量和第二文本段中每个字的特征向量,计算第一文本段和第二文本段的相似度,获取表示所述第一文本段与所述第二文本段相似度的相似值。
另一方面,一种文本相似度获取装置,包括:预处理模块,被配置为将要进行相似度比较的两文本进行拼接,形成拼接文本,所述两文本分别形成了所述拼接文本中的第一文本段和第二文本段;向量化处理模块,被配置为对所述拼接文本进行字符分割和向量化处理,获取所述拼接文本中每个字的字向量;第一计算模块,被配置为针对拼接形成的所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征;第二计算模块,被配置为利用所述第一文本段中每个字的特征向量和第二文本段中每个字的特征向量,计算第一文本段和第二文本段 的相似度,获取表示所述第一文本段与所述第二文本段相似度的相似值。
另一方面,一种电子设备,包括:处理单元;存储单元,所述存储单元上存储有计算机可读指令,所述计算机可读指令被所述处理单元执行时,实现如上的方法。
另一方面,一种计算机可读存储介质,其存储有计算机程序指令,当所述计算机程序指令被计算机执行时,使计算机执行如上的方法。
在上述技术方案中,通过计算获取的特征向量,能够表征拼接文本中每个字与拼接文本的相似特征,这使得利用特征向量计算获取的表示拼接文本中第一文本段和第二文本段之间相似度的相似值具有很高的准确度。从而提高计算获得的表示第一拼接文本与第二拼接文本相似度的相似值的准确度。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于解释本申请的原理。
图1是根据一示例性实施例示出的一种文本相似度获取方法的系统构架示意图;
图2是根据一示例性实施例示出的一种文本相似度获取方法的流程图;
图3是根据图2对应实施例示出的一实施例的步骤220的细节流程图;
图4是根据图2对应实施例示出的一实施例的步骤230的细节流程图;
图5是根据图4对应实施例示出的一实施例的步骤231的细节流程图;
图6是根据图2对应实施例示出的一实施例的步骤240的细节流程图;
图7是根据图2对应实施例示出的一实施例的步骤240之后的步骤流程图;
图8是根据一示例性实施例示出的一种文本相似度获取装置的框图;
图9是根据一示例性实施例示出的一种文本相似度获取装置的框图;
图10是根据一示例性实施例示出的一种实现上述文本相似度获取方法的电子设备示例框图。
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述,这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。
具体实施方式
这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。
此外,附图仅为本申请的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。
本申请首先提供了一种文本相似度获取方法。文本指由中文或外文文字组成的能够进行意思表示的文本段。由于文字表示的方式多种多样,当文本由不同的文字内容组成时,文本中的文字或许不同,而文本要表示的意思或内容可能相似或一致。随着互联网技术的高速发展,通过计算机提取文字内容,获得文本,为文本中特征相同或相近的字或字设置大小相同或大小接近的数据,对文本中的字或字进行特征数字化提取,根据文本的字特征数据或字特征数据,计算获取表示文本特征的数据,然后将分别表示两个文本特征的数据进行计算,能够获取衡量两个文本之间的相似度的相似值。这里的特征可以是文字,如字或字,要表示的意思。通过对文本进行相似度计算,能够将相同或相似的文本归为一类,从而对相似文本所呈现的同一状况进行统一处理。
本申请的实施终端可以是任何具有运算和处理功能的设备,该设备还可以与外部设备相连,用于传输数据,其可以是便携移动设备,例如智能手机、平板电脑、笔记本电脑、PDA(Personal Digital Assistant)等,也可以是固定式设备,例如,计算机设备、现场终端、台式电脑、服务器、工作站等,还可以是多个设备的集合,比如云计算的物理基础设施。
图1是根据一示例性实施例示出的一种文本相似度获取方法的系统构架示意图。如图1所示,包括数据库110、服务器120以及用户终端130,其中,在本实施例中服务器120是本申请的实施终端,服务器120与数据库110之间通过通信链路进行连接,从而使服务器120可以对数据库110中存储的数据进行存取操作,数据库110中存储着预先置于其中的文本及训练好的分字模型,服务器120与用户终端130之间也存在着通信链路的连接,用户终端130可以向服务器120发送网络请求,服务器120会根据接收到的该网络请求向用户终端130返回相应的响应,具体可以为:服务器120对该网络请求进行处理,获取该网络请求所需要的文本、分字模型,然后从数据库110获取文本、分字模型并向用户终端130返回,用户终端130上存储有程序代码,用户终端130包括处理单元和存储单元,存储单元上存储有计算机可读指令,计算机可读指令被所述处理单元执行时,能够实现步骤:将要进行相似度比较的两文本进行拼接,形成拼接文本,所述两文本分别形成了所述拼接文本中的第一文本段和第二文本段;对拼接形成的所述拼接文本进行字符分割和向量化处理,获取所述拼接文本中每个字的字向量;针对所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相 似特征;利用所述第一文本段中每个字的特征向量和第二文本段中每个字的特征向量,计算第一文本段和第二文本段的相似度,获取表示所述第一文本段与所述第二文本段相似度的相似值。
值得一提的是,图1仅是本申请的一个实施例,虽然在图1示出的实施例中,文本、分字模型存储于与本申请的实施终端连接的数据库之中、本申请的实施终端为服务器、并且用户终端为台式电脑,但在实际应用中,文本、分字模型存储于各种位置,比如可以是本地存储空间,同时本申请的实施终端可以是上述的各种各样的设备,而用户终端也可以是各种终端设备,比如用户终端还可以是智能手机。因此本申请对此不作任何限定,本申请的保护范围也不应因此而受到任何限制。
图2是根据一示例性实施例示出的一种文本相似度获取方法的流程图。如图2所示,包括以下步骤:
步骤210,将要进行相似度比较的两文本进行拼接,形成拼接文本,所述两文本分别形成了所述拼接文本中的第一文本段和第二文本段。
文本段、文本、拼接文本均指能够进行意思表达的文字段。文本通常由多个能够进行意思表达的文字组成。文本的相似度分析,通常是在两个文本之间进行的,而对文本进行向量化处理的模型,如Bert模型等,通常只能够进行单文本输入。
为了能够将要进行相似度比较的两个文本输入向量化处理模型,进行向量化处理,需要将两个文本进行拼接,形成拼接文本。其中,拼接两个文本的方法可以将第一文本拼接在第二文本的前面,使得第二文本中第一个字连接在第一文本之后。也可以是,将第一文本拼接在第二文本之后。将两个文本进行拼接后,该两个文本分别形成了拼接后形成的拼接文本的第一文本段和第二文本段。
步骤220:对所述拼接文本进行字符分割和向量化处理,获取所述拼接文本中每个字的字向量。
拼接文本是由能够进行意思表达的文字组成的。在对拼接形成的拼接文本进行相似度处理,需要先对拼接文本中的各字进行向量化处理,形成表征拼接文本中每个字的字特征的字向量。
拼接文本所能够表达的意思与拼接文本中的每个字的字意和拼接文本中每个字的语序位置有关,进行拼接文本相似度处理,首先要对拼接文本中每个字的字意和拼接文本中每个字的语序位置进行向量化处理,形成表征拼接文本中每个字的字特征的字向量。因此拼接文本中每个字的字特征与每个字的字意和每个字在拼接文本中的语序位置有关。经过向量化处理,针对拼接文本中的每个字,获取的拼接文本中每个字的字向量与每个字的字特征有关,也就是与每个字的字意和每个字在拼接文本中的语序位置有关。
在对拼接文本中各字进行向量化处理之前,需要对拼接文本进行字符分割,获取拼接文本中包含的各字。
如图3所示,对拼接文本中的各字进行向量化处理,包括:
步骤221:针对所述拼接文本中的每个字,利用Bert模型对每个字进行向量化处理,获得每个字的字意向量。
Bert模型只能够进行单文本的输入。在利用Bert模型进行向量化处理时,需要将经过字符分割处理获取的各字分别输入Bert模型,获取各字的字意向量。
步骤222:将每个字意向量输入LSTM网络模型,获取同时包含每个字字意和每个字在其所在拼接文本中语序位置的字向量。
LSTM网络(Long Stort Term Memory,长短期记忆网络)是一种循环神经网络的改进模型,其通过遗忘门决定哪些信息需要被过滤掉,输入门确定当前输入信息和当前的状态,输出门决定输出。通过门的方法学习拼接文本的上下文信息,从而为已获得的拼接文本信息添加时序信息。
将经过Bert模型进行向量化处理后获得的字意向量输入LSTM网络模型,LSTM网络模型能够对输入的字意向量进行重新编码,从而为输入的字意向量添加时序信息,获取同时表达每个字的字意和每个字在拼接文本中语序位置的字向量。
各字的字向量能够分别表征各字的字特征。每个字的字特征与每个字的字意和每个字在拼接文本中的语序文字有关。
步骤230:针对所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本中各字的相似特征。
拼接文本中每个字的字向量均能够表征每个字的特征。利用拼接文本中各字的字向量进行计算,能够分别获得各字的特征向量。针对拼接文本中的每个字,每个字的特征向量表示每个字与拼接文本中各字的相似特征。每个字与拼接文本中各字的相似特征能够表示每个字与拼接文本的相似特征,因此拼接文本中每个字的特征向量均能够表示每个字与拼接文本的相似特征。
如图4所示,针对所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本中各字的相似特征,包括:
步骤231:针对所述拼接文本中的每个字,利用各字的字向量,计算获取分别表示每个字和所述拼接文本中各字的相似特征的若干正则权重,每一所述正则权重表示每个字和所述拼接文本中一个字的相似特征。
如图5所示,正则权重的计算步骤包括:
步骤2311:针对所述拼接文本中的每个字,将每个字的字向量和所述拼接文本中各字的字向量的转置向量分别进行叉乘,获取若干每个字的正则值,每个字的各所述正则值均表示每个字和所述拼接文本中一个字的相似特征。
针对文字中的每个字,将每个字的字向量与拼接文本中一个字字向量的转置向量进行叉乘,能够获取一个表示进行计算的两个字向量所分别对应的两个字之间相似特征的正则值。针对拼接文本中的每个字,将每个字的字向量与拼接文本中每个字的字向量的转置向量分别进行叉乘,能够获得若干每个字的正则值,每个字的各正则值均表示每个字和拼接 文本中一个字的相似特征。
拼接文本中各字的字向量均表征每个字的特征。针对拼接文本中的每个字,将每个字的字向量与拼接文本中一个字字向量的转置向量进行叉乘后,能够将每个字与进行计算的拼接文本中的字的相似特征得到增强,不相似的特征被减弱,因此计算获得的每个字的正则值能够表示进行计算的两个字的相似特征。
步骤2312:将每个字的所有所述正则值均除以一设定值,获取若干正则权重,各所述正则权重均表示每个字和非每个字所在拼接文本中一个字的相似特征,各所述正则权重之和为1。
将已获得的每个字的所有正则值均除以一设定值,能够获取若干每个字的正则权重,获取的各正则权重之和为1。将已获得的每个字的所有正则值均除以一设定值是为了:将已获得的每个字的所有正则值均除以一设定值后,获取的若干正则权重之和为1。
由于每个字的正则值能够表示每个字与拼接文本中一个字的相似特征,根据每个字的正则值计算获取的每个字的正则权重也能够表示每个字与拼接文本中一个字的相似特征。
步骤232:将所述拼接文本中的各字的字向量分别和其对应的所述每个字的正则权重叉乘,得到表示每个字与所述拼接文本中每一字相似特征的若干向量,各所述向量均表示每个字和所述拼接文本中一个字的相似特征。
针对拼接文本中的每个字,由于一个每个字正则权重是由每个字的字向量和拼接文本中一个字的字向量计算得来的,因此每个字的每一正则权重都与拼接文本中的一个字相对应,与一个每个字的正则权重相对应的拼接文本中的字是:计算获取该正则权重的拼接文本中的字。针对拼接文本中的每个字,拼接文本中的各字也分别对应一个每个字的正则权重。
针对拼接文本中的每个字,将拼接文本中的各字的字向量分别和其对应的每个字的正则权重叉乘,得到表示每个字与拼接文本中每一字相似特征的若干向量,各向量均表示每个字和所述拼接文本中一个字的相似特征。
针对拼接文本中的每个字,由于每个字的正则权重能够表示每个字与拼接文本中各字的相似特征,每个字的正则权重和文字中一个字的字向量计算获取的向量,也能够表示每个字与一个字的相似特征。
步骤233:将已得到所述若干向量相加,获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征。
针对拼接文本中的每个字,由于每个字的特征向量是由分别表示每个字与拼接文本中各字之间相似特征的若干向量相加得到的,所以每个字的特征向量能够表示每个字与拼接文本中各字之间的相似特征。
针对拼接文本中每个字,由于每个字的特征向量能够表示每个字与拼接文本中各字的相似特征,而拼接文本中各字的特征的综合即为拼接文本的特征,因此每个字的特征向量能够表示每个字与拼接文本的相似特征。
针对拼接文本中的每个字,在计算获取的特征向量中,每个字与拼接文本中各字的相似特征得到增强,不相似的特征被减弱。
步骤240:利用所述第一文本段中每个字的特征向量和第二文本段每个字的特征向量,计算第一文本段和第二文本段的相似度,获取表示所述第一文本段与所述第二文本段相似度的相似值。
各字的特征向量均表示每个字与拼接文本的相似特征,拼接文本中各字与拼接文本的相似程度不同,各字的特征向量不同,利用拼接文本的第一文本段中每个字的特征向量和拼接文本的第二文本段中每个字的特征向量,能够计算第一文本段和第二文本段的相似度。
如图6所示,利用拼接文本的第一文本段中每个字的特征向量和拼接文本的第二文本段中每个字的特征向量,计算第一文本段和第二文本段的相似度,包括:
步骤241:分别取所述第一文本段各字的特征向量中的特定值,形成第一相似向量和分别取所述第二文本段各字的特征向量中的特定值,形成第二相似向量。
该特定值包括各字的特征向量中的最大值。针对第一文本段中的每个字和第二文本段中的每个字,由于计算每个字的增强向量的目的是使每个字和拼接文本的相似特征得到增强,不相似的特征被减弱,因此每个字的增强向量中的最大值,最能够代表每个字与拼接文本的相似特征。
在一种实施例中,将各字的增强向量中的所有数据的平均值,作为每个字的增强向量的特定值。
在一种实施例中,将各字的增强向量中的最大值和所有数据的平均值,均作为每个字的增强向量的特定值。
取所述第一文本段各字的特征向量中的特定值,形成第一相似向量;取所述第二文本段各字的特征向量中的特定值,形成第二相似向量。第一相似向量与第二相似向量中,分别将第一文本段和第二文本段相似的特征进行了加强,不相似的特征进行了减弱,因此第一相似向量与第二相似向量均能够表示第一文本和第二文本的相似度。
步骤242:将所述第一相似向量和所述第二相似向量的欧式距离与所述第一相似向量的模和所述第二相似向量的模之和相除,获取表示所述第一文本段与所述第二文本段的相似度的相似值。
将所述第一相似向量和所述第二相似向量的欧式距离与所述第一相似向量的模和所述第二相似向量的模之和相除,计算公式为:
DW=||A+B||/(||A||+||B||)
公式中,A为第一相似向量,B为第二相似向量,||A+B||为第一相似向量A和第二相似向量B的欧式距离,||A||为第一相似向量A的模,||B||为第二相似向量B的模,DW为表示第一文本段和第二文本段相似度的相似值。
由于第一相似向量与第二相似向量中,分别将第一文本段和第二文本段相似的特征进行了加强,不相似的特征进行了减弱,因此根据第一相似向量与第二相似向量计算获取的 表示第一文本段和第二文本段相似度的相似值具有较高的准确度。
如图7所示,计算获取表示所述第一文本段与第二文本段的相似度的相似值之后,还包括:
步骤250:利用误差模型,将所述相似值输入所述误差模型得到相似值与表示第一文本段与第二文本段真实相似度的真实值之间的差值,以评估所述相似值的准确性。
误差模型为:
Figure PCTCN2019117670-appb-000001
其中:Y为设定值,在所述第一文本段与所述第二文本段相似时,Y值取1;所述第一文本段与所述第二文本段不相似时,Y值取0。m取1,DW为表示所述第一文本段与所述第二文本段相似度的相似值,
Figure PCTCN2019117670-appb-000002
为D W的平方。
Figure PCTCN2019117670-appb-000003
表示相似值与表示第一文本段与第二文本段真实相似度的真实值之间的差值。相似度值DW小于1。利用误差模型,能够对计算获取的表示第一文本段与第二文本段相似度的相似值的准确度进行评估。若计算获取的
Figure PCTCN2019117670-appb-000004
值较大时,表示相似值与表示第一文本段与第二文本段真实相似度的真实值之间的差值较大。已计算获得的相似值与表示第一文本段与第二文本段真实相似度的真实值之间的差值较大,表示上述计算步骤获取的表示第一文本段与第二文本段相似度的相似值不准确,无法真实反映第一文本段和第二文本段的相似度。
采用梯度下降法,对分字模型和向量化处理模型进行重新训练,并采用训练后的模型进行分字和获取每个字的字向量,采用各字的字向量再次运用步骤210-步骤240计算获取表示第一文本段和第二文本段相似度的相似值。并利用误差模型对相似值进行评估。如此,可进行数轮模型训练和相似值计算,直到通过误差模型得到的
Figure PCTCN2019117670-appb-000005
小于设定值,该设定值是一个接近零的数值。
本申请一实施例还提供了一种文本相似度获取装置,如图8所示,包括:
预处理模块310,被配置为将要进行相似度比较的两文本进行拼接,形成拼接文本,所述两文本分别形成了所述拼接文本中的第一文本段和第二文本段。向量化处理模块320,被配置为对所述拼接文本进行字符分割和向量化处理,获取所述拼接文本中每个字的字向量。第一计算模块330,被配置为针对所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征。第二计算模块340,被配置为利用所述第一文本段中每个字的特征向量和第二文本段中每个字的特征向量,计算第一文本段和第二文本段的相似度,获取表示所述第一文本段与所述第二文本段相似度的相似值。
如图9所示,在一种实施例中,向量化处理模块320包括:
分割子模块321,被配置为对所述拼接文本进行字符分割,获取拼接文本中包含的各字;处理子模块322,被配置为针对所述拼接文本中的每个字,对每个字进行向量化处理, 获取表征每个字特征的字向量。
在一种实施例中,处理子模块322包括:
字意向量单元3221,被配置为针对所述拼接文本中的每个字,利用Bert模型对每个字进行向量化处理,获得每个字的字意向量;位置单元3222,被配置为将每个字意向量输入LSTM网络模型,获取同时表达每个字的字意和每个字在所述拼接文本中语序位置的字向量。
在一种实施例中,第一计算模块330包括:
权重计算子模块331,被配置为针对所述拼接文本中的每个字,利用各字的字向量,计算获取分别表示每个字和所述拼接文本中各字的相似特征的若干每个字的正则权重,各每个字所述正则权重表示每个字和所述拼接文本中一个字的相似特征;叉乘子模块332,被配置为将所述拼接文本中的各字的字向量分别和其对应的所述每个字的正则权重叉乘,得到表示每个字与所述拼接文本中每一字相似特征的若干向量,各所述向量均表示每个字和所述拼接文本中一个字的相似特征;相加子模块333,被配置为将已得到所述若干向量相加,获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征。
在一种实施例中,权重计算子模块331包括:
第一叉乘单元3311,被配置为针对所述拼接文本中的每个字,将每个字的字向量和所述拼接文本中各字的字向量的转置向量分别进行叉乘,获取若干每个字的正则值,每个字的各所述正则值均表示每个字和所述拼接文本中一个字的相似特征;相除单元3312,被配置为将每个字的所有所述正则值均除以一设定值,获取若干每个字的正则权重,每个字的各所述正则权重均表示每个字和非每个字所在拼接文本中一个字的相似特征,各所述正则权重之和为1。
在一种实施例中,第二计算模块340包括:
相似向量计算子模块341,被配置为分别取所述第一文本段各字的特征向量中的特定值,形成第一相似向量和取所述第二文本段各字的特征向量中的特定值,形成第二相似向量;相除子模块342,被配置为将所述第一相似向量和所述第二相似向量的欧式距离与所述第一相似向量的模和所述第二相似向量的模之和相除,获取表示所述第一文本段与所述第二文本段相似度的相似值。
在一种实施例中,所述文本相似度获取装置,还包括:
评估模块350,被配置为在所述计算获取表示所述第一文本段与所述第二文本段的相似度的相似值之后,利用误差模型,将所述相似值输入所述误差模型得到相似值与表示第一文本段与第二文本段真实相似度的真实值之间的差值,以评估所述相似值的准确性。
在一种实施例中,在评估模块350中,误差模型为:
Figure PCTCN2019117670-appb-000006
其中:Y为设定值,在所述第一拼接文本与所述第二拼接文本相似时,Y值取1;所 述第一拼接文本与所述第二拼接文本不相似时,Y值取0。m取1,DW为表示所述第一拼接文本与所述第二拼接文本相似度的相似值,
Figure PCTCN2019117670-appb-000007
为D W的平方。
以上实施例中的各模块执行的内容,与上述方法实施例中的内容相同,这里不再赘述。
下面参照图10来描述根据本发明的这种实施方式的电子设备700。图10显示的电子设备700仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。
如图10所示,电子设备700以通用计算设备的形式表现。电子设备700的组件可以包括但不限于:上述至少一个处理单元710、上述至少一个存储单元720、连接不同系统组件(包括存储单元720和处理单元710)的总线730。其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元710执行,使得所述处理单元710执行本说明书上述“实施例方法”部分中描述的根据本发明各种示例性实施方式的步骤。
存储单元720可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)721和/或高速缓存存储单元722,还可以进一步包括只读存储单元(ROM)723。
存储单元720还可以包括具有一组(至少一个)程序模块725的程序/实用工具724,这样的程序模块725包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线730可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。电子设备700也可以与一个或多个外部设备900(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备700交互的设备通信,和/或与使得该电子设备700能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口750进行。并且,电子设备700还可以通过网络适配器760与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器760通过总线730与电子设备700的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备700使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本申请实施方式的方法。
根据本申请的第四方面,还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品,该计算机可读存储介质可以为计算机非易失性可读存储介质。在一些可能的实施方式中,本发明的各个方面还可以实现为一种程序产品的形式,其包括 程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。
本申请一实施例提供了用于实现上述方法的程序产品,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本发明的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。可以以一种或多种程序设计语言的任意组合来编写用于执行本发明操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。此外,上述附图仅是根据本发明示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本发明的范围仅由所附的权利要求来限制。

Claims (20)

  1. 一种文本相似度获取方法,所述方法包括:
    将要进行相似度比较的两文本进行拼接,形成拼接文本,所述两文本分别形成了所述拼接文本中的第一文本段和第二文本段;
    对所述拼接文本进行字符分割和向量化处理,获取所述拼接文本中每个字的字向量;
    针对所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征;
    利用所述第一文本段中每个字的特征向量和第二文本段中每个字的特征向量,计算第一文本段和第二文本段的相似度,获取表示所述第一文本段与所述第二文本段相似度的相似值。
  2. 根据权利要求1所述的方法,其中,所述对拼接形成的所述拼接文本进行字符分割和向量化处理,获取所述拼接文本中每个字的字向量,包括:
    对所述拼接文本进行字符分割,获取拼接文本中包含的各字;
    针对所述拼接文本中的每个字,对每个字进行向量化处理,获取表征每个字特征的字向量。
  3. 根据权利要求2所述的方法,其中,所述针对所述拼接文本中的每个字,对每个字进行向量化处理,获取表示每个字特征的字向量,包括:
    针对所述拼接文本中的每个字,利用Bert模型对每个字进行向量化处理,获得每个字的字意向量;
    将每个字意向量输入LSTM网络模型,获取同时表达每个字的字意和每个字在所述拼接文本中语序位置的字向量。
  4. 根据权利要求1所述的方法,其中,所述针对所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本中各字之间的相似特征,包括:
    针对所述拼接文本中的每个字,利用各字的字向量,计算获取分别表示每个字和所述拼接文本中各字的相似特征的若干每个字的正则权重,各每个字所述正则权重表示每个字和所述拼接文本中一个字的相似特征;
    将所述拼接文本中的各字的字向量分别和其对应的所述每个字的正则权重叉乘,得到表示每个字与所述拼接文本中每一字相似特征的若干向量,各所述向量均表示每个字和所 述拼接文本中一个字的相似特征;
    将已得到所述若干向量相加,获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征。
  5. 根据权利要求4所述的方法,其中,所述针对所述拼接文本中的每个字,利用各字的字向量,计算获取分别表示每个字和所述拼接文本中各字的相似特征的若干正则权重,每一所述正则权重表示每个字和所述拼接文本中一个字的相似特征,包括:
    针对所述拼接文本中的每个字,将每个字的字向量和所述拼接文本中各字的字向量的转置向量分别进行叉乘,获取若干每个字的正则值,每个字的各所述正则值均表示每个字和所述拼接文本中一个字的相似特征;
    将每个字的所有所述正则值均除以一设定值,获取若干每个字的正则权重,每个字的各所述正则权重均表示每个字和非每个字所在拼接文本中一个字的相似特征,各所述正则权重之和为1。
  6. 根据权利要求1所述的方法,其中,所述利用所述第一文本段中每个字的特征向量和第二文本段每个字的特征向量,计算第一文本段和第二文本段的相似度,获取表示所述第一文本段与所述第二文本段相似度的相似值,包括:
    分别取所述第一文本段各字的特征向量中的特定值,形成第一相似向量和取所述第二文本段各字的特征向量中的特定值,形成第二相似向量,所述特定值包括特征向量中符合预定标准的值;
    将所述第一相似向量和所述第二相似向量的欧式距离与所述第一相似向量的模和所述第二相似向量的模之和相除,获取表示所述第一文本段与所述第二文本段相似度的相似值。
  7. 根据权利要求1所述的方法,其中,所述计算第一文本段和第二文本段的相似度,获取表示所述第一文本段与所述第二文本段相似度的相似值之后,还包括:
    利用误差模型,将所述相似值输入所述误差模型得到相似值与表示第一文本段与第二文本段真实相似度的真实值之间的差值,以评估所述相似值的准确性。
  8. 一种文本相似度获取装置,所述装置包括:
    预处理模块,被配置为将要进行相似度比较的两文本进行拼接,形成拼接文本,所述两文本分别形成了所述拼接文本中的第一文本段和第二文本段;
    向量化处理模块,被配置为对所述拼接文本进行字符分割和向量化处理,获取所述拼接文本中每个字的字向量;
    第一计算模块,被配置为针对拼接形成的所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征;
    第二计算模块,被配置为利用所述第一文本段中每个字的特征向量和第二文本段中每个字的特征向量,计算第一文本段和第二文本段的相似度,获取表示所述第一文本段与所述第二文本段相似度的相似值。
  9. 根据权利要求8所述的装置,所述向量化处理模块包括:
    分割子模块,被配置为对所述拼接文本进行字符分割,获取拼接文本中包含的各字;
    处理子模块,被配置为针对所述拼接文本中的每个字,对每个字进行向量化处理,获取表征每个字特征的字向量。
  10. 根据权利要求9所述的装置,所述处理子模块包括:
    字意向量单元,被配置为针对所述拼接文本中的每个字,利用Bert模型对每个字进行向量化处理,获得每个字的字意向量;
    位置单元,被配置为将每个字意向量输入LSTM网络模型,获取同时表达每个字的字意和每个字在所述拼接文本中语序位置的字向量。
  11. 根据权利要求8所述的装置,所述第一计算模块包括:
    权重计算子模块,被配置为针对所述拼接文本中的每个字,利用各字的字向量,计算获取分别表示每个字和所述拼接文本中各字的相似特征的若干每个字的正则权重,各每个字所述正则权重表示每个字和所述拼接文本中一个字的相似特征;
    叉乘子模块,被配置为将所述拼接文本中的各字的字向量分别和其对应的所述每个字的正则权重叉乘,得到表示每个字与所述拼接文本中每一字相似特征的若干向量,各所述向量均表示每个字和所述拼接文本中一个字的相似特征;
    相加子模块,被配置为将已得到所述若干向量相加,获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征。
  12. 根据权利要求11所述的装置,所述权重计算子模块包括:
    第一叉乘单元,被配置为针对所述拼接文本中的每个字,将每个字的字向量和所述拼接文本中各字的字向量的转置向量分别进行叉乘,获取若干每个字的正则值,每个字的各所述正则值均表示每个字和所述拼接文本中一个字的相似特征;
    相除单元,被配置为将每个字的所有所述正则值均除以一设定值,获取若干每个字的正则权重,每个字的各所述正则权重均表示每个字和非每个字所在拼接文本中一个字的相 似特征,各所述正则权重之和为1。
  13. 根据权利要求8所述的装置,所述第二计算模块包括:
    相似向量计算子模块,被配置为分别取所述第一文本段各字的特征向量中的特定值,形成第一相似向量和取所述第二文本段各字的特征向量中的特定值,形成第二相似向量;
    相除子模块,被配置为将所述第一相似向量和所述第二相似向量的欧式距离与所述第一相似向量的模和所述第二相似向量的模之和相除,获取表示所述第一文本段与所述第二文本段相似度的相似值。
  14. 根据权利要求8所述的装置,还包括:评估模块,被配置为在所述计算获取表示所述第一文本段与所述第二文本段的相似度的相似值之后,利用误差模型,将所述相似值输入所述误差模型得到相似值与表示第一文本段与第二文本段真实相似度的真实值之间的差值,以评估所述相似值的准确性。
  15. 一种电子设备,包括:处理单元;
    存储单元,所述存储单元上存储有计算机可读指令,所述计算机可读指令被所述处理单元执行时执行以下处理:
    将要进行相似度比较的两文本进行拼接,形成拼接文本,所述两文本分别形成了所述拼接文本中的第一文本段和第二文本段;
    对所述拼接文本进行字符分割和向量化处理,获取所述拼接文本中每个字的字向量;
    针对所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征;
    利用所述第一文本段中每个字的特征向量和第二文本段中每个字的特征向量,计算第一文本段和第二文本段的相似度,获取表示所述第一文本段与所述第二文本段相似度的相似值。
  16. 根据权利要求15所述的电子设备,其中,所述对拼接形成的所述拼接文本进行字符分割和向量化处理,获取所述拼接文本中每个字的字向量,包括:
    对所述拼接文本进行字符分割,获取拼接文本中包含的各字;
    针对所述拼接文本中的每个字,对每个字进行向量化处理,获取表征每个字特征的字向量。
  17. 根据权利要求16所述的电子设备,其中,所述针对所述拼接文本中的每个字,对每个字进行向量化处理,获取表示每个字特征的字向量,包括:
    针对所述拼接文本中的每个字,利用Bert模型对每个字进行向量化处理,获得每个 字的字意向量;
    将每个字意向量输入LSTM网络模型,获取同时表达每个字的字意和每个字在所述拼接文本中语序位置的字向量。
  18. 根据权利要求15所述的电子设备,其中,所述针对所述拼接文本中的每个字,利用每个字的字向量,计算获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本中各字之间的相似特征,包括:
    针对所述拼接文本中的每个字,利用各字的字向量,计算获取分别表示每个字和所述拼接文本中各字的相似特征的若干每个字的正则权重,各每个字所述正则权重表示每个字和所述拼接文本中一个字的相似特征;
    将所述拼接文本中的各字的字向量分别和其对应的所述每个字的正则权重叉乘,得到表示每个字与所述拼接文本中每一字相似特征的若干向量,各所述向量均表示每个字和所述拼接文本中一个字的相似特征;
    将已得到所述若干向量相加,获取每个字的特征向量,每个字的特征向量表示每个字与所述拼接文本的相似特征。
  19. 根据权利要求18所述的电子设备,其中,所述针对所述拼接文本中的每个字,利用各字的字向量,计算获取分别表示每个字和所述拼接文本中各字的相似特征的若干正则权重,每一所述正则权重表示每个字和所述拼接文本中一个字的相似特征,包括:
    针对所述拼接文本中的每个字,将每个字的字向量和所述拼接文本中各字的字向量的转置向量分别进行叉乘,获取若干每个字的正则值,每个字的各所述正则值均表示每个字和所述拼接文本中一个字的相似特征;
    将每个字的所有所述正则值均除以一设定值,获取若干每个字的正则权重,每个字的各所述正则权重均表示每个字和非每个字所在拼接文本中一个字的相似特征,各所述正则权重之和为1。
  20. 一种计算机可读存储介质,其存储有计算机程序指令,当所述计算机程序指令被计算机执行时,使计算机执行权利要求1至7任一项所述的方法。
PCT/CN2019/117670 2019-10-15 2019-11-12 文本相似度获取方法、装置、电子设备及计算机可读存储介质 WO2021072864A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910980271.0 2019-10-15
CN201910980271.0A CN110929499B (zh) 2019-10-15 2019-10-15 文本相似度获取方法、装置、介质及电子设备

Publications (1)

Publication Number Publication Date
WO2021072864A1 true WO2021072864A1 (zh) 2021-04-22

Family

ID=69848997

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117670 WO2021072864A1 (zh) 2019-10-15 2019-11-12 文本相似度获取方法、装置、电子设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN110929499B (zh)
WO (1) WO2021072864A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689923A (zh) * 2020-05-19 2021-11-23 北京平安联想智慧医疗信息技术有限公司 医疗数据处理设备、系统和方法
CN114969257A (zh) * 2022-05-26 2022-08-30 平安普惠企业管理有限公司 标准语音识别数据库的构建方法、装置、计算机设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200967A (zh) * 2011-03-30 2011-09-28 中国人民解放军军事医学科学院放射与辐射医学研究所 一种基于dna序列的文本处理方法和系统
US9690849B2 (en) * 2011-09-30 2017-06-27 Thomson Reuters Global Resources Unlimited Company Systems and methods for determining atypical language
CN109165291A (zh) * 2018-06-29 2019-01-08 厦门快商通信息技术有限公司 一种文本匹配方法及电子设备
CN109214407A (zh) * 2018-07-06 2019-01-15 阿里巴巴集团控股有限公司 事件检测模型、方法、装置、计算设备及存储介质
CN109871540A (zh) * 2019-02-21 2019-06-11 武汉斗鱼鱼乐网络科技有限公司 一种文本相似度的计算方法以及相关设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874258B (zh) * 2017-02-16 2020-04-07 西南石油大学 一种基于汉字属性向量表示的文本相似性计算方法及系统
CN107729300B (zh) * 2017-09-18 2021-12-24 百度在线网络技术(北京)有限公司 文本相似度的处理方法、装置、设备和计算机存储介质
CN109493977B (zh) * 2018-11-09 2020-07-31 天津新开心生活科技有限公司 文本数据处理方法、装置、电子设备及计算机可读介质
CN109658938B (zh) * 2018-12-07 2020-03-17 百度在线网络技术(北京)有限公司 语音与文本匹配的方法、装置、设备及计算机可读介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200967A (zh) * 2011-03-30 2011-09-28 中国人民解放军军事医学科学院放射与辐射医学研究所 一种基于dna序列的文本处理方法和系统
US9690849B2 (en) * 2011-09-30 2017-06-27 Thomson Reuters Global Resources Unlimited Company Systems and methods for determining atypical language
CN109165291A (zh) * 2018-06-29 2019-01-08 厦门快商通信息技术有限公司 一种文本匹配方法及电子设备
CN109214407A (zh) * 2018-07-06 2019-01-15 阿里巴巴集团控股有限公司 事件检测模型、方法、装置、计算设备及存储介质
CN109871540A (zh) * 2019-02-21 2019-06-11 武汉斗鱼鱼乐网络科技有限公司 一种文本相似度的计算方法以及相关设备

Also Published As

Publication number Publication date
CN110929499B (zh) 2022-02-11
CN110929499A (zh) 2020-03-27

Similar Documents

Publication Publication Date Title
WO2020182122A1 (zh) 用于生成文本匹配模型的方法和装置
WO2021072863A1 (zh) 文本相似度计算方法、装置、电子设备及计算机可读存储介质
CN111199474B (zh) 一种基于双方网络图数据的风险预测方法、装置和电子设备
WO2022007438A1 (zh) 情感语音数据转换方法、装置、计算机设备及存储介质
CN114676704B (zh) 句子情感分析方法、装置、设备以及存储介质
WO2020232898A1 (zh) 文本分类方法、装置、电子设备及计算机非易失性可读存储介质
EP4113357A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
WO2021208727A1 (zh) 基于人工智能的文本错误检测方法、装置、计算机设备
CN114861889B (zh) 深度学习模型的训练方法、目标对象检测方法和装置
CN111488742B (zh) 用于翻译的方法和装置
WO2022174496A1 (zh) 基于生成模型的数据标注方法、装置、设备及存储介质
CN111538837A (zh) 用于分析企业经营范围信息的方法和装置
WO2023280106A1 (zh) 信息获取方法、装置、设备及介质
WO2021072864A1 (zh) 文本相似度获取方法、装置、电子设备及计算机可读存储介质
JP2023036681A (ja) タスク処理方法、処理装置、電子機器、記憶媒体及びコンピュータプログラム
CN112671985A (zh) 基于深度学习的坐席质检方法、装置、设备及存储介质
KR102608867B1 (ko) 업계 텍스트를 증분하는 방법, 관련 장치 및 매체에 저장된 컴퓨터 프로그램
CN111797204A (zh) 文本匹配方法、装置、计算机设备及存储介质
WO2021196935A1 (zh) 数据校验方法、装置、电子设备和存储介质
WO2021184547A1 (zh) 对话机器人意图语料生成方法、装置、介质及电子设备
WO2023236588A1 (zh) 基于客群偏差平滑优化的用户分类方法及装置
WO2020252925A1 (zh) 用户特征群中用户特征寻优方法、装置、电子设备及计算机非易失性可读存储介质
CN116703659A (zh) 一种应用于工程咨询的数据处理方法、装置及电子设备
CN113360672B (zh) 用于生成知识图谱的方法、装置、设备、介质和产品
CN115759292A (zh) 模型的训练方法及装置、语义识别方法及装置、电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19949047

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19949047

Country of ref document: EP

Kind code of ref document: A1