WO2023029356A1 - 基于句向量模型的句向量生成方法、装置及计算机设备 - Google Patents

基于句向量模型的句向量生成方法、装置及计算机设备 Download PDF

Info

Publication number
WO2023029356A1
WO2023029356A1 PCT/CN2022/071882 CN2022071882W WO2023029356A1 WO 2023029356 A1 WO2023029356 A1 WO 2023029356A1 CN 2022071882 W CN2022071882 W CN 2022071882W WO 2023029356 A1 WO2023029356 A1 WO 2023029356A1
Authority
WO
WIPO (PCT)
Prior art keywords
initial
text
sentence vector
model
similar
Prior art date
Application number
PCT/CN2022/071882
Other languages
English (en)
French (fr)
Inventor
陈浩
谯轶轩
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023029356A1 publication Critical patent/WO2023029356A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a sentence vector generation method, device, computer equipment and storage medium based on a sentence vector model.
  • Sentence embedding is one of the hot research fields in natural language processing in recent years. Sentence vectors can be obtained by mapping the words, tokens, and semantic information between words in a sentence to a quantifiable space. The generated sentence vectors are usually provided to downstream tasks for further processing, such as similarity calculation, classification and clustering based on sentence vectors.
  • the purpose of the embodiment of the present application is to propose a sentence vector generation method, device, computer equipment and storage medium based on the sentence vector model, so as to quickly generate accurate and usable sentence vectors.
  • the embodiment of the present application provides a sentence vector generation method based on the sentence vector model, which adopts the following technical solution:
  • the embodiment of the present application also provides a sentence vector generation device based on the sentence vector model, which adopts the following technical solution:
  • a text set acquisition module is used to obtain an initial text set
  • An information acquisition module configured to acquire TF-IDF information of the initial text for each initial text in the initial text set
  • a text adjustment module configured to determine a target adjustment word in the initial text according to the TF-IDF information, to adjust the initial text based on the target adjustment word, and generate similar text to the initial text;
  • a text input module configured to input the initial text into the initial sentence vector model to obtain an initial sentence vector, and input the similar text into the initial sentence vector model to obtain a similar sentence vector;
  • a vector setting module configured to set the similar sentence vector as a positive sample of the current initial sentence vector, and set the initial sentence vectors of other initial texts in the initial text set and the similar texts corresponding to the similar texts of other initial texts
  • the sentence vector is set as a negative sample of the current initial sentence vector
  • a comparative learning module configured to instruct the initial sentence vector model to perform comparative learning according to the current initial sentence vector, the positive sample, and the negative sample to obtain a sentence vector model
  • the input module to be processed is configured to input the acquired text to be processed into the sentence vector model to obtain the sentence vector of the text to be processed.
  • an embodiment of the present application further provides a computer device, including a memory and a processor, wherein computer-readable instructions are stored in the memory, and the processor implements the following steps when executing the computer-readable instructions:
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are executed by a processor to implement the following steps:
  • the embodiment of the present application mainly has the following beneficial effects: the TF-IDF information of each initial text in the initial text set is obtained, and the TF-IDF information reflects the importance of each word in the initial text.
  • the TF-IDF information Determine the target adjustment words and adjust the target adjustment words to generate similar texts under the condition of keeping the text semantics similar; use the initial sentence vector model to generate the initial sentence vector of the initial text and the similar sentence vector of the similar text, avoiding the Semantic loss caused by separate segmentation of words; using unsupervised comparative learning, only similar sentence vectors of similar texts are used as positive samples, and the rest of the vectors are used as negative samples, so that the model can fully integrate the initial sentence vectors and negative samples during training.
  • this application reduces the semantic loss when generating a sentence vector model, and performs unsupervised training, so that accurate and usable sentence vectors can be efficiently generated.
  • FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
  • Fig. 2 is the flow chart of an embodiment of the sentence vector generation method based on the sentence vector model of the present application
  • Fig. 3 is a schematic structural diagram of an embodiment of a sentence vector generation device based on a sentence vector model according to the present application
  • Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
  • Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
  • Terminal devices 101 , 102 , 103 Users can use terminal devices 101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like.
  • Various communication client applications can be installed on the terminal devices 101, 102, 103, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
  • Terminal devices 101, 102, 103 can be various electronic devices with display screens and support web browsing, including but not limited to smartphones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compress standard audio layer 4) players, laptops and desktop computers, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, moving picture experts compress standard audio layer 4
  • laptops and desktop computers etc.
  • the server 105 may be a server that provides various services, such as a background server that provides support for pages displayed on the terminal devices 101 , 102 , 103 .
  • the sentence vector generation method based on the sentence vector model provided in the embodiment of the present application is generally executed by the server, and correspondingly, the sentence vector generation device based on the sentence vector model is generally set in the server.
  • terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
  • FIG. 2 shows a flowchart of an embodiment of a sentence vector generation method based on a sentence vector model according to the present application.
  • the described sentence vector generation method based on the sentence vector model comprises the following steps:
  • Step S201 acquiring an initial text set.
  • the electronic device on which the sentence vector generation method based on the sentence vector model runs can communicate with the terminal through a wired connection or a wireless connection.
  • the above wireless connection methods may include but not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods known or developed in the future .
  • the server first needs to obtain an initial text set, and there are several initial texts in the initial text set.
  • the initial text set can be determined by the usage scenario of the sentence vector model. For example, in the book recommendation scenario, sentence vectors need to be generated through the sentence vector model, and similar books are recommended based on the sentence vectors.
  • the initial text set can be the introduction of several books.
  • the above-mentioned initial text set can also be stored in a block chain node.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • Step S202 for each initial text in the initial text set, obtain TF-IDF information of the initial text.
  • TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and data mining.
  • TF is Term Frequency, which refers to the frequency with which a given word appears in a file.
  • IDF is the Inverse Document Frequency Index (Inverse Document Frequency), which can be obtained by dividing the total number of documents by the number of documents containing a word, and then calculating the logarithm.
  • step S202 may include: for each initial text in the initial text set, perform word segmentation processing on the initial text to obtain several split words; calculate the TF-IDF value of each split word based on the initial text set to obtain the initial text TF-IDF information.
  • word segmentation is performed on the initial text, and the word segmentation operation can be implemented by calling a word segmentation tool to obtain several split words. Then calculate the TF-IDF value of each split word.
  • IDF Inverse Text Frequency Index
  • the initial text set can be used as a whole, based on the initial text set Computes the inverse text frequency index of split words.
  • the TF-IDF value of each split word constitutes the TF-IDF information of the original text.
  • the initial text is segmented first to obtain split words, and the TF-IDF value of each split word can be quickly calculated based on the initial text set.
  • Step S203 determine the target adjustment word in the initial text according to the TF-IDF information, so as to adjust the initial text based on the target adjustment word, and generate similar text to the initial text.
  • split words sort the split words in descending order according to the TF-IDF value, calculate the number of target adjusted words according to the number of split words and the preset adjustment ratio, and then filter the top ranked words according to the TF-IDF value according to the number Several split words are used as target adjustment words.
  • TF-IDF is a statistical method to evaluate the importance of split words to the initial text. Screen the split words with larger TF-IDF values and replace them, and get similar texts when the semantic changes of the initial text are small.
  • Step S204 input the initial text into the initial sentence vector model to obtain the initial sentence vector, and input the similar text into the initial sentence vector model to obtain the similar sentence vector.
  • the initial sentence vector model may be a sentence vector model that has not been trained yet, and the sentence vector model is used to input text, and output a sentence vector (sentence embedding) corresponding to the text.
  • the sentence vector model in this application can process the text as a whole sequence, and can learn the semantic information contained in the text.
  • the server inputs the initial text into the initial sentence vector model, and the initial sentence vector model converts the initial text, and outputs the initial sentence vector corresponding to the initial text; inputs the similar text of the initial text into the initial sentence vector model, and obtains the Similar sentence vectors.
  • each initial text in the initial text set corresponding similar texts are generated.
  • Each initial text and its corresponding similar text will be input into the initial sentence vector model to obtain the corresponding initial sentence vector and similar sentence vector.
  • Step S205 set the similar sentence vector as the positive sample of the current initial sentence vector, and set the initial sentence vectors of other initial texts in the initial text set and the similar sentence vectors corresponding to similar texts of other initial texts as the current initial sentence vector negative samples.
  • this application requires positive samples and negative samples during model training. Assuming that there are initial texts S 1 , S 2 , and S 3 in the initial text set, the corresponding similar texts are The initial sentence vectors of the initial texts S 1 , S 2 , and S 3 are E 1 , E 2 , and E 3 , and similar texts The similar sentence vector of is
  • the current initial text is S 1
  • the current initial sentence vector is E 1
  • the similar text of the initial text S 1 is The similar text corresponding to the current initial text S 1
  • Step S206 instructing the initial sentence vector model to perform comparative learning based on the current initial sentence vector, positive samples and negative samples, to obtain a sentence vector model.
  • the server trains the initial sentence vector model according to unsupervised contrastive learning (Contrastive Learning).
  • contrastive learning the model does not have to pay attention to every detail of the sample, as long as the learned features can distinguish the current sample from other samples. Therefore, during model training, the server adjusts the initial sentence vector model so that the initial sentence vector output by the model is closer to the positive sample, and the difference between the output initial sentence vector and the negative sample is maximized. After the model training is over, a sentence vector model is obtained.
  • Step S207 input the acquired text to be processed into the sentence vector model to obtain the sentence vector of the text to be processed.
  • the server obtains the text to be processed, inputs the text to be processed into the trained sentence vector model, and the sentence vector model converts the text to be processed into a sentence vector.
  • the generated sentence vectors can be provided to downstream tasks for further processing. For example, in the scenario of book recommendation, it is necessary to recommend similar books to the target book; the introduction of each book can be obtained, and the sentence vectors of all book introductions can be generated through the sentence vector model. Calculate the cosine similarity between the sentence vector of the target book introduction and the sentence vector of other book introductions, use the cosine similarity as the similarity between books, and recommend similar books to the user.
  • the TF-IDF information of each initial text in the initial text set is obtained, and the TF-IDF information reflects the importance of each word in the initial text, and the target adjustment word is determined according to the TF-IDF information, and the target adjustment word is adjusted, which can be Generate similar text under the condition of keeping the text semantics similar; use the initial sentence vector model to generate the initial sentence vector of the initial text and the similar sentence vector of similar text, avoiding the semantic loss caused by separate segmentation of words in the text; using unsupervised
  • the contrastive learning of similar texts only uses similar sentence vectors of similar texts as positive samples, and the rest of the vectors are used as negative samples, so that the model can fully distinguish the initial sentence vectors from negative samples during training, so as to obtain a sentence vector model that can generate sentence vectors ;
  • This application reduces the semantic loss when generating the sentence vector model, and performs unsupervised training, so that accurate and usable sentence vectors can be efficiently generated.
  • the above step S203 may include: based on the TF-IDF information, determining the split word whose TF-IDF value in the initial text is greater than the preset first threshold as the target adjustment word; obtaining several target adjustment words through Cilin Similar words; obtain the pre-ranked words and post-ranked words of the target adjustment word from the initial text; for each similar word, sequentially combine the pre-ranked words, similar words and post-ranked words to obtain a candidate word sequence; according to the preset The corpus obtains the sequence frequency of the candidate word sequence; replaces the target adjusted word in the initial text with the similar word in the candidate word sequence with the highest sequence frequency, and obtains a similar text of the initial text.
  • Cilin is a kind of dictionary, based on Cilin, similar words of a word can be queried, and the similarity between words can be obtained.
  • the pre-ranked word may be a word ranked before the target adjusted word in the initial text; the post-ranked word may be a word ranked behind the target adjusted word in the initial text.
  • the TF-IDF information records the TF-IDF value of each split word, and the server can obtain a preset first threshold, which can be a larger value, and set the TF-IDF value greater than the first threshold
  • the split word is determined as the target adjustment word.
  • the server obtains the similar words of the target adjustment word and the similarity between the target adjustment word and the similar words through Cilin, and can select several similar words with the highest similarity.
  • Cilin may be a synonym Cilin Net, which is a Cilin issued by Harbin Institute of Technology.
  • the pre-ordered word can be formed by a predetermined number of split words
  • the post-ordered word can be formed by a preset number of split words. For each similar word, a sequence of candidate words is combined in the order of "pre-ranked words, similar words, and post-ranked words".
  • the server obtains the sequence frequency of the candidate word sequence through a preset corpus.
  • the corpus contains multiple texts, and the sequence frequency is obtained by counting the number of occurrences of candidate word sequences in all texts of the corpus, divided by the ratio of the total number of word sequences in the corpus (the number of words contained is the same as the number of words in the candidate word sequences).
  • the corpus may or may not be the same as the initial text set.
  • the server selects the candidate word sequence with the highest sequence frequency, and replaces the candidate word sequence in the initial text with similar words in it to obtain a similar text.
  • the important split words are determined as the target adjustment words according to the first threshold, the similar words of the target adjustment words are obtained, and the candidate word sequences of similar words are generated, the sequence frequency of the candidate word sequences is obtained according to the corpus, and the sequence is selected.
  • the similar words in the candidate word sequence with the highest frequency are used to replace the target adjusted words, so as to automatically generate similar texts that are most similar in semantics to the original text.
  • the above step of replacing the target adjusted word in the initial text with the similar word in the candidate word sequence with the highest sequence frequency, and obtaining the similar text of the initial text it may also include: based on TF-IDF information, in the similar text Filter the split words whose TF-IDF value is less than the preset second threshold; delete the filtered split words from similar texts.
  • the server can also obtain a preset second threshold, which can be a smaller value, and filter the segmented words whose TF-IDF value is smaller than the second threshold. These segmented words are less important, and can be It was removed from similar texts to make further adjustments to the text.
  • the split words whose TF-IDF value is smaller than the second threshold are deleted, the text is further adjusted, and the difference between the similar text and the original text is enlarged while maintaining similar semantics.
  • the above-mentioned step of inputting the initial text into the initial sentence vector model to obtain the initial sentence vector may include: inputting the initial text into the vector generation model in the initial sentence vector model to obtain the original vector; inputting the original vector into the initial sentence vector model according to the standard Process the network to obtain standardized original vectors; input the standardized original vectors into the fully connected network in the initial sentence vector model for dimensionality reduction processing to obtain initial sentence vectors.
  • the initial sentence vector model can be composed of a vector generation model, a standard processing network and a fully connected network.
  • the initial text is first input into the vector generation model, and the vector generation model converts the initial text into the original vector.
  • the information in the original vector is unevenly distributed, messy, and has a high dimension, which is not convenient for vector storage and subsequent calculation. Therefore, the original vector can be input into the standard processing network first to standardize the original vector. Standardization can map the original vector to a smoother vector space, making the distribution of information in the vector more uniform. In one embodiment, the original vector can be input into the Layer Normalization layer for normalization processing.
  • the standardized original vector is input into the fully connected network, and the vector is multiplied by the matrix in the fully connected network, and the vector can be dimensionally reduced to obtain the initial sentence vector.
  • the original vectors output by the vector generation model are obtained, the original vectors are subjected to standardization processing and dimensionality reduction processing to realize optimization of the original vectors and obtain initial sentence vectors.
  • the above-mentioned initial text is input into the vector generation model in the initial sentence vector model
  • the step of obtaining the original vector may include: inputting the initial text into the vector generation model in the initial sentence vector to obtain model output information, and the vector generation model is a Bert model ; According to the preset identifier, extract the original vector from the model output information.
  • the vector generation model may be a Bert model.
  • the Bert model is a pre-trained language model that inputs the text as a whole sequence, and can learn the semantic information contained in the text based on the word vector.
  • other models based on Bert such as Bert-wwt model, Bert-wwt-ext model, etc. can also be used.
  • the model output information may include various types of information, and various types of information may have identifiers to describe the information.
  • the server extracts the required information from the model output information according to the identifier, and obtains the original vector. In one embodiment, the server extracts the vector corresponding to the last layer [CLS] of the Bert model as the original vector.
  • the initial sentence vector model or the sentence vector model processes the input text in the same way.
  • the similar texts are also processed in the same way as above.
  • the Bert model is used as the vector generation model, which ensures that the obtained original vector can contain semantic information in the text, and improves the accuracy of the original vector.
  • step S206 may include: respectively calculating the similarity between the initial sentence vector and the positive sample and the negative sample; calculating the model loss according to the obtained similarity; adjusting the model parameters of the initial sentence vector model according to the model loss until the model loss converges, Get the sentence vector model.
  • the loss function of the model is as follows:
  • sim is the cosine similarity function
  • B is the number of vectors in a batch
  • E i is the current initial sentence vector
  • E k are the initial sentence vectors in a batch and the similar sentence vectors associated with the initial sentence vectors
  • L i is the model loss calculated for an initial sentence vector
  • L is the total model loss in a batch.
  • the server needs to calculate the similarity between the current initial sentence vector and the positive samples and each negative sample. Specifically, the similarity can be calculated through the cosine similarity function, and then the similarity can be substituted into formula (2) to calculate the model loss.
  • the model parameters of the initial sentence vector model can be adjusted with the goal of maximizing the model loss; and after the parameter adjustment, the initial text is re-input into the model to The model is iteratively trained until the model loss does not change and convergence is achieved, thereby obtaining the final sentence vector model.
  • the model loss is calculated based on the similarity between the vectors, and the model parameters are adjusted according to the model loss to ensure the accuracy of the generated sentence vector model.
  • the sentence vector generation method based on the sentence vector model in this application can be applied in the field of artificial intelligence, such as natural language processing in the field of artificial intelligence.
  • AI artificial intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the aforementioned storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM).
  • the present application provides an embodiment of a sentence vector generation device based on a sentence vector model.
  • This device embodiment is similar to the method embodiment shown in FIG. 2
  • the device can be specifically applied to various electronic devices.
  • the sentence vector generation device 300 based on the sentence vector model described in this embodiment includes: a text set acquisition module 301, an information acquisition module 302, a text adjustment module 303, a text input module 304, a vector setting module 305, Contrasting the learning module 306 and the input module 307 to be processed, wherein:
  • Text set acquisition module 301 configured to acquire an initial text set.
  • the information acquiring module 302 is configured to acquire TF-IDF information of each initial text in the initial text set.
  • the text adjustment module 303 is configured to determine target adjustment words in the initial text according to the TF-IDF information, so as to adjust the initial text based on the target adjustment words, and generate similar text to the original text.
  • the text input module 304 is configured to input the initial text into the initial sentence vector model to obtain the initial sentence vector, and input the similar text into the initial sentence vector model to obtain the similar sentence vector.
  • the vector setting module 305 is used to set the similar sentence vector as the positive sample of the current initial sentence vector, and set the initial sentence vectors of other initial texts in the initial text set and the similar sentence vectors corresponding to the similar texts of other initial texts as the current Negative samples of initial sentence vectors for .
  • the comparative learning module 306 is used to instruct the initial sentence vector model to perform comparative learning according to the current initial sentence vector, positive samples and negative samples to obtain a sentence vector model.
  • the to-be-processed input module 307 is configured to input the acquired text to be processed into the sentence vector model to obtain the sentence vector of the text to be processed.
  • the TF-IDF information of each initial text in the initial text set is obtained, and the TF-IDF information reflects the importance of each word in the initial text, and the target adjustment word is determined according to the TF-IDF information, and the target adjustment word is adjusted, which can be Generate similar text under the condition of keeping the text semantics similar; use the initial sentence vector model to generate the initial sentence vector of the initial text and the similar sentence vector of similar text, avoiding the semantic loss caused by separate segmentation of words in the text; using unsupervised
  • the contrastive learning of similar texts only uses similar sentence vectors of similar texts as positive samples, and the rest of the vectors are used as negative samples, so that the model can fully distinguish the initial sentence vectors from negative samples during training, so as to obtain a sentence vector model that can generate sentence vectors ;
  • This application reduces the semantic loss when generating the sentence vector model, and performs unsupervised training, so that accurate and usable sentence vectors can be efficiently generated.
  • the information acquisition module 302 may include: a word segmentation processing submodule and a calculation submodule, wherein:
  • the word segmentation processing sub-module is used for performing word segmentation processing on each initial text in the initial text set to obtain several split words.
  • the calculation sub-module is used to calculate the TF-IDF value of each split word based on the initial text set, and obtain the TF-IDF information of the initial text.
  • the initial text is segmented first to obtain split words, and the TF-IDF value of each split word can be quickly calculated based on the initial text set.
  • the text adjustment module 303 may include: a target determination submodule, a similarity acquisition submodule, an acquisition submodule, a combination submodule, a frequency acquisition submodule, and a target replacement submodule, wherein:
  • the target determination sub-module is configured to determine, based on the TF-IDF information, split words whose TF-IDF value is greater than a preset first threshold in the initial text as target adjustment words.
  • the similarity obtaining sub-module is used to obtain several similar words of the target adjustment word through Cilin.
  • the acquisition submodule is used to obtain the pre-order words and post-order words of the target adjustment word from the initial text.
  • the combination sub-module is used to sequentially combine the pre-ranked words, similar words and post-ranked words for each similar word to obtain a candidate word sequence.
  • the frequency obtaining sub-module is used to obtain the sequence frequency of the candidate word sequence according to the preset corpus.
  • the target replacement submodule is used to replace the target adjusted word in the initial text with the similar word in the candidate word sequence with the highest sequence frequency, so as to obtain the similar text of the initial text.
  • the important split words are determined as the target adjustment words according to the first threshold, the similar words of the target adjustment words are obtained, and the candidate word sequences of similar words are generated, the sequence frequency of the candidate word sequences is obtained according to the corpus, and the sequence is selected.
  • the similar words in the candidate word sequence with the highest frequency are used to replace the target adjusted words, so as to automatically generate similar texts that are most similar in semantics to the original text.
  • the text adjustment module 303 may also include: a determination submodule and a deletion submodule, wherein:
  • the determination submodule is used to screen the split words whose TF-IDF value is smaller than the preset second threshold in similar texts based on the TF-IDF information.
  • the delete submodule is used to delete the filtered split words from similar texts.
  • the split words whose TF-IDF value is smaller than the second threshold are deleted, the text is further adjusted, and the difference between the similar text and the original text is enlarged while maintaining similar semantics.
  • the text input module 304 may include: a text input submodule, an original input submodule, and a dimensionality reduction processing submodule, wherein:
  • the text input submodule is used to input the initial text into the vector generation model in the initial sentence vector model to obtain the original vector.
  • the original input sub-module is used to input the original vector into the standard processing network in the initial sentence vector model to obtain a standardized original vector.
  • the dimensionality reduction processing sub-module is used to input the standardized original vector into the fully connected network in the initial sentence vector model for dimensionality reduction processing to obtain the initial sentence vector.
  • the original vectors output by the vector generation model are obtained, the original vectors are subjected to standardization processing and dimensionality reduction processing to realize optimization of the original vectors and obtain initial sentence vectors.
  • the text input submodule may include: a text input unit and a vector extraction unit, wherein:
  • the text input unit is used to input the initial text into the vector generation model in the initial sentence vector to obtain model output information, and the vector generation model is a Bert model.
  • the vector extraction unit is used to extract the original vector from the model output information according to the preset identifier.
  • the Bert model is used as the vector generation model, which ensures that the obtained original vector can contain semantic information in the text, and improves the accuracy of the original vector.
  • the comparative learning module 306 may include: a similarity calculation submodule, a loss calculation submodule, and a model adjustment submodule, wherein:
  • the similarity calculation sub-module is used to calculate the similarity between the initial sentence vector and the positive sample and the negative sample respectively.
  • the loss calculation sub-module is used to calculate the model loss according to the obtained similarity.
  • the model adjustment sub-module is used to adjust the model parameters of the initial sentence vector model according to the model loss until the model loss converges to obtain the sentence vector model.
  • the model loss is calculated based on the similarity between the vectors, and the model parameters are adjusted according to the model loss to ensure the accuracy of the generated sentence vector model.
  • FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 4 includes a memory 41 , a processor 42 and a network interface 43 connected to each other through a system bus. It should be noted that only the computer device 4 with components 41-43 is shown in the figure, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to preset or stored instructions, and its hardware includes but is not limited to microprocessors, dedicated Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded devices, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • the computer equipment may be computing equipment such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can perform human-computer interaction with the user through keyboard, mouse, remote controller, touch panel or voice control device.
  • the memory 41 includes at least one type of computer-readable storage medium, the computer-readable storage medium can be non-volatile or volatile, and the computer-readable storage medium includes flash memory, hard disk, multimedia card , card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable Program read-only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 41 may be an internal storage unit of the computer device 4 , such as a hard disk or memory of the computer device 4 .
  • the memory 41 can also be an external storage device of the computer device 4, such as a plug-in hard disk equipped on the computer device 4, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 41 may also include both an internal storage unit of the computer device 4 and an external storage device thereof.
  • the memory 41 is generally used to store the operating system and various application software installed in the computer device 4, such as computer-readable instructions of the sentence vector generation method based on the sentence vector model.
  • the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chips in some embodiments. This processor 42 is generally used to control the general operation of said computer device 4 . In this embodiment, the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, execute computer-readable instructions of the method for generating sentence vectors based on sentence vector models.
  • CPU Central Processing Unit
  • controller microcontroller
  • microprocessor microprocessor
  • This processor 42 is generally used to control the general operation of said computer device 4 .
  • the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, execute computer-readable instructions of the method for generating sentence vectors based on sentence vector models.
  • the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
  • the computer device provided in this embodiment can execute the above sentence vector generation method based on the sentence vector model.
  • the sentence vector generation method based on the sentence vector model may be the sentence vector generation method based on the sentence vector model in the above embodiments.
  • the TF-IDF information of each initial text in the initial text set is obtained, and the TF-IDF information reflects the importance of each word in the initial text, and the target adjustment word is determined according to the TF-IDF information, and the target adjustment word is adjusted, which can be Generate similar text under the condition of keeping the text semantics similar; use the initial sentence vector model to generate the initial sentence vector of the initial text and the similar sentence vector of similar text, avoiding the semantic loss caused by separate segmentation of words in the text; using unsupervised
  • the contrastive learning of similar texts only uses similar sentence vectors of similar texts as positive samples, and the rest of the vectors are used as negative samples, so that the model can fully distinguish the initial sentence vectors from negative samples during training, so as to obtain a sentence vector model that can generate sentence vectors ;
  • This application reduces the semantic loss when generating the sentence vector model, and performs unsupervised training, so that accurate and usable sentence vectors can be efficiently generated.
  • the present application also provides another implementation manner, which is to provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is made to execute the steps of the sentence vector generation method based on the sentence vector model as described above.
  • the TF-IDF information of each initial text in the initial text set is obtained, and the TF-IDF information reflects the importance of each word in the initial text, and the target adjustment word is determined according to the TF-IDF information, and the target adjustment word is adjusted, which can be Generate similar text under the condition of keeping the text semantics similar; use the initial sentence vector model to generate the initial sentence vector of the initial text and the similar sentence vector of similar text, avoiding the semantic loss caused by separate segmentation of words in the text; using unsupervised
  • the contrastive learning of similar texts only uses similar sentence vectors of similar texts as positive samples, and the rest of the vectors are used as negative samples, so that the model can fully distinguish the initial sentence vectors from negative samples during training, so as to obtain a sentence vector model that can generate sentence vectors ;
  • This application reduces the semantic loss when generating the sentence vector model, and performs unsupervised training, so that accurate and usable sentence vectors can be efficiently generated.
  • the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于句向量模型的句向量生成方法、装置、计算机设备及存储介质,方法包括:获取初始文本集中各初始文本的TF-IDF信息以确定目标调整词,并基于目标调整词调整初始文本生成相似文本;将初始文本输入初始句向量模型得到初始句向量,并将相似文本输入初始句向量模型得到相似句向量;将相似句向量作为当前初始句向量的正样本,将其他的初始句向量以及相似句向量作为当前初始句向量的负样本,以对初始句向量模型进行对比学习,得到句向量模型;将待处理文本输入句向量模型,得到待处理文本的句向量,初始文本集可存储于区块链中。

Description

基于句向量模型的句向量生成方法、装置及计算机设备
本申请要求于2021年08月31日提交中国专利局、申请号为202111013003.5,发明名称为“基于句向量模型的句向量生成方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种基于句向量模型的句向量生成方法、装置、计算机设备及存储介质。
背景技术
句向量(sentence embedding)是自然语言处理近年来的热门研究领域之一。将句子中的字、词(token)以及词之间的语义信息映射到可量化的空间,即可得到句向量。生成的句向量通常会提供给下游任务做进一步的处理,例如,基于句向量进行相似性计算、分类和聚类等任务。
发明人意识到,现有的句向量生成技术,有的是简单地将句子中的词进行单独分割,将词转化为词向量后再求平均值得到句向量,然而这样会损失句子中的语义信息,影响句向量的准确性;有的是通过有监督学习的方式生成句向量,但现实中难以获取到大量带有标注的文本语料进行有监督学习。
发明内容
本申请实施例的目的在于提出一种基于句向量模型的句向量生成方法、装置、计算机设备及存储介质,以快速生成准确可用的句向量。
为了解决上述技术问题,本申请实施例提供一种基于句向量模型的句向量生成方法,采用了如下所述的技术方案:
获取初始文本集;
对于所述初始文本集中的各初始文本,获取初始文本的TF-IDF信息;
根据所述TF-IDF信息在所述初始文本中确定目标调整词,以基于所述目标调整词对所述初始文本进行调整,生成所述初始文本的相似文本;
将所述初始文本输入初始句向量模型得到初始句向量,并将所述相似文本输入所述初始句向量模型得到相似句向量;
将所述相似句向量设置为当前的初始句向量的正样本,并将所述初始文本集中其他初始文本的初始句向量以及所述其他初始文本的相似文本所对应的相似句向量设置为所述当前的初始句向量的负样本;
指示所述初始句向量模型根据所述当前的初始句向量、所述正样本和所述负样本进行对比学习,得到句向量模型;
将获取到的待处理文本输入所述句向量模型,得到所述待处理文本的句向量。
为了解决上述技术问题,本申请实施例还提供一种基于句向量模型的句向量生成装置,采用了如下所述的技术方案:
文本集获取模块,用于获取初始文本集;
信息获取模块,用于对于所述初始文本集中的各初始文本,获取初始文本的TF-IDF信息;
文本调整模块,用于根据所述TF-IDF信息在所述初始文本中确定目标调整词,以基于所述目标调整词对所述初始文本进行调整,生成所述初始文本的相似文本;
文本输入模块,用于将所述初始文本输入初始句向量模型得到初始句向量,并将所述相似文本输入所述初始句向量模型得到相似句向量;
向量设置模块,用于将所述相似句向量设置为当前的初始句向量的正样本,并将所述初始文本集中其他初始文本的初始句向量以及所述其他初始文本的相似文本所对应的相似句向量设置为所述当前的初始句向量的负样本;
对比学习模块,用于指示所述初始句向量模型根据所述当前的初始句向量、所述正样本和所述负样本进行对比学习,得到句向量模型;
待处理输入模块,用于将获取到的待处理文本输入所述句向量模型,得到所述待处理文本的句向量。
为了解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取初始文本集;
对于所述初始文本集中的各初始文本,获取初始文本的TF-IDF信息;
根据所述TF-IDF信息在所述初始文本中确定目标调整词,以基于所述目标调整词对所述初始文本进行调整,生成所述初始文本的相似文本;
将所述初始文本输入初始句向量模型得到初始句向量,并将所述相似文本输入所述初始句向量模型得到相似句向量;
将所述相似句向量设置为当前的初始句向量的正样本,并将所述初始文本集中其他初始文本的初始句向量以及所述其他初始文本的相似文本所对应的相似句向量设置为所述当前的初始句向量的负样本;
指示所述初始句向量模型根据所述当前的初始句向量、所述正样本和所述负样本进行对比学习,得到句向量模型;
将获取到的待处理文本输入所述句向量模型,得到所述待处理文本的句向量。
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:
获取初始文本集;
对于所述初始文本集中的各初始文本,获取初始文本的TF-IDF信息;
根据所述TF-IDF信息在所述初始文本中确定目标调整词,以基于所述目标调整词对所述初始文本进行调整,生成所述初始文本的相似文本;
将所述初始文本输入初始句向量模型得到初始句向量,并将所述相似文本输入所述初始句向量模型得到相似句向量;
将所述相似句向量设置为当前的初始句向量的正样本,并将所述初始文本集中其他初始文本的初始句向量以及所述其他初始文本的相似文本所对应的相似句向量设置为所述当前的初始句向量的负样本;
指示所述初始句向量模型根据所述当前的初始句向量、所述正样本和所述负样本进行对比学习,得到句向量模型;
将获取到的待处理文本输入所述句向量模型,得到所述待处理文本的句向量。
与现有技术相比,本申请实施例主要有以下有益效果:获取初始文本集中各初始文本的TF-IDF信息,TF-IDF信息反应了初始文本中各词的重要性,根据TF-IDF信息确定目标调整词,并调整目标调整词,可以在保持文本语义相近的条件下生成相似文本;利用初始句向量模型生成初始文本的初始句向量和相似文本的相似句向量,避免了对文本中的词进行单独分割造成的语义损失;采用无监督的对比学习,仅将相似文本的相似句向量作为正样本,其余向量均作为负样本,使模型在训练中可以将初始句向量与负样本进行充分区分,从而得到能够生成句向量的句向量模型;本申请在生成句向量模型时减少了语义损失,且进行无监督的训练,从而能够高效地生成准确可用的句向量。
附图说明
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请可以应用于其中的示例性系统架构图;
图2是根据本申请的基于句向量模型的句向量生成方法的一个实施例的流程图;
图3是根据本申请的基于句向量模型的句向量生成装置的一个实施例的结构示意图;
图4是根据本申请的计算机设备的一个实施例的结构示意图。
具体实施方式
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。
需要说明的是,本申请实施例所提供的基于句向量模型的句向量生成方法一般由服务器执行,相应地,基于句向量模型的句向量生成装置一般设置于服务器中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
继续参考图2,示出了根据本申请的基于句向量模型的句向量生成方法的一个实施例的流程图。所述的基于句向量模型的句向量生成方法,包括以下步骤:
步骤S201,获取初始文本集。
在本实施例中,基于句向量模型的句向量生成方法运行于其上的电子设备(例如图1所示的服务器)可以通过有线连接方式或者无线连接方式与终端进行通信。需要指出的是,上述无线连接方式可以包括但不限于3G/4G连接、WiFi连接、蓝牙连接、WiMAX连接、Zigbee连接、UWB(ultra wideband)连接、以及其他现在已知或将来开发的无线连接方式。
具体地,服务器首先需要获取初始文本集,初始文本集中有若干个初始文本。初始文本集可以由句向量模型的使用场景确定,例如,在图书推荐场景中,需要通过句向量模型生成句向量,依据句向量推荐相似图书,初始文本集可以是若干本图书的简介。
需要强调的是,为进一步保证上述初始文本集的私密和安全性,上述初始文本集还可以存储于一区块链的节点中。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
步骤S202,对于初始文本集中的各初始文本,获取初始文本的TF-IDF信息。
具体地,对于初始文本集中的每一个初始文本,对初始文本进行计算,以获取初始文本的TF-IDF信息。其中,TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。TF是词频(Term Frequency),是指某一个给定的词在文件中出现的频率。IDF是逆文本频率指数(Inverse Document Frequency),可以由总文件数目除以包含某词语之文件的数目,再求对数得到。
进一步的,上述步骤S202可以包括:对于初始文本集中的各初始文本,对初始文本进行分词处理,得到若干个拆分词;基于初始文本集计算各拆分词的TF-IDF值,得到初始文本的TF-IDF信息。
具体地,对于初始文本集中的各初始文本,对初始文本进行分词,分词操作可以调用分词工具实现,得到若干个拆分词。然后计算每个拆分词的TF-IDF值。在计算TF-IDF值时,IDF(逆文本频率指数)是文本总数除以包含某个拆分词的初始文本的数量,再求对数得到,可以将初始文本集作为总体,基于初始文本集计算拆分词的逆文本频率指数。各拆分词的TF-IDF值组成了初始文本的TF-IDF信息。
本实施例中,先对初始文本进行分词得到拆分词,以初始文本集为基准,即可快速计算出各拆分词的TF-IDF值。
步骤S203,根据TF-IDF信息在初始文本中确定目标调整词,以基于目标调整词对初始文本进行调整,生成初始文本的相似文本。
具体地,根据TF-IDF值对各拆分词进行降序排序,按照拆分词的数量和预设的调整比例,计算目标调整词的数量,然后根据该数量筛选TF-IDF值排序靠前的若干个拆分词作为目标调整词。
在对初始文本进行调整时,可以获取目标调整词的近义词或者同义词,然后令近义词或同义词替换初始文本中的目标调整词,得到初始文本的相似文本。TF-IDF是一种统计方法,用以评估拆分词对初始文本的重要程度。筛选TF-IDF值较大的拆分词并进行替换,在初始文本语义变化较小的情况下,得到相似文本。
步骤S204,将初始文本输入初始句向量模型得到初始句向量,并将相似文本输入初始句向量模型得到相似句向量。
其中,初始句向量模型可以是尚未训练完毕的句向量模型,句向量模型用于输入文本,输出文本所对应的句向量(sentence embedding)。本申请中的句向量模型可以将文本作为一个整体序列进行处理,能够学习到文本中所蕴含的语义信息。
具体地,服务器将初始文本输入初始句向量模型,初始句向量模型对初始文本进行转化,输出初始文本所对应的初始句向量;将初始文本的相似文本输入初始句向量模型,得到该相似文本的相似句向量。
可以理解,对于初始文本集中的每个初始文本,均生成对应的相似文本。每个初始文本及其对应的相似文本,都将输入初始句向量模型,得到相应的初始句向量和相似句向量。
步骤S205,将相似句向量设置为当前的初始句向量的正样本,并将初始文本集中其他 初始文本的初始句向量以及其他初始文本的相似文本所对应的相似句向量设置为当前的初始句向量的负样本。
具体地,本申请在模型训练时需要正样本和负样本。假设初始文本集中有初始文本S 1、S 2、S 3,其对应的相似文本为
Figure PCTCN2022071882-appb-000001
初始文本S 1、S 2、S 3的初始句向量为E 1、E 2、E 3,相似文本
Figure PCTCN2022071882-appb-000002
的相似句向量为
Figure PCTCN2022071882-appb-000003
在当前的初始文本为S 1时,当前的初始句向量为E 1,初始文本S 1的相似文本为
Figure PCTCN2022071882-appb-000004
将当前的初始文本S 1所对应相似文本
Figure PCTCN2022071882-appb-000005
的相似句向量
Figure PCTCN2022071882-appb-000006
设置为正样本,将其他的初始句向量E 2、E 3,以及其他的相似句向量
Figure PCTCN2022071882-appb-000007
设置为负样本。
步骤S206,指示初始句向量模型根据当前的初始句向量、正样本和负样本进行对比学习,得到句向量模型。
具体地,服务器按照无监督的对比学习(Contrastive Learning)对初始句向量模型进行训练。在对比学习中,模型并不一定要关注到样本的每一个细节,只要学到的特征能够使当前的样本和其他样本区别开即可。因此,在模型训练中,服务器调整初始句向量模型,使得模型输出的初始句向量和正样本不断接近,并且尽可能拉大输出的初始句向量和负样本之间的差异。模型训练结束后,得到句向量模型。
步骤S207,将获取到的待处理文本输入句向量模型,得到待处理文本的句向量。
具体地,在模型应用时,服务器获取待处理文本,将待处理文本输入训练完毕的句向量模型,由句向量模型将待处理文本转化为句向量。
生成的句向量可以提供给下游任务做进一步的处理。例如,在图书推荐的场景中,要推荐目标图书的相似图书;可以获取各图书的简介,通过句向量模型生成全部图书简介的句向量。计算目标图书简介的句向量与其他图书简介的句向量之间的余弦相似度,将余弦相似度作为图书之间的相似度,向用户推荐目标图书的相似图书。
本实施例中,获取初始文本集中各初始文本的TF-IDF信息,TF-IDF信息反应了初始文本中各词的重要性,根据TF-IDF信息确定目标调整词,并调整目标调整词,可以在保持文本语义相近的条件下生成相似文本;利用初始句向量模型生成初始文本的初始句向量和相似文本的相似句向量,避免了对文本中的词进行单独分割造成的语义损失;采用无监督的对比学习,仅将相似文本的相似句向量作为正样本,其余向量均作为负样本,使模型在训练中可以将初始句向量与负样本进行充分区分,从而得到能够生成句向量的句向量模型;本申请在生成句向量模型时减少了语义损失,且进行无监督的训练,从而能够高效地生成准确可用的句向量。
进一步的,上述步骤S203可以包括:基于TF-IDF信息,将初始文本中TF-IDF值大于预设的第一阈值的拆分词确定为目标调整词;通过词林获取目标调整词的若干个相似词;从初始文本中获取目标调整词的前排序词和后排序词;对于每个相似词,将前排序词、相似词和后排序词进行顺序组合,得到候选词序列;根据预设的语料库获取候选词序列的序列频率;将初始文本中的目标调整词替换为序列频率最高的候选词序列中的相似词,得到初始文本的相似文本。
其中,词林是一种词典,基于词林可以查询一个词的相似词,并获取到词之间的相似度。前排序词可以是初始文本中,排在目标调整词前边的词;后排序词可以是初始文本中,排在目标调整词后边的词。
具体地,TF-IDF信息记录了各拆分词的TF-IDF值,服务器可以获取预设的第一阈值,第一阈值可以是一个较大的数值,将TF-IDF值大于第一阈值的拆分词确定为目标调整词。
对于每个目标调整词,服务器通过词林获取目标调整词的相似词,以及目标调整词与相似词之间的相似度,可以选取相似度最高的若干个相似词。在一个实施例中,词林可以是同义词词林Net,它是由哈尔滨工业大学发布的一款词林。
再从初始文本中获取目标调整词的前排序词和后排序词。其中,前排序词可以由预设 数量的拆分词组成,后排序词可以由预设数量的拆分词组成。对于每个相似词,按照“前排序词、相似词、后排序词”的顺序组合成候选词序列。
服务器通过预设的语料库获取候选词序列的序列频率。语料库中包含多个文本,序列频率是统计候选词序列在语料库全部文本中的出现次数,除以语料库中词序列(所包含的词数量与候选词序列中词数量一致)总数的比值得到的。语料库与初始文本集可以相同,也可以不同。
序列频率越高,代表候选词序列出现频率越高,应用越多,语义越优。服务器选取序列频率最高的候选词序列,令其中的相似词替换初始文本中的候选词序列,得到相似文本。
本实施例中,根据第一阈值将重要的拆分词确定为目标调整词,获取目标调整词的相似词,并生成相似词的候选词序列,根据语料库获取候选词序列的序列频率,选取序列频率最高的候选词序列中的相似词去替换目标调整词,从而自动生成与初始文本语义最为相似的相似文本。
进一步的,上述将初始文本中的目标调整词替换为序列频率最高的候选词序列中的相似词,得到初始文本的相似文本的步骤之后,还可以包括:基于TF-IDF信息,在相似文本中筛选TF-IDF值小于预设的第二阈值的拆分词;将筛选到的拆分词从相似文本中删除。
具体地,服务器还可以获取预设的第二阈值,第二阈值可以是一个较小的数值,筛选TF-IDF值小于第二阈值的拆分词,这些拆分词重要性较低,可以将其从相似文本中删除,以对文本进行进一步的调整。
本实施例中,将TF-IDF值小于第二阈值的拆分词进行删除,对文本进行进一步的调整,在保持语义相近的情况下,扩大相似文本与初始文本的不同。
进一步的,上述将初始文本输入初始句向量模型得到初始句向量的步骤可以包括:将初始文本输入初始句向量模型中的向量生成模型,得到原始向量;将原始向量输入初始句向量模型中的标准处理网络,得到标准化的原始向量;将标准化的原始向量输入初始句向量模型中的全连接网络进行降维处理,得到初始句向量。
具体地,初始句向量模型可以由向量生成模型、标准处理网络和全连接网络构成。初始文本首先输入向量生成模型,由向量生成模型将初始文本转化为原始向量。
原始向量中信息分布不均,较为凌乱,且维度较高,不便于向量的存储以及后续的计算。因此,可以先将原始向量输入标准处理网络,以对原始向量进行标准化,标准化可以将原始向量映射到更光滑的向量空间,使得信息在向量中的分布更为均匀。在一个实施例中,可以将原始向量输入Layer Normalization层进行标准化处理。
然后将标准化的原始向量输入全连接网络,向量在全连接网络中与矩阵实现相乘运算,可以对向量实现降维处理,得到初始句向量。
本实施例中,在得到向量生成模型输出的原始向量后,对原始向量进行标准化处理以及降维处理,实现对原始向量的优化,得到初始句向量。
进一步的,上述将初始文本输入初始句向量模型中的向量生成模型,得到原始向量的步骤可以包括:将初始文本输入初始句向量中的向量生成模型,得到模型输出信息,向量生成模型为Bert模型;根据预设的标识符,从模型输出信息中提取出原始向量。
具体地,将初始文本输入初始句向量模型中的向量生成模型,以便将文本转化为向量,得到模型输出信息,向量生成模型可以是Bert模型。Bert模型是一种预训练语言模型,它将文本作为一个整体序列进行输入,在词向量的基础上,可以学习到文本中蕴含的语义信息。在一个实施例中,还可以使用基于Bert的其他模型,例如Bert-wwt模型,Bert-wwt-ext模型等。
模型输出信息可以包括多种信息,各种信息可以带有标识符以对信息进行说明。服务器根据标识符从模型输出信息中提取所需的信息,得到原始向量。在一个实施例中,服务器提取Bert模型最后一层[CLS]所对应的向量作为原始向量。
可以理解,初始句向量模型或者句向量模型对于输入的文本,处理方式都是相同的。 将初始文本的相似文本输入模型时,相似文本也会进行与上文相同的处理过程。
本实施例中,将Bert模型作为向量生成模型,保证了得到的原始向量可以蕴含文本中的语义信息,提高了原始向量的准确性。
进一步的,上述步骤S206可以包括:分别计算初始句向量与正样本以及负样本的相似度;根据得到的相似度计算模型损失;根据模型损失调整初始句向量模型的模型参数,直至模型损失收敛,得到句向量模型。
具体地,在对比学习中,模型的损失函数如下:
Figure PCTCN2022071882-appb-000008
Figure PCTCN2022071882-appb-000009
Figure PCTCN2022071882-appb-000010
其中,sim为余弦相似性函数,B为一个batch中的向量数量,E i为当前的初始句向量,
Figure PCTCN2022071882-appb-000011
为E i的正样本,E k
Figure PCTCN2022071882-appb-000012
分别为一个batch中的各初始句向量、关联于初始句向量的相似句向量;L i是针对一个初始句向量计算得到的模型损失,L是一个batch中总的模型损失。
服务器需要计算当前的初始句向量与正样本以及各负样本的相似度,具体可以是通过余弦相似性函数计算相似度,然后将相似度代入公式(2)计算模型损失。
在得到模型损失后,根据模型损失调整初始句向量模型的模型参数,具体可以是以最大化模型损失为目标调整初始句向量模型的模型参数;并在参数调整后将初始文本重新输入模型,以对模型进行迭代训练,直至模型损失不再变化,达到收敛,从而得到最后的句向量模型。
本实施例中,为了最大化初始句向量和各负样本的差异,基于向量之间的相似度计算模型损失,并根据模型损失调整模型参数,保证了生成的句向量模型的准确性。
本申请中的基于句向量模型的句向量生成方法可应用于人工智能领域,例如人工智能领域中的自然语言处理等。
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他 步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
进一步参考图3,作为对上述图2所示方法的实现,本申请提供了一种基于句向量模型的句向量生成装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图3所示,本实施例所述的基于句向量模型的句向量生成装置300包括:文本集获取模块301、信息获取模块302、文本调整模块303、文本输入模块304、向量设置模块305、对比学习模块306以及待处理输入模块307,其中:
文本集获取模块301,用于获取初始文本集。
信息获取模块302,用于对于初始文本集中的各初始文本,获取初始文本的TF-IDF信息。
文本调整模块303,用于根据TF-IDF信息在初始文本中确定目标调整词,以基于目标调整词对初始文本进行调整,生成初始文本的相似文本。
文本输入模块304,用于将初始文本输入初始句向量模型得到初始句向量,并将相似文本输入初始句向量模型得到相似句向量。
向量设置模块305,用于将相似句向量设置为当前的初始句向量的正样本,并将初始文本集中其他初始文本的初始句向量以及其他初始文本的相似文本所对应的相似句向量设置为当前的初始句向量的负样本。
对比学习模块306,用于指示初始句向量模型根据当前的初始句向量、正样本和负样本进行对比学习,得到句向量模型。
待处理输入模块307,用于将获取到的待处理文本输入句向量模型,得到待处理文本的句向量。
本实施例中,获取初始文本集中各初始文本的TF-IDF信息,TF-IDF信息反应了初始文本中各词的重要性,根据TF-IDF信息确定目标调整词,并调整目标调整词,可以在保持文本语义相近的条件下生成相似文本;利用初始句向量模型生成初始文本的初始句向量和相似文本的相似句向量,避免了对文本中的词进行单独分割造成的语义损失;采用无监督的对比学习,仅将相似文本的相似句向量作为正样本,其余向量均作为负样本,使模型在训练中可以将初始句向量与负样本进行充分区分,从而得到能够生成句向量的句向量模型;本申请在生成句向量模型时减少了语义损失,且进行无监督的训练,从而能够高效地生成准确可用的句向量。
在本实施例的一些可选的实现方式中,信息获取模块302可以包括:分词处理子模块以及计算子模块,其中:
分词处理子模块,用于对于初始文本集中的各初始文本,对初始文本进行分词处理,得到若干个拆分词。
计算子模块,用于基于初始文本集计算各拆分词的TF-IDF值,得到初始文本的TF-IDF信息。
本实施例中,先对初始文本进行分词得到拆分词,以初始文本集为基准,即可快速计算出各拆分词的TF-IDF值。
在本实施例的一些可选的实现方式中,文本调整模块303可以包括:目标确定子模块、相似获取子模块、获取子模块、组合子模块、频率获取子模块以及目标替换子模块,其中:
目标确定子模块,用于基于TF-IDF信息,将初始文本中TF-IDF值大于预设的第一阈值的拆分词确定为目标调整词。
相似获取子模块,用于通过词林获取目标调整词的若干个相似词。
获取子模块,用于从初始文本中获取目标调整词的前排序词和后排序词。
组合子模块,用于对于每个相似词,将前排序词、相似词和后排序词进行顺序组合,得到候选词序列。
频率获取子模块,用于根据预设的语料库获取候选词序列的序列频率。
目标替换子模块,用于将初始文本中的目标调整词替换为序列频率最高的候选词序列中的相似词,得到初始文本的相似文本。
本实施例中,根据第一阈值将重要的拆分词确定为目标调整词,获取目标调整词的相似词,并生成相似词的候选词序列,根据语料库获取候选词序列的序列频率,选取序列频率最高的候选词序列中的相似词去替换目标调整词,从而自动生成与初始文本语义最为相似的相似文本。
在本实施例的一些可选的实现方式中,文本调整模块303还可以包括:确定子模块以及删除子模块,其中:
确定子模块,用于基于TF-IDF信息,在相似文本中筛选TF-IDF值小于预设的第二阈值的拆分词。
删除子模块,用于将筛选到的拆分词从相似文本中删除。
本实施例中,将TF-IDF值小于第二阈值的拆分词进行删除,对文本进行进一步的调整,在保持语义相近的情况下,扩大相似文本与初始文本的不同。
在本实施例的一些可选的实现方式中,文本输入模块304可以包括:文本输入子模块、原始输入子模块以及降维处理子模块,其中:
文本输入子模块,用于将初始文本输入初始句向量模型中的向量生成模型,得到原始向量。
原始输入子模块,用于将原始向量输入初始句向量模型中的标准处理网络,得到标准化的原始向量。
降维处理子模块,用于将标准化的原始向量输入初始句向量模型中的全连接网络进行降维处理,得到初始句向量。
本实施例中,在得到向量生成模型输出的原始向量后,对原始向量进行标准化处理以及降维处理,实现对原始向量的优化,得到初始句向量。
在本实施例的一些可选的实现方式中,文本输入子模块可以包括:文本输入单元以及向量提取单元,其中:
文本输入单元,用于将初始文本输入初始句向量中的向量生成模型,得到模型输出信息,向量生成模型为Bert模型。
向量提取单元,用于根据预设的标识符,从模型输出信息中提取出原始向量。
本实施例中,将Bert模型作为向量生成模型,保证了得到的原始向量可以蕴含文本中的语义信息,提高了原始向量的准确性。
在本实施例的一些可选的实现方式中,对比学习模块306可以包括:相似计算子模块、损失计算子模块以及模型调整子模块,其中:
相似计算子模块,用于分别计算初始句向量与正样本以及负样本的相似度。
损失计算子模块,用于根据得到的相似度计算模型损失。
模型调整子模块,用于根据模型损失调整初始句向量模型的模型参数,直至模型损失收敛,得到句向量模型。
本实施例中,为了最大化初始句向量和各负样本的差异,基于向量之间的相似度计算模型损失,并根据模型损失调整模型参数,保证了生成的句向量模型的准确性。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。
所述计算机设备4包括通过系统总线相互通信连接存储器41、处理器42、网络接口43。需要指出的是,图中仅示出了具有组件41-43的计算机设备4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application  Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
所述存储器41至少包括一种类型的计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。在另一些实施例中,所述存储器41也可以是所述计算机设备4的外部存储设备,例如该计算机设备4上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器41还可以既包括所述计算机设备4的内部存储单元也包括其外部存储设备。本实施例中,所述存储器41通常用于存储安装于所述计算机设备4的操作系统和各类应用软件,例如基于句向量模型的句向量生成方法的计算机可读指令等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的计算机可读指令或者处理数据,例如运行所述基于句向量模型的句向量生成方法的计算机可读指令。
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。
本实施例中提供的计算机设备可以执行上述基于句向量模型的句向量生成方法。此处基于句向量模型的句向量生成方法可以是上述各个实施例的基于句向量模型的句向量生成方法。
本实施例中,获取初始文本集中各初始文本的TF-IDF信息,TF-IDF信息反应了初始文本中各词的重要性,根据TF-IDF信息确定目标调整词,并调整目标调整词,可以在保持文本语义相近的条件下生成相似文本;利用初始句向量模型生成初始文本的初始句向量和相似文本的相似句向量,避免了对文本中的词进行单独分割造成的语义损失;采用无监督的对比学习,仅将相似文本的相似句向量作为正样本,其余向量均作为负样本,使模型在训练中可以将初始句向量与负样本进行充分区分,从而得到能够生成句向量的句向量模型;本申请在生成句向量模型时减少了语义损失,且进行无监督的训练,从而能够高效地生成准确可用的句向量。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的基于句向量模型的句向量生成方法的步骤。
本实施例中,获取初始文本集中各初始文本的TF-IDF信息,TF-IDF信息反应了初始文本中各词的重要性,根据TF-IDF信息确定目标调整词,并调整目标调整词,可以在保持文本语义相近的条件下生成相似文本;利用初始句向量模型生成初始文本的初始句向量和相似文本的相似句向量,避免了对文本中的词进行单独分割造成的语义损失;采用无监督的对比学习,仅将相似文本的相似句向量作为正样本,其余向量均作为负样本,使模型在训练中可以将初始句向量与负样本进行充分区分,从而得到能够生成句向量的句向量模 型;本申请在生成句向量模型时减少了语义损失,且进行无监督的训练,从而能够高效地生成准确可用的句向量。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。

Claims (20)

  1. 一种基于句向量模型的句向量生成方法,包括下述步骤:
    获取初始文本集;
    对于所述初始文本集中的各初始文本,获取初始文本的TF-IDF信息;
    根据所述TF-IDF信息在所述初始文本中确定目标调整词,以基于所述目标调整词对所述初始文本进行调整,生成所述初始文本的相似文本;
    将所述初始文本输入初始句向量模型得到初始句向量,并将所述相似文本输入所述初始句向量模型得到相似句向量;
    将所述相似句向量设置为当前的初始句向量的正样本,并将所述初始文本集中其他初始文本的初始句向量以及所述其他初始文本的相似文本所对应的相似句向量设置为所述当前的初始句向量的负样本;
    指示所述初始句向量模型根据所述当前的初始句向量、所述正样本和所述负样本进行对比学习,得到句向量模型;
    将获取到的待处理文本输入所述句向量模型,得到所述待处理文本的句向量。
  2. 根据权利要求1所述的基于句向量模型的句向量生成方法,其中,所述对于所述初始文本集中的各初始文本,获取初始文本的TF-IDF信息的步骤包括:
    对于所述初始文本集中的各初始文本,对初始文本进行分词处理,得到若干个拆分词;
    基于所述初始文本集计算各拆分词的TF-IDF值,得到所述初始文本的TF-IDF信息。
  3. 根据权利要求2所述的基于句向量模型的句向量生成方法,其中,所述根据所述TF-IDF信息在所述初始文本中确定目标调整词,以基于所述目标调整词对所述初始文本进行调整,生成所述初始文本的相似文本的步骤包括:
    基于所述TF-IDF信息,将所述初始文本中TF-IDF值大于预设的第一阈值的拆分词确定为目标调整词;
    通过词林获取所述目标调整词的若干个相似词;
    从所述初始文本中获取所述目标调整词的前排序词和后排序词;
    对于每个相似词,将所述前排序词、相似词和所述后排序词进行顺序组合,得到候选词序列;
    根据预设的语料库获取所述候选词序列的序列频率;
    将所述初始文本中的目标调整词替换为序列频率最高的候选词序列中的相似词,得到所述初始文本的相似文本。
  4. 根据权利要求3所述的基于句向量模型的句向量生成方法,其中,所述将所述初始文本中的目标调整词替换为序列频率最高的候选词序列中的相似词,得到所述初始文本的相似文本的步骤之后,还包括:
    基于所述TF-IDF信息,在所述相似文本中筛选TF-IDF值小于预设的第二阈值的拆分词;
    将筛选到的拆分词从所述相似文本中删除。
  5. 根据权利要求1所述的基于句向量模型的句向量生成方法,其中,所述将所述初始文本输入初始句向量模型得到初始句向量的步骤包括:
    将所述初始文本输入初始句向量模型中的向量生成模型,得到原始向量;
    将所述原始向量输入所述初始句向量模型中的标准处理网络,得到标准化的原始向量;
    将所述标准化的原始向量输入所述初始句向量模型中的全连接网络进行降维处理,得到初始句向量。
  6. 根据权利要求5所述的基于句向量模型的句向量生成方法,其中,所述将所述初始文本输入初始句向量模型中的向量生成模型,得到原始向量的步骤包括:
    将所述初始文本输入初始句向量中的向量生成模型,得到模型输出信息,所述向量生成模型为Bert模型;
    根据预设的标识符,从所述模型输出信息中提取出原始向量。
  7. 根据权利要求1所述的基于句向量模型的句向量生成方法,其中,所述指示所述初始句向量模型根据所述当前的初始句向量、所述正样本和所述负样本进行对比学习,得到句向量模型的步骤包括:
    分别计算所述初始句向量与所述正样本以及所述负样本的相似度;
    根据得到的相似度计算模型损失;
    根据所述模型损失调整所述初始句向量模型的模型参数,直至所述模型损失收敛,得到句向量模型。
  8. 一种基于句向量模型的句向量生成装置,包括:
    文本集获取模块,用于获取初始文本集;
    信息获取模块,用于对于所述初始文本集中的各初始文本,获取初始文本的TF-IDF信息;
    文本调整模块,用于根据所述TF-IDF信息在所述初始文本中确定目标调整词,以基于所述目标调整词对所述初始文本进行调整,生成所述初始文本的相似文本;
    文本输入模块,用于将所述初始文本输入初始句向量模型得到初始句向量,并将所述相似文本输入所述初始句向量模型得到相似句向量;
    向量设置模块,用于将所述相似句向量设置为当前的初始句向量的正样本,并将所述初始文本集中其他初始文本的初始句向量以及所述其他初始文本的相似文本所对应的相似句向量设置为所述当前的初始句向量的负样本;
    对比学习模块,用于指示所述初始句向量模型根据所述当前的初始句向量、所述正样本和所述负样本进行对比学习,得到句向量模型;
    待处理输入模块,用于将获取到的待处理文本输入所述句向量模型,得到所述待处理文本的句向量。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取初始文本集;
    对于所述初始文本集中的各初始文本,获取初始文本的TF-IDF信息;
    根据所述TF-IDF信息在所述初始文本中确定目标调整词,以基于所述目标调整词对所述初始文本进行调整,生成所述初始文本的相似文本;
    将所述初始文本输入初始句向量模型得到初始句向量,并将所述相似文本输入所述初始句向量模型得到相似句向量;
    将所述相似句向量设置为当前的初始句向量的正样本,并将所述初始文本集中其他初始文本的初始句向量以及所述其他初始文本的相似文本所对应的相似句向量设置为所述当前的初始句向量的负样本;
    指示所述初始句向量模型根据所述当前的初始句向量、所述正样本和所述负样本进行对比学习,得到句向量模型;
    将获取到的待处理文本输入所述句向量模型,得到所述待处理文本的句向量。
  10. 根据权利要求9所述的计算机设备,其中,所述对于所述初始文本集中的各初始文本,获取初始文本的TF-IDF信息的步骤包括:
    对于所述初始文本集中的各初始文本,对初始文本进行分词处理,得到若干个拆分词;
    基于所述初始文本集计算各拆分词的TF-IDF值,得到所述初始文本的TF-IDF信息。
  11. 根据权利要求10所述的计算机设备,其中,所述根据所述TF-IDF信息在所述初始文本中确定目标调整词,以基于所述目标调整词对所述初始文本进行调整,生成所述初始文本的相似文本的步骤包括:
    基于所述TF-IDF信息,将所述初始文本中TF-IDF值大于预设的第一阈值的拆分词确定为目标调整词;
    通过词林获取所述目标调整词的若干个相似词;
    从所述初始文本中获取所述目标调整词的前排序词和后排序词;
    对于每个相似词,将所述前排序词、相似词和所述后排序词进行顺序组合,得到候选词序列;
    根据预设的语料库获取所述候选词序列的序列频率;
    将所述初始文本中的目标调整词替换为序列频率最高的候选词序列中的相似词,得到所述初始文本的相似文本。
  12. 根据权利要求9所述的计算机设备,其中,所述将所述初始文本输入初始句向量模型得到初始句向量的步骤包括:
    将所述初始文本输入初始句向量模型中的向量生成模型,得到原始向量;
    将所述原始向量输入所述初始句向量模型中的标准处理网络,得到标准化的原始向量;
    将所述标准化的原始向量输入所述初始句向量模型中的全连接网络进行降维处理,得到初始句向量。
  13. 根据权利要求12所述的计算机设备,其中,所述将所述初始文本输入初始句向量模型中的向量生成模型,得到原始向量的步骤包括:
    将所述初始文本输入初始句向量中的向量生成模型,得到模型输出信息,所述向量生成模型为Bert模型;
    根据预设的标识符,从所述模型输出信息中提取出原始向量。
  14. 根据权利要求9所述的计算机设备,其中,所述指示所述初始句向量模型根据所述当前的初始句向量、所述正样本和所述负样本进行对比学习,得到句向量模型的步骤包括:
    分别计算所述初始句向量与所述正样本以及所述负样本的相似度;
    根据得到的相似度计算模型损失;
    根据所述模型损失调整所述初始句向量模型的模型参数,直至所述模型损失收敛,得到句向量模型。
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令;所述计算机可读指令被处理器执行时实现如下步骤:
    获取初始文本集;
    对于所述初始文本集中的各初始文本,获取初始文本的TF-IDF信息;
    根据所述TF-IDF信息在所述初始文本中确定目标调整词,以基于所述目标调整词对所述初始文本进行调整,生成所述初始文本的相似文本;
    将所述初始文本输入初始句向量模型得到初始句向量,并将所述相似文本输入所述初始句向量模型得到相似句向量;
    将所述相似句向量设置为当前的初始句向量的正样本,并将所述初始文本集中其他初始文本的初始句向量以及所述其他初始文本的相似文本所对应的相似句向量设置为所述当前的初始句向量的负样本;
    指示所述初始句向量模型根据所述当前的初始句向量、所述正样本和所述负样本进行对比学习,得到句向量模型;
    将获取到的待处理文本输入所述句向量模型,得到所述待处理文本的句向量。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述对于所述初始文本集中的各初始文本,获取初始文本的TF-IDF信息的步骤包括:
    对于所述初始文本集中的各初始文本,对初始文本进行分词处理,得到若干个拆分词;
    基于所述初始文本集计算各拆分词的TF-IDF值,得到所述初始文本的TF-IDF信息。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述根据所述TF-IDF信息在所述初始文本中确定目标调整词,以基于所述目标调整词对所述初始文本进行调整,生 成所述初始文本的相似文本的步骤包括:
    基于所述TF-IDF信息,将所述初始文本中TF-IDF值大于预设的第一阈值的拆分词确定为目标调整词;
    通过词林获取所述目标调整词的若干个相似词;
    从所述初始文本中获取所述目标调整词的前排序词和后排序词;
    对于每个相似词,将所述前排序词、相似词和所述后排序词进行顺序组合,得到候选词序列;
    根据预设的语料库获取所述候选词序列的序列频率;
    将所述初始文本中的目标调整词替换为序列频率最高的候选词序列中的相似词,得到所述初始文本的相似文本。
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述将所述初始文本输入初始句向量模型得到初始句向量的步骤包括:
    将所述初始文本输入初始句向量模型中的向量生成模型,得到原始向量;
    将所述原始向量输入所述初始句向量模型中的标准处理网络,得到标准化的原始向量;
    将所述标准化的原始向量输入所述初始句向量模型中的全连接网络进行降维处理,得到初始句向量。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述将所述初始文本输入初始句向量模型中的向量生成模型,得到原始向量的步骤包括:
    将所述初始文本输入初始句向量中的向量生成模型,得到模型输出信息,所述向量生成模型为Bert模型;
    根据预设的标识符,从所述模型输出信息中提取出原始向量。
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述指示所述初始句向量模型根据所述当前的初始句向量、所述正样本和所述负样本进行对比学习,得到句向量模型的步骤包括:
    分别计算所述初始句向量与所述正样本以及所述负样本的相似度;
    根据得到的相似度计算模型损失;
    根据所述模型损失调整所述初始句向量模型的模型参数,直至所述模型损失收敛,得到句向量模型。
PCT/CN2022/071882 2021-08-31 2022-01-13 基于句向量模型的句向量生成方法、装置及计算机设备 WO2023029356A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111013003.5 2021-08-31
CN202111013003.5A CN113722438B (zh) 2021-08-31 2021-08-31 基于句向量模型的句向量生成方法、装置及计算机设备

Publications (1)

Publication Number Publication Date
WO2023029356A1 true WO2023029356A1 (zh) 2023-03-09

Family

ID=78679828

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071882 WO2023029356A1 (zh) 2021-08-31 2022-01-13 基于句向量模型的句向量生成方法、装置及计算机设备

Country Status (2)

Country Link
CN (1) CN113722438B (zh)
WO (1) WO2023029356A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631550A (zh) * 2023-07-26 2023-08-22 深圳爱递医药科技有限公司 一种临床试验的数据管理及逻辑核查方法及其医疗系统
CN116821966A (zh) * 2023-08-25 2023-09-29 杭州海康威视数字技术股份有限公司 机器学习模型训练数据集隐私保护方法、装置及设备

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722438B (zh) * 2021-08-31 2023-06-23 平安科技(深圳)有限公司 基于句向量模型的句向量生成方法、装置及计算机设备
CN114186548B (zh) * 2021-12-15 2023-08-15 平安科技(深圳)有限公司 基于人工智能的句子向量生成方法、装置、设备及介质
CN116579320B (zh) * 2023-07-07 2023-09-15 航天宏康智能科技(北京)有限公司 句向量模型的训练方法、文本语义增强的方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190197105A1 (en) * 2017-12-21 2019-06-27 International Business Machines Corporation Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources
CN110472241A (zh) * 2019-07-29 2019-11-19 平安科技(深圳)有限公司 生成去冗余信息句向量的方法及相关设备
CN111178082A (zh) * 2019-12-05 2020-05-19 北京葡萄智学科技有限公司 一种句向量生成方法、装置及电子设备
CN111709223A (zh) * 2020-06-02 2020-09-25 上海硬通网络科技有限公司 基于bert的句子向量生成方法、装置及电子设备
CN112052395A (zh) * 2020-09-16 2020-12-08 北京搜狗科技发展有限公司 一种数据处理方法及装置
CN113722438A (zh) * 2021-08-31 2021-11-30 平安科技(深圳)有限公司 基于句向量模型的句向量生成方法、装置及计算机设备

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590192B (zh) * 2017-08-11 2023-05-05 深圳市腾讯计算机系统有限公司 文本问题的数学化处理方法、装置、设备和存储介质
CN108628825A (zh) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 文本信息相似度匹配方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190197105A1 (en) * 2017-12-21 2019-06-27 International Business Machines Corporation Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources
CN110472241A (zh) * 2019-07-29 2019-11-19 平安科技(深圳)有限公司 生成去冗余信息句向量的方法及相关设备
CN111178082A (zh) * 2019-12-05 2020-05-19 北京葡萄智学科技有限公司 一种句向量生成方法、装置及电子设备
CN111709223A (zh) * 2020-06-02 2020-09-25 上海硬通网络科技有限公司 基于bert的句子向量生成方法、装置及电子设备
CN112052395A (zh) * 2020-09-16 2020-12-08 北京搜狗科技发展有限公司 一种数据处理方法及装置
CN113722438A (zh) * 2021-08-31 2021-11-30 平安科技(深圳)有限公司 基于句向量模型的句向量生成方法、装置及计算机设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631550A (zh) * 2023-07-26 2023-08-22 深圳爱递医药科技有限公司 一种临床试验的数据管理及逻辑核查方法及其医疗系统
CN116631550B (zh) * 2023-07-26 2023-11-28 深圳爱递医药科技有限公司 一种临床试验的数据管理及逻辑核查方法及其医疗系统
CN116821966A (zh) * 2023-08-25 2023-09-29 杭州海康威视数字技术股份有限公司 机器学习模型训练数据集隐私保护方法、装置及设备
CN116821966B (zh) * 2023-08-25 2023-12-19 杭州海康威视数字技术股份有限公司 机器学习模型训练数据集隐私保护方法、装置及设备

Also Published As

Publication number Publication date
CN113722438B (zh) 2023-06-23
CN113722438A (zh) 2021-11-30

Similar Documents

Publication Publication Date Title
WO2023029356A1 (zh) 基于句向量模型的句向量生成方法、装置及计算机设备
WO2022126971A1 (zh) 基于密度的文本聚类方法、装置、设备及存储介质
WO2018040068A1 (zh) 基于知识图谱的语意分析系统及方法
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
WO2018040343A1 (zh) 用于识别文本类型的方法、装置和设备
WO2021189951A1 (zh) 文本搜索方法、装置、计算机设备和存储介质
WO2021135455A1 (zh) 语义召回方法、装置、计算机设备及存储介质
WO2022048363A1 (zh) 网站分类方法、装置、计算机设备及存储介质
CN114780727A (zh) 基于强化学习的文本分类方法、装置、计算机设备及介质
WO2020147409A1 (zh) 一种文本分类方法、装置、计算机设备及存储介质
CN111737997A (zh) 一种文本相似度确定方法、设备及储存介质
CN107924398B (zh) 用于提供以评论为中心的新闻阅读器的系统和方法
TW202001621A (zh) 語料庫產生方法及裝置、人機互動處理方法及裝置
CN113723077B (zh) 基于双向表征模型的句向量生成方法、装置及计算机设备
CN115730597A (zh) 多级语义意图识别方法及其相关设备
CN108875050B (zh) 面向文本的数字取证分析方法、装置和计算机可读介质
CN114116997A (zh) 知识问答方法、装置、电子设备及存储介质
CN111191011B (zh) 一种文本标签的搜索匹配方法、装置、设备及存储介质
CN113434636A (zh) 基于语义的近似文本搜索方法、装置、计算机设备及介质
CN115878761B (zh) 事件脉络生成方法、设备及介质
CN111275683A (zh) 图像质量评分处理方法、系统、设备及介质
CN114742058B (zh) 一种命名实体抽取方法、装置、计算机设备及存储介质
CN116186198A (zh) 信息检索方法、装置、计算机设备及存储介质
CN110852078A (zh) 生成标题的方法和装置
CN114491038A (zh) 一种基于会话场景的流程挖掘方法、装置及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22862507

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE