WO2022134759A1 - Procédé et appareil de génération de mots-clés et dispositif électronique et support de stockage informatique - Google Patents

Procédé et appareil de génération de mots-clés et dispositif électronique et support de stockage informatique Download PDF

Info

Publication number
WO2022134759A1
WO2022134759A1 PCT/CN2021/123901 CN2021123901W WO2022134759A1 WO 2022134759 A1 WO2022134759 A1 WO 2022134759A1 CN 2021123901 W CN2021123901 W CN 2021123901W WO 2022134759 A1 WO2022134759 A1 WO 2022134759A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
vector
keyword generation
generation model
semantic
Prior art date
Application number
PCT/CN2021/123901
Other languages
English (en)
Chinese (zh)
Inventor
蒋宏达
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022134759A1 publication Critical patent/WO2022134759A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a keyword generation method, apparatus, electronic device, and computer-readable storage medium.
  • the inventor realizes that the current keyword generation methods mainly use unsupervised or extraction methods to extract keywords or topics from articles, but these methods have the following shortcomings: specific word segmentation tools are required; the generated keywords are more repetitive. High and incoherent; the generated keywords are not highly relevant to the summary of the article content, that is, the accuracy is low.
  • a keyword generation method comprising:
  • Receive the text to be processed use the encoder in the keyword generation model to extract the semantic information of the text to be processed, and use the attention mechanism to process the semantic information to generate a semantic vector;
  • the decoder of the keyword generation model is used to extract keywords from the semantic vector based on a preset penalty factor, using a cluster search method, and output the extracted keywords.
  • a keyword generating device comprising:
  • a data acquisition module used for acquiring text data, and using a preset identifier to identify the text data to obtain a training data set
  • a model training module used for evaluating the loss function based on the orthogonal normalization loss function and the noise comparison, and using the training data set to train the pre-built original keyword generation model to obtain the keyword generation model;
  • a semantic extraction module configured to receive the text to be processed, extract the semantic information of the text to be processed by using the encoder in the keyword generation model, and use an attention mechanism to process the semantic information to generate a semantic vector
  • the keyword generation module is configured to use the decoder of the keyword generation model, based on a preset penalty factor, to extract keywords from the semantic vector in a cluster search manner, and output the extracted keywords.
  • An electronic device comprising:
  • the processor executes the computer program stored in the memory to realize the following steps:
  • Receive the text to be processed use the encoder in the keyword generation model to extract the semantic information of the text to be processed, and use the attention mechanism to process the semantic information to generate a semantic vector;
  • the decoder of the keyword generation model is used to extract keywords from the semantic vector based on a preset penalty factor, using a cluster search method, and output the extracted keywords.
  • a computer-readable storage medium comprising a storage data area and a storage program area, the storage data area stores the data created, and the storage program area is stored with a computer program; wherein, the computer program is implemented by the processor and implements the following steps:
  • Receive the text to be processed use the encoder in the keyword generation model to extract the semantic information of the text to be processed, and use the attention mechanism to process the semantic information to generate a semantic vector;
  • the decoder of the keyword generation model is used to extract keywords from the semantic vector based on a preset penalty factor, using a cluster search method, and output the extracted keywords.
  • the present application can improve the accuracy of keyword generation, reduce the repetition of generated keywords, and enhance the coherence between generated keywords.
  • FIG. 1 is a schematic flowchart of a method for generating keywords according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a model training method provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for extracting keywords from a model according to an embodiment of the present application
  • FIG. 4 is a schematic block diagram of a keyword generating apparatus provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the internal structure of an electronic device implementing a keyword generation method provided by an embodiment of the present application
  • the embodiments of the present application provide a method for generating keywords.
  • the execution body of the keyword generation method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal.
  • the keyword generation method may be executed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
  • the embodiments of the present application may acquire and process related data based on artificial intelligence technology.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the keyword generation method includes:
  • the text data includes articles, paragraphs, sentences, etc.
  • the identifiers include start identifiers, separators, end identifiers, and the like.
  • the text data can also be stored in a node of a blockchain.
  • the S1 includes:
  • An identifier is added to each input sequence in the text data for identification to obtain a training data set.
  • the input sequence may be a sentence in the text data or a paragraph in the text data.
  • an identifier is pre-added to the text data before being input as a training data set. For example, for each input sequence in the text data, a start identifier is added at the starting position and an end identifier is added at the end.
  • the end identifier is used to identify the boundary between each input sequence, and can also be used as a stop symbol in the natural language generation task.
  • a separator may also be added to each output sequence in the training data set, and a keyword corresponding to each input sequence may be added after the separator as the original keyword together with each input sequence. Generate input to the model.
  • the original keyword generation model is a generative pre-training model, and a variety of training methods can be used during training, including Bidirectional LM (two-way), Left-to-Right LM (from left to right) One-way) and Seq-to-Seq LM (sequence to sequence) three training methods, so that the original keyword generation model can handle NLG (NatureLanguageGeneration natural language generation) and NLU (NatureLanguageUnderstanding natural language understanding) tasks.
  • NLG NetureLanguageGeneration natural language generation
  • NLU NetureLanguageUnderstanding natural language understanding
  • the Seq-to-Seq LM method is mainly used to train the original keyword generation model.
  • the sequence-to-sequence method means that the input to the original keyword generation model is a continuous word sequence, such as a complete sentence, the output is also a continuous word sequence, and each word in the output sequence is based on the input sequence. Li and the previous input sequence Li -1 of the input sequence Li are generated, and the lengths of the input and output are not necessarily equal.
  • the S2 includes:
  • satisfying the termination condition means that the loss value is less than or equal to a preset threshold.
  • the orthogonal loss value L OR of the training result set is obtained by using the following orthogonal normalization loss function:
  • L OR is the orthogonal loss value
  • H is the encoding matrix obtained according to the encoding output of the encoder corresponding to the separator in the training data set in the original keyword generation model
  • H T is the transpose matrix of H
  • (1-I n ) is the quadrature coefficient, which is a constant
  • the noise loss value L SC of the training result set is calculated by using the following noise contrast evaluation loss function:
  • L SC is the noise loss value
  • L SC is the noise loss value
  • N is the total number of input sequences in the training data set
  • a comprehensive loss value of the training result set is calculated according to the orthogonal loss value and the noise loss value.
  • the orthogonal normalization loss function introduced in the model training stage in the embodiment of the present application can ensure the diversity of generated keywords, and the noise comparison evaluation loss function can ensure that the generated keywords can more contain the subject information of the text.
  • the keyword generation model in the embodiment of the present application includes an encoder and a decoder, the encoder is used to extract semantic information of the input text to generate a semantic vector, and the decoder is used to generate a plurality of semantic information according to the semantic vector. Key words.
  • the decoder includes a fully connected layer and an activation function for calculating the probabilities of multiple semantic vectors obtained by the encoder, and generating and outputting keywords according to the probabilities of the semantic vectors.
  • the semantic vector in the embodiment of the present application is obtained according to the semantic information of the input sequence, contains the semantic features of the input sequence, and is the result of a natural language understanding (NLU) task.
  • NLU natural language understanding
  • the first semantic feature and the second semantic feature are fused and converted into a vector to obtain the semantic vector of the text to be processed.
  • the keyword generation model processes each word in the to-be-processed text in sequence.
  • the fusion of the first semantic feature and the second semantic feature is to use the similarity as the weight of the second semantic feature, and combine it with the first semantic feature according to the weight, so that the weight is
  • the semantic features of important words can account for a larger proportion in the final semantic features, which is more obvious.
  • the attention mechanism described in the embodiments of the present application is used to distinguish the influence of different parts in the input sequence on the output.
  • the adjacent words of a word help to enhance the semantic representation of the word.
  • the mechanism can enhance the semantic vector of the input sequence, better extract the semantic information of the input sequence, and improve the accuracy of the keywords generated by the model.
  • the S4 includes:
  • the update of the third word vector set is multiple updates.
  • the size of the dictionary is 3, including [A, B, C], and k is 2.
  • select 2 words from the dictionary assuming A and B, combine the current two sequences A and C with the previously selected 2 words respectively, and obtain new 4 sequences as AA, AB, CA, CB, and calculate the probability of each sequence, and then keep the two sequences with the highest probability, assuming AA and CB; repeat the above process until the preset end identifier is encountered, get the final 2 sequences, and select the probability Highest sequence output.
  • converting the semantic vector into a plurality of word vectors by using the decoder in the keyword generation model includes: linearly performing a linear process on the semantic vector through a multi-layer network of the decoder. Transform to obtain a transformation vector; select a vector whose distance from the transformation vector is less than a preset distance threshold in a preset dictionary to obtain a plurality of word vectors.
  • the output probability values of the multiple word vectors are calculated through the fully connected layer and the activation function of the keyword generation model, including:
  • This embodiment of the present application reduces the probability value of the repeated word by multiplying the probability value by a preset penalty factor, such as 0.1, after calculating the probability value of the repeated word, so that the probability of the repeated word being output is reduced , so as to reduce the repetition of generated keywords and improve the quality of generated keywords.
  • a preset penalty factor such as 0.1
  • keywords are generated from the text to be processed by using the keyword model, and the generated keywords are closer to the subject of the text to be processed, and the repetition between words is lower and the coherence is better.
  • the original keyword generation model is trained by using an orthogonal normalization loss function and a noise contrast evaluation loss function to obtain a keyword generation model.
  • the orthogonal normalization loss function ensures the diversity of generated keywords.
  • the noise contrast evaluation loss function ensures that the generated keywords are more able to contain the subject information of the text, improves the accuracy of the keywords generated by the keyword generation model, and enhances the coherence between the generated keywords.
  • the set penalty factor is used to extract keywords from the semantic vector in a cluster search manner, and the penalty factor can reduce the output probability of keywords, thereby reducing the repetition of generated keywords. Therefore, the keyword generation method, device and computer-readable storage medium proposed in this application can improve the accuracy of keyword generation, reduce the repetition of generated keywords, and enhance the coherence between generated keywords.
  • FIG. 4 it is a schematic block diagram of the keyword generating apparatus of the present application.
  • the keyword generating apparatus 100 described in this application may be installed in an electronic device. According to the implemented functions, the keyword generating apparatus may include a data acquisition module 101 , a model training module 102 , a semantic extraction module 103 and a keyword generation module 104 .
  • the modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module of the keyword generation device is as follows:
  • the data acquisition module 101 is configured to acquire text data, identify the text data with a preset identifier, and obtain a training data set.
  • the text data includes articles, paragraphs, sentences, etc.
  • the identifiers include start identifiers, separators, end identifiers, and the like.
  • the text data can also be stored in a node of a blockchain
  • the data acquisition module 101 is specifically used for:
  • An identifier is added to each input sequence in the text data for identification to obtain a training data set.
  • the input sequence may be a sentence in the text data or a paragraph in the text data.
  • an identifier is pre-added to the text data before being input as a training data set. For example, for each input sequence in the text data, a start identifier is added at the starting position and an end identifier is added at the end.
  • the end identifier is used to identify the boundary between each input sequence, and can also be used as a stop symbol in the natural language generation task.
  • a separator may also be added to each output sequence in the training data set, and a keyword corresponding to each input sequence may be added after the separator as the original keyword together with each input sequence. Generate input to the model.
  • the model training module 102 is configured to evaluate the loss function based on the orthogonal normalization loss function and the noise comparison, and use the training data set to train the pre-built original keyword generation model to obtain the keyword generation model.
  • the original keyword generation model is a generative pre-training model, and a variety of training methods can be used during training, including Bidirectional LM (two-way), Left-to-Right LM (from left to right) One-way) and Seq-to-Seq LM (sequence to sequence) three training methods, so that the original keyword generation model can handle NLG (NatureLanguageGeneration natural language generation) and NLU (NatureLanguageUnderstanding natural language understanding) tasks.
  • NLG NetureLanguageGeneration natural language generation
  • NLU NetureLanguageUnderstanding natural language understanding
  • the Seq-to-Seq LM method is mainly used to train the original keyword generation model.
  • the sequence-to-sequence method means that the input to the original keyword generation model is a continuous word sequence, such as a complete sentence, the output is also a continuous word sequence, and each word in the output sequence is based on the input sequence. Li and the previous input sequence Li -1 of the input sequence Li are generated, and the lengths of the input and output are not necessarily equal.
  • model training module 102 is specifically used for:
  • stop training is performed to obtain a keyword generation model.
  • satisfying the termination condition means that the loss value is less than or equal to a preset threshold.
  • the orthogonal loss value L OR of the training result set is obtained by using the following orthogonal normalization loss function:
  • L OR is the orthogonal loss value
  • H is the encoding matrix obtained according to the encoding output of the encoder corresponding to the separator in the training data set in the original keyword generation model
  • H T is the transpose matrix of H
  • (1-I n ) is the quadrature coefficient, which is a constant
  • the noise loss value L SC of the training result set is calculated by using the following noise contrast evaluation loss function:
  • L SC is the noise loss value
  • L SC is the noise loss value
  • N is the total number of input sequences in the training data set
  • a comprehensive loss value of the training result set is calculated according to the orthogonal loss value and the noise loss value.
  • the orthogonal normalization loss function introduced in the model training stage in the embodiment of the present application can ensure the diversity of generated keywords, and the noise comparison evaluation loss function can ensure that the generated keywords can contain the subject information of the text.
  • the semantic extraction module 103 is configured to receive the text to be processed, use the encoder in the keyword generation model to extract the semantic information of the text to be processed, and use the attention mechanism to process the semantic information to generate a semantic vector .
  • the keyword generation model in the embodiment of the present application includes an encoder and a decoder, the encoder is used to extract semantic information of the input text to generate a semantic vector, and the decoder is used to generate a plurality of semantic information according to the semantic vector. Key words.
  • the decoder includes a fully connected layer and an activation function for calculating the probabilities of multiple semantic vectors obtained by the encoder, and generating and outputting keywords according to the probabilities of the semantic vectors.
  • the semantic vector in the embodiment of the present application is obtained according to the semantic information of the input sequence, contains the semantic features of the input sequence, and is the result of a natural language understanding (NLU) task.
  • NLU natural language understanding
  • the semantic extraction module 103 specifically executes Do the following:
  • the first semantic feature and the second semantic feature are fused and converted into a vector to obtain a semantic vector of the text to be processed.
  • the keyword generation model processes each word in the to-be-processed text in sequence.
  • the fusion of the first semantic feature and the second semantic feature is to use the similarity as the weight of the second semantic feature, and combine it with the first semantic feature according to the weight, so that the weight is
  • the semantic features of important words can account for a larger proportion in the final semantic features, which is more obvious.
  • the attention mechanism described in the embodiments of the present application is used to distinguish the influence of different parts in the input sequence on the output.
  • the adjacent words of a word help to enhance the semantic representation of the word.
  • the mechanism can enhance the semantic vector of the input sequence, better extract the semantic information of the input sequence, and improve the accuracy of the keywords generated by the model.
  • the keyword generation module 104 is configured to use the decoder of the keyword generation model, and based on a preset penalty factor, to extract keywords from the semantic vector in a cluster search manner, and output the extracted keywords.
  • the keyword generation module 104 is specifically used for:
  • the probability values of the multiple word vectors are calculated through the fully connected layer and the activation function of the keyword generation model;
  • the decoder is used to re-calculate the probability of the plurality of word vectors, and select the k word vectors with the highest probability values as the second word vector set, and the first word vector set Perform a pairwise combination with each word vector in the second word vector set to obtain a third word vector set;
  • the update of the third word vector set is multiple updates.
  • the size of the dictionary is 3, including [A, B, C], and k is 2.
  • select 2 words from the dictionary assuming A and B, combine the current two sequences A and C with the previously selected 2 words respectively, and obtain new 4 sequences as AA, AB, CA, CB, and calculate the probability of each sequence, and then keep the two sequences with the highest probability, assuming AA and CB; repeat the above process until the preset end identifier is encountered, get the final 2 sequences, and select the probability Highest sequence output.
  • converting the semantic vector into a plurality of word vectors by using the decoder in the keyword generation model includes: linearly performing a linear process on the semantic vector through a multi-layer network of the decoder. Transform to obtain a transformation vector; select a vector whose distance from the transformation vector is less than a preset distance threshold in a preset dictionary to obtain a plurality of word vectors.
  • the output probability values of the multiple word vectors are calculated through the fully connected layer and the activation function of the keyword generation model, including:
  • This embodiment of the present application reduces the probability value of the repeated word by multiplying the probability value by a preset penalty factor, such as 0.1, after calculating the probability value of the repeated word, so that the probability of the repeated word being output is reduced , so as to reduce the repetition of generated keywords and improve the quality of generated keywords.
  • a preset penalty factor such as 0.1
  • keywords are generated from the text to be processed by using the keyword model, and the generated keywords are closer to the subject of the text to be processed, and the repetition between words is lower and the coherence is better.
  • FIG. 5 it is a schematic structural diagram of an electronic device implementing the keyword generation method of the present application.
  • the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a keyword generation program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 .
  • the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as codes of the keyword generation program 12, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
  • Central Processing Unit CPU
  • microprocessor digital processing chip
  • graphics processor and combination of various control chips, etc.
  • the processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the program) stored in the memory 11. keyword generation program, etc.), and call data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA Extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 5 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 5 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the drawings. components, or a combination of certain components, or a different arrangement of components.
  • the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • the keyword generation program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple computer programs, and when running in the processor 10, can realize:
  • Receive the text to be processed use the encoder in the keyword generation model to extract the semantic information of the text to be processed, and use the attention mechanism to process the semantic information to generate a semantic vector;
  • the decoder of the keyword generation model is used to extract keywords from the semantic vector based on a preset penalty factor, using a cluster search method, and output the extracted keywords.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).
  • the present application also provides a computer-readable storage medium, where the readable storage medium stores a computer program, and when executed by a processor of an electronic device, the computer program can realize:
  • Receive the text to be processed use the encoder in the keyword generation model to extract the semantic information of the text to be processed, and use the attention mechanism to process the semantic information to generate a semantic vector;
  • the decoder of the keyword generation model is used to extract keywords from the semantic vector based on a preset penalty factor, using a cluster search method, and output the extracted keywords.
  • the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.
  • modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un appareil de génération de mots-clés, ainsi qu'un dispositif électronique et un support de stockage lisible par ordinateur, qui se rapportent à la technologie de l'intelligence artificielle. Le procédé consiste : à acquérir des données de texte et à identifier les données de texte en utilisant un identifiant prédéfini de sorte à obtenir un ensemble de données d'apprentissage ; à réaliser un apprentissage en utilisant l'ensemble de données d'apprentissage de sorte à obtenir un modèle de génération de mots-clés ; à recevoir un texte à traiter, à extraire des informations sémantiques dudit texte en utilisant le modèle de génération de mots-clés et à générer un vecteur sémantique en utilisant un mécanisme d'attention ; et, sur la base d'un facteur de pénalité prédéfini, à effectuer une extraction de mot-clé sur le vecteur sémantique en utilisant le modèle de génération de mots-clés et un mode de recherche en faisceau, et à délivrer en sortie un mot-clé extrait. Le procédé se rapporte en outre à une technologie des chaînes de blocs et des données de texte peuvent être stockées dans un nœud de chaîne de blocs. Au moyen du procédé, la précision de la génération de mots-clés peut être améliorée, la répétabilité des mots-clés générés est réduite et la cohérence entre des mots-clés générés est améliorée.
PCT/CN2021/123901 2020-12-21 2021-10-14 Procédé et appareil de génération de mots-clés et dispositif électronique et support de stockage informatique WO2022134759A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011519819.0A CN112667800A (zh) 2020-12-21 2020-12-21 关键词生成方法、装置、电子设备及计算机存储介质
CN202011519819.0 2020-12-21

Publications (1)

Publication Number Publication Date
WO2022134759A1 true WO2022134759A1 (fr) 2022-06-30

Family

ID=75406932

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123901 WO2022134759A1 (fr) 2020-12-21 2021-10-14 Procédé et appareil de génération de mots-clés et dispositif électronique et support de stockage informatique

Country Status (2)

Country Link
CN (1) CN112667800A (fr)
WO (1) WO2022134759A1 (fr)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329751A (zh) * 2022-10-17 2022-11-11 广州数说故事信息科技有限公司 针对网络平台发文的关键词提取方法、装置、介质及设备
CN115470322A (zh) * 2022-10-21 2022-12-13 深圳市快云科技有限公司 一种基于人工智能的关键词生成系统及方法
CN115794999A (zh) * 2023-02-01 2023-03-14 北京知呱呱科技服务有限公司 一种基于扩散模型的专利文档查询方法及计算机设备
CN115809665A (zh) * 2022-12-13 2023-03-17 杭州电子科技大学 一种基于双向多粒度注意力机制的无监督关键词抽取方法
CN115859964A (zh) * 2022-11-24 2023-03-28 安徽冠成教育科技有限公司 基于教育云平台的教育资源共享方法及系统
CN115880036A (zh) * 2023-02-23 2023-03-31 山东金潮交通设施有限公司 一种车位级动态共享智能管控交易平台
CN115910047A (zh) * 2023-01-06 2023-04-04 阿里巴巴达摩院(杭州)科技有限公司 数据处理方法、模型训练方法、关键词检测方法及设备
CN116070641A (zh) * 2023-03-13 2023-05-05 北京点聚信息技术有限公司 一种电子合同的在线解读方法
CN116796754A (zh) * 2023-04-20 2023-09-22 浙江浙里信征信有限公司 基于时变上下文语义序列成对比较的可视分析方法及系统
CN116866054A (zh) * 2023-07-25 2023-10-10 安徽百方云科技有限公司 公共信息安全监测系统及其方法
CN117011435A (zh) * 2023-09-28 2023-11-07 世优(北京)科技有限公司 数字人形象ai生成方法及装置
CN117235121A (zh) * 2023-11-15 2023-12-15 华北电力大学 一种能源大数据查询方法和系统
CN117558392A (zh) * 2024-01-12 2024-02-13 富纳德科技(北京)有限公司 一种电子病历共享协作方法与系统
CN117743869A (zh) * 2024-02-18 2024-03-22 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 一种内容发现方法、系统、终端及存储介质
CN117891531A (zh) * 2024-03-14 2024-04-16 蒲惠智造科技股份有限公司 用于saas软件的系统参数配置方法、系统、介质及电子设备

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667800A (zh) * 2020-12-21 2021-04-16 深圳壹账通智能科技有限公司 关键词生成方法、装置、电子设备及计算机存储介质
CN113204965A (zh) * 2021-05-31 2021-08-03 平安科技(深圳)有限公司 关键词提取方法、装置、计算机设备及可读存储介质
CN113591917B (zh) * 2021-06-29 2024-04-09 深圳市捷顺科技实业股份有限公司 一种数据增强的方法及装置
CN113723102B (zh) * 2021-06-30 2024-04-26 平安国际智慧城市科技股份有限公司 命名实体识别方法、装置、电子设备及存储介质
CN113723058B (zh) * 2021-11-02 2022-03-08 深圳市北科瑞讯信息技术有限公司 文本摘要与关键词抽取方法、装置、设备及介质
CN114399775A (zh) * 2022-01-21 2022-04-26 平安科技(深圳)有限公司 文档标题生成方法、装置、设备及存储介质
CN114492669A (zh) * 2022-02-16 2022-05-13 平安科技(深圳)有限公司 关键词推荐模型训练方法、推荐方法和装置、设备、介质
CN114547266B (zh) * 2022-02-21 2023-06-30 北京百度网讯科技有限公司 信息生成模型的训练方法、生成信息的方法、装置和设备
CN114818685B (zh) * 2022-04-21 2023-06-20 平安科技(深圳)有限公司 关键词提取方法、装置、电子设备及存储介质
CN114757154B (zh) * 2022-06-13 2022-09-30 深圳市承儒科技有限公司 基于深度学习的作业生成方法、装置、设备及存储介质
CN116029291B (zh) * 2023-03-29 2023-07-11 摩尔线程智能科技(北京)有限责任公司 关键词识别方法、装置、电子设备和存储介质
CN116189193B (zh) * 2023-04-25 2023-11-10 杭州镭湖科技有限公司 一种基于样本信息的数据存储可视化方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110330A (zh) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 基于文本的关键词提取方法和计算机设备
CN110119765A (zh) * 2019-04-18 2019-08-13 浙江工业大学 一种基于Seq2seq框架的关键词提取方法
US20200250376A1 (en) * 2019-12-13 2020-08-06 Beijing Xiaomi Intelligent Technology Co., Ltd. Keyword extraction method, keyword extraction device and computer-readable storage medium
CN111539211A (zh) * 2020-04-17 2020-08-14 中移(杭州)信息技术有限公司 实体及语义关系识别方法、装置、电子设备及存储介质
CN112667800A (zh) * 2020-12-21 2021-04-16 深圳壹账通智能科技有限公司 关键词生成方法、装置、电子设备及计算机存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119765A (zh) * 2019-04-18 2019-08-13 浙江工业大学 一种基于Seq2seq框架的关键词提取方法
CN110110330A (zh) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 基于文本的关键词提取方法和计算机设备
US20200250376A1 (en) * 2019-12-13 2020-08-06 Beijing Xiaomi Intelligent Technology Co., Ltd. Keyword extraction method, keyword extraction device and computer-readable storage medium
CN111539211A (zh) * 2020-04-17 2020-08-14 中移(杭州)信息技术有限公司 实体及语义关系识别方法、装置、电子设备及存储介质
CN112667800A (zh) * 2020-12-21 2021-04-16 深圳壹账通智能科技有限公司 关键词生成方法、装置、电子设备及计算机存储介质

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329751A (zh) * 2022-10-17 2022-11-11 广州数说故事信息科技有限公司 针对网络平台发文的关键词提取方法、装置、介质及设备
CN115470322A (zh) * 2022-10-21 2022-12-13 深圳市快云科技有限公司 一种基于人工智能的关键词生成系统及方法
CN115470322B (zh) * 2022-10-21 2023-05-05 深圳市快云科技有限公司 一种基于人工智能的关键词生成系统及方法
CN115859964A (zh) * 2022-11-24 2023-03-28 安徽冠成教育科技有限公司 基于教育云平台的教育资源共享方法及系统
CN115809665A (zh) * 2022-12-13 2023-03-17 杭州电子科技大学 一种基于双向多粒度注意力机制的无监督关键词抽取方法
CN115809665B (zh) * 2022-12-13 2023-07-11 杭州电子科技大学 一种基于双向多粒度注意力机制的无监督关键词抽取方法
CN115910047A (zh) * 2023-01-06 2023-04-04 阿里巴巴达摩院(杭州)科技有限公司 数据处理方法、模型训练方法、关键词检测方法及设备
CN115910047B (zh) * 2023-01-06 2023-05-19 阿里巴巴达摩院(杭州)科技有限公司 数据处理方法、模型训练方法、关键词检测方法及设备
CN115794999A (zh) * 2023-02-01 2023-03-14 北京知呱呱科技服务有限公司 一种基于扩散模型的专利文档查询方法及计算机设备
CN115880036A (zh) * 2023-02-23 2023-03-31 山东金潮交通设施有限公司 一种车位级动态共享智能管控交易平台
CN116070641B (zh) * 2023-03-13 2023-06-06 北京点聚信息技术有限公司 一种电子合同的在线解读方法
CN116070641A (zh) * 2023-03-13 2023-05-05 北京点聚信息技术有限公司 一种电子合同的在线解读方法
CN116796754A (zh) * 2023-04-20 2023-09-22 浙江浙里信征信有限公司 基于时变上下文语义序列成对比较的可视分析方法及系统
CN116866054A (zh) * 2023-07-25 2023-10-10 安徽百方云科技有限公司 公共信息安全监测系统及其方法
CN117011435A (zh) * 2023-09-28 2023-11-07 世优(北京)科技有限公司 数字人形象ai生成方法及装置
CN117011435B (zh) * 2023-09-28 2024-01-09 世优(北京)科技有限公司 数字人形象ai生成方法及装置
CN117235121A (zh) * 2023-11-15 2023-12-15 华北电力大学 一种能源大数据查询方法和系统
CN117235121B (zh) * 2023-11-15 2024-02-20 华北电力大学 一种能源大数据查询方法和系统
CN117558392A (zh) * 2024-01-12 2024-02-13 富纳德科技(北京)有限公司 一种电子病历共享协作方法与系统
CN117558392B (zh) * 2024-01-12 2024-04-05 富纳德科技(北京)有限公司 一种电子病历共享协作方法与系统
CN117743869A (zh) * 2024-02-18 2024-03-22 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 一种内容发现方法、系统、终端及存储介质
CN117743869B (zh) * 2024-02-18 2024-05-17 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 一种内容发现方法、系统、终端及存储介质
CN117891531A (zh) * 2024-03-14 2024-04-16 蒲惠智造科技股份有限公司 用于saas软件的系统参数配置方法、系统、介质及电子设备

Also Published As

Publication number Publication date
CN112667800A (zh) 2021-04-16

Similar Documents

Publication Publication Date Title
WO2022134759A1 (fr) Procédé et appareil de génération de mots-clés et dispositif électronique et support de stockage informatique
WO2022142593A1 (fr) Procédé et appareil de classification de texte, dispositif électronique et support de stockage lisible
WO2022141861A1 (fr) Procédé et appareil de classification d'émotions, dispositif électronique et support de stockage
WO2022121171A1 (fr) Procédé et appareil de mise en correspondance de textes similaires, ainsi que dispositif électronique et support de stockage informatique
WO2021135469A1 (fr) Procédé, appareil, dispositif informatique et support d'extraction d'informations basée sur l'apprentissage automatique
CN111460797B (zh) 关键字抽取方法、装置、电子设备及可读存储介质
CN111858843B (zh) 一种文本分类方法及装置
CN112380343A (zh) 问题解析方法、装置、电子设备及存储介质
CN113157927B (zh) 文本分类方法、装置、电子设备及可读存储介质
CN113378970B (zh) 语句相似性检测方法、装置、电子设备及存储介质
WO2022134355A1 (fr) Procédé et appareil de recherche basés sur une invite de mots-clé, dispositif électronique et support de stockage
CN114077841A (zh) 基于人工智能的语义提取方法、装置、电子设备及介质
CN113986950A (zh) 一种sql语句处理方法、装置、设备及存储介质
CN112464642A (zh) 文本添加标点的方法、装置、介质及电子设备
CN113360654B (zh) 文本分类方法、装置、电子设备及可读存储介质
CN113344125B (zh) 长文本匹配识别方法、装置、电子设备及存储介质
CN115409041B (zh) 一种非结构化数据提取方法、装置、设备及存储介质
CN116468025A (zh) 电子病历结构化方法、装置、电子设备及存储介质
CN116450829A (zh) 医疗文本分类方法、装置、设备及介质
CN116821373A (zh) 基于图谱的prompt推荐方法、装置、设备及介质
CN116542246A (zh) 基于关键词质检文本的方法、装置和电子设备
WO2022141867A1 (fr) Procédé et appareil de reconnaissance de parole, dispositif électronique et support de stockage lisible
WO2022142019A1 (fr) Procédé et appareil de distribution de questions basés sur un robot intelligent, et dispositif électronique et support de stockage
CN114220505A (zh) 病历数据的信息抽取方法、终端设备及可读存储介质
CN111680513B (zh) 特征信息的识别方法、装置及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908760

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 280923)

122 Ep: pct application non-entry in european phase

Ref document number: 21908760

Country of ref document: EP

Kind code of ref document: A1