CN117669561A - Unsupervised keyword extraction method, system, equipment and medium - Google Patents

Unsupervised keyword extraction method, system, equipment and medium Download PDF

Info

Publication number
CN117669561A
CN117669561A CN202311628915.2A CN202311628915A CN117669561A CN 117669561 A CN117669561 A CN 117669561A CN 202311628915 A CN202311628915 A CN 202311628915A CN 117669561 A CN117669561 A CN 117669561A
Authority
CN
China
Prior art keywords
mask
document
cls
vector
original document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311628915.2A
Other languages
Chinese (zh)
Inventor
李嘉豪
戴宪华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202311628915.2A priority Critical patent/CN117669561A/en
Publication of CN117669561A publication Critical patent/CN117669561A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the field of keyword extraction technology, and in particular, to an unsupervised keyword extraction method, system, device, and medium. Preprocessing an original document to obtain a plurality of candidate keywords; masking operation is carried out on the original document according to the candidate keywords respectively, so that a masking document corresponding to each candidate keyword is obtained; inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword; respectively calculating the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the rest mask documents; weighting the first cosine similarity and the second cosine similarity to obtain total similarity; the accuracy and the diversity of keyword extraction can be improved.

Description

Unsupervised keyword extraction method, system, equipment and medium
Technical Field
The present invention relates to the field of keyword extraction technology, and in particular, to an unsupervised keyword extraction method, system, device, and medium.
Background
At present, keyword extraction is divided into two types, namely supervised keyword extraction and unsupervised keyword extraction, and the document data is easy to obtain in actual engineering, the labeling data is difficult to obtain, and the unsupervised keyword extraction is more widely used. The traditional unsupervised keyword extraction only focuses on low-level features such as word frequency, position, part of speech and the like, but does not use semantics, but the semantics are decisive factors for keyword extraction, so that the traditional method is low in accuracy. The accuracy of word embedding and extracting technology based on the pre-training model is greatly improved compared with that of the traditional method, but repeated extraction can occur for words with similar semantics, and the word embedding and extracting technology lacks diversity; most embedding methods use words and documents to embed and calculate similarity, but words are often much shorter than documents, and it is difficult to use one word to represent a whole document, so that much information can be lost in the calculation method; when the pre-training language model is used for obtaining the embedding, only the last layer of output is used, the information of the middle layer is not utilized, and the information is lost.
Disclosure of Invention
The invention aims to solve the problems of low keyword extraction accuracy and lack of diversity in the prior art.
In order to achieve the above object, the present invention provides an unsupervised keyword extraction method, which is characterized in that the method includes:
preprocessing an original document to obtain a plurality of candidate keywords;
masking operation is carried out on the original document according to the candidate keywords respectively, so that a masking document corresponding to each candidate keyword is obtained;
inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword;
respectively calculating the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the rest mask documents;
weighting the first cosine similarity and the second cosine similarity to obtain total similarity;
and screening target candidate keywords according to the total similarity.
Further, the preprocessing the original document to obtain a plurality of candidate keywords includes:
and performing word segmentation, part-of-speech tagging and stop word removal on the original document through a jieba tool.
Further, the method comprises the steps of,
the pre-trained language characterization model is an AlBert model.
Further, the original document cls vector is represented by the following formula:
wherein h is i As a trainable parameter, representing the weight output by the ith layer; cls i Cls representing the i-th layer]Embedding the representation;
the cls vector of the mask document corresponding to each candidate keyword is expressed by the following formula:
the first cosine similarity is calculated using the following formula:
wherein sim is i Representing cosine similarity of the ith mask document vector and the original document vector;
the second cosine similarity is calculated using the following formula:
wherein,representing the cosine similarity of the ith and kth mask-document vectors.
Further, the weighting the first cosine similarity and the second cosine similarity to obtain a total similarity includes:
summing cosine similarities of all mask document vectors;
setting a weighting coefficient of the first cosine similarity and the second cosine similarity;
the overall similarity is calculated according to the following formula:
wherein lambda is a weighting coefficient, and the value range is 0-1.
Further, the screening the target candidate keywords according to the total similarity includes:
and sequentially screening a predetermined number of target candidate keywords according to the sequence from small to large of the total similarity.
The invention provides an unsupervised keyword extraction system, which is characterized by comprising:
the preprocessing module is used for preprocessing the original document to obtain a plurality of candidate keywords;
the mask operation module is used for performing mask operation on the original document according to the plurality of candidate keywords to obtain mask documents corresponding to each candidate keyword;
the vector acquisition module is used for inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword;
the computing module is used for respectively computing the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the rest mask documents;
the weighting module is used for weighting the first cosine similarity and the second cosine similarity to obtain total similarity;
and the screening module is used for screening target candidate keywords according to the total similarity.
Another embodiment of the present invention also proposes a computer-readable storage medium including a stored computer program; wherein the computer program, when run, controls a device in which the computer-readable storage medium resides to perform the unsupervised keyword extraction method as described above.
Another embodiment of the present invention also proposes a terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the unsupervised keyword extraction method as described above when executing the computer program.
According to the method, the system, the equipment and the medium for extracting the unsupervised keywords, which are disclosed by the embodiment of the invention, the original document is preprocessed to obtain a plurality of candidate keywords; masking operation is carried out on the original document according to the candidate keywords respectively, so that a masking document corresponding to each candidate keyword is obtained; inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword; respectively calculating the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the rest mask documents; weighting the first cosine similarity and the second cosine similarity to obtain total similarity; the accuracy and the diversity of keyword extraction can be improved.
Drawings
FIG. 1 is a flowchart of an unsupervised keyword extraction method provided by an embodiment of the present invention;
FIG. 2 is a block diagram of an unsupervised keyword extraction system according to an embodiment of the present invention;
fig. 3 is a block diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
It should be noted that, the step numbers herein are only for convenience of explanation of the specific embodiments, and are not used as limiting the order of execution of the steps. The method provided in this embodiment may be executed by a relevant server, and the following description will take the server as an execution body as an example.
As shown in fig. 1, an unsupervised keyword extraction method according to a preferred embodiment of the present invention includes steps S1 to S6:
step S1, preprocessing an original document to obtain a plurality of candidate keywords;
according to the embodiment of the invention, the jieba tool is utilized to segment the original document, label the part of speech and remove the stop word, so that the candidate word is obtained. The stopping word set uses a Chinese stopping word set built in jieba, words marked as n, nr, ns, nt, nw, nz, vn are extracted, and the words respectively represent common nouns, person names, place names, organization names, work names, other proper nouns and proper nouns to form a candidate word set A. The word segmentation, part-of-speech tagging and stop word removal operations performed on the original document in this embodiment are not limited to the use of jieba tools, which are used to explain the process of obtaining candidate keywords in this embodiment, so that other word segmentation tools may be selected to perform the above preprocessing operations on the original document.
S2, masking operation is carried out on the original document according to the candidate keywords, and masking documents corresponding to the candidate keywords are obtained;
for the candidate word set A obtained in the above step, the embodiment of the invention performs the following steps on each candidate word A in the candidate word set A n And (n candidate words) performing masking operation, namely mask operation, shielding one candidate keyword at a time, and sequentially obtaining mask documents corresponding to each candidate keyword. Through the step, the embodiment of the invention can keep the lengths of the document after mask and the original document consistent, so that the information quantity difference only exists whether the document has a shielded screen or notThe candidate keywords are masked.
If the similarity between the mask document and the original document is high, the candidate keywords which are removed by the mask are not greatly influenced on the document, namely, the keyword degree is low; if the similarity between the mask document and the original document is low, the candidate keywords which are used for explaining mask drop have a great influence on the document, namely the keyword degree is high. Therefore, the embodiment of the present invention continues to perform the comparison operation of the similarity between the original document and the mask document after the mask document is obtained, i.e., steps S3 and S4.
S3, inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword;
in the embodiment, an Albert model (a lightweight Bert model) is selected to perform training reasoning on an original document and mask documents corresponding to each candidate keyword, and the original document and the mask documents corresponding to each candidate keyword are input into the Albert model to obtain final document vector representations which are respectively marked as cls and cls';
wherein cls is (cls) 1 ,cls 2 ,...,cls n ) Cls 'is (cls' 1 ,cls‘ 2 ,...,cls‘ n )。
The Albert model adopted by the embodiment trains the document, and can fully use partial information of the middle layer, namely, the output of the middle layer is utilized, and a method of weighting and summing the output of each layer is adopted to obtain the final embedded representation of the document. Therefore, compared with the existing mode of only using the last layer of the pre-training model as output, partial information of the middle layer cannot be lost, the output of the middle layer can be utilized, and training reasoning accuracy is greatly improved.
Step S4, respectively calculating the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vector of the rest mask documents;
the original document cls vector is expressed by the following formula:
wherein h is i As a trainable parameter, representing the weight output by the ith layer; cls i Cls representing the i-th layer]Embedding the representation; n is the number of layers of the Albert model, and n=12 is preferred in this embodiment.
The cls vector of the mask document corresponding to each candidate keyword is expressed by the following formula:
the first cosine similarity is calculated using the following formula:
wherein sim is i Representing cosine similarity of the ith mask document vector and the original document vector;
the second cosine similarity is calculated using the following formula:
wherein,representing the cosine similarity of the ith and kth mask-document vectors.
The calculation is that the selected keywords are as various as possible, and the semantic difference is larger
Step S5, weighting the first cosine similarity and the second cosine similarity to obtain total similarity;
specifically, the present embodiment sums cosine similarities of all mask document vectors; setting a weighting coefficient of the first cosine similarity and the second cosine similarity; and calculates the total similarity according to the following formula:
wherein lambda is a weighting coefficient, and the value range is 0-1.
In this embodiment, the weighting system λ is an adjustable parameter, for example, λ is smaller than 0.5, so that the diversity of the extracted candidate keywords is more concerned in the similarity calculation, and λ is larger than 0.5, so that the association degree between the extracted candidate keywords and the original document is more concerned. The extracted vocabulary may be more varied when λ is less than 0.5, but the relevance of the vocabulary to the original document may be weakened. The association of words with the original document with lambda greater than 0.5 is enhanced, but diversity may be impaired. And (5) adjusting according to actual conditions. If the diversity of the extracted candidate keywords is more focused, λ may be set to a value less than 0.5, and if the association degree of the extracted candidate keywords with the original document is more focused, λ may be set to a value greater than 0.5.
And S6, screening target candidate keywords according to the total similarity.
The smaller the SIM is, the more important the keywords are, and the target candidate keywords with preset numbers are sequentially screened according to the sequence from the smaller total similarity to the larger total similarity.
In summary, the present example provides that a jieba tool is used to process an original document to obtain candidate keywords, then mask operation is performed on the document, a mask document corresponding to each candidate keyword and the original document are input into an Albert model, and a final vector representation is obtained by using the output of an Albert middle layer, so as to improve accuracy; and acquiring final similarity by using the cosine similarity of the original document vector and the mask document vector corresponding to each candidate keyword and the cosine similarity of the mask document vector corresponding to each candidate keyword and the rest mask document vectors, and screening the keywords according to the final similarity, so that the extracted keywords are more diversified and have high importance.
As shown in fig. 2, the embodiment of the present invention further provides an unsupervised keyword extraction system, configured to perform an unsupervised keyword extraction method as described above, where the system includes:
a preprocessing module 21, configured to preprocess an original document to obtain a plurality of candidate keywords;
the mask operation module 22 is configured to perform a mask operation on the original document according to the plurality of candidate keywords, so as to obtain a mask document corresponding to each candidate keyword;
the vector obtaining module 23 is configured to input the original document and each of the mask documents into a pre-trained language representation model, so as to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword;
a calculating module 24, configured to calculate a first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword, and a second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the remaining mask documents, respectively;
a weighting module 25, configured to weight the first cosine similarity and the second cosine similarity to obtain a total similarity;
and a screening module 26, configured to screen the target candidate keywords according to the total similarity.
The technical features and technical effects of the unsupervised keyword extraction system provided by the embodiment of the present invention are the same as those of the unsupervised keyword extraction method provided by the embodiment of the present invention, and are not repeated here. The modules in the above-described unsupervised keyword extraction system may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
The embodiment of the invention also provides a computer readable storage medium, which comprises a stored computer program; wherein the computer program, when run, controls a device in which the computer-readable storage medium resides to perform an unsupervised keyword extraction method as described above.
As shown in fig. 3, the embodiment of the present invention further provides a computer device, and fig. 3 is a block diagram of a preferred embodiment of the computer device provided by the present invention, where the computer device includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements an unsupervised keyword extraction method as described above when executing the computer program.
Preferably, the computer program may be divided into one or more modules/units (e.g. computer program 1, computer program 2, … …) stored in the memory and executed by the processor to complete the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the computer device.
The processor may be a central processing unit (Central Processing Unit, CPU), or may be other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., or the general purpose processor may be a microprocessor, or any conventional processor, which is the control center of the terminal device, that connects the various parts of the terminal device using various interfaces and lines.
The memory mainly includes a program storage area, which may store an operating system, an application program required for at least one function, and the like, and a data storage area, which may store related data and the like. In addition, the memory may be a high-speed random access memory, a nonvolatile memory such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), or the like, or may be other volatile solid-state memory devices.
It should be noted that the above-mentioned terminal device may include, but is not limited to, a processor, a memory, and those skilled in the art will understand that the structural block diagram of fig. 3 is merely an example of the terminal device, and does not constitute limitation of the terminal device, and may include more or less components than those illustrated, or may combine some components, or different components.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and substitutions will now occur to those skilled in the art without departing from the spirit of the present invention, and these modifications and substitutions should also be considered to be within the scope of the present invention.

Claims (9)

1. An unsupervised keyword extraction method, comprising:
preprocessing an original document to obtain a plurality of candidate keywords;
masking operation is carried out on the original document according to the candidate keywords respectively, so that a masking document corresponding to each candidate keyword is obtained;
inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword;
respectively calculating the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the rest mask documents;
weighting the first cosine similarity and the second cosine similarity to obtain total similarity;
and screening target candidate keywords according to the total similarity.
2. The method for extracting unsupervised keywords of claim 1, wherein the preprocessing the original document to obtain a plurality of candidate keywords comprises:
and performing word segmentation, part-of-speech tagging and stop word removal on the original document through a jieba tool.
3. The method for extracting an unsupervised keyword according to claim 1,
the pre-trained language characterization model is an AlBert model.
4. The method for extracting unsupervised keywords according to claim 1, wherein the cls vector of the original document is represented by the following formula:
wherein h is i As a trainable parameter, representing the weight output by the ith layer; cls i Cls representing the i-th layer]Embedding the representation;
the cls vector of the mask document corresponding to each candidate keyword is expressed by the following formula:
the first cosine similarity is calculated using the following formula:
wherein sim is i Representing cosine similarity of the ith mask document vector and the original document vector;
the second cosine similarity is calculated using the following formula:
wherein,representing the cosine similarity of the ith and kth mask-document vectors.
5. The method for extracting an unsupervised keyword according to claim 4, wherein weighting the first cosine similarity and the second cosine similarity to obtain a total similarity comprises:
summing cosine similarities of all mask document vectors;
setting a weighting coefficient of the first cosine similarity and the second cosine similarity;
the overall similarity is calculated according to the following formula:
wherein lambda is a weighting coefficient, and the value range is 0-1.
6. The method for extracting an unsupervised keyword according to claim 1, wherein the screening the target candidate keywords according to the total similarity comprises:
and sequentially screening a predetermined number of target candidate keywords according to the sequence from small to large of the total similarity.
7. An unsupervised keyword extraction system, the system comprising:
the preprocessing module is used for preprocessing the original document to obtain a plurality of candidate keywords;
the mask operation module is used for performing mask operation on the original document according to the plurality of candidate keywords to obtain mask documents corresponding to each candidate keyword;
the vector acquisition module is used for inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword;
the computing module is used for respectively computing the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the rest mask documents;
the weighting module is used for weighting the first cosine similarity and the second cosine similarity to obtain total similarity;
and the screening module is used for screening target candidate keywords according to the total similarity.
8. A computer device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the unsupervised keyword extraction method of any one of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, wherein the computer readable storage medium comprises a stored computer program; wherein the computer program, when run, controls a device in which the computer-readable storage medium is located to perform the unsupervised keyword extraction method according to any one of claims 1 to 6.
CN202311628915.2A 2023-11-30 2023-11-30 Unsupervised keyword extraction method, system, equipment and medium Pending CN117669561A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311628915.2A CN117669561A (en) 2023-11-30 2023-11-30 Unsupervised keyword extraction method, system, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311628915.2A CN117669561A (en) 2023-11-30 2023-11-30 Unsupervised keyword extraction method, system, equipment and medium

Publications (1)

Publication Number Publication Date
CN117669561A true CN117669561A (en) 2024-03-08

Family

ID=90070770

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311628915.2A Pending CN117669561A (en) 2023-11-30 2023-11-30 Unsupervised keyword extraction method, system, equipment and medium

Country Status (1)

Country Link
CN (1) CN117669561A (en)

Similar Documents

Publication Publication Date Title
CN111291177B (en) Information processing method, device and computer storage medium
CN110502742B (en) Complex entity extraction method, device, medium and system
CN112633423B (en) Training method of text recognition model, text recognition method, device and equipment
CN112287656B (en) Text comparison method, device, equipment and storage medium
CN113159013A (en) Paragraph identification method and device based on machine learning, computer equipment and medium
CN112100374A (en) Text clustering method and device, electronic equipment and storage medium
EP4089568A1 (en) Cascade pooling for natural language document processing
CN112307175B (en) Text processing method, text processing device, server and computer readable storage medium
CN112445914A (en) Text classification method, device, computer equipment and medium
CN116257601A (en) Illegal word stock construction method and system based on deep learning
CN113486169B (en) Synonymous statement generation method, device, equipment and storage medium based on BERT model
CN112579774B (en) Model training method, model training device and terminal equipment
CN117669561A (en) Unsupervised keyword extraction method, system, equipment and medium
CN112528646B (en) Word vector generation method, terminal device and computer-readable storage medium
CN110928987B (en) Legal provision retrieval method and related equipment based on neural network hybrid model
CN113962221A (en) Text abstract extraction method and device, terminal equipment and storage medium
CN111767710B (en) Indonesia emotion classification method, device, equipment and medium
CN113836297A (en) Training method and device for text emotion analysis model
CN110688472A (en) Method for automatically screening answers to questions, terminal equipment and storage medium
CN112380974B (en) Classifier optimization method, back door detection method and device and electronic equipment
CN115688771B (en) Document content comparison performance improving method and system
CN115146596B (en) Recall text generation method and device, electronic equipment and storage medium
CN114091456B (en) Intelligent positioning method and system for quotation contents
CN117669493B (en) Intelligent image-text typesetting method and system based on significance detection
CN113297353B (en) Text matching method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination