CN116450829A - Medical text classification method, device, equipment and medium - Google Patents

Medical text classification method, device, equipment and medium Download PDF

Info

Publication number
CN116450829A
CN116450829A CN202310500108.6A CN202310500108A CN116450829A CN 116450829 A CN116450829 A CN 116450829A CN 202310500108 A CN202310500108 A CN 202310500108A CN 116450829 A CN116450829 A CN 116450829A
Authority
CN
China
Prior art keywords
medical text
classified
word
vector
medical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310500108.6A
Other languages
Chinese (zh)
Inventor
刘羲
田巍
舒畅
陈又新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310500108.6A priority Critical patent/CN116450829A/en
Publication of CN116450829A publication Critical patent/CN116450829A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Abstract

The invention relates to an artificial intelligence technology, and discloses a medical text classification method, which comprises the following steps: acquiring a medical text to be classified, and performing word segmentation on the medical text to be classified to obtain a medical text character sequence to be classified; encoding the medical text character sequence to be classified by using a preset encoding model to obtain a first word embedded vector; performing vector embedding on the medical text character sequence to be classified by utilizing the pre-trained word embedding table to obtain a second word embedding vector; splicing the first word embedding vector and the second word embedding vector to obtain a fusion word embedding vector; extracting medical text semantic features of the fusion word embedded vector by using a feature extraction model; and classifying the semantic features of the medical texts by using a preset classifier to obtain the category of the medical texts to be classified. The invention also provides a medical text classification device, electronic equipment and a storage medium. The invention can improve the accuracy of medical text classification.

Description

Medical text classification method, device, equipment and medium
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a medical text classification method, apparatus, electronic device, and computer readable storage medium.
Background
Text classification is an important branch of natural language processing, and aims to classify unstructured text into preset categories, and current text classification is widely applied and comprises the fields of medical classification, emotion recognition, spam filtering and the like.
In the current medical text classification task, a traditional machine learning algorithm and a deep learning algorithm are mainly adopted, wherein the traditional machine learning algorithm comprises K-Nearest Neighbor (KNN), naive Bayesian (NB) and support vector machine (Support Vector Machine, SVM), and the traditional machine learning algorithm cannot perform better text semantic representation on medical professional terms, so that the medical text classification accuracy is lower; the text classification method based on deep learning can extract more text features through a neural network to perform automatic classification, but omits fusion of external related information, so that semantic representation of medical professional terms is insufficient, and the accuracy of medical text classification is low.
Disclosure of Invention
The invention provides a medical text classification method, a medical text classification device and a computer readable storage medium, which mainly aim to solve the problem of low accuracy of medical text classification.
In order to achieve the above object, the present invention provides a medical text classification method, including:
acquiring a medical text to be classified, and performing word segmentation on the medical text to be classified to obtain a medical text character sequence to be classified;
encoding the medical text character sequence to be classified by using a preset encoding model to obtain a first word embedded vector;
performing vector embedding on the medical text character sequences to be classified by utilizing a pre-trained word embedding table to obtain a second word embedding vector;
splicing the first word embedded vector and the second word embedded vector to obtain a fusion word embedded vector;
extracting medical text semantic features of the fusion word embedded vector by using a preset feature extraction model;
and classifying the semantic features of the medical texts by using a preset classifier to obtain the category of the medical texts to be classified.
Optionally, the performing vector embedding on the medical text character sequence to be classified by using a pre-trained word embedding table to obtain a second word embedding vector, including:
traversing the medical text character sequence to be classified by using a preset window to obtain the times of each vocabulary appearing in the same window, and constructing a co-occurrence matrix according to the times of each vocabulary appearing in the same window;
according to the co-occurrence frequency matrix, calculating the probability that a random character appears on another character to obtain the co-occurrence probability;
and according to the co-occurrence probability, mapping the medical text character sequence to be classified into a vector space by utilizing a pre-trained word embedding table to obtain a second word embedding vector.
Optionally, according to the co-occurrence probability, mapping the medical text character sequence to be classified to a vector space by using a pre-trained word embedding table to obtain a second word embedding vector, including:
taking the co-occurrence probability as the character weight of the corresponding character;
matching the medical text character sequence to be classified with a pre-trained word embedding table to obtain a target embedding vector;
and weighting the target embedded vector according to the character weight to obtain a second word embedded vector.
Optionally, the encoding the medical text character sequence to be classified by using a preset encoding model to obtain a first word embedded vector includes:
inserting a separator in front of the medical text character sequence to be classified and adding a segmenter behind the last character of each sentence corresponding to the medical text character sequence to be classified to obtain a target medical text character sequence to be classified;
coding the target medical text character sequence to be classified by using a preset coding model to obtain a word embedding vector;
coding the position information of the target medical text character sequence to be classified by using the coding model to obtain a position embedded vector;
and integrating the word embedding vector and the position embedding vector to obtain a first word embedding vector.
Optionally, the extracting the medical text semantic feature of the fusion word embedded vector by using a preset feature extraction model includes:
extracting the features of the fusion word embedded vector by using a forward network in a preset feature extraction model to obtain forward medical text semantic features;
performing reverse sequence processing on the fusion word embedded vector to obtain a reverse fusion word embedded vector;
extracting the characteristics of the reverse fusion word embedded vector by utilizing a backward network in the characteristic extraction model to obtain backward medical text semantic characteristics;
extracting the context information characteristics of the fusion word embedded vector by using a self-attention mechanism module in the characteristic extraction model;
and splicing the forward medical text semantic features, the backward medical text semantic features and the contextual information features to obtain the medical text semantic features of the fusion word embedded vector.
Optionally, the classifying the semantic features of the medical text by using a preset classifier to obtain the category of the medical text to be classified includes:
calculating the score value of each category of the medical text semantic features;
mapping the score value into a score probability value by using a sigmoid function;
and taking the category with the largest scoring probability value as the category of the medical text to be classified.
Optionally, the word segmentation processing is performed on the medical text to be classified to obtain a character sequence of the medical text to be classified, including:
performing word segmentation on the medical texts to be classified by using N preset word segmenters to obtain N initial medical text character sequences to be classified, wherein N is a natural number greater than or equal to 2;
and selecting the medical text character sequence to be classified from the N initial medical text character sequences to be classified by utilizing a bidding mechanism.
In order to solve the above problems, the present invention also provides a medical text classification apparatus, the apparatus comprising:
the word segmentation module is used for obtaining medical texts to be classified, and carrying out word segmentation on the medical texts to be classified to obtain a medical text character sequence to be classified;
the vector conversion module is used for coding the medical text character sequence to be classified by utilizing a preset coding model to obtain a first word embedded vector; performing vector embedding on the medical text character sequences to be classified by utilizing a pre-trained word embedding table to obtain a second word embedding vector; splicing the first word embedded vector and the second word embedded vector to obtain a fusion word embedded vector;
the feature extraction module is used for extracting medical text semantic features of the fusion word embedded vector by using a preset feature extraction model;
and the classification module is used for classifying the semantic features of the medical texts by using a preset classifier to obtain the category of the medical texts to be classified.
In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the medical text classification method described above.
In order to solve the above-mentioned problems, the present invention also provides a computer-readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the above-mentioned medical text classification method.
According to the embodiment of the invention, the medical text to be classified is obtained, word segmentation is carried out on the medical text to be classified to obtain the medical text character sequence to be classified, the preset coding model is utilized to code the medical text character sequence to be classified to obtain the first word embedded vector, and the global information of the text is extracted, so that more accurate acquisition of semantic information is facilitated; further, the pre-trained word embedding table is utilized to carry out vector embedding on the medical text character sequence to be classified, so that a second word embedding vector is obtained, the medical text to be classified is fused with external information, the semantics of medical professional terms can be better expressed, and the accuracy rate of medical text classification can be improved; the first word embedded vector and the second word embedded vector are spliced to obtain a fusion word embedded vector, global information and external information are fully fused, and therefore semantic expression of medical professional terms is more accurate; finally, extracting the medical text semantic features of the fusion word embedded vector by using a preset feature extraction model; and classifying the semantic features of the medical texts by using a preset classifier to obtain the category of the medical texts to be classified, fully considering external information and semantic information of the texts, and improving the accuracy of medical text classification. Therefore, the medical text classification method, the medical text classification device, the electronic equipment and the computer readable storage medium can solve the problem of low accuracy of medical text classification.
Drawings
Fig. 1 is a flow chart of a medical text classification method according to an embodiment of the invention;
FIG. 2 is a detailed flow chart of one of the steps in the medical text classification method shown in FIG. 1;
FIG. 3 is a detailed flow chart of another step in the medical text classification method shown in FIG. 1;
FIG. 4 is a functional block diagram of a medical text classification apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device for implementing the medical text classification method according to an embodiment of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The embodiment of the application provides a medical text classification method. The execution subject of the medical text classification method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the medical text classification method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (ContentDelivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
Referring to fig. 1, a flow chart of a medical text classification method according to an embodiment of the invention is shown. In this embodiment, the medical text classification method includes:
s1, acquiring medical texts to be classified, and performing word segmentation on the medical texts to be classified to obtain a medical text character sequence to be classified.
In the embodiment of the present invention, the medical text to be classified is a medical text containing professional terms, for example, doctor-patient inquiry data of a hospital.
In detail, the step S1 of word segmentation processing is performed on the medical text to be classified to obtain a character sequence of the medical text to be classified, including:
performing word segmentation on the medical texts to be classified by using N preset word segmenters to obtain N initial medical text character sequences to be classified, wherein N is a natural number greater than or equal to 2;
and selecting the medical text character sequence to be classified from the N initial medical text character sequences to be classified by utilizing a bidding mechanism.
In the embodiment of the invention, the commonly used word segmentation tools comprise a barker word segmentation device (jieba), a Chinese lexical word segmentation device (thuac), a multi-field Chinese word segmentation device (pkuseg), a ZPar word segmentation device and the like.
Specifically, in one embodiment of the invention, the medical text to be classified is subjected to dirt removal processing according to a preset text restriction rule to obtain a clean text; selecting three word separators to separate the clean texts respectively to obtain a first initial medical text character sequence to be classified, a second initial medical text character sequence to be classified and a third initial medical text character sequence to be classified; and selecting one word sequence from the first initial medical text character sequence to be classified, the second initial medical text character sequence to be classified and the third initial medical text character sequence to be classified by using a voting mechanism as the medical text character sequence to be classified.
S2, encoding the medical text character sequence to be classified by using a preset encoding model to obtain a first word embedded vector.
In the implementation of the invention, the preset coding model can be BERT (Bidirectional Encoder Representations from Transformer) model which is a large-scale pre-training language model based on a bidirectional transducer, has strong language characterization capability and feature extraction capability, and can extract the features of each word in the text.
In detail, the S2 includes:
inserting a separator in front of the medical text character sequence to be classified and adding a segmenter behind the last character of each sentence corresponding to the medical text character sequence to be classified to obtain a target medical text character sequence to be classified;
coding the target medical text character sequence to be classified by using a preset coding model to obtain a word embedding vector;
coding the position information of the target medical text character sequence to be classified by using the coding model to obtain a position embedded vector;
and integrating the word embedding vector and the position embedding vector to obtain a first word embedding vector.
In the implementation of the invention, a separator [ CLS ] is inserted before the medical text character sequence to be classified, and a segmenter [ SEP ] is added behind the last character of each sentence corresponding to the medical text character sequence to be classified, so as to obtain the target medical text character sequence to be classified.
Specifically, the medical text character sequence to be classified is c= { c_1, c_2, …, c_n }, wherein c_n is the nth character of the medical text character sequence to be classified, a separator [ CLS ] is added to the first digit of the medical text character sequence to be classified, and a segmenter [ SEP ] is added after the last character of each sentence corresponding to the medical text character sequence to be classified, so as to obtain the target medical text character sequence to be classified. For example: the medical text character sequence to be classified is common cold and generally refers to upper respiratory tract infection, special characters are added, the obtained target medical text character sequence to be classified is [ CLS ] common cold and generally refers to upper respiratory tract infection [ SEP ] ", and after being coded by a BERT model, the character coding sequences are E [ c ] = { e_1 [ c ], e_2 [ c ], …, e_N [ c ], wherein e_N [ c ] represents word embedding vectors corresponding to the N-th characters.
In the embodiment of the invention, the separator [ CLS ] and the segmenter [ SEP ] are added to the medical text character sequence to be classified, and the position coding is carried out on the medical text character sequence to be classified, so that the semantic information can be acquired more accurately.
And S3, performing vector embedding on the medical text character sequences to be classified by utilizing a pre-trained word embedding table to obtain a second word embedding vector.
In the embodiment of the present invention, the pre-trained word embedding table may be a GLOVE (Global Vectors for Word Representation, word embedding of global vector for short) word embedding table obtained by training according to a corpus such as medical professional terms, medical scenes, etc.
In detail, referring to fig. 2, the step S3 includes:
s31, traversing the medical text character sequence to be classified by using a preset window to obtain the times of each vocabulary in the same window, and constructing a co-occurrence matrix according to the times of each vocabulary in the same window;
s32, calculating the probability of the random character appearing in the other character according to the co-occurrence frequency matrix to obtain the co-occurrence probability;
and S33, mapping the medical text character sequence to be classified into a vector space by utilizing a pre-trained word embedding table according to the co-occurrence probability to obtain a second word embedding vector.
Further, the step S33 includes:
taking the co-occurrence probability as the character weight of the corresponding character;
matching the medical text character sequence to be classified with a pre-trained word embedding table to obtain a target embedding vector;
and weighting the target embedded vector according to the character weight to obtain a second word embedded vector.
In the embodiment of the invention, the pre-trained word embedding table is utilized to carry out vector embedding on the medical text character sequence to be classified to obtain the second word embedding vector, so that the medical text to be classified is fused with external information, the semantics of medical professional terms can be better expressed, and the accuracy rate of medical text classification can be improved.
And S4, splicing the first word embedded vector and the second word embedded vector to obtain a fusion word embedded vector.
In the embodiment of the invention, the first word embedded vector and the second word embedded vector are spliced in different orders.
In the embodiment of the invention, the first word embedded vector and the second word embedded vector are spliced to obtain the fusion word embedded vector, and the global information and the external information of the medical text to be classified are fully fused, so that the fusion word embedded vector information is expressed more fully, and the accuracy of classifying the medical text to be classified is improved.
S5, extracting the medical text semantic features of the fusion word embedded vector by using a preset feature extraction model.
In the embodiment of the present invention, the preset feature extraction model may be constructed by using a BiLSTM network (Bidirectional Long Short-Term Memory Network, two-way long-short-term memory network), where the BiLSTM network is a deep learning model capable of processing time series data, and includes a forward BiLSTM network, a backward BiLSTM network, and a self-attention mechanism module, where the two BiLSTM layers respectively process sequences from the forward direction and the backward direction, and the self-attention mechanism module may extract context information.
In detail, as described with reference to fig. 3, the step S5 includes:
s51, extracting the features of the fusion word embedded vector by utilizing a forward network in a preset feature extraction model to obtain forward medical text semantic features;
s52, performing reverse sequence processing on the fusion word embedded vector to obtain a reverse fusion word embedded vector;
s53, extracting the characteristics of the reverse fusion word embedded vector by utilizing a backward network in the characteristic extraction model to obtain backward medical text semantic characteristics;
s54, extracting the context information features of the fusion word embedded vector by using a self-attention mechanism module in the feature extraction model;
s55, splicing the forward medical text semantic features, the backward medical text semantic features and the context information features to obtain the medical text semantic features of the fusion word embedded vector.
In the embodiment of the invention, the context information is acquired by using an attention mechanism, the forward and reverse characteristic information in the medical text is acquired by using the BiLSTM network, the context information and the forward and reverse information are fully utilized, partial semantic information is prevented from being lost in the characteristic extraction process, and the accuracy of subsequent classification is improved.
And S6, classifying the semantic features of the medical texts by using a preset classifier to obtain the category of the medical texts to be classified.
In an embodiment of the present invention, the preset classifier may be a softmax classifier.
In detail, the S6 includes:
calculating the score value of each category of the medical text semantic features;
mapping the score value into a score probability value by using a sigmoid function;
and taking the category with the largest scoring probability value as the category of the medical text to be classified.
In the embodiment of the invention, the cross entropy function can be used as a loss function to calculate the loss value between the real category and the predicted category in the training process.
According to the embodiment of the invention, the medical text to be classified is obtained, word segmentation is carried out on the medical text to be classified to obtain the medical text character sequence to be classified, the preset coding model is utilized to code the medical text character sequence to be classified to obtain the first word embedded vector, and the global information of the text is extracted, so that more accurate acquisition of semantic information is facilitated; further, the pre-trained word embedding table is utilized to carry out vector embedding on the medical text character sequence to be classified, so that a second word embedding vector is obtained, the medical text to be classified is fused with external information, the semantics of medical professional terms can be better expressed, and the accuracy rate of medical text classification can be improved; the first word embedded vector and the second word embedded vector are spliced to obtain a fusion word embedded vector, global information and external information are fully fused, and therefore semantic expression of medical professional terms is more accurate; finally, extracting the medical text semantic features of the fusion word embedded vector by using a preset feature extraction model; and classifying the semantic features of the medical texts by using a preset classifier to obtain the category of the medical texts to be classified, fully considering external information and semantic information of the texts, and improving the accuracy of medical text classification. Therefore, the medical text classification method provided by the invention can solve the problem of lower accuracy of medical text classification.
Fig. 4 is a functional block diagram of a medical text classification apparatus according to an embodiment of the present invention.
The medical text classification apparatus 100 of the present invention may be installed in an electronic device. Depending on the implemented functions, the medical text classification apparatus 100 may include a word segmentation module 101, a vector conversion module 102, a feature extraction module 103, and a classification module 104. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the word segmentation module 101 is configured to obtain a medical text to be classified, and perform word segmentation on the medical text to be classified to obtain a character sequence of the medical text to be classified;
the vector conversion module 102 is configured to encode the medical text character sequence to be classified by using a preset encoding model to obtain a first word embedded vector; performing vector embedding on the medical text character sequences to be classified by utilizing a pre-trained word embedding table to obtain a second word embedding vector; splicing the first word embedded vector and the second word embedded vector to obtain a fusion word embedded vector;
the feature extraction module 103 is configured to extract the medical text semantic feature of the fusion word embedded vector by using a preset feature extraction model;
the classifying module 104 is configured to classify the semantic features of the medical text by using a preset classifier, so as to obtain the category of the medical text to be classified.
In detail, each module in the medical text classification apparatus 100 in the embodiment of the present invention adopts the same technical means as the medical text classification method described in fig. 1 to 3, and can produce the same technical effects, which are not described herein.
Fig. 5 is a schematic structural diagram of an electronic device for implementing a medical text classification method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as a medical text classification program, stored in the memory 11 and executable on the processor 10.
The processor 10 may be formed by an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units (Central Processing Unit, CPU), a microprocessor, a digital processing chip, a graphics processor, a combination of various control chips, and so on. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device and processes data by running or executing programs or modules (for example, executing medical text classification programs, etc.) stored in the memory 11, and calling data stored in the memory 11.
The memory 11 includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, such as a mobile hard disk of the electronic device. The memory 11 may in other embodiments also be an external storage device of the electronic device, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only for storing application software installed in an electronic device and various types of data, such as codes of medical text classification programs, etc., but also for temporarily storing data that has been output or is to be output.
The communication bus 12 may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
The communication interface 13 is used for communication between the electronic device and other devices, including a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), or alternatively a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface.
Fig. 5 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device may further include a power source (such as a battery) for supplying power to the respective components, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device may further include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described herein.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
Acquiring a medical text to be classified, and performing word segmentation on the medical text to be classified to obtain a medical text character sequence to be classified;
encoding the medical text character sequence to be classified by using a preset encoding model to obtain a first word embedded vector;
performing vector embedding on the medical text character sequences to be classified by utilizing a pre-trained word embedding table to obtain a second word embedding vector;
splicing the first word embedded vector and the second word embedded vector to obtain a fusion word embedded vector;
extracting medical text semantic features of the fusion word embedded vector by using a preset feature extraction model;
and classifying the semantic features of the medical texts by using a preset classifier to obtain the category of the medical texts to be classified.
In particular, the specific implementation method of the above instructions by the processor 10 may refer to the description of the relevant steps in the corresponding embodiment of the drawings, which is not repeated herein.
Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:
acquiring a medical text to be classified, and performing word segmentation on the medical text to be classified to obtain a medical text character sequence to be classified;
encoding the medical text character sequence to be classified by using a preset encoding model to obtain a first word embedded vector;
performing vector embedding on the medical text character sequences to be classified by utilizing a pre-trained word embedding table to obtain a second word embedding vector;
splicing the first word embedded vector and the second word embedded vector to obtain a fusion word embedded vector;
extracting medical text semantic features of the fusion word embedded vector by using a preset feature extraction model;
and classifying the semantic features of the medical texts by using a preset classifier to obtain the category of the medical texts to be classified.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. A medical text classification method, the method comprising:
acquiring a medical text to be classified, and performing word segmentation on the medical text to be classified to obtain a medical text character sequence to be classified;
encoding the medical text character sequence to be classified by using a preset encoding model to obtain a first word embedded vector;
performing vector embedding on the medical text character sequences to be classified by utilizing a pre-trained word embedding table to obtain a second word embedding vector;
splicing the first word embedded vector and the second word embedded vector to obtain a fusion word embedded vector;
extracting medical text semantic features of the fusion word embedded vector by using a preset feature extraction model;
and classifying the semantic features of the medical texts by using a preset classifier to obtain the category of the medical texts to be classified.
2. The medical text classification method of claim 1, wherein the vector embedding the sequence of medical text characters to be classified using a pre-trained word embedding table to obtain a second word embedding vector, comprises:
traversing the medical text character sequence to be classified by using a preset window to obtain the times of each vocabulary appearing in the same window, and constructing a co-occurrence matrix according to the times of each vocabulary appearing in the same window;
according to the co-occurrence frequency matrix, calculating the probability that a random character appears on another character to obtain the co-occurrence probability;
and according to the co-occurrence probability, mapping the medical text character sequence to be classified into a vector space by utilizing a pre-trained word embedding table to obtain a second word embedding vector.
3. The medical text classification method of claim 2, wherein mapping the sequence of medical text characters to be classified to a vector space using a pre-trained word embedding table according to the co-occurrence probability to obtain a second word embedding vector, comprises:
taking the co-occurrence probability as the character weight of the corresponding character;
matching the medical text character sequence to be classified with a pre-trained word embedding table to obtain a target embedding vector;
and weighting the target embedded vector according to the character weight to obtain a second word embedded vector.
4. The medical text classification method according to claim 1, wherein the encoding the sequence of medical text characters to be classified using a preset encoding model to obtain a first word embedding vector includes:
inserting a separator in front of the medical text character sequence to be classified and adding a segmenter behind the last character of each sentence corresponding to the medical text character sequence to be classified to obtain a target medical text character sequence to be classified;
coding the target medical text character sequence to be classified by using a preset coding model to obtain a word embedding vector;
coding the position information of the target medical text character sequence to be classified by using the coding model to obtain a position embedded vector;
and integrating the word embedding vector and the position embedding vector to obtain a first word embedding vector.
5. The medical text classification method of claim 1, wherein extracting the medical text semantic features of the fusion word embedded vector using a preset feature extraction model comprises:
extracting the features of the fusion word embedded vector by using a forward network in a preset feature extraction model to obtain forward medical text semantic features;
performing reverse sequence processing on the fusion word embedded vector to obtain a reverse fusion word embedded vector;
extracting the characteristics of the reverse fusion word embedded vector by utilizing a backward network in the characteristic extraction model to obtain backward medical text semantic characteristics;
extracting the context information characteristics of the fusion word embedded vector by using a self-attention mechanism module in the characteristic extraction model;
and splicing the forward medical text semantic features, the backward medical text semantic features and the contextual information features to obtain the medical text semantic features of the fusion word embedded vector.
6. The medical text classification method according to claim 1, wherein classifying the semantic features of the medical text by using a preset classifier to obtain the category of the medical text to be classified comprises:
calculating the score value of each category of the medical text semantic features;
mapping the score value into a score probability value by using a sigmoid function;
and taking the category with the largest scoring probability value as the category of the medical text to be classified.
7. The medical text classification method according to claim 1, wherein the word segmentation processing is performed on the medical text to be classified to obtain a character sequence of the medical text to be classified, and the method comprises the following steps:
performing word segmentation on the medical texts to be classified by using N preset word segmenters to obtain N initial medical text character sequences to be classified, wherein N is a natural number greater than or equal to 2;
and selecting the medical text character sequence to be classified from the N initial medical text character sequences to be classified by utilizing a bidding mechanism.
8. A medical text classification apparatus, the apparatus comprising:
the word segmentation module is used for obtaining medical texts to be classified, and carrying out word segmentation on the medical texts to be classified to obtain a medical text character sequence to be classified;
the vector conversion module is used for coding the medical text character sequence to be classified by utilizing a preset coding model to obtain a first word embedded vector; performing vector embedding on the medical text character sequences to be classified by utilizing a pre-trained word embedding table to obtain a second word embedding vector; splicing the first word embedded vector and the second word embedded vector to obtain a fusion word embedded vector;
the feature extraction module is used for extracting medical text semantic features of the fusion word embedded vector by using a preset feature extraction model;
and the classification module is used for classifying the semantic features of the medical texts by using a preset classifier to obtain the category of the medical texts to be classified.
9. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the medical text classification method according to any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the medical text classification method according to any one of claims 1 to 7.
CN202310500108.6A 2023-05-06 2023-05-06 Medical text classification method, device, equipment and medium Pending CN116450829A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310500108.6A CN116450829A (en) 2023-05-06 2023-05-06 Medical text classification method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310500108.6A CN116450829A (en) 2023-05-06 2023-05-06 Medical text classification method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN116450829A true CN116450829A (en) 2023-07-18

Family

ID=87133701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310500108.6A Pending CN116450829A (en) 2023-05-06 2023-05-06 Medical text classification method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN116450829A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117254819A (en) * 2023-11-20 2023-12-19 深圳市瑞健医信科技有限公司 Medical waste intelligent supervision system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117254819A (en) * 2023-11-20 2023-12-19 深圳市瑞健医信科技有限公司 Medical waste intelligent supervision system
CN117254819B (en) * 2023-11-20 2024-02-27 深圳市瑞健医信科技有限公司 Medical waste intelligent supervision system

Similar Documents

Publication Publication Date Title
CN111695354A (en) Text question-answering method and device based on named entity and readable storage medium
CN113378970B (en) Sentence similarity detection method and device, electronic equipment and storage medium
CN113157927B (en) Text classification method, apparatus, electronic device and readable storage medium
CN113051356B (en) Open relation extraction method and device, electronic equipment and storage medium
CN114822812A (en) Character dialogue simulation method, device, equipment and storage medium
CN115983271B (en) Named entity recognition method and named entity recognition model training method
CN113722483A (en) Topic classification method, device, equipment and storage medium
CN113821622B (en) Answer retrieval method and device based on artificial intelligence, electronic equipment and medium
CN115238115A (en) Image retrieval method, device and equipment based on Chinese data and storage medium
CN113344125B (en) Long text matching recognition method and device, electronic equipment and storage medium
CN115221276A (en) Chinese image-text retrieval model training method, device, equipment and medium based on CLIP
CN116450829A (en) Medical text classification method, device, equipment and medium
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN113220964B (en) Viewpoint mining method based on short text in network message field
CN116821373A (en) Map-based prompt recommendation method, device, equipment and medium
CN116595023A (en) Address information updating method and device, electronic equipment and storage medium
CN116628162A (en) Semantic question-answering method, device, equipment and storage medium
CN114548114B (en) Text emotion recognition method, device, equipment and storage medium
CN113705692B (en) Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN114462411B (en) Named entity recognition method, device, equipment and storage medium
CN113656703B (en) Intelligent recommendation method, device, equipment and storage medium based on new online courses
CN111680513B (en) Feature information identification method and device and computer readable storage medium
CN115169330B (en) Chinese text error correction and verification method, device, equipment and storage medium
CN113792539B (en) Entity relationship classification method and device based on artificial intelligence, electronic equipment and medium
CN116341646A (en) Pretraining method and device of Bert model, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination