CN109840280B

CN109840280B - Text classification method and device and computer readable storage medium

Info

Publication number: CN109840280B
Application number: CN201910165382.6A
Authority: CN
Inventors: 施振辉; 陈俊; 夏源; 陆超; 黄海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-03-05
Filing date: 2019-03-05
Publication date: 2023-07-18
Anticipated expiration: 2039-03-05
Also published as: CN109840280A

Abstract

The embodiment of the invention provides a text classification method, a text classification device and a computer readable storage medium, wherein the method comprises the following steps: identifying texts to be classified to obtain information of at least one dimension; wherein the information of at least one dimension at least comprises text elements and medical characteristic information; encoding the information of at least one dimension to obtain an initial vector corresponding to each dimension; and determining the medical label of the text to be classified based on the initial vector corresponding to each dimension.

Description

Text classification method and device and computer readable storage medium

Technical Field

The present invention relates to text recognition technology in the medical field, and in particular, to a text classification method, apparatus and computer readable storage medium.

Background

Text classification refers to a technology for automatically classifying and marking natural language text according to a certain classification system or standard. In many internet services today, in order to better provide medical related services for users, it is very important to understand input text of users, and currently, a general text recognition scheme is adopted to ensure higher text recognition accuracy for non-medical fields, but text recognition or classification processing for medical fields cannot ensure accuracy of results.

Disclosure of Invention

Embodiments of the present invention provide a text classification method, apparatus, and computer readable storage medium, to solve one or more technical problems in the prior art.

In a first aspect, an embodiment of the present invention provides a text classification method, including:

identifying texts to be classified to obtain information of at least one dimension; wherein the information of at least one dimension at least comprises text elements and medical characteristic information;

encoding the information of at least one dimension to obtain an initial vector corresponding to each dimension;

and determining the medical label of the text to be classified based on the initial vector corresponding to each dimension.

In one embodiment, the determining the medical label of the text to be classified based on the initial vector corresponding to each dimension includes:

splicing the initial vectors corresponding to each dimension to obtain a spliced first vector to be processed;

inputting the first vector to be processed into a first network to obtain a first type output vector output by the first network, and determining a first type medical label of the text to be classified based on the first type output vector;

the first-class output vector comprises two coding values, wherein the two coding values respectively represent medical intention of the text to be classified and non-medical intention of the text to be classified;

the first type of medical tags are used for representing whether the text to be classified is medical intention.

splicing the initial vectors corresponding to each dimension to obtain a spliced second vector to be processed;

inputting the second vector to be processed into a second network to obtain a second class output vector output by the second network; wherein the second class output vector comprises at least one code value corresponding to at least one medical department;

and determining a second type of medical label for representing the medical department corresponding to the text to be classified based on the second type of output vector.

In one embodiment, before the identifying the text to be classified and obtaining the information of at least one dimension, the method further includes:

and judging whether the first type of medical label of the text to be classified is medical intention.

In one embodiment, the method further comprises:

and carrying out smoothing treatment on the second class output vector to obtain a second class output vector corresponding to at least one medical department after the smoothing treatment.

In one embodiment, the text element includes at least one word, and/or at least one word;

the medical characteristic information includes at least one of: at least one type of critical information, at least one intent feature, at least one medical statistics feature.

In a second aspect, an embodiment of the present invention provides a text classification apparatus, including:

the identifying unit is used for identifying the text to be classified to obtain information of at least one dimension; wherein the information of at least one dimension at least comprises text elements and medical characteristic information;

the first processing unit is used for encoding the information of at least one dimension to obtain an initial vector corresponding to each dimension;

and the second processing unit is used for determining the medical label of the text to be classified based on the initial vector corresponding to each dimension.

In one embodiment, the second processing unit is configured to splice the initial vectors corresponding to each dimension to obtain a first spliced vector to be processed; inputting the first vector to be processed into a first network to obtain a first type output vector output by the first network, and determining a first type medical label of the text to be classified based on the first type output vector;

In one embodiment, the second processing unit is configured to splice the initial vectors corresponding to each dimension to obtain a spliced second vector to be processed; inputting the second vector to be processed into a second network to obtain a second class output vector output by the second network; wherein the second class output vector comprises at least one code value corresponding to at least one medical department; and determining a second type of medical label for representing the medical department corresponding to the text to be classified based on the second type of output vector.

In one embodiment, the second processing unit is configured to determine whether the first type of medical tag of the text to be classified is a medical intention.

In one embodiment, the second processing unit is configured to perform smoothing processing on the second class output vector, to obtain a second class output vector corresponding to at least one medical department after the smoothing processing.

In a third aspect, an embodiment of the present invention provides a text classification apparatus, including:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of the preceding claims.

In one possible design, the apparatus includes a processor and a memory in a structure thereof, the memory storing a program for supporting the apparatus to perform the above method, the processor being configured to execute the program stored in the memory. The apparatus may also include a communication interface for communicating with other devices or communication networks.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing computer software instructions for use with a text classification apparatus, including a program for executing the above-described text classification method.

One of the above technical solutions has the following advantages or beneficial effects:

the method can extract information with various granularities of the text to be classified, particularly the extraction of medical characteristic information contained in the text to be classified, further the extracted medical characteristic information is used for encoding to obtain an initial vector, and the medical label of the text to be classified is determined based on the initial vector. Therefore, medical characteristic information can be added to the input information when the text to be classified is processed, and the accuracy of the recognition result of the text to be classified is improved.

The foregoing summary is for the purpose of the specification only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will become apparent by reference to the drawings and the following detailed description.

Drawings

In the drawings, the same reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily drawn to scale. It is appreciated that these drawings depict only some embodiments according to the disclosure and are not therefore to be considered limiting of its scope.

FIG. 1 is a schematic flow chart of a text classification method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a network architecture according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a text classification method according to an embodiment of the present invention;

FIG. 4 shows a flow chart of medical department identification according to an embodiment of the present invention;

FIG. 5 shows a third flow chart of a text classification method according to an embodiment of the invention;

FIG. 6 is a block diagram showing a text classification apparatus according to an embodiment of the present invention;

fig. 7 shows a block diagram of a text classification apparatus according to an embodiment of the invention.

Detailed Description

Hereinafter, only certain exemplary embodiments are briefly described. As will be recognized by those of skill in the pertinent art, the described embodiments may be modified in various different ways without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive.

In one implementation, fig. 1 shows a flowchart of a text classification method according to an embodiment of the present invention, the method including:

step S11: identifying texts to be classified to obtain information of at least one dimension; wherein the information of at least one dimension at least comprises text elements and medical characteristic information;

step S12: encoding the information of at least one dimension to obtain an initial vector corresponding to each dimension;

step S13: and determining the medical label of the text to be classified based on the initial vector corresponding to each dimension.

Here, the solution provided in this embodiment may be applied to a device having a processing function, for example, may be a terminal device, or may be applied to a network device.

When the scheme is applied to the terminal equipment, the text to be classified can be acquired through the terminal equipment or acquired through other functions, and then the step S11-step S13 is executed to finally acquire the result.

When the scheme is applied to the network equipment, the text to be classified sent by the terminal equipment can be received, and then the network equipment executes the steps S11-S13; further, when the present solution is applied at the network side, the result may be sent to the terminal device by the network device after the completion of step S13 is performed.

The text to be classified in this embodiment may be directly input text, or may be text information obtained by converting collected voice information. The method comprises the steps that a target object is possibly scanned in an acquisition mode, and then characters in a certain area contained in the target object are identified to obtain a text to be classified; further acquisition modes are not exhaustive in this embodiment.

In the medical field, the complaints of medical records and the simple description of the patient's own illness state belong to medical short texts, the text length is from a few words to hundreds of words, and the average length is about tens of words. Resulting in the information contained in the text and its limitation, some of the critical information is concentrated on one word. For example, for the text "pregnancy 23 weeks", if the division is performed, the key information is a word "pregnancy", the model is required to learn "pregnancy" from the text, understanding "pregnancy" is required to give the division result "obstetrics". For the text "how much money is done for appendicitis surgery", if medical intention recognition is performed on it, for example, the key information is how much money is done instead of asking the appendicitis surgery to go to which department to treat.

In this embodiment, before executing step S11, the text to be classified needs to be filtered first to obtain the filtered text to be classified. And then executing step S11 to identify the text to be classified to obtain information of at least one dimension.

The filtering process can be understood as deleting punctuation marks in the text to be classified and preset special characters; for example, a sentence "today, some stomach discomfort" is input, and then the user can delete the sentence ",", the preset special words can be general words with higher probability, such as "day", "have", etc., which is not exhaustive in this embodiment, so the following "stomach discomfort" can be remained in the above-mentioned input sentence.

The information of at least one dimension at least comprises text elements and medical characteristic information; wherein the text element comprises at least one word and/or at least one word; the medical characteristic information includes at least one of: at least one type of critical information, at least one intent feature, at least one medical statistics feature.

That is, filtering the text to be classified to obtain a filtered text to be classified with redundant information deleted; and extracting at least one word and/or at least one word from the filtered text to be classified as a text element.

At least one word and/or at least one word are extracted from the filtered text to be classified, each word in the filtered text to be classified can be extracted, at least one word can be extracted for at least two continuous words capable of forming words, and words such as stomach discomfort can be extracted from the words.

In order to reduce labor and time cost and improve accuracy of classifying short medical texts, the text to be classified is classified according to medical characteristic information, namely medical knowledge.

At least one kind of key information is obtained by identifying the text to be classified, and the key information containing at least one kind of symptoms, diseases, examination and parts is extracted from the text to be classified by utilizing NLU (natural language understanding). For example, the above-mentioned "stomach discomfort" is still taken as an example, and the symptom key information "discomfort" and the key information "stomach" corresponding to the location can be obtained. It should be understood that not every sentence may obtain all types of critical information, and only some of them may be obtained, which is not exhaustive.

In the present embodiment, description is made with respect to identification of an intended feature, and the medical feature information may include only: at least one type of critical information and at least one type of intent feature.

The manner in which the at least one intent feature is obtained may be a keyword matching manner. Specifically, the matching may be performed by using at least one word and/or at least one text in the text element to obtain the corresponding intent feature. For example, a certain word and/or text is used as a keyword, and corresponding intention characteristics are obtained based on keyword matching; of course, it can be understood that the intention characteristic is obtained by matching based on at least one type of key information, for example, the symptom key information is pain, the position key information is head, and the intention characteristic is cared for by matching; for another example, the word "how much money" can be matched to the intended feature is cost; also such as "go to hospital" feature phrases may be matched to the intended feature as hospital-related.

In the foregoing step S12, the encoding of the information of at least one dimension to obtain an initial vector corresponding to each dimension may be: and encoding at least one word to obtain an initial vector corresponding to the word, and encoding the key and the intention feature in the medical feature information together to obtain an initial vector corresponding to the medical feature information.

Wherein, the encoding of the at least one word and the at least one word may employ a first word encoding network and a first word encoding network, respectively; one of LSTM, CNN, RNN may be employed as its encoding network.

The key and the intention feature in the medical feature information are coded together to obtain an initial vector corresponding to the medical feature information, which can be a first feature coding network and can be realized by adopting a DNN network.

Referring to fig. 2, the input in the figure may be understood as a text to be classified, and at least one word, at least one text and at least one medical characteristic information are obtained respectively; wherein, the at least one medical characteristic information can comprise key information and intention characteristics. Aiming at words and words, an RNN network can be adopted to process to obtain codes, and initial vectors of the words are obtained respectively; it should be understood that the initial vector corresponding to at least one word may be a multidimensional vector, that is, a vector formed by 3 encoded values may be obtained by processing a plurality of words through the RNN network, and the initial vector may be a vector with more dimensions, which is not exhaustive herein; the processing of at least one word is also a corresponding processing, and a description thereof will not be repeated here.

For the processing of key information and intention characteristics in at least one medical characteristic information, DNN network can be adopted for processing, and multidimensional vectors can be obtained, for example, three-dimensional or four-dimensional vectors can be obtained, namely, 3 or 4 or more coded values can be obtained to form corresponding multidimensional vectors.

The determining the medical label of the text to be classified based on the initial vector corresponding to each dimension comprises the following steps:

Wherein the first network may be a DNN network.

As shown in fig. 2, the initial vector of the text, the initial vector of the word and the initial vector of the medical feature information are spliced to obtain a spliced first class of vectors to be processed.

Still further, the first type of vector to be processed is input to the first network, i.e. the DNN network in the graph, and the first type of vector to be processed is input to the first network to obtain the first type of output vector, which in this embodiment may be a two-dimensional vector, that is, may output two encoding values; wherein the two encoded values may correspond to a medical intent and a non-medical intent, respectively; assuming that the first code value corresponds to medical intention, the second code value is non-medical intention, and assuming that the two code values outputted are (0, 1), then it is possible to confirm that the inputted text to be classified is non-medical intention, and assuming that the two code values outputted are (1, 0), then it is possible to confirm that the inputted text to be classified is text of medical intention.

The embodiment can be applied to an intelligent triage system, and in order to improve user experience, some texts to be classified which are not related to medical treatment and texts to be classified which are obviously not required by triage need to be filtered, such as ' today weather is good ', ' pictures of certain skin diseases ', ' and the like, so that the input of the system is relevant to triage intention.

In medical intent recognition, texts to be classified with the same intent may be similar or dissimilar, and texts to be classified with different intentions may be similar or dissimilar. Such as "cold to see" and "upper respiratory tract infection to see" are intended to be similar, but text is dissimilar (cold is upper respiratory tract infection); the "what medicine is taken by cold" and the "how much money is spent by cold" are similar to the text and are different in intention. In the identification of medical intention, although the neural network is simple and efficient, the neural network cannot obtain ideal results, firstly, the neural network can learn statistical characteristics, but the logic reasoning characteristics of medical profession are difficult to learn; secondly, because texts to be classified with different intentions possibly have most similar statistical characteristics, the difficulty of classifying the texts to be classified by the neural network is increased. Therefore, the embodiment combines medical features, namely intention features and key information related to medical knowledge as inputs, so that the accuracy of intention recognition can be improved.

In order to better identify and classify the text, a neural network method is adopted to combine specific intention key information and specific medical knowledge, and the following description is further carried out based on fig. 3 in combination with the foregoing embodiment, and the method mainly comprises the following three steps:

data cleaning, including construction of training samples, filtering of special characters and the like; specific processes have been described above and are not repeated here;

extracting features; extracting characters and words from a text to be classified, extracting key information based on NLU, wherein the key information can comprise symptoms, diseases, examination, parts and the like, and the corresponding intention characteristics can be obtained based on a matching method of at least one of the characters, the words and the key information;

constructing a network; the word and character characteristics are respectively encoded by adopting a bidirectional LSTM network (or RNN, CNN and other networks); extracting key information and intention characteristics of symptoms, diseases, examination, parts and the like from an NLU, directly encoding by adopting a DNN network, then splicing the three output initial vectors to obtain a first class of vectors to be processed, and inputting the first class of vectors to be processed into the DNN network to obtain a first class of medical labels finally output.

By adopting the scheme, the text to be classified can be subjected to information extraction with various granularities, particularly the extraction of medical characteristic information contained in the text to be classified, and then the extracted medical characteristic information is encoded to obtain an initial vector, and the medical label of the text to be classified is determined based on the initial vector. Therefore, medical characteristic information can be added to the input information when the text to be classified is processed, and the accuracy of the recognition result of the text to be classified is improved.

In another embodiment, further explanation is made based on the steps described in fig. 1, and the difference between the present embodiment and the previous embodiment is that the medical label obtained in the present embodiment is not aimed at medical intention, but identifies the classification result of at least one medical department corresponding to the text to be classified. In particular, the method comprises the steps of,

the filtering of the text to be classified and the extraction of at least one word and/or at least one word are the same as those in the previous embodiment, and are not repeated in this embodiment.

Unlike the previous embodiment, in this embodiment, the classification result of the medical department is identified, and then the medical feature information may include: at least one type of critical information and at least one medical statistical feature.

The manner of acquiring the key information is the same as that of the above embodiment, and will not be described again. In this embodiment, at least one medical statistical feature needs to be obtained, and the obtaining manner may be understood as determining statistics corresponding to different departments or different department classifications based on at least one of at least one word, and at least one kind of key information; for example, the word "cold" may have a corresponding medical statistics of 80% for "respiratory department" and 0% for "gynecology", and it should be understood that this is only an example, and that there may actually be more medical statistics corresponding to different departments, which is not exhaustive.

Still further, the obtaining of at least one medical statistical feature may be determining medical statistical features corresponding to different words, characters and key information according to a preset analysis model, or determining medical statistical features according to a preset table, which is not exhaustive here.

It should be noted that, in this embodiment, step S12 is also executed to encode the information of at least one dimension to obtain an initial vector corresponding to each dimension; except that at least one parameter in the network employed by each dimension in this embodiment may be different from the parameters in the previous embodiment. For example, encoding at least one word and at least one word may employ a second word encoding network and a second word encoding network, respectively. Likewise, one of LSTM, CNN, RNN may be used as the encoding network, except that the second word encoding network and the second word encoding network are at least partially different from the parameters in the first word encoding network and the first word encoding network.

The key information and the medical statistical features in the medical feature information are coded together to obtain an initial vector corresponding to the medical feature information, which can be a second feature coding network, and can be realized by adopting a DNN network. Likewise, there are at least partially different parameters in the second feature encoding network than in the first feature encoding network.

In S13 of this embodiment, the determining, based on the initial vector corresponding to each dimension, the medical label of the text to be classified includes, as shown in fig. 4:

step S131, splicing the initial vectors corresponding to each dimension to obtain a spliced second vector to be processed;

step S132, inputting the second vector to be processed into a second network to obtain a second class output vector output by the second network; wherein the second class output vector comprises at least one code value corresponding to at least one medical department;

step S133, determining a second type medical label for representing the medical department corresponding to the text to be classified based on the at least one second type output vector.

The at least one code value corresponding to the at least one medical department may be a code value corresponding to one output for each medical department, where the code values form a second class of output vectors. For example, there may be three medical departments, which may be a first medical department, a second medical department, and a third medical department, and the three medical departments may respectively correspond to different coding positions in the output multidimensional vector, and the first medical department may be set to correspond to the first coding value, the second medical department corresponds to the second coding value, and the third medical department corresponds to the third coding value; the output vector is (1, 0) then it may be determined that the second type of medical label indicates to the first of the three medical departments.

The method further comprises the steps of: and carrying out smoothing treatment on the second class output vector to obtain a second class output vector corresponding to at least one medical department after the smoothing treatment. The concrete explanation is as follows:

the embodiment can be applied to an intelligent diagnosis and guide system, and the system can automatically identify what department the user should go to according to the text to be classified of the user. Such as: what department is the upper respiratory tract infection? The answer is "respiratory medicine"; what department should the leg be swollen? The answer is "cardiovascular medicine or vascular surgery", etc. However, the departments corresponding to some texts to be classified are unique, and some texts to be classified are not unique, for example, the above example is "what department is the leg swelling to hang? After consultation of medical professionals (considering severity of symptoms), it was concluded that cardiovascular and vascular surgery were both correct and superior to vascular surgery without other information.

That is, in order to avoid that the finally output second-class medical label is too absolute, the second-class output vector may be subjected to smoothing processing, where the smoothing processing may be performed by adjusting at least one encoded value of the second-class output vector based on a preset weight value; for example, the maximum code value is reduced based on a preset first weight value, and the remaining code values except the maximum code value are amplified based on a second weight value, so that it is ensured that the sum of the code values before adjustment or the sum of the code values after final adjustment is equal to 1. For example, the (1, 0) may be adjusted to (0.8,0.1,0.1).

It should also be noted that since it may not be a medical intention when a sentence is arbitrarily input, the result that may be obtained by performing the identification of the medical department in this case is not a correct result; therefore, the present embodiment may provide further processing of determining whether the first type of medical tag of the text to be classified is a medical intention or not before executing step S11. That is, when the present embodiment is executed, it is first determined whether or not the inputted sentence is the medical intention by the previous embodiment, and if so, the present embodiment is executed, and if not, the second type medical label corresponding to the medical department is identified, and if not, the present embodiment is not executed.

In order to better classify the texts in departments, the scheme provided by the embodiment is combined with the method of smoothing the labels in addition to the medical characteristic information. The overall flow is as shown in fig. 5, comprising:

the data cleaning is the same as the flow provided in fig. 3, and will not be described again;

feature extraction, unlike fig. 3, is to obtain statistical features of the words under each department; that is, in addition to the key features, the statistical features of the words under each category are extracted.

Constructing a network, and encoding word and character characteristics by adopting a bidirectional LSTM network; extracting key information such as symptoms, diseases, examination, parts and the like and word statistical characteristics from NLU, directly encoding by adopting DNN network, splicing the outputs of the three characteristics, and inputting the outputs into DNN network to obtain second-class output vectors.

Tag smoothing, for the case that there is one text to be classified belonging to multiple departments, for example: "I uncomfortable go to which department to see at a glance", no matter which department to go to, no matter what department to calculate the error; in the processing process of the second class output vector, a label smoothing method is added, so that the situation that the second class output vector is excessively dependent on a department is weakened. And after finishing the second type output vector of the label smoothing process, determining the second type medical label.

Still another embodiment of the present invention provides an image quality evaluation apparatus, as shown in fig. 6, including:

a recognition unit 61, configured to recognize a text to be classified to obtain information of at least one dimension; wherein the information of at least one dimension at least comprises text elements and medical characteristic information;

a first processing unit 62, configured to encode the information of the at least one dimension to obtain an initial vector corresponding to each dimension;

the second processing unit 63 is configured to determine the medical label of the text to be classified based on the initial vector corresponding to each dimension.

In one embodiment, the second processing unit 63 is configured to splice the initial vectors corresponding to each dimension to obtain a first spliced vector to be processed; inputting the first vector to be processed into a first network to obtain a first type output vector output by the first network, and determining a first type medical label of the text to be classified based on the first type output vector;

In one embodiment, the second processing unit 63 is configured to splice the initial vectors corresponding to each dimension to obtain a spliced second vector to be processed; inputting the second vector to be processed into a second network to obtain a second class output vector output by the second network; wherein the second class output vector comprises at least one code value corresponding to at least one medical department; and determining a second type of medical label for representing the medical department corresponding to the text to be classified based on the second type of output vector.

In one embodiment, the second processing unit 63 is configured to determine whether the first type of medical tag of the text to be classified is a medical intention.

In one embodiment, the second processing unit 63 is configured to perform smoothing processing on the second class output vector, to obtain a second class output vector corresponding to at least one medical department after the smoothing processing.

It should be noted that, the functions of each unit in the apparatus of the embodiment of the present invention may be referred to the corresponding descriptions in the above method, and are not repeated herein.

Fig. 7 shows a block diagram of a text classification apparatus according to an embodiment of the present invention. As shown in fig. 7, includes: memory 910 and processor 920, memory 910 stores a computer program executable on processor 920. The processor 920 implements the method in the above-described embodiments when executing the computer program. The number of the memories 910 and the processors 920 may be one or more.

The apparatus/device/terminal/server further comprises:

and the communication interface 930 is used for communicating with external equipment and carrying out data interaction transmission.

The memory 910 may include high-speed RAM memory or may further include non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 910, the processor 920, and the communication interface 930 are implemented independently, the memory 910, the processor 920, and the communication interface 930 may be connected to each other and perform communication with each other through buses. The bus may be an industry standard architecture (ISA, industry Standard Architecture) bus, a peripheral component interconnect (PCI, peripheral Component) bus, or an extended industry standard architecture (EISA, extended Industry Standard Component) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 910, the processor 920, and the communication interface 930 are integrated on a chip, the memory 910, the processor 920, and the communication interface 930 may communicate with each other through internal interfaces.

An embodiment of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements a method as in any of the above embodiments.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that various changes and substitutions are possible within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of text classification, the method comprising:

determining the medical label of the text to be classified based on the initial vector corresponding to each dimension;

wherein the medical characteristic information comprises at least one type of key information and comprises at least one medical statistical feature; the at least one medical statistics feature is statistics under corresponding different departments or different categories of departments determined based on the text element and/or the at least one type of key information;

the identifying the text to be classified to obtain information of at least one dimension comprises the following steps: extracting key information of at least one type of symptoms, diseases, examination and parts from the text to be classified by natural language understanding;

the determining the medical label of the text to be classified based on the initial vector corresponding to each dimension includes:

determining a second type of medical label for representing the medical department corresponding to the text to be classified based on the second type of output vector;

the at least one medical statistical feature is a medical statistical feature corresponding to different words, characters and key information determined according to a preset analysis model or a preset table.

2. The method according to claim 1, wherein the method further comprises:

3. The method according to any of claims 1-2, wherein the text element comprises at least one word, and/or at least one word.

4. A text classification device, comprising:

the second processing unit is used for determining the medical label of the text to be classified based on the initial vector corresponding to each dimension;

the identification unit is specifically used for extracting key information of at least one type of symptoms, diseases, examination and parts from the text to be classified by using natural language understanding;

the second processing unit is further used for splicing the initial vectors corresponding to each dimension to obtain a spliced second vector to be processed; inputting the second vector to be processed into a second network to obtain a second class output vector output by the second network; wherein the second class output vector comprises at least one code value corresponding to at least one medical department; determining a second type of medical label for representing the medical department corresponding to the text to be classified based on the second type of output vector; the at least one medical statistical feature is a medical statistical feature corresponding to different words, characters and key information determined according to a preset analysis model or a preset table.

5. The apparatus according to claim 4, wherein the second processing unit is configured to perform smoothing processing on the second class output vector to obtain a second class output vector corresponding to the smoothed at least one medical department.

6. A text classification device, comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-3.

7. A computer readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1 to 3.