CN113590783B

CN113590783B - NLP natural language processing-based traditional Chinese medicine health preserving intelligent question-answering system

Info

Publication number: CN113590783B
Application number: CN202110858167.1A
Authority: CN
Inventors: 周峰; 吕智慧; 严绍根; 陈宇; 徐杨川; 林榕健
Original assignee: Zhejiang Zhishu Network Technology Co ltd; Fudan University
Current assignee: Zhejiang Zhishu Network Technology Co ltd; Fudan University
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2023-10-03
Anticipated expiration: 2041-07-28
Also published as: CN113590783A

Abstract

The invention provides a traditional Chinese medicine health preserving intelligent question-answering system based on NLP natural language processing, which comprises the following steps: the system comprises a user terminal and a management server, wherein the management server comprises a traditional Chinese medicine health-preserving word segmentation module, a part-of-speech labeling module, a named entity recognition and classification module, a dependency relationship analysis module, a semantic similarity calculation module and an answer retrieval module, the traditional Chinese medicine health-preserving word segmentation module carries out word segmentation processing on clinical presentation texts according to a preset word segmentation word library, the named entity recognition and classification module carries out named entity recognition and classification on the segmented words, the dependency relationship analysis module analyzes the relationship among all components of the clinical presentation texts to obtain clinical presentation text semantics, the semantic similarity calculation module calculates semantic similarity values between user questions and warehouse-in questions based on the clinical presentation text semantics, and the answer retrieval module adopts a preset retrieval mechanism to retrieve answers corresponding to the user questions from a question-answer library according to the semantic similarity values to serve as health-preserving suggestions.

Description

NLP natural language processing-based traditional Chinese medicine health preserving intelligent question-answering system

Technical Field

The invention belongs to the field of data identification, and particularly relates to a traditional Chinese medicine health maintenance intelligent question-answering system based on NLP natural language processing.

Background

With the development of medical informatization, it is increasingly desired to obtain more accurate diagnosis and treatment information of diseases from a network. In general, people obtain relevant webpage information by inputting keywords into a re-search engine, but the webpage information also needs to be identified by the user, and has high requirement on the identification capability of the user.

The question-answering system is a high-level form of information retrieval system that can answer questions posed by a user in natural language in accurate, compact natural language. The question-answering modes of the existing medical question-answering websites are that doctors reply or search on line to give out relevant retrieval results, and no intelligent knowledge question-answering system provides services yet.

The traditional Chinese medicine is an important component of the medical industry in China, is also an important means for curing diseases, protecting health and preserving health for people, and the application of the intelligent knowledge question-answering system in the traditional Chinese medicine field is in urgent need of research and development. However, the existing intelligent question-answering system in the traditional Chinese medicine field is not really built based on the traditional Chinese medicine field body, so that the question-answering system cannot obtain higher accuracy.

Disclosure of Invention

In order to solve the problems, the invention provides a technical scheme capable of accurately and rapidly solving the health-care problems of traditional Chinese medicines for users, and the invention adopts the following technical scheme:

the invention provides a traditional Chinese medicine health preserving intelligent question-answering system based on NLP natural language processing, which is used for carrying out natural language processing on clinical presentation texts input by users so as to obtain corresponding health preserving suggestions, and is characterized by comprising the following steps: a plurality of user terminals held by users; and the management server is in communication connection with the user terminal, wherein the management server comprises a traditional Chinese medicine health-preserving word segmentation module, a part-of-speech labeling module, a named entity recognition and classification module, a dependency relation analysis module, a semantic similarity calculation module and an answer retrieval module, the traditional Chinese medicine health-preserving word segmentation module carries out word segmentation processing on a clinical presentation text according to a preset word segmentation library so as to obtain a plurality of traditional Chinese medicine health-preserving words, the part-of-speech labeling module carries out part-speech labeling on each traditional Chinese medicine health-preserving word so that each traditional Chinese medicine health-preserving word has a corresponding part of speech, the named entity recognition and classification module carries out named entity recognition on the traditional Chinese medicine health-preserving words according to the traditional Chinese medicine health-preserving word and the corresponding part of speech, the named entity recognition module carries out named entity recognition on the basis of a pre-trained named entity recognition model so as to obtain a named entity recognition result, the named entity recognition result is classified according to a preset entity category, the dependency relation analysis module carries out analysis on the relation between each component of the clinical presentation text on the basis of the traditional Chinese medicine health-preserving word, the corresponding part of the semantic similarity and the corresponding semantic meaning word, and the entity category carries out the similarity calculation on the basis of the similarity of the fact that the similarity is used as a question and the similarity of the user answer in the preset user answer value in the library, and the answer is not corresponds to the user question and is obtained by the preset text, and the similarity is calculated according to the semantic similarity value in the preset text.

According to the traditional Chinese medicine health preserving intelligent question-answering system based on NLP natural language processing, the invention can also have the technical characteristics that the word segmentation word stock consists of a high-frequency word stock and a segmentation word stock, and is obtained through the following steps: s1-1, extracting relevant Chinese character information from different Chinese medicine books; s1-2, performing text segmentation on the clinical presentation text according to punctuation characters in the clinical presentation text, so as to obtain each text; step S1-3, counting the frequency quantity of the adjacent characters at the same time; s1-4, judging whether the frequency quantity is larger than a preset frequency threshold T; and step S1-5, when the judgment in the step S1-4 is yes, generating a high-frequency word stock by using adjacent characters as high-frequency words, and when the judgment in the step S1-4 is no, generating a split word stock by using the characters as split words.

According to the traditional Chinese medicine health preserving intelligent question-answering system based on NLP natural language processing, which is provided by the invention, the technical characteristics can be further provided, wherein the word segmentation processing of the clinical presentation text by the traditional Chinese medicine health preserving word segmentation module comprises the following steps: step S2-1, traversing all adjacent character strings mn in the clinical presentation text, and judging whether the adjacent character strings mn are elements in a high-frequency word stock or not; step S2-2, when the step S2-1 judges yes, the adjacent character strings mn form words, and when the step S2-1 judges no, whether the adjacent character strings mn are elements of the segmentation word stock is judged; step S2-3, when the step S2-2 judges yes, the adjacent character strings mn do not form words, and when the step S2-2 judges no, the forward probability between the adjacent character strings mn is calculated; s2-4, judging whether the forward probability is larger than 0.1; step S2-5, when the step S2-4 judges yes, the adjacent character strings mn form words, and when the step S2-4 judges no, whether the forward probability is smaller than 0.001 is judged; step S2-6, when the step S2-5 judges yes, the adjacent character strings mn do not form words, and when the step S2-5 judges no, the reverse probability of the adjacent character strings mn is calculated; s2-7, judging whether the reverse probability is larger than 0.1; step S2-8, when the step S2-7 judges yes, the adjacent character strings mn form words, and when the step S2-7 judges no, whether the reverse probability is smaller than 0.001 is judged; step S2-9, when the step S2-8 judges yes, the adjacent character strings mn are not formed into words, and when the step S2-8 judges no, the relative distance between the adjacent character strings mn is calculated; s2-10, judging whether the relative distance is within a preset threshold range; and step S2-11, when the step S2-10 judges yes, forming words by the adjacent character strings mn, and when the step S2-10 judges no, not forming words by the adjacent character strings mn, and completing word segmentation aiming at the clinical presentation text.

According to the traditional Chinese medicine health preserving intelligent question-answering system based on NLP natural language processing, the system can also have the technical characteristics that the training steps of the named entity recognition model are as follows: s3-1, obtaining training texts related to the health maintenance questions and answers of the traditional Chinese medicine, and segmenting the training texts; s3-2, taking ontology concepts in different Chinese medicine books as an initial source dictionary, and acquiring entity categories in the initial source dictionary by using a reverse maximum matching algorithm; s3-3, carrying out entity naming identification on the entity according to the entity category; s3-4, labeling the multi-attribute characteristics of the training text by using a conditional random field algorithm, so as to obtain a labeled training sample; s3-5, formulating a corresponding characteristic template according to the characteristics of the training sample; and step S3-6, training by using the marked training sample and the characteristic template, so as to obtain a trained named entity recognition model, wherein the entity category comprises main complaints, clinical manifestations, western medicine diagnosis, chinese medicine symptoms, chinese medicine prescription, prescription composition, treatment advice, medicated diet advice, sports advice, life advice and acupoint advice.

According to the traditional Chinese medicine health preserving intelligent question-answering system based on NLP natural language processing, the system can also have the technical characteristics that the step of analyzing the clinical presentation text by the dependency relation analysis module is as follows: step S4-1, extracting features of all Chinese medicine health preserving word segmentation, all parts of speech and all entity categories respectively, so as to obtain corresponding word features, part of speech features and word label features; step S4-2, mapping the word characteristics, the part-of-speech characteristics and the word label characteristics into corresponding word vectors, part-of-speech vectors and word label vectors, respectively, and mapping all the word vectorsAll parts of speech vectors->All word tag vectors +.>Respectively splicing according to a preset dimension W to obtain a W-dimension word vector, a W-dimension part-of-speech vector and a W-dimension word tag vector:

step S4-3, combining the W-dimensional word vector, the W-dimensional part-of-speech vector and the W-dimensional word label vector into a new feature vector by utilizing a cube function:

and S4-4, predicting the transfer step and the type of the edge based on the new feature vector, so as to determine the relation among all components of the clinical presentation text and obtain the semantic meaning of the clinical presentation text.

According to the traditional Chinese medicine health preserving intelligent question-answering system based on NLP natural language processing, the invention can also have the technical characteristics that when the semantic similarity calculation module calculates semantic similarity between user questions in clinical presentation text and different warehouse-in questions in a preset question-answering library based on clinical presentation text semantics, the invention comprises the following steps: step S5-1, word embedding is respectively carried out on the user questions and each warehouse-in question in the question-answer library based on Google BERT, so that corresponding user question word vectors and warehouse-in question word vectors are obtained; s5-2, mapping the user question word vector and the warehouse-in question word vector by using a mirror image network based on the LSTM-CNN, so as to obtain a corresponding user question final vector and a corresponding warehouse-in question final vector; and S5-3, calculating the distance between the final vector of the user problem and the final vector of each warehouse-in problem by using a Manhattan distance function, and taking the distance as a semantic similarity value.

According to the traditional Chinese medicine health preserving intelligent question-answering system based on NLP natural language processing, the invention can also have the technical characteristics that a retrieval mechanism is a retrieval mechanism based on combination of an inverted mechanism and text feature classification, and the method comprises the following steps: s6-1, extracting keywords by analyzing character strings of the problems input by the user, and sequencing the word frequency similarity of a sentence of the problem text; s6-2, sorting the warehouse-in problems based on the semantic similarity value, so as to obtain a warehouse-in problem sequence; s6-3, acquiring corresponding warehouse-in questions from the warehouse-in question sequence according to a preset similarity value range to serve as a target question set; and S6-4, sorting all the warehouse-in questions in the target question set to obtain a sorting result, selecting the warehouse-in question with the highest similarity value as the best matching question according to the sorting result, and taking the answer corresponding to the best matching question as the health care suggestion.

The actions and effects of the invention

The invention relates to a traditional Chinese medicine health maintenance intelligent question-answering system based on NLP natural language processing, which consists of a user terminal and a management server, wherein the management server comprises a traditional Chinese medicine health maintenance text word segmentation module, a part-of-speech tagging module, a named entity recognition and classification module, a dependency relationship analysis module, a semantic similarity calculation module and an answer retrieval module. After the user inputs the clinical presentation information, firstly, the word segmentation module and the part of speech tagging module of the system sequentially process texts of the clinical presentation information input by the user, the processing process greatly reduces ambiguity on word semantic understanding in traditional question-answering, improves word segmentation accuracy, then, the named entity recognition and classification module recognizes and sorts the processed texts to obtain which kind of entity the clinical presentation information input by the user belongs to, finally, the dependency relationship analysis module and the semantic similarity calculation module perform semantic similarity calculation on the clinical presentation input by the user and the clinical presentation input by the system, and finally, the answer retrieval module extracts traditional Chinese medicine health care suggestions corresponding to the clinical presentation with the highest semantic similarity input by the user from the system database and displays the traditional Chinese medicine health care suggestions to the user.

The system has the advantages that the processing speed of each module is high, the processing efficiency is high, the semantic similarity calculation module and the answer retrieval module of the system and the answer retrieval module based on the combination of the inverted mechanism and the text feature classification meet the design requirement of a high concurrent application scene, the matching degree of the questions and the answers is high, the response speed is high, the real simulation effect of the traditional Chinese medicine health maintenance question-answering scene is achieved, and the experience of a user is greatly improved.

Drawings

Fig. 1 is a structural block diagram of a traditional Chinese medicine health maintenance intelligent question-answering system based on NLP natural language processing in an embodiment of the invention;

FIG. 2 is a flowchart of word segmentation preprocessing in an embodiment of the present invention;

FIG. 3 is a flowchart of word segmentation processing performed on clinical presentation text by the Chinese medical science health care word segmentation module in the embodiment of the invention;

FIG. 4 is a schematic diagram illustrating the operation of a part-of-speech tagging module according to an embodiment of the present invention;

FIG. 5 is a diagram showing a matching structure of a similarity calculation module according to an embodiment of the present invention;

fig. 6 is a schematic diagram of an answer retrieval process of the answer retrieval module in the embodiment of the invention.

Detailed Description

In order to make the technical means, creation characteristics, achievement purposes and effects realized by the invention easy to understand, the invention relates to a Chinese medicine health preserving intelligent question-answering system based on NLP natural language processing, which is specifically described below with reference to the embodiment and the attached drawings.

< example >

Fig. 1 is a schematic structural diagram of a traditional Chinese medicine health preserving intelligent question-answering system based on NLP natural language processing according to an embodiment of the invention.

As shown in fig. 1, a system for intelligent question-answering of health care of traditional Chinese medicine based on NLP natural language processing includes a user terminal 11 and a management server 12.

Wherein, a plurality of user terminals are respectively held by a plurality of users.

The user inputs name, age, occupation, ethnicity, resident area, marital status, history of female menstruation Shi Hunyo, allergy history (with/without specification), chronic medical history (diabetes, hypertension, viral hepatitis, etc.), family medical history, selects a corresponding complaint (the most needed problem to be solved this time including symptoms and time or duration of onset), clinical manifestation (accompanying symptoms), and related examination assay data as clinical manifestation text through the user terminal 11.

The management server 12 is in communication with the user terminal 11.

The management server 12 includes a chinese medical science health-preserving word segmentation module 121, a part-of-speech tagging module 122, a named entity recognition and classification module 123, a dependency analysis module 124, a semantic similarity calculation module 125, and an answer retrieval module 126.

The chinese medical science health care word segmentation module 121 performs word segmentation processing on the clinical presentation text according to a predetermined word segmentation word stock, thereby obtaining a plurality of chinese medical science health care word segments.

FIG. 2 is a flowchart of word segmentation preprocessing in an embodiment of the present invention.

The word segmentation word stock consists of a high-frequency word stock and a segmentation word stock, as shown in fig. 2, and is obtained through the following steps:

and step S1-1, extracting the Chinese medicine related text information from different Chinese medicine books.

In this embodiment, the books of traditional Chinese medicine include basic theory of traditional Chinese medicine, diagnosis of traditional Chinese medicine, prescriptions, internal science of traditional Chinese medicine, and related text information of traditional Chinese medicine including complaints, clinical manifestations, diagnosis of traditional Chinese medicine, syndrome of traditional Chinese medicine, prescription composition, treatment advice, diet advice, medicated diet advice, exercise advice, life advice, and acupoint advice.

And S1-2, performing text segmentation on the clinical presentation text according to punctuation characters in the clinical presentation text, so as to obtain each text.

In this embodiment, punctuation characters include punctuation characters in 67 and in English, such as small brackets, large brackets, medium brackets, semicolons, periods, commas, and stop signs.

And S1-3, counting the number of frequencies Count (mn) of adjacent characters at the same time.

And S1-4, judging whether the frequency number Count (mn) is larger than a preset frequency threshold T according to the word segmentation word stock.

And step S1-5, when the judgment in the step S1-4 is yes, generating a high-frequency word stock by using adjacent characters as high-frequency words, and when the judgment in the step S1-4 is no, generating a split word stock by using the characters as split words.

In this embodiment, all the high-frequency words are checked, and the character strings mn which are not words are moved into the segmented word library, so that the accuracy of classifying the segmented words and the high-frequency words is improved.

Fig. 3 is a flowchart of word segmentation processing performed on a clinical presentation text by the Chinese medical health care word segmentation module.

As shown in fig. 3, the word segmentation process performed by the chinese medical science health care word segmentation module 121 on the clinical presentation text includes the following steps:

step S2-1, traversing all adjacent character strings mn in the clinical presentation text, and judging whether the adjacent character strings mn are elements in a high-frequency word stock or not;

step S2-2, when the step S2-1 judges yes, the adjacent character strings mn form words, and when the step S2-1 judges no, whether the adjacent character strings mn are elements of the segmentation word stock is judged;

step S2-3, when the step S2-2 judges yes, the adjacent character strings mn do not form words, and when the step S2-2 judges no, the forward probability between the adjacent character strings mn is calculated;

s2-4, judging whether the forward probability is larger than 0.1;

step S2-5, when the step S2-4 judges yes, the adjacent character strings mn form words, and when the step S2-4 judges no, whether the forward probability is smaller than 0.001 is judged;

step S2-6, when the step S2-5 judges yes, the adjacent character strings mn do not form words, and when the step S2-5 judges no, the reverse probability of the adjacent character strings mn is calculated;

s2-7, judging whether the reverse probability is larger than 0.1;

step S2-8, when the step S2-7 judges yes, the adjacent character strings mn form words, and when the step S2-7 judges no, whether the reverse probability is smaller than 0.001 is judged;

step S2-9, when the step S2-8 judges yes, the adjacent character strings mn are not formed into words, and when the step S2-8 judges no, the relative distance between the adjacent character strings mn is calculated;

s2-10, judging whether the relative distance is within a preset threshold range;

and step S2-11, when the step S2-10 judges yes, forming words by the adjacent character strings mn, and when the step S2-10 judges no, not forming words by the adjacent character strings mn, and completing word segmentation aiming at the clinical presentation text.

In the present embodiment, the threshold range is the relative distance dis (n, m) <2, dis (m, n) <2, or dis (n, m) <5and dis (m, n) <5.

The part of speech tagging module 122 tags each of the chinese medical science healthcare terms such that each of the chinese medical science healthcare terms has a corresponding part of speech.

FIG. 4 is a schematic diagram illustrating the operation of the part-of-speech tagging module according to an embodiment of the present invention.

In this embodiment, the part of speech notation specifically includes:

part-of-speech sequence x=x ₁ ,x ₂ ,…,x _n Vocabulary sequence y=y ₁ ,y ₂ ,…,y _n The maximum probability is P (x|y). Specifically:

P(X|Y)＝(P(X)·P(Y|X))/(P(Y))

P(Y)＝1

as shown in fig. 4, the part-of-speech sequence X corresponds to the part-of-speech sequence invisible in fig. 4, the vocabulary sequence Y corresponds to the word sequence visible in fig. 4, and when the traditional Chinese medicine health preserving intelligent question-answering system receives the visible word sequence "epigastric pain disease", the "gastralgia" and the "gastralgia" are marked as symptom parts, and the "pain disease" and the "gastralgia" are marked as symptom expressions.

The named entity recognition and classification module 123 performs named entity recognition on the traditional Chinese medicine health care word by using a pre-trained named entity recognition model according to the traditional Chinese medicine health care word and the corresponding part of speech, so as to obtain a named entity recognition result, and classifies the named entity recognition result according to a predetermined entity category, so that each named entity recognition result has a corresponding entity category.

The training steps of the trained named entity recognition model are as follows:

the training steps of the named entity recognition model are as follows:

and S3-1, obtaining training texts related to the health preserving questions and answers of the traditional Chinese medicine, and segmenting the training texts.

The training text is a dox file containing the contents of gender, age, occupation, ethnicity, marital, region, heart rate, blood pressure, respiration, body temperature, brief medical history, gastroscope report, clinical diagnosis, prescription medicine, nursing scheme, exercise training, life habit, massage and the like.

In this embodiment, after the training text is segmented, the training text is divided into a training sample and a test sample according to a ratio of 3:1, the training sample is used for training the named entity recognition model, and the test sample is used for testing the performance of the named entity recognition model.

And S3-2, taking ontology concepts in different Chinese medical books as an initial source dictionary, and acquiring entity categories in the initial source dictionary by using a reverse maximum matching algorithm.

In this example, the basic concepts of the books of Chinese medicine, i.e., basic theory of Chinese medicine, diagnosis of Chinese medicine, prescriptions, and internal science of Chinese medicine, are initial source dictionary.

The entity category comprises twelve entity categories of complaints, clinical manifestations, western diagnosis, chinese medicine symptoms, chinese medicine prescription, prescription composition, treatment advice, medicated diet advice, exercise advice, life advice and acupoint advice.

And S3-3, carrying out entity naming identification on the entity according to the entity category.

In this embodiment, the named entities are identified according to the text frame and the context segmentation words of the thirteen kinds of entity category named entities.

And S3-4, labeling the multi-attribute characteristics of the training sample by using a conditional random field algorithm, so as to obtain the labeled training sample.

In this embodiment, based on a Conditional Random Fields (CRFs) algorithm and a mode of integrating multiple features, a method of manual labeling and dictionary searching is used to label multiple attribute features on a training sample and reject nonsensical words (such as bar, and so on).

And S3-5, formulating a corresponding characteristic template according to the characteristics of the training sample.

And step S3-6, training by using the marked training sample and the characteristic template, thereby obtaining a trained named entity recognition model.

In this embodiment, a model file is trained based on a designed named entity recognition model flow by using a labeled training sample and a corresponding feature template, and entity term recognition is performed on a test sample according to the trained model file. The training model file is a training corpus sample.

The dependency analysis module 124 analyzes the relationships between the components of the clinical presentation text based on the traditional Chinese medicine health care word, the corresponding part of speech and the entity class, thereby obtaining the clinical presentation text semantics.

Wherein the step of analyzing the clinical presentation text by the dependency analysis module 124 is as follows:

and S4-1, respectively extracting features of all Chinese medicine health preserving word segmentation, all parts of speech and all entity categories, so as to obtain corresponding word features, part of speech features and word label features.

Step S4-2, mapping the word characteristics, the part-of-speech characteristics and the word label characteristics into corresponding word vectors, part-of-speech vectors and word label vectors, respectively, and mapping all the word vectorsAll parts of speech vectors->All word tag vectors +.>Respectively splicing according to a preset dimension W to obtain a W-dimension word vector, a W-dimension part-of-speech vector and a W-dimension word tag vector:

The semantic similarity calculation module 125 calculates semantic similarity between the user question in the clinical presentation text and different warehousing questions in the predetermined question-answer library based on the clinical presentation text semantic, thereby obtaining a semantic similarity value between the user question and each warehousing question.

Fig. 5 is a matching structure diagram of a semantic similarity calculation module in the embodiment of the present invention.

Wherein, when the semantic similarity calculating module 125 calculates the semantic similarity between the user questions in the clinical presentation text and the different warehouse-in questions in the predetermined question-answer library based on the clinical presentation text semantic, the method comprises the following steps:

and step S5-1, word embedding is respectively carried out on the user questions and each warehouse-in question in the question-answer library based on the Google BERT, so that corresponding user question word vectors and warehouse-in question word vectors are obtained.

And S5-2, mapping the user question word vector and the warehouse-in question word vector by using a mirror image network based on the LSTM-CNN, so as to obtain a corresponding user question final vector and a corresponding warehouse-in question final vector.

In this embodiment, word embedding with dimension of 200 dimension is performed on question questions of end users and system in-warehouse questions in the LSTM-CNN training process, random initialization is performed by using gaussian distribution, and cross entropy loss is used for loss functions.

And the optimization algorithm uses RMSprop to slice the norms of the gradients exceeding the set value, so that gradient explosion is avoided.

And S5-3, calculating the distance between the final vector of the user problem and the final vector of each warehouse-in problem by using a Manhattan distance function, and taking the distance as a semantic similarity value.

The answer retrieval module 126 uses a predetermined retrieval mechanism to extract the question with the highest semantic similarity with the text of the question asked by the user and the answer corresponding to the question from the questions in the system of the present embodiment according to the semantic similarity value, and uses the extracted question as the health care suggestion.

In a traditional question-answering system, in order to improve the matching degree of semantic similarity between a question asked by a user and a question which is entered into a system, a machine learning and deep learning method is generally adopted to process answer sorting and question extraction tasks according to text classification tasks, a classification model is trained, the method needs to calculate text semantic similarity of each question which is entered into the system, and each query needs to calculate and extract the text semantic similarity of the question which is entered into the system again, so that the method is not preferable for the question-answering system with high concurrency and real-time response requirement, because each extraction needs to search a system database once, the more the extraction tasks are, the longer the response delay of the question-answering system is, the resources of a server are limited, and a large amount of server resources are consumed, so that the method becomes a bottleneck for improving the performance of the question-answering system.

The query problem aims at extracting the problem closest to the text semantics of the query problem from the problems in the system, the query process is to extract the problem closest to the similarity of all the problems in the system and the query problem by calculating the similarity one by one, and the method can reduce the system performance because the database is queried once for each problem retrieval.

As shown in fig. 6, to solve such problems, the retrieval mechanism designed in this embodiment first establishes a system-in-warehouse problem index library, then converts the problem retrieval into a retrieval string, adopts a set with the closest text semantic similarity in a given range based on an inverted mechanism and text feature classification, and reorders, and when reorders, extracts surface features, text semantic space features, word bag model features and topic model features between the problem texts, and uses these dimensions to extract the system-in-warehouse problem with the closest text semantic in the set. Specifically:

in this embodiment, the predetermined search mechanism is a search mechanism based on a combination of an inverted mechanism and text feature classification, and the workflow of the search mechanism includes the following steps:

s6-1, extracting keywords by analyzing character strings of the problems input by the user, and sequencing the word frequency similarity of a sentence of the problem text;

s6-2, sorting the warehouse-in problems based on the semantic similarity value, so as to obtain a warehouse-in problem sequence;

s6-3, acquiring corresponding warehouse-in questions from the warehouse-in question sequence according to a preset similarity value range to serve as a target question set;

and S6-4, sorting all the warehouse-in questions in the target question set to obtain a sorting result, selecting the warehouse-in question with the highest similarity value as the best matching question according to the sorting result, and taking the answer corresponding to the best matching question as the health care suggestion.

In the embodiment, the system stores the semantic vectors of the calculated problem features into a NoSQL database, and when the feature matching degree is calculated, the system preferentially inquires from the NoSQL database, so that the time loss and the server resource expense of semantic vector calculation caused by text feature extraction are reduced.

Example operation and Effect

According to the traditional Chinese medicine health maintenance intelligent question-answering system based on NLP natural language processing, the system is composed of a user terminal and a management server, wherein the management server comprises a traditional Chinese medicine health maintenance text word segmentation module, a part-of-speech tagging module, a named entity recognition and classification module, a dependency relationship analysis module, a semantic similarity calculation module and an answer retrieval module. After the user inputs the clinical manifestation information, firstly, the word segmentation module and the part of speech tagging module of the system sequentially process texts of the clinical manifestation information input by the user, then the named entity recognition and classification module recognizes and sorts the processed texts to obtain which kind of entity the clinical manifestation information input by the user belongs to, finally, the dependency relationship analysis module and the semantic similarity calculation module perform semantic similarity calculation on the clinical manifestation input by the user and the clinical manifestation input by the system, and finally, the answer retrieval module extracts traditional Chinese medicine health maintenance suggestions corresponding to the clinical manifestation with the highest semantic similarity input by the user from the system database and displays the traditional Chinese medicine health maintenance suggestions to the user.

In this embodiment, the books of traditional Chinese medicine, the basic theory of traditional Chinese medicine, the diagnosis of traditional Chinese medicine, the prescription of science and the Chinese medicine science are used as dictionary, and the conclusion of the final Chinese text word segmentation is finally obtained by combining the forward probability, the reverse probability and the relative distance, so that the ambiguity in word semantic understanding can be greatly reduced by marking the part of speech of the current text based on the part of speech and the vocabulary before the current text, and the word segmentation accuracy in the field of traditional Chinese medicine health is higher than that of the conventional Chinese text word segmentation tool.

The intelligent question-answering system of traditional Chinese medicine health preserving based on NLP natural language processing in the embodiment can also carry out entity identification on twelve types of entities of main complaints, clinical manifestations, western medicine diagnosis, traditional Chinese medicine symptoms, traditional Chinese medicine prescription drugs, prescription drug composition, treatment advice, medicated diet advice, exercise advice, life advice and acupoint advice, and the included entity types are more abundant than those of the conventional question-answering system.

In the embodiment, the context dependency analysis of the text based on the neural network is designed, and the model used in the process is simple, fast and good in effect.

The calculation model of semantic similarity of text information of Chinese medical health-preserving clinical complaints designed in the embodiment is to calculate semantic similarity between a question of a terminal user and a question which is put in a warehouse by a system, extract a question which is the closest to the question in the semantic similarity in the warehouse, and finally extract an answer corresponding to the question, wherein the model is more in line with Chinese medical health-preserving question-answering scenes.

In addition, the retrieval model in the answer retrieval module designed in the embodiment stores the semantic vectors of the calculated problem features into the NoSQL database, when the feature matching degree is calculated, the query is preferentially performed from the NoSQL database, so that the time loss and the server resource expenditure for calculating the semantic vectors generated by extracting text features are reduced, and the timeliness of model response is greatly improved.

The above examples are only for illustrating the specific embodiments of the present invention, and the present invention is not limited to the description scope of the above examples.

Claims

1. An intelligent question-answering system of traditional Chinese medicine health maintenance based on NLP natural language processing is used for carrying out natural language processing on clinical presentation texts input by users so as to obtain corresponding health maintenance suggestions, and is characterized by comprising the following components:

a plurality of user terminals held by the users; and

a management server in communication with the user terminal,

wherein the management server comprises a Chinese medicine health-preserving character word segmentation module, a part-of-speech labeling module, a named entity recognition and classification module, a dependency relationship analysis module, a semantic similarity calculation module and an answer retrieval module,

the Chinese medical health care word segmentation module carries out word segmentation processing on the clinical presentation text according to a preset word segmentation word stock so as to obtain a plurality of Chinese medical health care word segments,

the part of speech tagging module tags each Chinese medical science health care word with a corresponding part of speech,

the named entity recognition and classification module performs named entity recognition on the traditional Chinese medical science health care word by utilizing a pre-trained named entity recognition model according to the traditional Chinese medical science health care word and the corresponding part of speech, thereby obtaining a named entity recognition result, and classifies the named entity recognition result according to a preset entity category, thereby enabling each named entity recognition result to have a corresponding entity category,

the dependency relationship analysis module analyzes the relationship among the components of the clinical presentation text based on the traditional Chinese medicine health preserving word, the corresponding part of speech and the entity category, thereby obtaining the clinical presentation text semantic,

the semantic similarity calculation module calculates semantic similarity between user questions in the clinical presentation text and different warehousing questions in a preset question-answer library based on the clinical presentation text semantic, thereby obtaining semantic similarity values between the user questions and the warehousing questions,

and the answer retrieval module retrieves answers corresponding to the user questions from the question-answer library by utilizing a preset retrieval mechanism according to the semantic similarity value to serve as the health care suggestions.

2. The intelligent question-answering system of traditional Chinese medicine health maintenance based on NLP natural language processing of claim 1, wherein the system is characterized in that:

the word segmentation word stock consists of a high-frequency word stock and a segmentation word stock, and is obtained through the following steps:

s1-1, extracting relevant Chinese character information from different Chinese medicine books;

s1-2, performing text segmentation on the clinical presentation text according to punctuation characters in the clinical presentation text, so as to obtain various texts;

step S1-3, counting the frequency quantity of the adjacent characters at the same time;

s1-4, judging whether the frequency quantity is larger than a preset frequency threshold T;

and S1-5, when the step S1-4 judges yes, generating the high-frequency word stock by using the adjacent characters as high-frequency words, and when the step S1-4 judges no, generating the split word stock by using the characters as split words.

3. The intelligent question-answering system of traditional Chinese medicine health maintenance based on NLP natural language processing of claim 1, wherein the system is characterized in that:

the word segmentation processing of the clinical presentation text by the Chinese medical health care word segmentation module comprises the following steps:

step S2-2, when the step S2-1 judges yes, the adjacent character string mn forms words, and when the step S2-1 judges no, whether the adjacent character string mn is an element of a segmentation word stock is judged;

step S2-3, when the step S2-2 judges yes, the adjacent character strings mn do not form words, and when the step S2-2 judges no, the forward probabilities among the adjacent character strings mn are calculated;

s2-4, judging whether the forward probability is larger than 0.1;

step S2-6, when the step S2-5 judges yes, the adjacent character string mn is not word-forming, and when the step S2-5 judges no, the reverse probability of the adjacent character string mn is calculated;

s2-7, judging whether the reverse probability is larger than 0.1;

step S2-9, when the step S2-8 judges yes, the adjacent character strings mn are not word-formed, and when the step S2-8 judges no, the relative distance of the adjacent character strings mn is calculated;

s2-10, judging whether the relative distance is within a preset threshold range or not;

4. The intelligent question-answering system of traditional Chinese medicine health maintenance based on NLP natural language processing of claim 1, wherein the system is characterized in that:

the training steps of the named entity recognition model are as follows:

s3-1, obtaining training texts related to health preserving questions and answers of the traditional Chinese medicine, and segmenting the training texts;

s3-2, taking ontology concepts in different Chinese medicine books as an initial source dictionary, and acquiring entity categories in the initial source dictionary by using a reverse maximum matching algorithm;

s3-3, carrying out entity naming identification on the entity according to the entity category;

s3-4, labeling the multi-attribute characteristics of the training text by using a conditional random field algorithm, so as to obtain a labeled training sample;

s3-5, formulating a corresponding characteristic template according to the characteristics of the training sample;

step S3-6, training by using the marked training sample and the characteristic template, thereby obtaining a trained named entity recognition model,

the entity category comprises complaints, clinical manifestations, western diagnosis, chinese medicine symptoms, chinese medicine prescription, prescription composition, treatment advice, medicated diet advice, exercise advice, life advice and acupoint advice.

5. The intelligent question-answering system of traditional Chinese medicine health maintenance based on NLP natural language processing of claim 1, wherein the system is characterized in that:

wherein, the step of the dependency relation analysis module analyzing the clinical presentation text is as follows:

step S4-1, respectively extracting features of all Chinese medical health preserving word segmentation, all parts of speech and all entity categories, so as to obtain corresponding word features, part of speech features and word label features;

step S4-2, mapping the word characteristics, the part-of-speech characteristics and the word label characteristics into corresponding word vectors, part-of-speech vectors and word label vectors, respectively, and mapping all word vectorsAll parts of speech vectors->All word tag vectors +.>Respectively splicing according to a preset dimension W to obtain a W-dimension word vector, a W-dimension part-of-speech vector and a W-dimension word tag vector:

and S4-4, predicting the transfer step and the type of the edge based on the new feature vector, so as to determine the relation among all the components of the clinical presentation text and obtain the clinical presentation text semantics.

6. The intelligent question-answering system of traditional Chinese medicine health maintenance based on NLP natural language processing of claim 1, wherein the system is characterized in that:

the semantic similarity calculation module calculates semantic similarity between user questions in the clinical presentation text and different warehouse-in questions in a preset question-answering library based on the clinical presentation text semantic, and comprises the following steps:

step S5-1, word embedding is carried out on the user questions and the warehousing questions in the question-answering library based on Google BERT, so that corresponding user question word vectors and warehousing question word vectors are obtained;

s5-2, mapping the user question word vector and the warehouse-in question word vector by using a mirror image network based on LSTM-CNN, so as to obtain a corresponding user question final vector and a corresponding warehouse-in question final vector;

and S5-3, calculating the distance between the final vector of the user problem and the final vector of each warehousing problem by using a Manhattan distance function, and taking the distance as the semantic similarity value.

7. The intelligent question-answering system of traditional Chinese medicine health maintenance based on NLP natural language processing of claim 1, wherein the system is characterized in that:

the retrieval mechanism is a retrieval mechanism based on combination of an inverted mechanism and text feature classification, and comprises the following steps:

s6-2, sorting the warehousing problems based on the semantic similarity values, so as to obtain a warehousing problem sequence;

s6-3, acquiring corresponding warehousing problems from the warehousing problem sequence according to a preset similarity value range to serve as a target problem set;