CN116401543A - Text data label generation method based on artificial intelligence and related equipment - Google Patents
Text data label generation method based on artificial intelligence and related equipment Download PDFInfo
- Publication number
- CN116401543A CN116401543A CN202310295186.7A CN202310295186A CN116401543A CN 116401543 A CN116401543 A CN 116401543A CN 202310295186 A CN202310295186 A CN 202310295186A CN 116401543 A CN116401543 A CN 116401543A
- Authority
- CN
- China
- Prior art keywords
- text
- text data
- similarity
- inquiry
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 38
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 54
- 239000013598 vector Substances 0.000 claims description 155
- 230000015654 memory Effects 0.000 claims description 25
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000005457 optimization Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 238000012163 sequencing technique Methods 0.000 claims description 8
- 238000012216 screening Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 9
- 201000010099 disease Diseases 0.000 description 7
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 230000037213 diet Effects 0.000 description 4
- 235000005911 diet Nutrition 0.000 description 4
- 239000003814 drug Substances 0.000 description 4
- 208000024891 symptom Diseases 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 206010011224 Cough Diseases 0.000 description 1
- 206010028748 Nasal obstruction Diseases 0.000 description 1
- 206010037660 Pyrexia Diseases 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H40/00—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
- G16H40/60—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices
- G16H40/67—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices for remote operation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H80/00—ICT specially adapted for facilitating communication between medical practitioners or patients, e.g. for collaborative diagnosis, therapy or health monitoring
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Primary Health Care (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Databases & Information Systems (AREA)
Abstract
The application relates to the field of artificial intelligence and digital medical treatment, and provides a text data label generation method, a device, electronic equipment and a storage medium based on the artificial intelligence, wherein the text data label generation method based on the artificial intelligence comprises the following steps: acquiring on-line consultation data of a doctor to acquire a consultation text data set; clustering the inquiry text data sets to obtain a plurality of inquiry text data clusters; designing task labels to generate keyword lists; calculating a first text similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm; calculating a second text similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm; and matching corresponding task labels for the inquiry text data clusters based on the first text similarity and the second text similarity. The method and the device can automatically match the acquired inquiry data with the corresponding label, so that the acquisition efficiency of the text data label is improved.
Description
Technical Field
The application relates to the technical field of artificial intelligence and digital medical treatment, in particular to a text data label generation method, a device, electronic equipment and a storage medium based on artificial intelligence.
Background
In the auxiliary diagnosis and treatment scene, massive online consultation generates a large amount of text data, the data is the basis for making data labels in a plurality of supervision and study tasks, such as a construction speech library, intention recognition, search pushing and dialogue system and the like, and the supervision and study tasks provide great assistance for online consultation, so that the consultation efficiency and the conversion rate are improved.
At present, most data labels are generated manually, but all the methods depending on manual definition cannot be produced on a large scale, so that the expandability of a label system is not strong, and the label system is not very rich, and meanwhile, the mode of manually generating the data labels takes time and labor along with the exponential increase of the text data quantity, so that the label generation efficiency of the text data is lower.
Disclosure of Invention
In view of the foregoing, it is necessary to propose a text data tag generation method, apparatus, electronic device and storage medium based on artificial intelligence, so as to solve the technical problem of how to improve the tag generation efficiency of text data. The related equipment comprises a text data label generating device based on artificial intelligence, electronic equipment and a storage medium.
The application provides a text data label generation method based on artificial intelligence, which comprises the following steps:
acquiring on-line consultation data of a doctor to acquire a consultation text data set;
clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters;
designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one;
calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm to obtain a first text similarity;
calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm to obtain a second text similarity;
and matching corresponding task labels for the inquiry text data clusters based on the first text similarity and the second text similarity.
In some embodiments, the acquiring the on-line interview data of the doctor to obtain the interview text dataset comprises:
acquiring text messages sent by a doctor every time during online consultation as a piece of consultation text data to obtain an initial consultation text data set;
screening the initial inquiry text data set according to a preset mode;
And storing the data in the screened initial consultation text data set into a consultation text database to obtain a consultation text data set.
In some embodiments, the clustering the query text data in the query text data set to obtain a plurality of query text data clusters includes:
respectively converting each inquiry text data in the inquiry text data set into inquiry text vectors;
optimizing a preset text clustering algorithm to obtain a text clustering optimization algorithm;
and clustering the inquiry text vectors based on the text clustering optimization algorithm to obtain inquiry text data clusters of a plurality of categories.
In some embodiments, the designing the task tag according to the preset manner to generate the keyword list includes:
designing a plurality of task labels according to preset task categories;
configuring a plurality of corresponding keywords for each task label according to a preset mode;
and taking a plurality of keywords corresponding to each task label as a keyword list.
In some embodiments, the calculating the similarity between the query text data cluster and the keyword list according to the cosine similarity algorithm to obtain a first text similarity includes:
Respectively calculating text word vectors of all the inquiry texts in the inquiry text data clusters and keyword vectors of all the keywords in the keyword list according to a word vector model;
calculating an average value of each element in the text word vector as a text word value, and constructing a consultation text vector of each consultation text based on the text word value;
carrying out standardization processing on the inquiry text vector to obtain a standard text vector;
calculating the similarity between the standard text vector and the keyword vector according to a cosine similarity algorithm to obtain text vector similarity;
and calculating the similarity between the inquiry text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity.
In some embodiments, the calculating the similarity between the query text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity includes:
calculating an average value of text vector similarity between a target inquiry text and all keyword vectors in the keyword list as a target average similarity of the target inquiry text, wherein the target inquiry text is any one of the inquiry text data clusters;
Traversing the inquiry text data cluster to obtain the average similarity of the targets corresponding to each inquiry text;
and calculating the average value of all the target average similarities as the first text similarity of the inquiry text data cluster.
In some embodiments, the matching the task labels for the query text data clusters based on the first text similarity and the second text similarity includes:
multiplying the first text similarity and the second text similarity to obtain inquiry text similarity;
sequencing the similarity of the inquiry texts according to the sequence from big to small to obtain a sequencing result;
and determining task labels corresponding to the inquiry text data clusters based on the sorting results.
The embodiment of the application also provides a text data label generating device based on artificial intelligence, which comprises an acquisition module, a clustering module, a generating module, a first calculating module, a second calculating module and a matching module:
the acquisition module is used for acquiring on-line consultation data of a doctor to acquire a consultation text data set;
the clustering module is used for clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters;
The generating module is used for designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one;
the first calculation module is used for calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm so as to obtain a first text similarity;
the second calculation module is used for calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm so as to obtain second text similarity;
the matching module is configured to match the task labels corresponding to the query text data clusters based on the first text similarity and the second text similarity.
The embodiment of the application also provides electronic equipment, which comprises:
a memory storing at least one instruction;
and the processor executes the instructions stored in the memory to realize the text data label generating method based on artificial intelligence.
Embodiments of the present application also provide a computer-readable storage medium having at least one instruction stored therein, the at least one instruction being executed by a processor in an electronic device to implement the artificial intelligence-based text data tag generation method.
According to the method and the device, the collected inquiry data are clustered, the keyword list corresponding to the task label is created to perform different text similarity calculation with the clustered inquiry data, so that the inquiry data of each category are automatically matched with more accurate and proper task labels, and the acquisition efficiency of the text data labels is effectively improved.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of an artificial intelligence based text data tag generation method in accordance with the present application.
Fig. 2 is a functional block diagram of a preferred embodiment of an artificial intelligence based text data tag generating apparatus according to the present application.
Fig. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the text data tag generation method based on artificial intelligence according to the present application.
Detailed Description
In order that the objects, features and advantages of the present application may be more clearly understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, the described embodiments are merely some, rather than all, of the embodiments of the present application.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more of the described features. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
The embodiment of the application provides a text data tag generation method based on artificial intelligence, which can be applied to one or more electronic devices, wherein the electronic devices are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware comprises, but is not limited to, a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, an ASIC), a programmable gate array (Field-Programmable Gate Array, FPGA), a digital processor (Digital Signal Processor, DSP), an embedded device and the like.
The electronic device may be any electronic product that can interact with a customer in a human-machine manner, such as a personal computer, tablet, smart phone, personal digital assistant (Personal Digital Assistant, PDA), gaming machine, interactive web television (Internet Protocol Television, IPTV), smart wearable device, etc.
The electronic device may also include a network device and/or a client device. Wherein the network device includes, but is not limited to, a single network server, a server group composed of a plurality of network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers.
The network in which the electronic device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.
As shown in FIG. 1, a flow chart of a preferred embodiment of the artificial intelligence based text data tag generation method of the present application is shown. The order of the steps in the flowchart may be changed and some steps may be omitted according to various needs.
In an alternative embodiment, data support may be provided for many supervised learning tasks, such as construction of a speech library, intention recognition, disease diagnosis, etc., according to a large amount of text data generated by on-line interrogation, so that corresponding text data may be collected according to different supervised learning tasks, thereby improving efficiency of the data acquisition process.
S10, acquiring on-line consultation data of a doctor to acquire a consultation text data set.
In an alternative embodiment, the acquiring the on-line consultation data of the doctor to acquire the consultation text data set includes:
acquiring text messages sent by a doctor every time during online consultation as a piece of consultation text data to obtain an initial consultation text data set;
screening the initial inquiry text data set according to a preset mode;
and storing the data in the screened initial consultation text data set into a consultation text database to obtain a consultation text data set.
In this optional embodiment, the crawler technology may crawl message data sent by multiple doctors in each online consultation process, and take text messages sent by each doctor each time as one piece of consultation text data, where all acquired consultation text data are taken as an initial consultation text data set.
In this alternative embodiment, since there may be some unnecessary interview text data in the interview text data set, such as "good", "ok", "pair", "hello", and "? "text messages which do not include more semantic information, etc., so that the redundant inquiry text data in the inquiry text data set can be screened out and filtered out by compiling a script command to pre-specify which inquiry text data are not needed.
In this optional embodiment, the data in the screened initial query text data set may be stored in a pre-constructed query text database, where the query text database may be a common text database such as txtSQL, TXTDB, openSSL, and in this embodiment, the initial query text data set screened and stored in the query text database is used as a query text data set.
Therefore, by collecting message data of doctors in the on-line consultation process and screening redundant message data, a consultation text data set which can provide accurate data support for the follow-up process can be quickly obtained.
S11, clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters.
In an optional embodiment, the clustering the query text data in the query text data set to obtain a plurality of query text data clusters includes:
respectively converting each inquiry text data in the inquiry text data set into inquiry text vectors;
optimizing a preset text clustering algorithm to obtain a text clustering optimization algorithm;
and clustering the inquiry text vectors based on the text clustering optimization algorithm to obtain inquiry text data clusters of a plurality of categories.
In this alternative embodiment, each of the query text data in the query text data set may be represented by a binary vector through one-hot encoding, where the binary vector of each of the query text data is used as the query text vector. Illustratively, a patient has three symptoms of "fever", "nasal obstruction", "cough", which can be represented as "100", "010", "001", respectively, after being converted to binary vectors by one-hot encoding.
In this alternative embodiment, the preset text clustering algorithm may be a single pass algorithm, where the single pass algorithm does not need to specify a specific number of clusters, and may complete clustering by sequentially calculating the similarity between the input query text vector and the existing query text vector. Because the single pass algorithm generally calculates the similarity between data through cosine similarity and Euclidean distance, the method mainly aims at text data, so that editing distance which is more suitable for the text data can be selected for substitution, the single pass algorithm is further optimized, the more accurate similarity between the inquiry text vectors is obtained, and the single pass algorithm with the optimized editing distance is used as a text clustering algorithm in the method.
In this alternative embodiment, all the query text vectors may be clustered by the text clustering algorithm, so as to obtain multiple types of query text data clusters, and for example, 10 query text vectors with the existing numbers from 1 to 10 are counted, and the main clustering process is as follows:
a) Randomly selecting one inquiry text vector as a clustering seed, for example, taking the inquiry text vector with the number of 1 as the clustering seed;
b) Calculating the similarity between the clustering seeds and other 9 inquiry text vectors through editing distances;
c) If the similarity is greater than a preset similarity threshold, classifying the corresponding inquiry text vector and the clustering seeds into a category, for example, the similarity between the inquiry text vectors 2, 3 and 4 and the inquiry text vector 1 is greater than the preset similarity threshold, so that the inquiry text vectors 1, 2, 3 and 4 are used as a category;
d) If the similarity is not greater than the preset similarity threshold, selecting one of the query text vectors not greater than the similarity threshold as another clustering seed, and continuously calculating the similarity between the rest of the query text vectors and each clustering seed, and so on until all the query text vectors are traversed and then are classified into corresponding categories, such as query text vector 5
If the similarity between the query text vector 5 and the clustering seeds is smaller than a preset similarity threshold, taking the query text vector 5 as a new clustering seed, and continuously calculating the query text vector 5 and the query text vector respectively
6. 7, 8, 9, 10, if the query text vector 5 and the query text vector
6. 7 is greater than a preset similarity threshold, the inquiry text vectors 5, 6,
7 is classified into a category, and finally any one of the inquiry text vectors 8, 9 or 10 is used as another clustering seed, for example, the inquiry text vector 8 is used as the latest clustering seed, and the similarity between the inquiry text vector 8 and the inquiry text vectors 9 and 10 is greater than a preset similarity threshold
Values, therefore, the query text vectors 8, 9, 10 are finally taken as another category, and three query text data clusters are finally obtained.
In the scheme, the finally obtained inquiry text data corresponding to the inquiry text vectors of all the categories are respectively used as inquiry text data clusters, so that the inquiry text data with similar texts are gathered together.
Therefore, through optimizing a text clustering algorithm, the text data of each inquiry with similar texts can be more accurately classified.
S12, designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one.
In an optional embodiment, the designing the task tag according to the preset manner to generate a keyword list, where the tag corresponds to the keyword list one to one, includes:
designing a plurality of task labels according to preset task categories;
configuring a plurality of corresponding keywords for each task label according to a preset mode;
and taking a plurality of keywords corresponding to each task label as a keyword list.
In this optional embodiment, a plurality of task labels corresponding to the task category may be designed in advance according to different supervised learning tasks, and in this embodiment, the doctor's intention in the on-line inquiry process is identified as an example of the supervised learning task, where the task labels may be "cause", "life advice", "patient comfort", "rehabilitation expectation", "medicine curative effect", "inquiry medication" and are used to represent the intention of the doctor, and the label classification of the inquiry text data set is implemented by matching the inquiry text data cluster to the corresponding task label, so as to identify the intention of the doctor corresponding to each text data in the inquiry text data set. For example, one text data in the inquiry text data set is "your illness is not serious", and the corresponding task label is automatically identified as "illness comfort" through supervision and study tasks, namely, the doctor intends to "illness comfort" at the moment.
In this alternative embodiment, a plurality of keywords corresponding to each task tag may be configured in advance by a medical expert, and the plurality of keywords corresponding to each task tag may be used as the keyword list of the task tag.
Exemplary, as shown in table 1, the keyword list corresponding to each task tag set in the present solution is shown.
TABLE 1
Therefore, corresponding task labels can be designed according to different task categories, and corresponding keywords are configured for each task label, so that the text data can be conveniently matched with the corresponding task labels according to the keywords in the subsequent process.
And S13, calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm to obtain a first text similarity.
In an alternative embodiment, the calculating the similarity between the query text data cluster and the keyword list according to the cosine similarity algorithm to obtain the first text similarity includes:
respectively calculating text word vectors of all the inquiry texts in the inquiry text data clusters and keyword vectors of all the keywords in the keyword list according to a word vector model;
calculating an average value of each element in the text word vector as a text word value, and constructing a consultation text vector of each consultation text based on the text word value;
Carrying out standardization processing on the inquiry text vector to obtain a standard text vector;
calculating the similarity between the standard text vector and the keyword vector according to a cosine similarity algorithm to obtain text vector similarity;
and calculating the similarity between the inquiry text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity.
In this alternative embodiment, word vectors of each query text in the query text data cluster and word vectors of each keyword in the keyword list may be calculated through a word2vec word vector model, so as to be used as a text word vector and a keyword vector. The word2vec is a word vector calculation tool of open source, and can convert words in the inquiry text and the keywords into word vectors. Where the word vector is made up of numbers of multiple dimensions, and the exemplary interview text is "attention rest," the "attention" and "rest" can be converted by word2vec into two text word vectors of the expressions [0.682,0.388, -0.466,0.262] and [0.823, -0.588, -0.765,0.361 ].
In this optional embodiment, an average value of each element in each text word vector may be calculated as a text word value corresponding to the text word vector, that is, the text word vector and the text word value are in one-to-one correspondence, and all text word values corresponding to each query text are used as elements to generate a query text vector of the query text. Illustratively, the element averages for the text word vectors [0.682,0.388, -0.466,0.262] and [0.823, -0.588, -0.765,0.361] are 0.217 and-0.042, respectively, and the "attention rest" of the interview text corresponds to the interview text vector of [0.217, -0.042].
Set S pq A query text vector representing the q-th text in the p-th query text data cluster, S pq =[S pq,0 ,S pq,1 ,……,S pq,s ,……,S pq,l-1 ]Wherein S is pq,s And the s-th text word value of the q-th text in the p-th inquiry text data cluster is represented, and l is the number of the text word values of the q-th text in the p-th inquiry text data cluster.
In this optional embodiment, in order to make the distribution of the query text vectors in each query text data cluster uniform in each dimension, the query text vectors corresponding to each query text may be normalized to obtain standard text vectors, where the standard text vectors satisfy the relation:
wherein,,standard text vector representing the q-th text in the p-th question text data cluster, u p And alpha p The mean value and standard deviation of each inquiry text vector in the p-th inquiry text data cluster are respectively.
For example, the query text data cluster a includes 3 texts, and the corresponding query text vectors are respectively: s is S p1 =[0.2,100],S p2 =[0.9,120],S p3 =[0.8,130]The mean and standard deviation of each inquiry text vector in the inquiry text data cluster a can be calculated as follows: u (u) p =[0.633,116.67],α p =[0.309,12.47]Therefore, the standard text vector corresponding to each text is
In this alternative embodiment, a cosine similarity algorithm may be used to calculate cosine similarity between the standard text vector and the keyword vector as the text vector similarity. Wherein, the cosine similarity between the q standard text vector in the p-th question text data cluster and the j keyword vector of the i task label is set as Sim pq,ij Cosine similarity Sim between the q standard text vector in the p-th question text data cluster and the keyword vectors corresponding to all keywords in the keyword list corresponding to the i-th task tag pq,i The method comprises the following steps:
wherein, len i Representing the number of keywords in a keyword list corresponding to the ith task tag, taking any one of the inquiry text data (such as the q-th inquiry text data) in the inquiry text data cluster as a target inquiry text, and taking the obtained Sim pq,i And obtaining the target average similarity corresponding to each inquiry text in the inquiry text data cluster as the target average similarity corresponding to the target inquiry text.
The cosine similarity Sim between the p-th question text data cluster and the keyword list corresponding to the i-th task tag can be calculated i,p The method comprises the following steps:
wherein, len p Representing the total number of interview text data within the p-th interview text data cluster. In the scheme, the cosine similarity Sim between the finally obtained p-th inquiry text data cluster and the keyword list corresponding to the i-th task label i,p As a first text similarity corresponding to the query text data cluster.
In this way, the first text similarity between the inquiry text data clusters and the keyword list can be calculated preliminarily through the cosine similarity algorithm, so that the follow-up process can be facilitated to match corresponding task labels for each inquiry text data cluster accordingly.
And S14, calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm to obtain a second text similarity.
In this alternative embodiment, a TF-IDF algorithm may be selected as a text similarity algorithm for calculating a similarity between the query text data cluster and the keyword list as a second text similarity. Wherein, TF-IDF is an index for measuring the importance of words in documents, TF is the frequency of keywords in the inquiry text data clusters, and IDF is the frequency of keywords in all inquiry text data clusters.
In this alternative embodiment, the TF value of the jth keyword of the ith task tag in the jth query text data cluster satisfies the relation:
wherein n is ij,p Representing the number of times the jth keyword of the ith task tag appears in the jth question text data cluster, N p Is the number of all words that occur within the p-th question text data cluster.
In this alternative embodiment, the IDF value of the j-th keyword of the i-th task tag satisfies the relation:
where m is the total number of the query text data clusters, and the denominator is the number of the query text data clusters containing the j-th keyword.
From the above two relations, the TF-IDF value of the jth keyword of the ith task tag in the p-th query text data cluster is as follows:
TFIDF ij,p =TF ij,p ×IDF ij
wherein TFIDF ij,p The TF-IDF value of the jth keyword representing the ith task tag in the p-th query text data cluster. In the scheme, the maximum value of TF-IDF values of keywords in the ith task tag in the ith inquiry text data cluster is taken as a second text similarity TFIDF between the ith task tag and the ith inquiry text data cluster i,p I.e. TFIDF i,p =max(TFIDF ij,p )。
Therefore, the second text similarity between the inquiry text data cluster and the keyword list can be calculated through a text similarity algorithm, so that more accurate task labels can be matched for the inquiry text data cluster in the follow-up process according to the obtained second text similarity.
S15, matching corresponding task labels for the inquiry text data clusters based on the first text similarity and the second text similarity.
In an optional embodiment, the matching the task labels corresponding to the query text data clusters based on the first text similarity and the second text similarity includes:
multiplying the first text similarity and the second text similarity to obtain inquiry text similarity;
Sequencing the similarity of the inquiry texts according to the sequence from big to small to obtain a sequencing result;
and determining task labels corresponding to the inquiry text data clusters based on the sorting results.
In this alternative embodiment, the first text similarity can be efficiently distinguished for task labels that are more distinct, while it is difficult to distinguish for task labels that are more similar.
Illustratively, for the query text data cluster 1 and the query text data cluster 2, two query text data are included in each query text data cluster. Two pieces of inquiry text data of the inquiry text data cluster 1 are respectively 'the disease is very common, is not very serious' and 'the disease is very common, is not very serious in general', and can easily judge that the task labels corresponding to keywords such as common, not and serious are 'the disease is comfort' through the first text similarity; the two pieces of inquiry text data in the inquiry text data cluster 2 are respectively ' diet attention at ordinary times ', repeated attacks are avoided, serious consequences are avoided, diet attention in life is avoided, recurrence is avoided, serious symptoms are avoided ', and whether the corresponding task label is ' life suggestion ' or ' cause ' is difficult to judge through the first text similarity of keywords ' attention ', ' avoidance ', ' cause '.
In this alternative embodiment, the similarity of the query text may be obtained by multiplying the first text similarity and the second text similarity, thereby solving the problem that the first text similarity alone is difficult to correctDistinguishing the problem that task labels are relatively similar, wherein the inquiry text similarity score i,p The relation is satisfied:
score i,p =abs(Sim i,p ×TFIDF i,p )
wherein score i,p And (3) representing the similarity of the query text of the ith task label and the p-th query text data cluster, wherein abs is an absolute value function.
In this optional embodiment, the similarity of the query text between the query text data cluster and each task label may be calculated, the query text similarities are sorted according to the order from the top to the bottom, then the task label corresponding to the query text similarity ranked at the top is selected as the task label matched with the query text data, and finally the corresponding task label is set for the data in the query text data cluster.
In this way, by combining the first text similarity and the second text similarity, the task labels can be more accurately matched with the inquiry text data clusters.
Referring to fig. 2, fig. 2 is a functional block diagram of a preferred embodiment of the text data tag generating apparatus based on artificial intelligence of the present application. The text data tag generating device 11 based on artificial intelligence comprises an acquisition module 110, a clustering module 111, a generating module 112, a first calculating module 113, a second calculating module 114 and a matching module 115. The unit/module referred to herein is a series of computer readable instructions capable of being executed by the processor 13 and of performing a fixed function, stored in the memory 12. In the present embodiment, the functions of the respective units/modules will be described in detail in the following embodiments.
In an alternative embodiment, the acquisition module 110 is configured to acquire on-line interview data of a doctor to obtain an interview text dataset.
In an alternative embodiment, the acquiring the on-line consultation data of the doctor to acquire the consultation text data set includes:
acquiring text messages sent by a doctor every time during online consultation as a piece of consultation text data to obtain an initial consultation text data set;
screening the initial inquiry text data set according to a preset mode;
and storing the data in the screened initial consultation text data set into a consultation text database to obtain a consultation text data set.
In an alternative embodiment, the clustering module 111 is configured to cluster the query text data in the query text data set to obtain a plurality of query text data clusters.
In an optional embodiment, the clustering the query text data in the query text data set to obtain a plurality of query text data clusters includes:
respectively converting each inquiry text data in the inquiry text data set into inquiry text vectors;
optimizing a preset text clustering algorithm to obtain a text clustering optimization algorithm;
And clustering the inquiry text vectors based on the text clustering optimization algorithm to obtain inquiry text data clusters of a plurality of categories.
In an alternative embodiment, the generating module 112 is configured to design the task tag according to a preset manner to generate a keyword list, where the tag corresponds to the keyword list one by one.
In an optional embodiment, the designing the task tag according to the preset manner to generate a keyword list, where the tag corresponds to the keyword list one to one, includes:
designing a plurality of task labels according to preset task categories;
configuring a plurality of corresponding keywords for each task label according to a preset mode;
and taking a plurality of keywords corresponding to each task label as a keyword list.
In this optional embodiment, a plurality of task labels corresponding to the task category may be designed in advance according to different supervised learning tasks, and in this embodiment, the doctor's intention in the on-line inquiry process is identified as an example of the supervised learning task, where the task labels may be "cause", "life advice", "patient comfort", "rehabilitation expectation", "medicine curative effect", "inquiry medication" and are used to represent the intention of the doctor, and the label classification of the inquiry text data set is implemented by matching the inquiry text data cluster to the corresponding task label, so as to identify the intention of the doctor corresponding to each text data in the inquiry text data set. For example, one text data in the inquiry text data set is "your illness is not serious", and the corresponding task label is automatically identified as "illness comfort" through supervision and study tasks, namely, the doctor intends to "illness comfort" at the moment.
In this alternative embodiment, a plurality of keywords corresponding to each task tag may be configured in advance by a medical expert, and the plurality of keywords corresponding to each task tag may be used as the keyword list of the task tag.
Exemplary, as shown in table 1, the keyword list corresponding to each task tag set in the present solution is shown.
TABLE 1
In an alternative embodiment, the first calculating module 113 is configured to calculate the similarity between the query text data cluster and the keyword list according to a cosine similarity algorithm to obtain the first text similarity.
In an alternative embodiment, the calculating the similarity between the query text data cluster and the keyword list according to the cosine similarity algorithm to obtain the first text similarity includes:
respectively calculating text word vectors of all the inquiry texts in the inquiry text data clusters and keyword vectors of all the keywords in the keyword list according to a word vector model;
calculating an average value of each element in the text word vector as a text word value, and constructing a consultation text vector of each consultation text based on the text word value;
carrying out standardization processing on the inquiry text vector to obtain a standard text vector;
Calculating the similarity between the standard text vector and the keyword vector according to a cosine similarity algorithm to obtain text vector similarity;
and calculating the similarity between the inquiry text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity.
In this alternative embodiment, word vectors of each query text in the query text data cluster and word vectors of each keyword in the keyword list may be calculated through a word2vec word vector model, so as to be used as a text word vector and a keyword vector. The word2vec is a word vector calculation tool of open source, and can convert words in the inquiry text and the keywords into word vectors. Where the word vector is made up of numbers of multiple dimensions, and the exemplary interview text is "attention rest," the "attention" and "rest" can be converted by word2vec into two text word vectors of the expressions [0.682,0.388, -0.466,0.262] and [0.823, -0.588, -0.765,0.361 ].
In this optional embodiment, an average value of each element in each text word vector may be calculated as a text word value corresponding to the text word vector, that is, the text word vector and the text word value are in one-to-one correspondence, and all text word values corresponding to each query text are used as elements to generate a query text vector of the query text. Illustratively, the element averages for the text word vectors [0.682,0.388, -0.466,0.262] and [0.823, -0.588, -0.765,0.361] are 0.217 and-0.042, respectively, and the "attention rest" of the interview text corresponds to the interview text vector of [0.217, -0.042].
Set S pq A query text vector representing the q-th text in the p-th query text data cluster, S pq =[S pq,0 ,S pq,1 ,……,S pq,s ,……,S pq,l-1 ]Wherein S is pq,s An s-th text word value representing a q-th text in a p-th question text data cluster, l being the q-th text in the p-th question text data clusterIs a number of text word values of (a).
In this optional embodiment, in order to make the distribution of the query text vectors in each query text data cluster uniform in each dimension, the query text vectors corresponding to each query text may be normalized to obtain standard text vectors, where the standard text vectors satisfy the relation:
wherein,,standard text vector representing the q-th text in the p-th question text data cluster, u p And alpha p The mean value and standard deviation of each inquiry text vector in the p-th inquiry text data cluster are respectively.
For example, the query text data cluster a includes 3 texts, and the corresponding query text vectors are respectively: s is S p1 =[0.2,100],S p2 =[0.9,120],S p3 =[0.8,130]The mean and standard deviation of each inquiry text vector in the inquiry text data cluster a can be calculated as follows: u (u) p =[0.633,116.67],α p =[0.309,12.47]Therefore, the standard text vector corresponding to each text is
In this alternative embodiment, a cosine similarity algorithm may be used to calculate cosine similarity between the standard text vector and the keyword vector as the text vector similarity. Wherein, the cosine similarity between the q standard text vector in the p-th question text data cluster and the j keyword vector of the i task label is set as Sim pq,ij The q standard text vector in the p-th question text data cluster and the keyword list corresponding to the i task labelCosine similarity Sim between keyword vectors corresponding to all keywords of (a) pq,i The method comprises the following steps:
wherein, len i Representing the number of keywords in a keyword list corresponding to the ith task tag, taking any one of the inquiry text data (such as the q-th inquiry text data) in the inquiry text data cluster as a target inquiry text, and taking the obtained Sim pq,i And obtaining the target average similarity corresponding to each inquiry text in the inquiry text data cluster as the target average similarity corresponding to the target inquiry text.
The cosine similarity Sim between the p-th question text data cluster and the keyword list corresponding to the i-th task tag can be calculated i,p The method comprises the following steps:
wherein, len p Representing the total number of interview text data within the p-th interview text data cluster. In the scheme, the cosine similarity Sim between the finally obtained p-th inquiry text data cluster and the keyword list corresponding to the i-th task label i,p As a first text similarity corresponding to the query text data cluster.
In an alternative embodiment, the second calculating module 114 is configured to calculate the similarity between the query text data cluster and the keyword list according to a text similarity algorithm to obtain the second text similarity.
In this alternative embodiment, a TF-IDF algorithm may be selected as a text similarity algorithm for calculating a similarity between the query text data cluster and the keyword list as a second text similarity. Wherein, TF-IDF is an index for measuring the importance of words in documents, TF is the frequency of keywords in the inquiry text data clusters, and IDF is the frequency of keywords in all inquiry text data clusters.
In this alternative embodiment, the TF value of the jth keyword of the ith task tag in the jth query text data cluster satisfies the relation:
wherein n is ij,p Representing the number of times the jth keyword of the ith task tag appears in the jth question text data cluster, N p Is the number of all words that occur within the p-th question text data cluster.
In this alternative embodiment, the IDF value of the j-th keyword of the i-th task tag satisfies the relation:
where m is the total number of the query text data clusters, and the denominator is the number of the query text data clusters containing the j-th keyword.
From the above two relations, the TF-IDF value of the jth keyword of the ith task tag in the p-th query text data cluster is as follows:
TFIDF ij,p =TF ij,p ×IDF ij
Wherein TFIDF ij,p The TF-IDF value of the jth keyword representing the ith task tag in the p-th query text data cluster. In the scheme, the maximum value of TF-IDF values of keywords in the ith task tag in the ith inquiry text data cluster is taken as a second text similarity TFIDF between the ith task tag and the ith inquiry text data cluster i,p I.e. TFIDF i,p =max(TFIDF ij,p )。
In an alternative embodiment, the matching module 115 is configured to match the task labels corresponding to the query text data clusters based on the first text similarity and the second text similarity.
In an optional embodiment, the matching the task labels corresponding to the query text data clusters based on the first text similarity and the second text similarity includes:
multiplying the first text similarity and the second text similarity to obtain inquiry text similarity;
sequencing the similarity of the inquiry texts according to the sequence from big to small to obtain a sequencing result;
and determining task labels corresponding to the inquiry text data clusters based on the sorting results.
In this alternative embodiment, the first text similarity can be efficiently distinguished for task labels that are more distinct, while it is difficult to distinguish for task labels that are more similar.
Illustratively, for the query text data cluster 1 and the query text data cluster 2, two query text data are included in each query text data cluster. Two pieces of inquiry text data of the inquiry text data cluster 1 are respectively 'the disease is very common, is not very serious' and 'the disease is very common, is not very serious in general', and can easily judge that the task labels corresponding to keywords such as common, not and serious are 'the disease is comfort' through the first text similarity; the two pieces of inquiry text data in the inquiry text data cluster 2 are respectively ' diet attention at ordinary times ', repeated attacks are avoided, serious consequences are avoided, diet attention in life is avoided, recurrence is avoided, serious symptoms are avoided ', and whether the corresponding task label is ' life suggestion ' or ' cause ' is difficult to judge through the first text similarity of keywords ' attention ', ' avoidance ', ' cause '.
In this alternative embodiment, the similarity of the query text may be obtained by multiplying the first text similarity and the second text similarity, so as to solve the problem that it is difficult to distinguish task labels with a school only by the first text similarity, where the query text similarity score i,p The relation is satisfied:
score i,p =abs(Sim i,p ×TFIDF i,p )
wherein score i,p And (3) representing the similarity of the query text of the ith task label and the p-th query text data cluster, wherein abs is an absolute value function.
In this optional embodiment, the similarity of the query text between the query text data cluster and each task label may be calculated, the query text similarities are sorted according to the order from the top to the bottom, then the task label corresponding to the query text similarity ranked at the top is selected as the task label matched with the query text data, and finally the corresponding task label is set for the data in the query text data cluster.
According to the technical scheme, the acquired inquiry data can be clustered, the keyword list corresponding to the task label is created to perform different text similarity calculation with the clustered inquiry data, so that more accurate and proper task labels can be automatically matched for the inquiry data of each category, and further the acquisition efficiency of the text data labels is effectively improved.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 1 comprises a memory 12 and a processor 13. The memory 12 is configured to store computer readable instructions and the processor 13 is configured to execute the computer readable instructions stored in the memory to implement the artificial intelligence based text data tag generating method according to any of the above embodiments.
In an alternative embodiment, the electronic device 1 further comprises a bus, a computer program stored in said memory 12 and executable on said processor 13, such as an artificial intelligence based text data tag generation program.
Fig. 3 shows only an electronic device 1 with a memory 12 and a processor 13, it being understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or a different arrangement of components.
In connection with fig. 1, the memory 12 in the electronic device 1 stores a plurality of computer readable instructions to implement an artificial intelligence based text data tag generating method, the processor 13 being executable to implement:
acquiring on-line consultation data of a doctor to acquire a consultation text data set;
clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters;
designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one;
calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm to obtain a first text similarity;
Calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm to obtain a second text similarity;
and matching corresponding task labels for the inquiry text data clusters based on the first text similarity and the second text similarity.
Specifically, the specific implementation method of the above instructions by the processor 13 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
It will be appreciated by those skilled in the art that the schematic diagram is merely an example of the electronic device 1 and does not constitute a limitation of the electronic device 1, the electronic device 1 may be a bus type structure, a star type structure, the electronic device 1 may further comprise more or less other hardware or software than illustrated, or a different arrangement of components, e.g. the electronic device 1 may further comprise an input-output device, a network access device, etc.
It should be noted that the electronic device 1 is only used as an example, and other electronic products that may be present in the present application or may be present in the future are also included in the scope of the present application and are incorporated herein by reference.
The memory 12 includes at least one type of readable storage medium, which may be non-volatile or volatile. The readable storage medium includes flash memory, a removable hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, such as a mobile hard disk of the electronic device 1. The memory 12 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. The memory 12 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of text data tag generation programs based on artificial intelligence, but also for temporarily storing data that has been output or is to be output.
The processor 13 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, a combination of various control chips, and the like. The processor 13 is a Control Unit (Control Unit) of the electronic device 1, connects the respective components of the entire electronic device 1 using various interfaces and lines, executes or executes programs or modules stored in the memory 12 (for example, executes an artificial intelligence-based text data tag generation program or the like), and invokes data stored in the memory 12 to perform various functions of the electronic device 1 and process data.
The processor 13 executes the operating system of the electronic device 1 and various types of applications installed. The processor 13 executes the application program to implement the steps of the various embodiments of the artificial intelligence based text data tag generation method described above, such as the steps shown in fig. 1.
Illustratively, the computer program may be split into one or more units/modules, which are stored in the memory 12 and executed by the processor 13 to complete the present application. The one or more units/modules may be a series of computer readable instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the electronic device 1. For example, the computer program may be partitioned into an acquisition module 110, a clustering module 111, a generation module 112, a first calculation module 113, a second calculation module 114, a matching module 115.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a computer device, or a network device, etc.) or processor (processor) to perform portions of the artificial intelligence-based text data tag generation methods described in various embodiments of the present application.
The integrated units/modules of the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand alone product. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by instructing the relevant hardware device by a computer program, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each method embodiment described above when executed by a processor.
Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory, other memories, and the like.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.
The blockchain referred to in the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The bus may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 3, but only one bus or one type of bus is not shown. The bus is arranged to enable a connection communication between the memory 12 and at least one processor 13 or the like.
The embodiment of the application further provides a computer readable storage medium (not shown), in which computer readable instructions are stored, and the computer readable instructions are executed by a processor in an electronic device to implement the method for generating a text data tag based on artificial intelligence according to any one of the embodiments.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in each embodiment of the present application may be integrated in one processing unit, or each module may exist alone physically, or two or more modules may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
Furthermore, it is evident that the word "comprising" does not exclude other modules or steps, and that the singular does not exclude a plurality. The various modules or means set forth in the specification may also be implemented by one module or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above embodiments are merely for illustrating the technical solution of the present application and not for limiting, and although the present application has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present application may be modified or substituted without departing from the spirit and scope of the technical solution of the present application.
Claims (10)
1. A method for generating text data labels based on artificial intelligence, the method comprising:
acquiring on-line consultation data of a doctor to acquire a consultation text data set;
clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters;
designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one;
calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm to obtain a first text similarity;
Calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm to obtain a second text similarity;
and matching corresponding task labels for the inquiry text data clusters based on the first text similarity and the second text similarity.
2. The artificial intelligence based text data tag generation method of claim 1, wherein the acquiring on-line interview data of a doctor to acquire an interview text dataset comprises:
acquiring text messages sent by a doctor every time during online consultation as a piece of consultation text data to obtain an initial consultation text data set;
screening the initial inquiry text data set according to a preset mode;
and storing the data in the screened initial consultation text data set into a consultation text database to obtain a consultation text data set.
3. The method for generating text data labels based on artificial intelligence according to claim 1, wherein the clustering the query text data in the query text data set to obtain a plurality of query text data clusters includes:
respectively converting each inquiry text data in the inquiry text data set into inquiry text vectors;
Optimizing a preset text clustering algorithm to obtain a text clustering optimization algorithm;
and clustering the inquiry text vectors based on the text clustering optimization algorithm to obtain inquiry text data clusters of a plurality of categories.
4. The method for generating text data labels based on artificial intelligence according to claim 1, wherein the designing task labels according to a preset manner to generate the keyword list comprises:
designing a plurality of task labels according to preset task categories;
configuring a plurality of corresponding keywords for each task label according to a preset mode;
and taking a plurality of keywords corresponding to each task label as a keyword list.
5. The artificial intelligence based text data tag generation method of claim 1, wherein the calculating the similarity between the question text data cluster and the keyword list according to the cosine similarity algorithm to obtain a first text similarity comprises:
respectively calculating text word vectors of all the inquiry texts in the inquiry text data clusters and keyword vectors of all the keywords in the keyword list according to a word vector model;
calculating an average value of each element in the text word vector as a text word value, and constructing a consultation text vector of each consultation text based on the text word value;
Carrying out standardization processing on the inquiry text vector to obtain a standard text vector;
calculating the similarity between the standard text vector and the keyword vector according to a cosine similarity algorithm to obtain text vector similarity;
and calculating the similarity between the inquiry text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity.
6. The artificial intelligence based text data tag generation method of claim 5, wherein the calculating similarity between the question text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity comprises:
calculating an average value of text vector similarity between a target inquiry text and all keyword vectors in the keyword list as a target average similarity of the target inquiry text, wherein the target inquiry text is any one of the inquiry text data clusters;
traversing the inquiry text data cluster to obtain the average similarity of the targets corresponding to each inquiry text;
and calculating the average value of all the target average similarities as the first text similarity of the inquiry text data cluster.
7. The method for generating text data labels based on artificial intelligence according to claim 1, wherein the matching the task labels corresponding to the question text data clusters based on the first text similarity and the second text similarity includes:
multiplying the first text similarity and the second text similarity to obtain inquiry text similarity;
sequencing the similarity of the inquiry texts according to the sequence from big to small to obtain a sequencing result;
and determining task labels corresponding to the inquiry text data clusters based on the sorting results.
8. The text data label generating device based on the artificial intelligence is characterized by comprising an acquisition module, a clustering module, a generating module, a first calculating module, a second calculating module and a matching module:
the acquisition module is used for acquiring on-line consultation data of a doctor to acquire a consultation text data set;
the clustering module is used for clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters;
the generating module is used for designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one;
The first calculation module is used for calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm so as to obtain a first text similarity;
the second calculation module is used for calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm so as to obtain second text similarity;
the matching module is configured to match the task labels corresponding to the query text data clusters based on the first text similarity and the second text similarity.
9. An electronic device, the electronic device comprising:
a memory storing computer readable instructions; and
A processor executing computer readable instructions stored in the memory to implement the artificial intelligence based text data tag generation method of any one of claims 1 to 7.
10. A computer readable storage medium having computer readable instructions stored thereon, which when executed by a processor, implement the artificial intelligence based text data tag generation method of any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310295186.7A CN116401543A (en) | 2023-03-22 | 2023-03-22 | Text data label generation method based on artificial intelligence and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310295186.7A CN116401543A (en) | 2023-03-22 | 2023-03-22 | Text data label generation method based on artificial intelligence and related equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116401543A true CN116401543A (en) | 2023-07-07 |
Family
ID=87015192
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310295186.7A Pending CN116401543A (en) | 2023-03-22 | 2023-03-22 | Text data label generation method based on artificial intelligence and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116401543A (en) |
-
2023
- 2023-03-22 CN CN202310295186.7A patent/CN116401543A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yuvaraj et al. | Diabetes prediction in healthcare systems using machine learning algorithms on Hadoop cluster | |
CN113822494B (en) | Risk prediction method, device, equipment and storage medium | |
CN111695033B (en) | Enterprise public opinion analysis method, enterprise public opinion analysis device, electronic equipment and medium | |
US11232365B2 (en) | Digital assistant platform | |
CN113761218B (en) | Method, device, equipment and storage medium for entity linking | |
CN110929125B (en) | Search recall method, device, equipment and storage medium thereof | |
US20200074242A1 (en) | System and method for monitoring online retail platform using artificial intelligence | |
Vidhya et al. | Modified adaptive neuro-fuzzy inference system (M-ANFIS) based multi-disease analysis of healthcare Big Data | |
CN113435202A (en) | Product recommendation method and device based on user portrait, electronic equipment and medium | |
CN112214515B (en) | Automatic data matching method and device, electronic equipment and storage medium | |
WO2022160442A1 (en) | Answer generation method and apparatus, electronic device, and readable storage medium | |
Wanyan et al. | Deep learning with heterogeneous graph embeddings for mortality prediction from electronic health records | |
CN113706253A (en) | Real-time product recommendation method and device, electronic equipment and readable storage medium | |
Johnson et al. | Encoding high-dimensional procedure codes for healthcare fraud detection | |
CN113705698B (en) | Information pushing method and device based on click behavior prediction | |
Sharma et al. | A novel approach of ensemble methods using the stacked generalization for high-dimensional datasets | |
Belwal et al. | Extractive text summarization using clustering-based topic modeling | |
CN114706985A (en) | Text classification method and device, electronic equipment and storage medium | |
WO2021174923A1 (en) | Concept word sequence generation method, apparatus, computer device, and storage medium | |
WO2021009375A1 (en) | A method for extracting information from semi-structured documents, a related system and a processing device | |
CN114581177B (en) | Product recommendation method, device, equipment and storage medium | |
CN116150185A (en) | Data standard extraction method, device, equipment and medium based on artificial intelligence | |
CN115169360A (en) | User intention identification method based on artificial intelligence and related equipment | |
CN113627186B (en) | Entity relation detection method based on artificial intelligence and related equipment | |
CN116401543A (en) | Text data label generation method based on artificial intelligence and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |