CN113254658B - Text information processing method, system, medium, and apparatus - Google Patents

Text information processing method, system, medium, and apparatus Download PDF

Info

Publication number
CN113254658B
CN113254658B CN202110765335.2A CN202110765335A CN113254658B CN 113254658 B CN113254658 B CN 113254658B CN 202110765335 A CN202110765335 A CN 202110765335A CN 113254658 B CN113254658 B CN 113254658B
Authority
CN
China
Prior art keywords
data
preprocessing
word string
term
string distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110765335.2A
Other languages
Chinese (zh)
Other versions
CN113254658A (en
Inventor
姚娟娟
钟南山
樊代明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mingping Medical Data Technology Co ltd
Original Assignee
Mingpinyun Beijing Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mingpinyun Beijing Data Technology Co Ltd filed Critical Mingpinyun Beijing Data Technology Co Ltd
Priority to CN202110765335.2A priority Critical patent/CN113254658B/en
Publication of CN113254658A publication Critical patent/CN113254658A/en
Application granted granted Critical
Publication of CN113254658B publication Critical patent/CN113254658B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a text information processing method, a system, a medium and equipment, wherein in the text information processing method, on the basis of preprocessing text information to obtain a data preprocessing set, the data preprocessing set is subjected to first screening based on keyword matching according to a data reference set, the data preprocessing set is subjected to second screening based on deep learning, and processed text information is generated by combining the data sets screened twice before and after, so that error screening of the text information can be effectively prevented, and the processing accuracy and the processing efficiency of the text information are improved; each data set comprises a professional noun set, an adjective set and a parameter set which have a mapping relation with each other, and on the basis that each subset is singly compared and screened, the screening efficiency and the screening accuracy of the text information can be further improved by combining the auxiliary verification of the screening results of other subsets which have the mapping relation.

Description

Text information processing method, system, medium, and apparatus
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a text information processing method, system, medium, and device.
Background
In natural language processing, texts with multiple input ways and multiple purposes are usually involved, and for medical data, old paper text information of archive rooms and electronic medical record information of various hospitals or platforms are increasingly complicated, the medical data definition, the recording mode and the like of different hospitals or platforms are different, and the corresponding diagnostic texts have the problem of inconsistent diagnostic texts caused by specific terms, synonym expression, abbreviation, spelling and typing errors and the like.
Therefore, how to effectively summarize complicated medical text information and improve the processing efficiency and accuracy of the medical text information is a problem which needs to be solved urgently at present.
Disclosure of Invention
In view of the above problems in the prior art, the present invention provides a technical solution for processing text information, which is used to solve the above technical problems.
In order to achieve the above and other objects, the present invention adopts the following technical solutions.
A text information processing method comprising:
acquiring text information to be processed;
preprocessing the text information to generate a plurality of words and parameters;
classifying and extracting the words and the parameters to obtain a corresponding data preprocessing set, wherein the data preprocessing set comprises a proper noun preprocessing set, a proper noun description preprocessing set and a parameter preprocessing set which have a mapping relation with each other;
acquiring a data reference set of a related field, wherein the data reference set comprises a proper noun reference set, a proper noun description reference set and a parameter reference set which have a mapping relation with each other;
according to the data reference set, based on keyword matching, carrying out first screening on the data preprocessing set to obtain a first data set, wherein the first data set comprises a first term set, a first term description set and a first parameter set which have a mapping relation with each other;
according to the data reference set, based on deep learning, carrying out secondary screening on the data preprocessing set to obtain a second data set, wherein the second data set comprises a second term set, a second term description set and a second parameter set which have a mapping relation with each other;
and outputting the processed text information according to the first data set and the second data set.
Optionally, when the text information is preprocessed, at least data cleaning, punctuation removal, word segmentation, stop word removal, and repeated word removal are sequentially performed on the text information.
Optionally, the step of classifying and extracting the plurality of words and the parameters includes:
performing part-of-speech tagging on the words;
and classifying and extracting the plurality of words and the parameters according to the part of speech and the context of the words to obtain the data preprocessing set.
Optionally, the step of performing a first screening on the data preprocessing set based on keyword matching according to the data reference set includes:
sequentially calculating a first word string distance S between the a-th element in the term preprocessing set and the b-th element in the term reference set1(a, b) obtaining a first word string distance set S1(a);
If the first word string distance set S1(a) If the value of the element (a) contains zero, the a-th element is reserved and added into the first term set, the element corresponding to the element in the term description preprocessing set is added into the first term description set, and the element corresponding to the element in the parameter preprocessing set is added into the first parameter set.
Optionally, the step of performing a first screening on the data preprocessing set based on keyword matching according to the data reference set further includes:
if the first word string distance set S1(a) If the element value of (2) does not contain zero, further judging the first word string distance set S1(a) Whether elements with values smaller than a first threshold value exist in the table;
if the first word string distance set S1(a) If the value of at least one element is less than the first threshold value, the first word string distance set S is processed according to the sequence from small to large1(a) The elements in the list smaller than the first threshold value are subjected to statistical sorting to obtain a first word string distance screening set S10(a);
Screening set S for first word string distance10(a) The corresponding ith element in the term description preprocessing set and the term description reference are calculated in sequence from the first elementSecond word string distance S between corresponding jth elements in the set2(i, j) obtaining a second word string distance set S2(i);
If the second word string distance set S2(i) If the element value of (2) contains zero, further judging the second word string distance set S2(i) Whether the number of the elements with the median value of zero is larger than a second threshold value or not;
if the second word string distance set S2(i) If the number of the elements with the median value of zero is greater than or equal to the second threshold, retaining the corresponding elements in the term description preprocessing set, adding the corresponding elements into the first term description set, adding the elements in the term preprocessing set corresponding to the elements into the first term description set, and adding the elements in the parameter preprocessing set corresponding to the elements into the first parameter set;
if the second word string distance set S2(i) And if the number of the elements with the median value of zero is less than the second threshold, giving up the corresponding ith element in the term description preprocessing set, and giving up the corresponding element in the term preprocessing set and the corresponding element in the parameter preprocessing set.
Optionally, the step of performing a first screening on the data preprocessing set based on keyword matching according to the data reference set further includes:
if the first word string distance set S1(a) Does not contain zero, and the first word string distance set S1(a) If no element is less than the first threshold, abandoning the a-th element in the term pre-processing set, and abandoning the corresponding element in the term description pre-processing set and the corresponding element in the parameter pre-processing set.
Optionally, the step of performing a first screening on the data preprocessing set based on keyword matching according to the data reference set further includes:
if the second word string distance set S2(i) InAnd if the number of the elements with the value of zero is less than the second threshold value, abandoning the corresponding ith element in the term description preprocessing set, and abandoning the corresponding element in the term preprocessing set and the corresponding element in the parameter preprocessing set.
Optionally, the first word string distance S1(a, b) and said second word string distance S2The calculation formulas of (i, j) are respectively as follows:
S1(a,b)=[M];
M= [S2(a,b)+ S3(a,b)]/2;
S2(a,b)=|G2(a)|+|G2(b)|−2*|G2(a)∩G2(b)|;
S3(a,b)=|G3(a)|+|G3(b)|−2*|G3(a)∩G3(b)|;
S2(i,j)=[N];
N= [S2’(i,j)+ S3(i,j)]/2;
S2’(i,j)=|G2(i)|+|G2(j)|−2*|G2(i)∩G2(j)|;
S3(i,j)=|G3(i)|+|G3(j)|−2*|G3(i)∩G3(j)|;
wherein, the first word string distance S1(a, b) is a value rounding M, S2(a, b) denotes the first 2-Gram word string distance, S3(a, b) represents the first 3-Gram word string distance, the second word string distance S2(i, j) is a value rounded to N, S2’(i, j) denotes the first 3-Gram word string distance, S3(i, j) represents a second 3-Gram word string distance; g2(a) And G2(b) Respectively representing the set of 2-grams in the a-th element of the term pre-processing set and the b-th element of the term reference set, G2(i) And G2(j) Respectively representing the ith element in the term description preprocessing set and the ith element in the term description reference setSet of 2-Gram in corresponding jth element, G3(a) And G3(b) Set G representing the 3-Gram in the a-th element of the term pre-processing set and the b-th element of the term reference set, respectively3(i) And G3(j) Respectively representing the set of 3-Gram in the ith element in the term description preprocessing set and the corresponding jth element in the term description reference set.
Optionally, the step of performing a second screening on the data preprocessing set based on deep learning according to the data reference set includes:
constructing a convolution cyclic neural network model, and training the convolution cyclic neural network model based on the first data set and the data reference set;
and screening and identifying the data preprocessing set by using the trained convolution cyclic neural network model to obtain the second data set.
Optionally, the step of outputting the processed text information according to the first data set and the second data set includes:
analyzing the first data set and the second data set to obtain an intersection and a union of the first data set and the second data set;
and outputting first text information according to the intersection, wherein the first text information comprises all elements of the intersection.
Optionally, the step of outputting the processed text information according to the first data set and the second data set further includes:
and outputting second text information according to the intersection and the union, wherein the second text information comprises all elements which are removed from the union after the elements are repeated with the intersection.
A text information processing system comprising:
the receiving unit is used for receiving text information to be processed and receiving a data reference set of a related field, wherein the data reference set comprises a proper noun reference set, a proper noun description reference set and a parameter reference set which have a mapping relation with each other;
the preprocessing unit is used for preprocessing the text information to generate a plurality of words and parameters;
the classification extraction unit is used for performing classification extraction on the words and the parameters to obtain a corresponding data preprocessing set, wherein the data preprocessing set comprises a proper noun preprocessing set, a proper noun description preprocessing set and a parameter preprocessing set which have a mapping relation with each other;
the screening unit is used for screening the data preprocessing set twice to obtain a first data set and a second data set, wherein the first data set comprises a first term set, a first term description set and a first parameter set which have mapping relations with each other, and the second data set comprises a second term set, a second term description set and a second parameter set which have mapping relations with each other;
and the output unit is used for outputting the processed text information according to the first data set and the second data set.
Optionally, the screening unit includes a keyword matching module and a deep learning module, the keyword matching module performs a first screening on the data preprocessing set to obtain the first data set, and the deep learning module performs a second screening on the data preprocessing set to obtain the second data set.
A computer readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor to perform any of the above-described text information processing methods.
An electronic device, comprising:
a processor;
a computer readable storage medium having instructions stored thereon, which when executed by the processor, implement the text information processing method of any one of the above.
As described above, the text information processing method, system, medium, and apparatus provided by the present invention have at least the following beneficial effects:
on the basis of preprocessing text information to obtain a data preprocessing set, according to a data reference set, performing first screening on the data preprocessing set based on keyword matching, performing second screening on the data preprocessing set based on deep learning, and generating processed text information by combining data sets screened twice before and after, so that wrong screening of the text information can be effectively prevented, and the processing efficiency and accuracy of the text information are improved; each data set comprises a professional noun set, an adjective set and a parameter set which have a mapping relation with each other, and on the basis that each subset is singly compared and screened, the screening efficiency and the screening accuracy can be further improved by combining the auxiliary verification of the screening results of other subsets with the mapping relation.
Drawings
Fig. 1 is a schematic step diagram of a text information processing method according to an embodiment of the present invention.
FIG. 2 is a block diagram of a text message processing system according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a user terminal according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
Referring to fig. 1, the present invention provides a text information processing method, including the following steps:
and S1, acquiring the text information to be processed. For example, a large amount of medical text information is acquired from a paper document or a medical database through scanning recognition or text transmission and other acquisition modes.
And S2, preprocessing the text information to generate a plurality of words and parameters.
In an optional embodiment of the present invention, when the text information is preprocessed, at least data cleaning, punctuation removal, word segmentation, stop word removal, and repeated word removal are performed on the text information in sequence.
The detailed steps of data cleaning, word segmentation and stop word removal can refer to the prior art, and are not described herein again.
S3, classifying and extracting the words and the parameters to obtain a corresponding data preprocessing set, wherein the data preprocessing set comprises a term preprocessing set, a term description preprocessing set and a parameter preprocessing set which have a mapping relation with each other.
In an optional embodiment of the present invention, the step S3 of classifying and extracting the words and the parameters further includes:
s31, performing part-of-speech tagging on the words;
and S32, classifying and extracting the words and the parameters according to the parts of speech and the context of the words to obtain a data preprocessing set.
The data preprocessing set comprises a term preprocessing set, a term description preprocessing set and a parameter preprocessing set, mapping relations exist among the term preprocessing set, the term description preprocessing set and the parameter preprocessing set, and the same object is described, so that association judgment during subsequent identification and screening is facilitated, and the accuracy of identification and screening is improved.
S4, acquiring a data reference set of the related field, wherein the data reference set comprises a term reference set, a term description reference set and a parameter reference set which have a mapping relation with each other.
In an optional embodiment of the present invention, based on channels such as the internet or a block chain, a data reference set of a related field is obtained in a medical dictionary, a medical database, and the like that are determined professionally or authoritatively, and the data reference set is used as a comparison standard in subsequent identification screening, and includes a term reference set, a term description reference set, and a parameter reference set, which have a mapping relationship therebetween.
S5, according to the data reference set, based on keyword matching, a first screening is carried out on the data preprocessing set to obtain a first data set, wherein the first data set comprises a first term set, a first term description set and a first parameter set which have a mapping relation with each other.
In an optional embodiment of the present invention, the step S5 of performing a first filtering on the data preprocessing set based on the keyword matching according to the data reference set further includes:
s51, sequentially calculating a first word string distance S between the a-th element in the term preprocessing set and the b-th element in the term reference set aiming at the a-th element in the term preprocessing set1(a, b) obtaining a first word string distance set S1(a);
S52, if the value of the element of the first word string distance set S1 (a) contains zero, keeping the a-th element, adding the a-th element into the first term set, adding the element corresponding to the element in the term description preprocessing set into the first term description set, and adding the element corresponding to the element in the parameter preprocessing set into the first parameter set;
s53, set S if first word string distance1(a) If the element value of (2) does not contain zero, the first word string distance set S is further judged1(a) Whether elements with values smaller than a first threshold value exist in the table;
s54, set S if first word string distance1(a) If at least one element in the first word string has a value smaller than a first threshold value, the first word string distance set is subjected to the sequence from small to largeS1(a) The elements in the list smaller than the first threshold value are subjected to statistical sorting to obtain a first word string distance screening set S10(a);
S55, screening set S aiming at first word string distance10(a) The elements in (1) are used for sequentially calculating a second word string distance S between the corresponding ith element in the term description preprocessing set and the corresponding jth element in the term description reference set from the first element2(i, j) obtaining a second word string distance set S2(i);
S56, if the second word string distance set S2(i) If the element value of (2) contains zero, further judging the second word string distance set S2(i) Whether the number of the elements with the median value of zero is larger than a second threshold value or not;
s57, if the second word string distance set S2(i) If the number of the elements with the middle value of zero is larger than or equal to a second threshold value, retaining the corresponding elements in the term description preprocessing set, adding the corresponding elements into the first term description set, adding the elements in the term preprocessing set corresponding to the elements into the first term set, and adding the elements in the parameter preprocessing set corresponding to the elements into the first parameter set;
s58, if the second word string distance set S2(i) And if the number of the elements with the median value of zero is less than a second threshold value, giving up the corresponding ith element in the term description preprocessing set, and giving up the corresponding element in the term preprocessing set and the corresponding element in the parameter preprocessing set.
In addition, the step S5 of performing a first filtering on the data preprocessing set based on the keyword matching according to the data reference set further includes:
s59, set S if first word string distance1(a) The value of the element (S) does not contain zero, and the first word string distance set S1(a) If no element is less than the first threshold value, abandoning the a-th element in the term preprocessing set, and abandoning the term to describe the corresponding element in the preprocessing set and the corresponding element in the parameter preprocessing set;
s510, if the second word string distance set S2(i) And if the number of the elements with the median value of zero is less than a second threshold value, giving up the corresponding ith element in the term description preprocessing set, and giving up the corresponding element in the term preprocessing set and the corresponding element in the parameter preprocessing set.
The first threshold value is 1-2, and the length of the word string of the a-th element in the professional noun preprocessing set can be flexibly adjusted; the second threshold is 2/3 where the term describes the number of elements contained in the corresponding ith element in the preprocessed set.
In detail, based on keyword matching, the step S5 of performing first screening on the data preprocessing set is mainly to perform analysis based on the N-Gram model, perform word segmentation and word string distance calculation based on the N-Gram model, perform identification comparison on the data preprocessing set and the data reference set, and retain the same elements in the data preprocessing set as in the data reference set to obtain the first data set.
In an alternative embodiment of the present invention, the first string distance S1(a, b) and a second word string distance S2The calculation formulas of (i, j) are respectively as follows:
S1(a,b)=[M];
M= [S2(a,b)+ S3(a,b)]/2;
S2(a,b)=|G2(a)|+|G2(b)|−2*|G2(a)∩G2(b)|;
S3(a,b)=|G3(a)|+|G3(b)|−2*|G3(a)∩G3(b)|;
S2(i,j)=[N];
N= [S2’(i,j)+ S3(i,j)]/2;
S2’(i,j)=|G2(i)|+|G2(j)|−2*|G2(i)∩G2(j)|;
S3(i,j)=|G3(i)|+|G3(j)|−2*|G3(i)∩G3(j)|;
wherein, the first word string distance S1(a, b) is a value rounding M, S2(a, b) denotes the first 2-Gram word string distance, S3(a, b) represents the first 3-Gram word string distance, the second word string distance S2(i, j) is a value rounded to N, S2’(i, j) denotes the first 3-Gram word string distance, S3(i, j) represents a second 3-Gram word string distance; g2(a) And G2(b) Respectively representing the set of 2-grams in the a-th element of the term pre-processing set and the b-th element of the term reference set, G2(i) And G2(j) Respectively representing the ith element in the term description preprocessing set and the 2-Gram set in the corresponding jth element in the term description reference set, G3(a) And G3(b) Respectively representing the set of 3-grams in the a-th element of the term pre-processing set and the b-th element of the term reference set, G3(i) And G3(j) Respectively representing the ith element in the term description preprocessing set and the 3-Gram set in the corresponding jth element in the term description reference set.
Wherein, the first word string distance S1(a, b) second word string distance S2And (i, j) the average values of the corresponding 2-Gram word string distance and the 3-Gram word string distance are rounded, so that the fault tolerance rate during recognition of the longer character string can be properly increased, and the probability of screening errors is reduced.
S6, according to the data reference set, based on deep learning, carrying out secondary screening on the data preprocessing set to obtain a second data set, wherein the second data set comprises a second term set, a second term description set and a second parameter set which have a mapping relation with each other.
In an optional embodiment of the present invention, the step S6 of performing the second filtering on the data preprocessing set based on the deep learning according to the data reference set further includes:
s61, constructing a convolution cyclic neural network model, and training the convolution cyclic neural network model based on the first data set and the data reference set;
and S62, screening and identifying the data preprocessing set by using the trained convolution cyclic neural network model to obtain a second data set.
In an optional embodiment of the present invention, the convolutional recurrent neural network model at least comprises:
a CNN (convolutional layer) that extracts features from an input image using the depth CNN to obtain a feature map;
RNN (loop layer), which predicts a feature sequence using bi-directional RNN (blstm), learns each feature vector in the sequence, and outputs a prediction tag (true value) distribution;
CTC loss (transcription layer), using CTC loss, converts a series of tag distributions obtained from the loop layer into final tag sequences.
The specific structure of the convolutional recurrent neural network model can be referred to in the prior art, and is not described in detail herein.
When the trained convolution cyclic neural network model is used for screening and identifying, relevant elements (namely elements with mapping relations) in the term preprocessing set, the term description preprocessing set and the parameter preprocessing set are identified and screened in sequence, and a second data set is obtained.
And S7, outputting the processed text information according to the first data set and the second data set.
In detail, the step S7 of outputting the processed text information according to the first data set and the second data set further includes:
s71, analyzing the first data set and the second data set to obtain the intersection and the union of the first data set and the second data set;
s72, outputting first text information according to the intersection, wherein the first text information comprises all elements of the intersection;
and S73, outputting second text information according to the intersection and the union, wherein the second text information comprises all elements which are removed from the union after the intersection is repeated.
The first text information is output according to the intersection of the first data set and the second data set, namely the data which passes through the two screening processes forms the first text information, and the first text information is the default screening error-free information, so that the screening accuracy is improved; and outputting second text information according to the elements which are removed from the union of the first data set and the second data set and are repeated with the intersection, namely, the second text information is formed by data which is selected in the two previous and next screening processes and is only selected once, and the second text information is suspected information, so that the probability of wrong screening can be effectively reduced, and the screening accuracy is further improved.
Referring to fig. 2, the present invention further provides a text information processing system for executing the text information processing method in the foregoing method embodiment, and since the technical principle of the system embodiment is similar to that of the foregoing method embodiment, repeated details of the same technical details are not repeated.
As shown in fig. 2, in an alternative embodiment of the present invention, a text information processing system includes:
a receiving unit 10, configured to receive text information to be processed, and further configured to receive a data reference set of a related field, where the data reference set includes a term reference set, a term description reference set, and a parameter reference set, where mapping relationships exist between the term reference set and the term description reference set;
the preprocessing unit 11 is configured to preprocess the text information to generate a plurality of words and parameters;
the classification extraction unit 12 is configured to perform classification extraction on the multiple words and parameters to obtain a corresponding data preprocessing set, where the data preprocessing set includes a term preprocessing set, a term description preprocessing set, and a parameter preprocessing set, which have a mapping relationship therebetween;
the screening unit 13 is configured to perform two-time screening on the data preprocessing set to obtain a first data set and a second data set, where the first data set includes a first term set, a first term description set, and a first parameter set that have a mapping relationship therebetween, and the second data set includes a second term set, a second term description set, and a second parameter set that have a mapping relationship therebetween;
and the output unit 14 is used for outputting the processed text information according to the first data set and the second data set.
The receiving unit 10 is configured to assist in performing the steps S1 and S4 described in the foregoing method embodiment, the preprocessing unit 11 is configured to perform the step S2 described in the foregoing method embodiment, the classification extracting unit 12 is configured to perform the step S3 described in the foregoing method embodiment, the screening unit 13 is configured to perform the steps S5 to S6 described in the foregoing method embodiment, and the output unit 14 is configured to perform the step S7 described in the foregoing method embodiment.
Further, the screening unit 13 includes a keyword matching module 131 and a deep learning module 132, where the keyword matching module 131 performs a first screening on the data preprocessing set to obtain a first data set, and the deep learning module 132 performs a second screening on the data preprocessing set to obtain a second data set.
Based on the same inventive concept of the foregoing embodiment, the present invention further provides a computer-readable storage medium, on which a plurality of instructions are stored, where the instructions are suitable for being loaded by a processor to execute the text information processing method. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Based on the same inventive concept of the foregoing embodiment, the present invention also provides an electronic device, which may include: a processor; a computer readable storage medium having stored thereon instructions, which when executed by a processor, cause an electronic device to execute the text information processing method described in fig. 1.
In practical applications, the electronic device may be used as a user terminal or a server, and examples of the user terminal may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.
Fig. 3 is a schematic diagram of a hardware structure of a user terminal according to an alternative embodiment of the present invention. As shown, the user terminal may include: an input device 200, a processor 201, an output device 202, a memory 203, and at least one communication bus 204. The communication bus 204 is used to implement communication connections between the elements. The memory 203 may comprise a high speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, in which various programs may be stored for performing various processing functions and implementing the method steps of the present embodiment.
Alternatively, the processor 201 may be implemented by, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the processor 201 is coupled to the input device 200 and the output device 202 through a wired or wireless connection.
Optionally, the input device 200 may include a variety of input devices, for example, may include at least one of a user-oriented user interface, a device-oriented device interface, a software-programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output device 202 may include a display, a sound, or other output device.
In this embodiment, the processor of the user terminal includes a function for executing each module of the speech recognition device in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.
In summary, in the text information processing method, system, medium, and device provided by the present invention, on the basis of preprocessing text information to obtain a data preprocessing set, according to a data reference set, a first filtering is performed on the data preprocessing set based on keyword matching, a second filtering is performed on the data preprocessing set based on deep learning, and processed text information is generated by combining data sets that are filtered twice before and after, so that a mis-filtering of text information can be effectively prevented, and the accuracy and processing efficiency of text information are improved; each data set comprises a professional noun set, an adjective set and a parameter set which have a mapping relation with each other, and on the basis that each subset is singly compared and screened, the screening accuracy and efficiency can be further improved by combining the auxiliary verification of the screening results of other subsets with the mapping relation. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (13)

1. A text information processing method, comprising:
acquiring text information to be processed;
preprocessing the text information to generate a plurality of words and parameters;
classifying and extracting the words and the parameters to obtain a corresponding data preprocessing set, wherein the data preprocessing set comprises a proper noun preprocessing set, a proper noun description preprocessing set and a parameter preprocessing set which have a mapping relation with each other;
acquiring a data reference set of a related field, wherein the data reference set comprises a proper noun reference set, a proper noun description reference set and a parameter reference set which have a mapping relation with each other;
according to the data reference set, based on keyword matching, carrying out first screening on the data preprocessing set to obtain a first data set, wherein the first data set comprises a first term set, a first term description set and a first parameter set which have a mapping relation with each other;
according to the data reference set, based on deep learning, carrying out secondary screening on the data preprocessing set to obtain a second data set, wherein the second data set comprises a second term set, a second term description set and a second parameter set which have a mapping relation with each other;
outputting the processed text information according to the first data set and the second data set;
the step of screening the data preprocessing set for the first time based on keyword matching according to the data reference set comprises the following steps of:
sequentially calculating a first word string distance S between the a-th element in the term preprocessing set and the b-th element in the term reference set1(a, b) obtaining a first word string distance set S1(a);
If the first word string distance set S1(a) If the value of the element (a) contains zero, the a-th element is reserved and added into the first term set, the element corresponding to the element in the term description preprocessing set is added into the first term description set, and the element corresponding to the element in the parameter preprocessing set is added into the first parameter set;
if the first word string distance set S1(a) If the element value of (2) does not contain zero, further judging the first word string distance set S1(a) Whether elements with values smaller than a first threshold value exist in the table;
if the first word string distance set S1(a) If the value of at least one element is less than the first threshold value, the first word string distance set S is processed according to the sequence from small to large1(a) The elements in the list smaller than the first threshold value are subjected to statistical sorting to obtain a first word string distance screening set S10(a);
Screening set S for first word string distance10(a) The element in (2) sequentially calculates a second word string distance S between the corresponding ith element in the term description preprocessing set and the corresponding jth element in the term description reference set from the first element2(i, j) obtaining a second word string distance set S2(i);
If the second word string distance set S2(i) If the element value of (2) contains zero, further judging the second word string distance set S2(i) Whether the number of the elements with the median value of zero is larger than a second threshold value or not;
if the second word string distance set S2(i) If the number of the elements with the median value of zero is greater than or equal to the second threshold, retaining the corresponding elements in the term description preprocessing set, adding the corresponding elements into the first term description set, adding the elements in the term preprocessing set corresponding to the elements into the first term description set, and adding the elements in the parameter preprocessing set corresponding to the elements into the first parameter set;
if the second word string distance set S2(i) And if the number of the elements with the median value of zero is less than the second threshold, giving up the corresponding ith element in the term description preprocessing set, and giving up the corresponding element in the term preprocessing set and the corresponding element in the parameter preprocessing set.
2. The method according to claim 1, wherein when preprocessing the text information, at least data cleaning processing, punctuation removal processing, word segmentation processing, stop word removal processing, and repeat word removal processing are sequentially performed on the text information.
3. The text information processing method according to claim 1 or 2, wherein the step of classifying and extracting the plurality of words and the parameters includes:
performing part-of-speech tagging on the words;
and classifying and extracting the plurality of words and the parameters according to the part of speech and the context of the words to obtain the data preprocessing set.
4. The method of claim 3, wherein the step of first filtering the pre-processed set of data based on keyword matching according to the reference set of data further comprises:
if the first word string distance set S1(a) Does not contain zero, and the first word string distance set S1(a) If no element is less than the first threshold, abandoning the a-th element in the term pre-processing set, and abandoning the corresponding element in the term description pre-processing set and the corresponding element in the parameter pre-processing set.
5. The method of claim 4, wherein the step of first filtering the pre-processed set of data based on keyword matching according to the reference set of data further comprises:
if the second word string distance set S2(i) And if the number of the elements with the median value of zero is less than the second threshold, giving up the corresponding ith element in the term description preprocessing set, and giving up the corresponding element in the term preprocessing set and the corresponding element in the parameter preprocessing set.
6. The text information processing method according to claim 5, wherein the first word string distance S1(a, b) and said second word string distance S2The calculation formulas of (i, j) are respectively as follows:
S1(a,b)=[M];
M= [S2(a,b)+ S3(a,b)]/2;
S2(a,b)=|G2(a)|+|G2(b)|−2*|G2(a)∩G2(b)|;
S3(a,b)=|G3(a)|+|G3(b)|−2*|G3(a)∩G3(b)|;
S2(i,j)=[N];
N= [S2’(i,j)+ S3(i,j)]/2;
S2’(i,j)=|G2(i)|+|G2(j)|−2*|G2(i)∩G2(j)|;
S3(i,j)=|G3(i)|+|G3(j)|−2*|G3(i)∩G3(j)|;
wherein, the first word string distance S1(a, b) is a value rounding M, S2(a, b) denotes the first 2-Gram word string distance, S3(a, b) represents the first 3-Gram word string distance, the second word string distance S2(i, j) is a value rounded to N, S2’(i, j) denotes the first 3-Gram word string distance, S3(i, j) represents a second 3-Gram word string distance; g2(a) And G2(b) Respectively representing the a-th element in the term preprocessing setElement and set of 2-Gram in the b-th element of the reference set of terms, G2(i) And G2(j) Respectively representing the ith element in the term description preprocessing set and the set of 2-Gram in the corresponding jth element in the term description reference set, G3(a) And G3(b) Set G representing the 3-Gram in the a-th element of the term pre-processing set and the b-th element of the term reference set, respectively3(i) And G3(j) Respectively representing the set of 3-Gram in the ith element in the term description preprocessing set and the corresponding jth element in the term description reference set.
7. The method of claim 6, wherein the step of performing a second filtering on the pre-processed set of data based on deep learning according to the reference set of data comprises:
constructing a convolution cyclic neural network model, and training the convolution cyclic neural network model based on the first data set and the data reference set;
and screening and identifying the data preprocessing set by using the trained convolution cyclic neural network model to obtain the second data set.
8. The method of claim 7, wherein outputting the processed text message based on the first data set and the second data set comprises:
analyzing the first data set and the second data set to obtain an intersection and a union of the first data set and the second data set;
and outputting first text information according to the intersection, wherein the first text information comprises all elements of the intersection.
9. The method of claim 8, wherein the step of outputting the processed text message according to the first data set and the second data set further comprises:
and outputting second text information according to the intersection and the union, wherein the second text information comprises all elements which are removed from the union after the elements are repeated with the intersection.
10. A text information processing system, comprising:
the receiving unit is used for receiving text information to be processed and receiving a data reference set of a related field, wherein the data reference set comprises a proper noun reference set, a proper noun description reference set and a parameter reference set which have a mapping relation with each other;
the preprocessing unit is used for preprocessing the text information to generate a plurality of words and parameters;
the classification extraction unit is used for performing classification extraction on the words and the parameters to obtain a corresponding data preprocessing set, wherein the data preprocessing set comprises a proper noun preprocessing set, a proper noun description preprocessing set and a parameter preprocessing set which have a mapping relation with each other;
the screening unit is used for screening the data preprocessing set twice to obtain a first data set and a second data set, wherein the first data set comprises a first term set, a first term description set and a first parameter set which have mapping relations with each other, and the second data set comprises a second term set, a second term description set and a second parameter set which have mapping relations with each other;
the output unit is used for outputting the processed text information according to the first data set and the second data set;
the step of obtaining the first data set by the screening unit for the first screening includes:
sequentially calculating a first word string distance S between the a-th element in the term preprocessing set and the b-th element in the term reference set1(a, b) obtaining a first word string distance set S1(a);
If the first word string distance set S1(a) If the value of the element (a) contains zero, the a-th element is reserved and added into the first term set, the element corresponding to the element in the term description preprocessing set is added into the first term description set, and the element corresponding to the element in the parameter preprocessing set is added into the first parameter set;
if the first word string distance set S1(a) If the element value of (2) does not contain zero, further judging the first word string distance set S1(a) Whether elements with values smaller than a first threshold value exist in the table;
if the first word string distance set S1(a) If the value of at least one element is less than the first threshold value, the first word string distance set S is processed according to the sequence from small to large1(a) The elements in the list smaller than the first threshold value are subjected to statistical sorting to obtain a first word string distance screening set S10(a);
Screening set S for first word string distance10(a) The element in (2) sequentially calculates a second word string distance S between the corresponding ith element in the term description preprocessing set and the corresponding jth element in the term description reference set from the first element2(i, j) obtaining a second word string distance set S2(i);
If the second word string distance set S2(i) If the element value of (2) contains zero, further judging the second word string distance set S2(i) Whether the number of the elements with the median value of zero is larger than a second threshold value or not;
if the second word string distance set S2(i) If the number of the elements with the median value of zero is greater than or equal to the second threshold, the corresponding elements in the term description preprocessing set are reserved and added into the first term description set, the elements in the term preprocessing set corresponding to the elements are added into the first term set, and the elements in the parameter preprocessing set corresponding to the elements are added into the second term setA set of parameters;
if the second word string distance set S2(i) And if the number of the elements with the median value of zero is less than the second threshold, giving up the corresponding ith element in the term description preprocessing set, and giving up the corresponding element in the term preprocessing set and the corresponding element in the parameter preprocessing set.
11. The system of claim 10, wherein the filtering unit comprises a keyword matching module and a deep learning module, wherein the keyword matching module filters the pre-processed data set for a first time to obtain the first data set, and the deep learning module filters the pre-processed data set for a second time to obtain the second data set.
12. A computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor to perform the method of processing textual information according to any of claims 1-9.
13. An electronic device, comprising:
a processor;
a computer-readable storage medium having stored thereon instructions which, when executed by the processor, implement the text information processing method according to any one of claims 1 to 9.
CN202110765335.2A 2021-07-07 2021-07-07 Text information processing method, system, medium, and apparatus Active CN113254658B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110765335.2A CN113254658B (en) 2021-07-07 2021-07-07 Text information processing method, system, medium, and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110765335.2A CN113254658B (en) 2021-07-07 2021-07-07 Text information processing method, system, medium, and apparatus

Publications (2)

Publication Number Publication Date
CN113254658A CN113254658A (en) 2021-08-13
CN113254658B true CN113254658B (en) 2021-12-21

Family

ID=77190886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110765335.2A Active CN113254658B (en) 2021-07-07 2021-07-07 Text information processing method, system, medium, and apparatus

Country Status (1)

Country Link
CN (1) CN113254658B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018517963A (en) * 2015-04-24 2018-07-05 日本電気株式会社 Information processing apparatus, information processing method, and program
AU2019240633A1 (en) * 2014-08-14 2019-10-24 Accenture Global Services Limited System for automated analysis of clinical text for pharmacovigilance
CN111046660A (en) * 2019-11-21 2020-04-21 深圳无域科技技术有限公司 Method and device for recognizing text professional terms
CN111931477A (en) * 2020-09-29 2020-11-13 腾讯科技(深圳)有限公司 Text matching method and device, electronic equipment and storage medium
CN111949759A (en) * 2019-05-16 2020-11-17 北大医疗信息技术有限公司 Method and system for retrieving medical record text similarity and computer equipment
CN112364625A (en) * 2020-11-19 2021-02-12 深圳壹账通智能科技有限公司 Text screening method, device, equipment and storage medium
CN112632980A (en) * 2020-12-30 2021-04-09 广州友圈科技有限公司 Enterprise classification method and system based on big data deep learning and electronic equipment

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8700589B2 (en) * 2011-09-12 2014-04-15 Siemens Corporation System for linking medical terms for a medical knowledge base
CN107480135B (en) * 2017-07-31 2022-01-07 京东方科技集团股份有限公司 Data processing method, medical phrase processing system and medical diagnosis and treatment system
CN107562721B (en) * 2017-08-09 2020-11-03 刘聪 Noun classification method based on topology
CN109783811B (en) * 2018-12-26 2023-10-31 东软集团股份有限公司 Method, device, equipment and storage medium for identifying text editing errors
CN110442677A (en) * 2019-07-04 2019-11-12 平安科技(深圳)有限公司 Text matches degree detection method, device, computer equipment and readable storage medium storing program for executing
CN110491519B (en) * 2019-07-17 2024-01-02 上海明品医学数据科技有限公司 Medical data checking method
CN110442869B (en) * 2019-08-01 2021-02-23 腾讯科技(深圳)有限公司 Medical text processing method and device, equipment and storage medium thereof
CN111652299A (en) * 2020-05-26 2020-09-11 泰康保险集团股份有限公司 Method and equipment for automatically matching service data
CN112800758A (en) * 2021-04-08 2021-05-14 明品云(北京)数据科技有限公司 Method, system, equipment and medium for distinguishing similar meaning words in text

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2019240633A1 (en) * 2014-08-14 2019-10-24 Accenture Global Services Limited System for automated analysis of clinical text for pharmacovigilance
JP2018517963A (en) * 2015-04-24 2018-07-05 日本電気株式会社 Information processing apparatus, information processing method, and program
CN111949759A (en) * 2019-05-16 2020-11-17 北大医疗信息技术有限公司 Method and system for retrieving medical record text similarity and computer equipment
CN111046660A (en) * 2019-11-21 2020-04-21 深圳无域科技技术有限公司 Method and device for recognizing text professional terms
CN111931477A (en) * 2020-09-29 2020-11-13 腾讯科技(深圳)有限公司 Text matching method and device, electronic equipment and storage medium
CN112364625A (en) * 2020-11-19 2021-02-12 深圳壹账通智能科技有限公司 Text screening method, device, equipment and storage medium
CN112632980A (en) * 2020-12-30 2021-04-09 广州友圈科技有限公司 Enterprise classification method and system based on big data deep learning and electronic equipment

Also Published As

Publication number Publication date
CN113254658A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
CN106847288B (en) Error correction method and device for voice recognition text
CN109471945B (en) Deep learning-based medical text classification method and device and storage medium
WO2017127296A1 (en) Analyzing textual data
CN110335608B (en) Voiceprint verification method, voiceprint verification device, voiceprint verification equipment and storage medium
CN111274797A (en) Intention recognition method, device and equipment for terminal and storage medium
CN111223476B (en) Method and device for extracting voice feature vector, computer equipment and storage medium
CN112256822A (en) Text search method and device, computer equipment and storage medium
CN112417855A (en) Text intention recognition method and device and related equipment
CN112101010A (en) Telecom industry OA office automation manuscript auditing method based on BERT
CN116304307A (en) Graph-text cross-modal retrieval network training method, application method and electronic equipment
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN111126084A (en) Data processing method and device, electronic equipment and storage medium
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
WO2022022049A1 (en) Long difficult text sentence compression method and apparatus, computer device, and storage medium
US20220156489A1 (en) Machine learning techniques for identifying logical sections in unstructured data
CN113076749A (en) Text recognition method and system
CN112632956A (en) Text matching method, device, terminal and storage medium
CN113254658B (en) Text information processing method, system, medium, and apparatus
CN110619119A (en) Intelligent text editing method and device and computer readable storage medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN115048927A (en) Method, device and equipment for identifying disease symptoms based on text classification
CN113096667A (en) Wrongly-written character recognition detection method and system
CN113453065A (en) Video segmentation method, system, terminal and medium based on deep learning
CN114120425A (en) Emotion recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220711

Address after: 201615 room 1904, G60 Kechuang building, No. 650, Xinzhuan Road, Songjiang District, Shanghai

Patentee after: Shanghai Mingping Medical Data Technology Co.,Ltd.

Address before: 102400 no.86-n3557, Wanxing Road, Changyang, Fangshan District, Beijing

Patentee before: Mingpinyun (Beijing) data Technology Co.,Ltd.

TR01 Transfer of patent right