CN113254658B

CN113254658B - Text information processing method, system, medium, and apparatus

Info

Publication number: CN113254658B
Application number: CN202110765335.2A
Authority: CN
Inventors: 姚娟娟; 钟南山; 樊代明
Original assignee: Mingpinyun Beijing Data Technology Co Ltd
Current assignee: Shanghai Mingping Medical Data Technology Co ltd
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2021-12-21
Anticipated expiration: 2041-07-07
Also published as: CN113254658A

Abstract

The invention provides a text information processing method, a system, a medium and equipment, wherein in the text information processing method, on the basis of preprocessing text information to obtain a data preprocessing set, the data preprocessing set is subjected to first screening based on keyword matching according to a data reference set, the data preprocessing set is subjected to second screening based on deep learning, and processed text information is generated by combining the data sets screened twice before and after, so that error screening of the text information can be effectively prevented, and the processing accuracy and the processing efficiency of the text information are improved; each data set comprises a professional noun set, an adjective set and a parameter set which have a mapping relation with each other, and on the basis that each subset is singly compared and screened, the screening efficiency and the screening accuracy of the text information can be further improved by combining the auxiliary verification of the screening results of other subsets which have the mapping relation.

Description

Text information processing method, system, medium, and apparatus

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a text information processing method, system, medium, and device.

Background

In natural language processing, texts with multiple input ways and multiple purposes are usually involved, and for medical data, old paper text information of archive rooms and electronic medical record information of various hospitals or platforms are increasingly complicated, the medical data definition, the recording mode and the like of different hospitals or platforms are different, and the corresponding diagnostic texts have the problem of inconsistent diagnostic texts caused by specific terms, synonym expression, abbreviation, spelling and typing errors and the like.

Therefore, how to effectively summarize complicated medical text information and improve the processing efficiency and accuracy of the medical text information is a problem which needs to be solved urgently at present.

Disclosure of Invention

In view of the above problems in the prior art, the present invention provides a technical solution for processing text information, which is used to solve the above technical problems.

In order to achieve the above and other objects, the present invention adopts the following technical solutions.

A text information processing method comprising:

acquiring text information to be processed;

preprocessing the text information to generate a plurality of words and parameters;

classifying and extracting the words and the parameters to obtain a corresponding data preprocessing set, wherein the data preprocessing set comprises a proper noun preprocessing set, a proper noun description preprocessing set and a parameter preprocessing set which have a mapping relation with each other;

acquiring a data reference set of a related field, wherein the data reference set comprises a proper noun reference set, a proper noun description reference set and a parameter reference set which have a mapping relation with each other;

according to the data reference set, based on keyword matching, carrying out first screening on the data preprocessing set to obtain a first data set, wherein the first data set comprises a first term set, a first term description set and a first parameter set which have a mapping relation with each other;

according to the data reference set, based on deep learning, carrying out secondary screening on the data preprocessing set to obtain a second data set, wherein the second data set comprises a second term set, a second term description set and a second parameter set which have a mapping relation with each other;

and outputting the processed text information according to the first data set and the second data set.

Optionally, when the text information is preprocessed, at least data cleaning, punctuation removal, word segmentation, stop word removal, and repeated word removal are sequentially performed on the text information.

Optionally, the step of classifying and extracting the plurality of words and the parameters includes:

performing part-of-speech tagging on the words;

and classifying and extracting the plurality of words and the parameters according to the part of speech and the context of the words to obtain the data preprocessing set.

Optionally, the step of performing a first screening on the data preprocessing set based on keyword matching according to the data reference set includes:

sequentially calculating a first word string distance S between the a-th element in the term preprocessing set and the b-th element in the term reference set₁(a, b) obtaining a first word string distance set S₁（a）；

If the first word string distance set S₁(a) If the value of the element (a) contains zero, the a-th element is reserved and added into the first term set, the element corresponding to the element in the term description preprocessing set is added into the first term description set, and the element corresponding to the element in the parameter preprocessing set is added into the first parameter set.

Optionally, the step of performing a first screening on the data preprocessing set based on keyword matching according to the data reference set further includes:

if the first word string distance set S₁(a) If the element value of (2) does not contain zero, further judging the first word string distance set S₁(a) Whether elements with values smaller than a first threshold value exist in the table;

if the first word string distance set S₁(a) If the value of at least one element is less than the first threshold value, the first word string distance set S is processed according to the sequence from small to large₁(a) The elements in the list smaller than the first threshold value are subjected to statistical sorting to obtain a first word string distance screening set S₁₀（a）；

Screening set S for first word string distance₁₀(a) The corresponding ith element in the term description preprocessing set and the term description reference are calculated in sequence from the first elementSecond word string distance S between corresponding jth elements in the set₂(i, j) obtaining a second word string distance set S₂（i）；

If the second word string distance set S₂(i) If the element value of (2) contains zero, further judging the second word string distance set S₂(i) Whether the number of the elements with the median value of zero is larger than a second threshold value or not;

if the second word string distance set S₂(i) If the number of the elements with the median value of zero is greater than or equal to the second threshold, retaining the corresponding elements in the term description preprocessing set, adding the corresponding elements into the first term description set, adding the elements in the term preprocessing set corresponding to the elements into the first term description set, and adding the elements in the parameter preprocessing set corresponding to the elements into the first parameter set;

if the second word string distance set S₂(i) And if the number of the elements with the median value of zero is less than the second threshold, giving up the corresponding ith element in the term description preprocessing set, and giving up the corresponding element in the term preprocessing set and the corresponding element in the parameter preprocessing set.

if the first word string distance set S₁(a) Does not contain zero, and the first word string distance set S₁(a) If no element is less than the first threshold, abandoning the a-th element in the term pre-processing set, and abandoning the corresponding element in the term description pre-processing set and the corresponding element in the parameter pre-processing set.

if the second word string distance set S₂(i) InAnd if the number of the elements with the value of zero is less than the second threshold value, abandoning the corresponding ith element in the term description preprocessing set, and abandoning the corresponding element in the term preprocessing set and the corresponding element in the parameter preprocessing set.

Optionally, the first word string distance S₁(a, b) and said second word string distance S₂The calculation formulas of (i, j) are respectively as follows:

S₁（a，b）=[M]；

M= [S₂（a，b）+ S₃（a，b）]/2；

S₂（a，b）=|G₂(a)|+|G₂(b)|−2*|G₂(a)∩G₂(b)|；

S₃（a，b）=|G₃(a)|+|G₃(b)|−2*|G₃(a)∩G₃(b)|；

S₂（i，j）=[N]；

N= [S_2’（i，j）+ S₃（i，j）]/2；

S_2’（i，j）=|G₂(i)|+|G₂(j)|−2*|G₂(i)∩G₂(j)|；

S₃（i，j）=|G₃(i)|+|G₃(j)|−2*|G₃(i)∩G₃(j)|；

wherein, the first word string distance S₁(a, b) is a value rounding M, S₂(a, b) denotes the first 2-Gram word string distance, S₃(a, b) represents the first 3-Gram word string distance, the second word string distance S₂(i, j) is a value rounded to N, S_2’(i, j) denotes the first 3-Gram word string distance, S₃(i, j) represents a second 3-Gram word string distance; g₂(a) And G₂(b) Respectively representing the set of 2-grams in the a-th element of the term pre-processing set and the b-th element of the term reference set, G₂(i) And G₂(j) Respectively representing the ith element in the term description preprocessing set and the ith element in the term description reference setSet of 2-Gram in corresponding jth element, G₃(a) And G₃(b) Set G representing the 3-Gram in the a-th element of the term pre-processing set and the b-th element of the term reference set, respectively₃(i) And G₃(j) Respectively representing the set of 3-Gram in the ith element in the term description preprocessing set and the corresponding jth element in the term description reference set.

Optionally, the step of performing a second screening on the data preprocessing set based on deep learning according to the data reference set includes:

constructing a convolution cyclic neural network model, and training the convolution cyclic neural network model based on the first data set and the data reference set;

and screening and identifying the data preprocessing set by using the trained convolution cyclic neural network model to obtain the second data set.

Optionally, the step of outputting the processed text information according to the first data set and the second data set includes:

analyzing the first data set and the second data set to obtain an intersection and a union of the first data set and the second data set;

and outputting first text information according to the intersection, wherein the first text information comprises all elements of the intersection.

Optionally, the step of outputting the processed text information according to the first data set and the second data set further includes:

and outputting second text information according to the intersection and the union, wherein the second text information comprises all elements which are removed from the union after the elements are repeated with the intersection.

A text information processing system comprising:

the receiving unit is used for receiving text information to be processed and receiving a data reference set of a related field, wherein the data reference set comprises a proper noun reference set, a proper noun description reference set and a parameter reference set which have a mapping relation with each other;

the preprocessing unit is used for preprocessing the text information to generate a plurality of words and parameters;

the classification extraction unit is used for performing classification extraction on the words and the parameters to obtain a corresponding data preprocessing set, wherein the data preprocessing set comprises a proper noun preprocessing set, a proper noun description preprocessing set and a parameter preprocessing set which have a mapping relation with each other;

the screening unit is used for screening the data preprocessing set twice to obtain a first data set and a second data set, wherein the first data set comprises a first term set, a first term description set and a first parameter set which have mapping relations with each other, and the second data set comprises a second term set, a second term description set and a second parameter set which have mapping relations with each other;

and the output unit is used for outputting the processed text information according to the first data set and the second data set.

Optionally, the screening unit includes a keyword matching module and a deep learning module, the keyword matching module performs a first screening on the data preprocessing set to obtain the first data set, and the deep learning module performs a second screening on the data preprocessing set to obtain the second data set.

A computer readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor to perform any of the above-described text information processing methods.

An electronic device, comprising:

a processor;

a computer readable storage medium having instructions stored thereon, which when executed by the processor, implement the text information processing method of any one of the above.

As described above, the text information processing method, system, medium, and apparatus provided by the present invention have at least the following beneficial effects:

on the basis of preprocessing text information to obtain a data preprocessing set, according to a data reference set, performing first screening on the data preprocessing set based on keyword matching, performing second screening on the data preprocessing set based on deep learning, and generating processed text information by combining data sets screened twice before and after, so that wrong screening of the text information can be effectively prevented, and the processing efficiency and accuracy of the text information are improved; each data set comprises a professional noun set, an adjective set and a parameter set which have a mapping relation with each other, and on the basis that each subset is singly compared and screened, the screening efficiency and the screening accuracy can be further improved by combining the auxiliary verification of the screening results of other subsets with the mapping relation.

Drawings

Fig. 1 is a schematic step diagram of a text information processing method according to an embodiment of the present invention.

FIG. 2 is a block diagram of a text message processing system according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a user terminal according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1, the present invention provides a text information processing method, including the following steps:

and S1, acquiring the text information to be processed. For example, a large amount of medical text information is acquired from a paper document or a medical database through scanning recognition or text transmission and other acquisition modes.

And S2, preprocessing the text information to generate a plurality of words and parameters.

In an optional embodiment of the present invention, when the text information is preprocessed, at least data cleaning, punctuation removal, word segmentation, stop word removal, and repeated word removal are performed on the text information in sequence.

The detailed steps of data cleaning, word segmentation and stop word removal can refer to the prior art, and are not described herein again.

S3, classifying and extracting the words and the parameters to obtain a corresponding data preprocessing set, wherein the data preprocessing set comprises a term preprocessing set, a term description preprocessing set and a parameter preprocessing set which have a mapping relation with each other.

In an optional embodiment of the present invention, the step S3 of classifying and extracting the words and the parameters further includes:

s31, performing part-of-speech tagging on the words;

and S32, classifying and extracting the words and the parameters according to the parts of speech and the context of the words to obtain a data preprocessing set.

The data preprocessing set comprises a term preprocessing set, a term description preprocessing set and a parameter preprocessing set, mapping relations exist among the term preprocessing set, the term description preprocessing set and the parameter preprocessing set, and the same object is described, so that association judgment during subsequent identification and screening is facilitated, and the accuracy of identification and screening is improved.

S4, acquiring a data reference set of the related field, wherein the data reference set comprises a term reference set, a term description reference set and a parameter reference set which have a mapping relation with each other.

In an optional embodiment of the present invention, based on channels such as the internet or a block chain, a data reference set of a related field is obtained in a medical dictionary, a medical database, and the like that are determined professionally or authoritatively, and the data reference set is used as a comparison standard in subsequent identification screening, and includes a term reference set, a term description reference set, and a parameter reference set, which have a mapping relationship therebetween.

S5, according to the data reference set, based on keyword matching, a first screening is carried out on the data preprocessing set to obtain a first data set, wherein the first data set comprises a first term set, a first term description set and a first parameter set which have a mapping relation with each other.

In an optional embodiment of the present invention, the step S5 of performing a first filtering on the data preprocessing set based on the keyword matching according to the data reference set further includes:

s51, sequentially calculating a first word string distance S between the a-th element in the term preprocessing set and the b-th element in the term reference set aiming at the a-th element in the term preprocessing set₁(a, b) obtaining a first word string distance set S₁（a）；

S52, if the value of the element of the first word string distance set S1 (a) contains zero, keeping the a-th element, adding the a-th element into the first term set, adding the element corresponding to the element in the term description preprocessing set into the first term description set, and adding the element corresponding to the element in the parameter preprocessing set into the first parameter set;

s53, set S if first word string distance₁(a) If the element value of (2) does not contain zero, the first word string distance set S is further judged₁(a) Whether elements with values smaller than a first threshold value exist in the table;

s54, set S if first word string distance₁(a) If at least one element in the first word string has a value smaller than a first threshold value, the first word string distance set is subjected to the sequence from small to largeS₁(a) The elements in the list smaller than the first threshold value are subjected to statistical sorting to obtain a first word string distance screening set S₁₀（a）；

S55, screening set S aiming at first word string distance₁₀(a) The elements in (1) are used for sequentially calculating a second word string distance S between the corresponding ith element in the term description preprocessing set and the corresponding jth element in the term description reference set from the first element₂(i, j) obtaining a second word string distance set S₂（i）；

S56, if the second word string distance set S₂(i) If the element value of (2) contains zero, further judging the second word string distance set S₂(i) Whether the number of the elements with the median value of zero is larger than a second threshold value or not;

s57, if the second word string distance set S₂(i) If the number of the elements with the middle value of zero is larger than or equal to a second threshold value, retaining the corresponding elements in the term description preprocessing set, adding the corresponding elements into the first term description set, adding the elements in the term preprocessing set corresponding to the elements into the first term set, and adding the elements in the parameter preprocessing set corresponding to the elements into the first parameter set;

s58, if the second word string distance set S₂(i) And if the number of the elements with the median value of zero is less than a second threshold value, giving up the corresponding ith element in the term description preprocessing set, and giving up the corresponding element in the term preprocessing set and the corresponding element in the parameter preprocessing set.

In addition, the step S5 of performing a first filtering on the data preprocessing set based on the keyword matching according to the data reference set further includes:

s59, set S if first word string distance₁(a) The value of the element (S) does not contain zero, and the first word string distance set S₁(a) If no element is less than the first threshold value, abandoning the a-th element in the term preprocessing set, and abandoning the term to describe the corresponding element in the preprocessing set and the corresponding element in the parameter preprocessing set;

s510, if the second word string distance set S₂(i) And if the number of the elements with the median value of zero is less than a second threshold value, giving up the corresponding ith element in the term description preprocessing set, and giving up the corresponding element in the term preprocessing set and the corresponding element in the parameter preprocessing set.

The first threshold value is 1-2, and the length of the word string of the a-th element in the professional noun preprocessing set can be flexibly adjusted; the second threshold is 2/3 where the term describes the number of elements contained in the corresponding ith element in the preprocessed set.

In detail, based on keyword matching, the step S5 of performing first screening on the data preprocessing set is mainly to perform analysis based on the N-Gram model, perform word segmentation and word string distance calculation based on the N-Gram model, perform identification comparison on the data preprocessing set and the data reference set, and retain the same elements in the data preprocessing set as in the data reference set to obtain the first data set.

In an alternative embodiment of the present invention, the first string distance S₁(a, b) and a second word string distance S₂The calculation formulas of (i, j) are respectively as follows:

S₁（a，b）=[M]；

M= [S₂（a，b）+ S₃（a，b）]/2；

S₂（a，b）=|G₂(a)|+|G₂(b)|−2*|G₂(a)∩G₂(b)|；

S₃（a，b）=|G₃(a)|+|G₃(b)|−2*|G₃(a)∩G₃(b)|；

S₂（i，j）=[N]；

N= [S_2’（i，j）+ S₃（i，j）]/2；

S_2’（i，j）=|G₂(i)|+|G₂(j)|−2*|G₂(i)∩G₂(j)|；

S₃（i，j）=|G₃(i)|+|G₃(j)|−2*|G₃(i)∩G₃(j)|；

wherein, the first word string distance S₁(a, b) is a value rounding M, S₂(a, b) denotes the first 2-Gram word string distance, S₃(a, b) represents the first 3-Gram word string distance, the second word string distance S₂(i, j) is a value rounded to N, S_2’(i, j) denotes the first 3-Gram word string distance, S₃(i, j) represents a second 3-Gram word string distance; g₂(a) And G₂(b) Respectively representing the set of 2-grams in the a-th element of the term pre-processing set and the b-th element of the term reference set, G₂(i) And G₂(j) Respectively representing the ith element in the term description preprocessing set and the 2-Gram set in the corresponding jth element in the term description reference set, G₃(a) And G₃(b) Respectively representing the set of 3-grams in the a-th element of the term pre-processing set and the b-th element of the term reference set, G₃(i) And G₃(j) Respectively representing the ith element in the term description preprocessing set and the 3-Gram set in the corresponding jth element in the term description reference set.

Wherein, the first word string distance S₁(a, b) second word string distance S₂And (i, j) the average values of the corresponding 2-Gram word string distance and the 3-Gram word string distance are rounded, so that the fault tolerance rate during recognition of the longer character string can be properly increased, and the probability of screening errors is reduced.

S6, according to the data reference set, based on deep learning, carrying out secondary screening on the data preprocessing set to obtain a second data set, wherein the second data set comprises a second term set, a second term description set and a second parameter set which have a mapping relation with each other.

In an optional embodiment of the present invention, the step S6 of performing the second filtering on the data preprocessing set based on the deep learning according to the data reference set further includes:

s61, constructing a convolution cyclic neural network model, and training the convolution cyclic neural network model based on the first data set and the data reference set;

and S62, screening and identifying the data preprocessing set by using the trained convolution cyclic neural network model to obtain a second data set.

In an optional embodiment of the present invention, the convolutional recurrent neural network model at least comprises:

a CNN (convolutional layer) that extracts features from an input image using the depth CNN to obtain a feature map;

RNN (loop layer), which predicts a feature sequence using bi-directional RNN (blstm), learns each feature vector in the sequence, and outputs a prediction tag (true value) distribution;

CTC loss (transcription layer), using CTC loss, converts a series of tag distributions obtained from the loop layer into final tag sequences.

The specific structure of the convolutional recurrent neural network model can be referred to in the prior art, and is not described in detail herein.

When the trained convolution cyclic neural network model is used for screening and identifying, relevant elements (namely elements with mapping relations) in the term preprocessing set, the term description preprocessing set and the parameter preprocessing set are identified and screened in sequence, and a second data set is obtained.

And S7, outputting the processed text information according to the first data set and the second data set.

In detail, the step S7 of outputting the processed text information according to the first data set and the second data set further includes:

s71, analyzing the first data set and the second data set to obtain the intersection and the union of the first data set and the second data set;

s72, outputting first text information according to the intersection, wherein the first text information comprises all elements of the intersection;

and S73, outputting second text information according to the intersection and the union, wherein the second text information comprises all elements which are removed from the union after the intersection is repeated.

The first text information is output according to the intersection of the first data set and the second data set, namely the data which passes through the two screening processes forms the first text information, and the first text information is the default screening error-free information, so that the screening accuracy is improved; and outputting second text information according to the elements which are removed from the union of the first data set and the second data set and are repeated with the intersection, namely, the second text information is formed by data which is selected in the two previous and next screening processes and is only selected once, and the second text information is suspected information, so that the probability of wrong screening can be effectively reduced, and the screening accuracy is further improved.

Referring to fig. 2, the present invention further provides a text information processing system for executing the text information processing method in the foregoing method embodiment, and since the technical principle of the system embodiment is similar to that of the foregoing method embodiment, repeated details of the same technical details are not repeated.

As shown in fig. 2, in an alternative embodiment of the present invention, a text information processing system includes:

a receiving unit 10, configured to receive text information to be processed, and further configured to receive a data reference set of a related field, where the data reference set includes a term reference set, a term description reference set, and a parameter reference set, where mapping relationships exist between the term reference set and the term description reference set;

the preprocessing unit 11 is configured to preprocess the text information to generate a plurality of words and parameters;

the classification extraction unit 12 is configured to perform classification extraction on the multiple words and parameters to obtain a corresponding data preprocessing set, where the data preprocessing set includes a term preprocessing set, a term description preprocessing set, and a parameter preprocessing set, which have a mapping relationship therebetween;

the screening unit 13 is configured to perform two-time screening on the data preprocessing set to obtain a first data set and a second data set, where the first data set includes a first term set, a first term description set, and a first parameter set that have a mapping relationship therebetween, and the second data set includes a second term set, a second term description set, and a second parameter set that have a mapping relationship therebetween;

and the output unit 14 is used for outputting the processed text information according to the first data set and the second data set.

The receiving unit 10 is configured to assist in performing the steps S1 and S4 described in the foregoing method embodiment, the preprocessing unit 11 is configured to perform the step S2 described in the foregoing method embodiment, the classification extracting unit 12 is configured to perform the step S3 described in the foregoing method embodiment, the screening unit 13 is configured to perform the steps S5 to S6 described in the foregoing method embodiment, and the output unit 14 is configured to perform the step S7 described in the foregoing method embodiment.

Further, the screening unit 13 includes a keyword matching module 131 and a deep learning module 132, where the keyword matching module 131 performs a first screening on the data preprocessing set to obtain a first data set, and the deep learning module 132 performs a second screening on the data preprocessing set to obtain a second data set.

Based on the same inventive concept of the foregoing embodiment, the present invention further provides a computer-readable storage medium, on which a plurality of instructions are stored, where the instructions are suitable for being loaded by a processor to execute the text information processing method. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Based on the same inventive concept of the foregoing embodiment, the present invention also provides an electronic device, which may include: a processor; a computer readable storage medium having stored thereon instructions, which when executed by a processor, cause an electronic device to execute the text information processing method described in fig. 1.

In practical applications, the electronic device may be used as a user terminal or a server, and examples of the user terminal may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.

Fig. 3 is a schematic diagram of a hardware structure of a user terminal according to an alternative embodiment of the present invention. As shown, the user terminal may include: an input device 200, a processor 201, an output device 202, a memory 203, and at least one communication bus 204. The communication bus 204 is used to implement communication connections between the elements. The memory 203 may comprise a high speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, in which various programs may be stored for performing various processing functions and implementing the method steps of the present embodiment.

Alternatively, the processor 201 may be implemented by, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the processor 201 is coupled to the input device 200 and the output device 202 through a wired or wireless connection.

Optionally, the input device 200 may include a variety of input devices, for example, may include at least one of a user-oriented user interface, a device-oriented device interface, a software-programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output device 202 may include a display, a sound, or other output device.

In this embodiment, the processor of the user terminal includes a function for executing each module of the speech recognition device in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.

In summary, in the text information processing method, system, medium, and device provided by the present invention, on the basis of preprocessing text information to obtain a data preprocessing set, according to a data reference set, a first filtering is performed on the data preprocessing set based on keyword matching, a second filtering is performed on the data preprocessing set based on deep learning, and processed text information is generated by combining data sets that are filtered twice before and after, so that a mis-filtering of text information can be effectively prevented, and the accuracy and processing efficiency of text information are improved; each data set comprises a professional noun set, an adjective set and a parameter set which have a mapping relation with each other, and on the basis that each subset is singly compared and screened, the screening accuracy and efficiency can be further improved by combining the auxiliary verification of the screening results of other subsets with the mapping relation. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A text information processing method, comprising:

acquiring text information to be processed;

outputting the processed text information according to the first data set and the second data set;

the step of screening the data preprocessing set for the first time based on keyword matching according to the data reference set comprises the following steps of:

If the first word string distance set S₁(a) If the value of the element (a) contains zero, the a-th element is reserved and added into the first term set, the element corresponding to the element in the term description preprocessing set is added into the first term description set, and the element corresponding to the element in the parameter preprocessing set is added into the first parameter set;

Screening set S for first word string distance₁₀(a) The element in (2) sequentially calculates a second word string distance S between the corresponding ith element in the term description preprocessing set and the corresponding jth element in the term description reference set from the first element₂(i, j) obtaining a second word string distance set S₂（i）；

2. The method according to claim 1, wherein when preprocessing the text information, at least data cleaning processing, punctuation removal processing, word segmentation processing, stop word removal processing, and repeat word removal processing are sequentially performed on the text information.

3. The text information processing method according to claim 1 or 2, wherein the step of classifying and extracting the plurality of words and the parameters includes:

performing part-of-speech tagging on the words;

4. The method of claim 3, wherein the step of first filtering the pre-processed set of data based on keyword matching according to the reference set of data further comprises:

5. The method of claim 4, wherein the step of first filtering the pre-processed set of data based on keyword matching according to the reference set of data further comprises:

6. The text information processing method according to claim 5, wherein the first word string distance S₁(a, b) and said second word string distance S₂The calculation formulas of (i, j) are respectively as follows:

S₁（a，b）=[M]；

M= [S₂（a，b）+ S₃（a，b）]/2；

S₂（a，b）=|G₂(a)|+|G₂(b)|−2*|G₂(a)∩G₂(b)|；

S₃（a，b）=|G₃(a)|+|G₃(b)|−2*|G₃(a)∩G₃(b)|；

S₂（i，j）=[N]；

N= [S_2’（i，j）+ S₃（i，j）]/2；

S_2’（i，j）=|G₂(i)|+|G₂(j)|−2*|G₂(i)∩G₂(j)|；

S₃（i，j）=|G₃(i)|+|G₃(j)|−2*|G₃(i)∩G₃(j)|；

wherein, the first word string distance S₁(a, b) is a value rounding M, S₂(a, b) denotes the first 2-Gram word string distance, S₃(a, b) represents the first 3-Gram word string distance, the second word string distance S₂(i, j) is a value rounded to N, S_2’(i, j) denotes the first 3-Gram word string distance, S₃(i, j) represents a second 3-Gram word string distance; g₂(a) And G₂(b) Respectively representing the a-th element in the term preprocessing setElement and set of 2-Gram in the b-th element of the reference set of terms, G₂(i) And G₂(j) Respectively representing the ith element in the term description preprocessing set and the set of 2-Gram in the corresponding jth element in the term description reference set, G₃(a) And G₃(b) Set G representing the 3-Gram in the a-th element of the term pre-processing set and the b-th element of the term reference set, respectively₃(i) And G₃(j) Respectively representing the set of 3-Gram in the ith element in the term description preprocessing set and the corresponding jth element in the term description reference set.

7. The method of claim 6, wherein the step of performing a second filtering on the pre-processed set of data based on deep learning according to the reference set of data comprises:

8. The method of claim 7, wherein outputting the processed text message based on the first data set and the second data set comprises:

9. The method of claim 8, wherein the step of outputting the processed text message according to the first data set and the second data set further comprises:

10. A text information processing system, comprising:

the output unit is used for outputting the processed text information according to the first data set and the second data set;

the step of obtaining the first data set by the screening unit for the first screening includes:

if the second word string distance set S₂(i) If the number of the elements with the median value of zero is greater than or equal to the second threshold, the corresponding elements in the term description preprocessing set are reserved and added into the first term description set, the elements in the term preprocessing set corresponding to the elements are added into the first term set, and the elements in the parameter preprocessing set corresponding to the elements are added into the second term setA set of parameters;

11. The system of claim 10, wherein the filtering unit comprises a keyword matching module and a deep learning module, wherein the keyword matching module filters the pre-processed data set for a first time to obtain the first data set, and the deep learning module filters the pre-processed data set for a second time to obtain the second data set.

12. A computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor to perform the method of processing textual information according to any of claims 1-9.

13. An electronic device, comprising:

a processor;

a computer-readable storage medium having stored thereon instructions which, when executed by the processor, implement the text information processing method according to any one of claims 1 to 9.