CN108052508B

CN108052508B - Information extraction method and device

Info

Publication number: CN108052508B
Application number: CN201711476786.4A
Authority: CN
Inventors: 李重勋; 王利叶; 胡可云; 陈联忠
Original assignee: Beijing Jiahesen Health Technology Co ltd
Current assignee: Beijing Jiahesen Health Technology Co., Ltd.
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2021-11-09
Anticipated expiration: 2037-12-29
Also published as: CN108052508A

Abstract

The embodiment of the application discloses an information extraction method, which comprises the steps of segmenting a preset text according to a preset word bank to obtain a first segmentation result, extracting a plurality of undetermined words from the first segmentation result, and determining the undetermined words without containing relations from the plurality of undetermined words to serve as information extraction results of the first segmentation result. By adopting the two-time word segmentation, the longer first word segmentation result can be extracted, and the shorter information extraction result of the first word segmentation result without inclusion relation can be further extracted from the longer first word segmentation result, for example, words representing information such as parts, diseases and the like are extracted from the complete words representing the operation names, so that on one hand, the extracted information amount is increased, and on the other hand, the data structuring effect is enhanced through the structural hierarchy arrangement of the first word segmentation result and the information extraction result of the first word segmentation result, and the data query and the positioning are facilitated. The embodiment of the application also discloses an information extraction device.

Description

Information extraction method and device

Technical Field

The present application relates to the field of text processing, and in particular, to an information extraction method and apparatus.

Background

Electronic Medical Records (EMRs) are also called computerized Medical Record systems or computer-based patient records. The medical record is a digitalized medical service working record of characters, symbols, charts, data, graphs and the like generated by medical staff of medical institutions to clinic diagnosis and treatment and guide intervention of outpatients and inpatients, and using an information system. The development of the electronic medical record provides convenience for doctors to know the information of patients and clinical research in real time. However, currently, electronic medical records have both structured data and unstructured data, and some important information mostly exists in the unstructured data, such as chief complaints, current medical history, past history, and the like in the electronic medical records. Therefore, in order to effectively utilize the electronic medical record and discover useful information in the electronic medical record, the unstructured data needs to be generated into structured data, and the process is information extraction.

In the information extraction process, it is often necessary to perform word segmentation on the text based on a preset word library to obtain useful information, such as words representing diseases, symptoms, operations, and the like. In the prior art, word segmentation is performed based on the longest matching principle, that is, word segmentation is performed according to the longest word matched with a word bank, but in many cases, the longest word also includes other shorter words, which are very useful information, and the shorter words cannot be extracted based on the longest matching principle, so that the extracted information is less, and the effect of data structuring is affected.

For example, assuming that the segmentation result "diverticulectomy in bladder" is obtained based on the longest matching rule, the word belongs to one operation name as a whole, but in the word, including the site name (in bladder), the disease name (diverticulum), and the operation name (resection), since the "diverticulectomy in bladder" exists in the thesaurus, even if the three words "in bladder", "diverticulum", and "resection" exist in the thesaurus, they cannot be extracted according to the longest matching rule.

Disclosure of Invention

In order to solve the technical problem that a large number of words in a word segmentation result obtained according to a longest matching principle cannot be extracted in the prior art, the embodiment of the application provides an information extraction method and device.

In a first aspect, an embodiment of the present application provides an information extraction method, where the method includes:

segmenting words of a preset text according to a preset word bank to obtain a first segmentation result;

extracting a plurality of undetermined words from the first segmentation result based on the preset word bank, wherein the undetermined words do not comprise the first segmentation result;

and determining the undetermined words without containing relations from the undetermined words as the information extraction result of the first word segmentation result.

Optionally, the determining, from the multiple undetermined words, an undetermined word without a relation included as an information extraction result of the first segmentation result includes:

sequencing the multiple undetermined words according to the position of the first character and/or the tail character of each undetermined word in the first segmentation result;

if the current undetermined word has an adjacent next undetermined word, judging whether the current undetermined word has a containing relation with the adjacent next undetermined word, if not, taking the current undetermined word and/or the adjacent next undetermined word as the information extraction result, wherein the current undetermined word is one of the plurality of undetermined words.

Optionally, if the currently pending word and the next adjacent pending word have a containment relationship, the method further includes:

if the currently undetermined word contains the adjacent next undetermined word, covering the currently undetermined word with the adjacent next undetermined word, enabling the adjacent next undetermined word to be the currently undetermined word, and executing the step of judging whether the currently undetermined word contains the adjacent next undetermined word;

if the currently undetermined word is contained by the adjacent next undetermined word, the adjacent next undetermined word is made to be the currently undetermined word, and the step of judging whether the currently undetermined word contains the adjacent next undetermined word is executed.

Optionally, if the information extraction result includes a first pending word and a second pending word, the method further includes:

judging whether the first undetermined word and the second undetermined word comprise cross words or not based on the preset word bank, wherein the cross words are a part of the first undetermined word and a part of the second undetermined word;

if yes, judging whether the attribute of the cross word is the same as the attribute of the first undetermined sub-word, if so, removing the cross word from the second undetermined word to obtain a modified second undetermined word; and if not, removing the cross word from the first to-be-determined word to obtain a modified first to-be-determined word, wherein the first to-be-determined sub-word is a word adjacent to the cross word in the first to-be-determined word.

Optionally, if the attribute of the cross word is the same as the attribute of the first pending subword, the method further includes:

storing the mapping relation among the first word segmentation result, the first pending word and the modified second pending word;

if the attribute of the cross word is different from the attribute of the first pending subword, the method further comprises the following steps:

and storing the mapping relation among the first word segmentation result, the modified first word to be determined and the second word to be determined.

In a second aspect, an embodiment of the present application provides an information extraction apparatus, including:

the word segmentation unit is used for segmenting words of a preset text according to a preset word bank to obtain a first word segmentation result;

the extracting unit is used for extracting a plurality of undetermined words from the first segmentation result based on the preset word bank, wherein the undetermined words do not comprise the first segmentation result;

and the determining unit is used for determining the undetermined words without containing relations from the undetermined words as the information extraction result of the first segmentation result.

Optionally, the determining unit includes:

the sequencing subunit sequences the multiple undetermined words according to the positions of the first character and/or the tail character of each undetermined word in the first segmentation result;

a determining subunit, configured to determine, if there is an adjacent next undetermined word in the current undetermined word, whether the current undetermined word has a inclusion relationship with the adjacent next undetermined word, where the current undetermined word is one of the multiple undetermined words;

and the determining subunit is configured to, if the determination result is negative, use the currently pending word and/or the adjacent next pending word as the information extraction result.

Optionally, the determining unit further includes an executing subunit, configured to:

when the currently undetermined word and the adjacent next undetermined word have an inclusion relationship, if the currently undetermined word contains the adjacent next undetermined word, covering the currently undetermined word with the adjacent next undetermined word, enabling the adjacent next undetermined word to be the currently undetermined word, and executing the step of judging whether the currently undetermined word contains the adjacent next undetermined word;

Optionally, the apparatus further comprises:

a cross word judging unit, configured to judge, based on the preset word library, whether the first to-be-determined word and the second to-be-determined word include a cross word if the information extraction result includes the first to-be-determined word and the second to-be-determined word, where the cross word is a part of the first to-be-determined word and is a part of the second to-be-determined word;

the attribute judging unit is used for judging whether the attribute of the cross word is the same as the attribute of the first undetermined subword or not if the attribute of the cross word is included;

the modifying unit is used for removing the cross word from the second undetermined word if the cross word is the same as the second undetermined word to obtain a modified second undetermined word; and if not, removing the cross word from the first to-be-determined word to obtain a modified first to-be-determined word, wherein the first to-be-determined sub-word is a word adjacent to the cross word in the first to-be-determined word.

Optionally, the apparatus further includes a storage unit, specifically configured to:

if the attribute of the cross word is the same as that of the first undetermined subword, storing the mapping relation among the first word segmentation result, the first undetermined word and the modified second undetermined word;

and if the attribute of the cross word is different from the attribute of the first undetermined subword, storing the mapping relation among the first word segmentation result, the modified first undetermined word and the second undetermined word.

As can be seen from the above, in the information extraction method provided in the embodiment of the present application, first, a preset text is segmented according to a preset word bank to obtain a first segmentation result, then, based on the preset word bank, a plurality of undetermined words included in the first segmentation result are extracted from the first segmentation result, the plurality of undetermined words do not include the first segmentation result, and an undetermined word without an inclusion relation is determined from the plurality of undetermined words as an information extraction result of the first segmentation result.

By adopting the two-time word segmentation, the longer first word segmentation result can be extracted, and the shorter information extraction result of the first word segmentation result without inclusion relation can be further extracted from the longer first word segmentation result, for example, words representing information such as parts, diseases and the like are extracted from the complete words representing the operation names, so that on one hand, the extracted information amount is increased, and on the other hand, the data structuring effect is enhanced through the structural hierarchy arrangement of the first word segmentation result and the information extraction result of the first word segmentation result, and the data searching and positioning are facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an information extraction method provided in the present application;

fig. 2 is a flowchart of an information extraction result method for determining an undetermined word without inclusion relationship from a plurality of undetermined words as a first segmentation result according to the present application;

FIG. 3 is a flow chart of an information extraction method provided herein;

fig. 4 is a schematic structural diagram of an information extraction apparatus provided in the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In a conventional information extraction method, a text is often segmented based on a preset word bank to obtain useful information, such as words representing diseases, symptoms, operations, and the like, and then segmented based on a longest matching principle, that is, segmented according to the longest word matched with the word bank to obtain a segmentation result, so as to realize information extraction in an electronic medical record.

However, in many cases, the longest word also includes other shorter words, which are also very useful information, and these shorter words cannot be extracted based on the longest matching principle, so that the extracted information is less, and the effect of data structuring is affected.

In view of this, an embodiment of the present application provides an information extraction method, where a preset text is segmented according to a preset word bank to obtain a first segmentation result, and then multiple undetermined words included in the first segmentation result are extracted based on the preset word bank, where the multiple undetermined words do not include the first segmentation result, and an undetermined word without an inclusion relation is determined from the multiple undetermined words to serve as an information extraction result of the first segmentation result.

By adopting the two-time word segmentation, the longer first word segmentation result can be extracted, and the shorter information extraction result of the first word segmentation result without inclusion relation can be further extracted from the longer first word segmentation result, for example, words representing information such as parts, diseases and the like are extracted from the complete words representing the operation names, so that on one hand, the extracted information amount is increased, and on the other hand, the data structuring effect is enhanced through the structural hierarchy arrangement of the first word segmentation result and the information extraction result of the first word segmentation result, and the data query and the positioning are facilitated.

In order to make the information extraction method provided by the embodiment of the present application clearer, a specific implementation manner of the information extraction method provided by the embodiment of the present application is described below with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating an information extraction method according to an embodiment of the present application, and please refer to fig. 1, where the method includes:

s101: and segmenting words of the preset text according to the preset word bank to obtain a first segmentation result.

The preset word bank can be understood as a preset dictionary bank. The preset lexicon in the embodiment of the present application may be a lexicon in the medical field, including a symptom (symptom) lexicon, a time (time) lexicon, an organ part (organ) lexicon, a disease (disease) lexicon, an operation (operation) lexicon, and the like. The preset lexicon may comprise at least one of the lexicons mentioned above. In order to improve the accuracy of the word segmentation result, the preset lexicon may include more lexicons, for example, the lexicons may be all used as the preset lexicon.

The preset text can be understood as a preset text, and in the embodiment of the present application, the preset text may be a text of information to be extracted. As an example, the preset text may be an electronic medical record of the information to be extracted. In other possible implementation manners of the embodiment of the present application, the preset text may also be a text of other information to be extracted, which is not limited in the embodiment of the present application.

The preset text may include at least one sentence therein. For each sentence in the preset text, the sentence can be segmented according to the preset word stock, wherein the sentence segmentation can be realized in various ways, including matching with the preset word stock, matching according to a regular rule and the like, and at least one segmentation result can be obtained after the sentence is segmented. The word segmentation result may include a word segmentation result in a preset word bank, or may include a result not in the preset word bank, and the word segmentation result in the preset word bank may be used as the first word segmentation result.

For ease of understanding, this is illustrated.

For example, the default text is an electronic medical record of a user, and the electronic medical record includes a sentence "knee joint swelling and pain is 6 months, and the aggravation is 3 weeks. "the sentence is segmented according to the preset lexicon, and the segmentation result can be obtained as follows:

"swelling pain of knee joint/symptom-6 months/time-,/O-aggravated/O-3 weeks/time-". And O'.

Where symptom indicates the symptom, time indicates time, and O indicates other, meaning others. It is to be understood that the segmentation result labeled with symptom and time is the segmentation result in the preset lexicon, the segmentation result labeled with O is the segmentation result not in the preset lexicon, the segmentation results "knee joint swelling pain", "june" and "3 weeks" in the preset lexicon can be determined as the first segmentation result, and the segmentation results "that are not in the preset lexicon", "accentuate" and ". "etc. are not intended as a result of the first word.

S102: and extracting a plurality of undetermined words from the first segmentation result based on a preset word bank.

The word to be determined can be understood as a word to be determined, which is extracted from the first word segmentation result. The first segmentation result may be a longer word, and the pending word may be a shorter word extracted from the first segmentation result, so that the plurality of pending words does not include the first segmentation result.

Since the first segmentation result is obtained by segmenting the preset text based on the preset lexicon, for example, the first segmentation result may be obtained according to a longest matching principle, and the first segmentation result is a segmentation result located in the preset lexicon, a plurality of to-be-segmented words included in the first segmentation result may be extracted from the first segmentation result based on the preset lexicon.

In a possible implementation manner, the first segmentation result may be traced back, that is, the first segmentation result is scanned in a preset word bank, all possible word forming conditions in the first segmentation result are recorded, the first segmentation result is deleted from the record, and the remaining words may be used as pending words included in the first segmentation result.

Taking the example of "knee joint pain" as an example, the "knee joint pain" is scanned in a preset word stock, and all possible word cases include:

knee joint, knee joint swelling and pain, joint swelling and pain, etc.

Deleting the first segmentation result 'knee joint swelling and pain' from the records, and determining the remaining words 'knee joint, joint swelling and pain' as a plurality of undetermined words included in the first segmentation result 'knee joint swelling and pain'.

S103: and determining the undetermined words without containing relations from the undetermined words as the information extraction result of the first segmentation result.

The purpose of the embodiments of the present application is to extract shorter words containing useful information from longer words, for example, to extract a plurality of shorter words such as diseases and parts from a longer word such as a complete operation name. It is to be understood that the extracted plurality of shorter words may be understood as words that are independent of each other and have no inclusion relationship.

Therefore, the undetermined words included in the first segmentation result can be judged, and the undetermined words without inclusion relation are used as the information extraction result of the first segmentation result. The information extraction result may be regarded as a shorter segmentation result including useful information extracted from a longer first segmentation result. By extracting the shorter information extraction result from the first segmentation result, the extracted information can be increased, and the data structuring effect can be enhanced.

Because the plurality of undetermined words are extracted from the first segmentation result based on the preset word bank, and the plurality of undetermined words corresponding to the same first segmentation result may have an inclusion relationship, at least one of the undetermined words having the inclusion relationship can be removed to obtain undetermined words having no inclusion relationship, and then the undetermined words having no inclusion relationship are used as an information extraction result.

For ease of understanding, this is illustrated.

For example, the plurality of tentative words included in the first segmentation result are "knee joint, swelling and pain in joint", wherein "knee joint", "joint" and "knee joint" have an inclusive relationship, and "joint", "swelling and pain in joint" and "swelling and pain in joint" have an inclusive relationship, and the tentative words included in the "knee joint", "joint" and "swelling and pain in joint" are obtained by removing "knee joint", "joint", and "swelling and pain in joint", and the "knee joint" and "swelling and pain in joint" can be extracted as information of the first segmentation result "swelling and pain in joint".

The above specific implementation manner of the information extraction method provided in the embodiment of the present application is to perform word segmentation on a preset text according to a preset word bank to obtain a first word segmentation result, and then extract a plurality of undetermined words from the first word segmentation result based on the preset word bank, where the plurality of undetermined words do not include the first word segmentation result, and determine, from the plurality of undetermined words, an undetermined word without an inclusion relation as an information extraction result of the first word segmentation result.

By adopting the two-time word segmentation, the longer first word segmentation result can be extracted, and the shorter information extraction result which does not have the inclusion relation and is used for the first word segmentation result can be further extracted from the longer first word segmentation result, for example, words which represent information such as parts, diseases and the like are extracted from the complete words which represent the names of operations, so that on one hand, the extracted information amount is increased, and on the other hand, the data structuring effect is enhanced through the structural hierarchy arrangement of the first word segmentation result and the information extraction result, and the data searching and the positioning are facilitated.

In order to extract more information from the preset text and enhance the data structuring effect, the key point is that undetermined words without containing relations are determined from a plurality of undetermined words included in the first segmentation result and are used as the information extraction result of the first segmentation result. The undetermined words which are determined from the undetermined words and have no inclusion relationship can be used as the information extraction result in various implementation modes, for example, the undetermined words can be matched one by one to determine whether the inclusion relationship is performed between the undetermined words.

In order to improve the efficiency, the undetermined words which do not contain the relation in the undetermined words are quickly determined, the undetermined words can be sorted firstly, and compared with a mode of directly matching one by one, the method can reduce the calculation amount and improve the processing efficiency.

A specific implementation manner of determining an undetermined word without inclusion relation from a plurality of undetermined words as an information extraction result according to the embodiment of the present application is described below with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating a method for determining an undetermined word without inclusion relationship from a plurality of undetermined words as an information extraction result, please refer to fig. 2, where the method includes:

s201: and sequencing the multiple undetermined words according to the position of the first character of each undetermined word in the first word segmentation result.

The position of the first word to be determined in the first result can be understood as the position of the first word to be determined in the first result. The multiple undetermined words are ranked according to the position of the first word of each undetermined word in the first word segmentation result, the first undetermined word with the first word ranked in the first word segmentation result can be ranked in front, the second undetermined word with the first word ranked in the first word segmentation result can be ranked behind the first undetermined word with the first word ranked in the first word segmentation result, and the like is performed until all the undetermined words in the first word segmentation result are ranked, and the ranking can be regarded as being completed.

In the following, with reference to a specific example, a plurality of words to be determined are ordered according to the position of the first character of the word to be determined in the first segmentation result.

The first word segmentation result, namely the knee joint swelling and pain, comprises 5 undetermined words, namely 'knee joint, joint swelling and pain and swelling pain'. The first word may be located in the first position of the first result, that is, the undetermined words "knee joint", "knee joint" with the first word "knee" are ranked in front, and next, the first word is located in the second position of the first result, that is, the undetermined words "joint", "joint swelling pain" with the first word "knee" are ranked behind the undetermined words with the first word "knee", and since no first word is located in the third position of the first result, that is, the undetermined word with the first word "section", the first word may be located in the fourth position of the first result, that is, the undetermined word "swelling pain" with the first word "swelling" is ranked behind the undetermined word with the first word "knee", thus, the ranking of each undetermined word corresponding to the first result is completed, and the ranking results are as follows:

knee joint, joint swelling and pain, and swelling and pain.

In the case of an ordering by first letter only, the order between a plurality of pending words of the same first letter may be arbitrary. For example, the "knee joint" may be ranked before the "knee joint".

As an extension of the embodiment of the present application, the multiple undetermined words may also be sorted according to the position of the tail word of each undetermined word in the first segmentation result. For example, for a plurality of undetermined words "knee joint, joint swelling and pain" of the first segmentation result "knee joint swelling and pain", there are three types of end words "joint", "node", and "pain", respectively, according to the positions of the three types of end words in the first segmentation result, the undetermined word with the end word "joint" may be arranged in front, the undetermined word with the end word "node" may be arranged in the middle, the undetermined word with the end word "pain" may be arranged in the end, and the undetermined words may be sorted according to the positions of the end words in the first segmentation result, and the sorting result may be:

knee joint, joint swelling and pain, and swelling and pain.

In the case of sorting by the suffix only, the order between a plurality of pending words having the same suffix may be arbitrary. For example, the "knee joint" may also be arranged behind the "joint". And sorting according to the position of the tail word in the first word segmentation result, wherein the sorting result can be as follows:

knee joint, knee joint, swelling and pain, joint swelling and pain.

In other possible implementation manners of the embodiment of the present application, the multiple pending words may also be ranked according to the position of the first character and the last character of each pending word in the first segmentation result. The undetermined words can be sorted according to the positions of the first letters in the first word-dividing result, so that a temporary sorting result of ' knee joint, joint swelling and pain ', and the undetermined words with the same first letters can be sorted according to the end letters, so that the final sorting result of ' knee joint, joint swelling and pain ', and pain ' is obtained.

The above are some examples of sorting the words to be determined, and in other possible implementation manners of the embodiment of the present application, sorting may be performed in other manners, or sorting may not be performed, which is not limited in the embodiment of the present application.

S202: and if the currently undetermined word has an adjacent next undetermined word, judging whether the currently undetermined word has a containing relation with the adjacent next undetermined word.

The currently pending word is one of a plurality of pending words. In a possible manner, the first pending word in the sorted pending words may be used as the current pending word. The pending word having an inclusive relationship may be understood to have an inclusive or included relationship. Specifically, in this step, if there is an adjacent next undetermined word in the currently undetermined word, it may be determined whether the currently undetermined word includes the adjacent next undetermined word, or whether the currently undetermined word is included by the adjacent next undetermined word.

And judging whether the currently undetermined word contains an adjacent next undetermined word, matching the adjacent next undetermined word with the currently undetermined word one by one, and if each character of the next undetermined word can be continuously matched in the currently undetermined word, successfully matching, wherein the currently undetermined word contains the adjacent next undetermined word. Whether the currently undetermined word is contained by the adjacent next undetermined word or not is judged, and the currently undetermined word is similar to the judgment of whether the currently undetermined word contains the adjacent next undetermined word or not, so that the currently undetermined word and the adjacent next undetermined word can be matched word by word, and the specific matching process is not repeated.

For ease of understanding, this is illustrated.

If the currently undetermined word is "knee joint", and the next adjacent undetermined word is "knee joint", it may be determined whether the currently undetermined word "knee joint" includes the next adjacent undetermined word "knee joint", specifically, a first character of "knee joint" in the next undetermined word may be first matched in the currently undetermined word, when at least one result is matched, for each matching result, a character after the matching result in the currently undetermined word may be compared with a second character "off" in the next undetermined word, in this example, the second character of the next undetermined word may also be matched in the currently undetermined word, a third character "section" in the next undetermined word is continuously compared, and the next undetermined word cannot be matched in the currently undetermined word, that is, the matching is unsuccessful, so that the currently undetermined word "knee joint" does not include the next undetermined word "knee joint".

Then, whether the currently undetermined word "knee joint" is contained by the adjacent next undetermined word "knee joint" is judged, specifically, the first character "knee" of the currently undetermined word can be matched in the adjacent next undetermined word "knee joint", after the first character "knee" of the currently undetermined word is matched in the next undetermined word, then the second character "knee" of the currently undetermined word is matched, the matching is performed on the basis of the first matching, namely, whether the character behind the "knee" character matched in the next undetermined word is "knee" is judged, and the judgment result is yes, namely, the matching is successful, namely, the currently undetermined word is contained in the adjacent next undetermined word.

S203: and if not, taking the currently undetermined word and/or the adjacent next undetermined word as an information extraction result.

If the judgment result is negative, that is, the currently undetermined word and the adjacent next undetermined word do not have an inclusion relationship, the currently undetermined word can be used as an information extraction result. Further, the next adjacent undetermined word may be used as the current undetermined word, and the step of determining that the current undetermined word has a inclusion relationship with the next adjacent undetermined word is performed, so as to determine whether the current undetermined word is used as an information extraction result according to the inclusion relationship.

In some possible implementation manners of the embodiment of the present application, when the currently pending word and the adjacent next pending word do not have a inclusion relationship, the next pending word may be used as an information extraction result, or the currently pending word and/or the adjacent next pending word may be used as an information extraction result.

In some cases, the determination result may be that the currently pending word and the adjacent next pending word have an inclusion relationship, and the pending word in the information extraction result does not have an inclusion relationship, so that the currently pending word and the contained pending word in the adjacent next pending word are not used as the information extraction result, and the contained pending word may be covered by another pending word in the two pending words. After covering the contained undetermined word, the adjacent next undetermined word can be made to be the current undetermined word, and the step of judging whether the current undetermined word contains the adjacent next undetermined word is executed.

In a possible implementation manner, if the currently undetermined word contains the next adjacent undetermined word, covering the currently undetermined word with the next adjacent undetermined word, enabling the next adjacent undetermined word to be the currently undetermined word, and executing a step of judging whether the currently undetermined word contains the next adjacent undetermined word; and if the currently undetermined word is contained by the adjacent next undetermined word, the adjacent next undetermined word is made to be the currently undetermined word, and the step of judging whether the currently undetermined word contains the adjacent next undetermined word is executed.

It should be noted that, when the currently pending word is the penultimate pending word in the sequence of pending words and the next adjacent pending word is the last pending word, the last pending word may be used as the information extraction result after the above steps are performed. It can be understood that when the adjacent next undetermined word is the last undetermined word and the currently undetermined word contains the next undetermined word, the currently undetermined word, that is, the last undetermined word is covered by the penultimate word, and the covered last undetermined word is used as an information extraction result. When the next undetermined word, that is, the last undetermined word does not have a inclusion relationship with the currently undetermined word, the last undetermined word can be used as an information extraction result.

For ease of understanding, the present step is described below by way of example.

Aiming at a pending word sequence of 'knee joint, joint swelling and pain', the 'knee joint' is taken as a currently pending word, the 'knee joint' is taken as an adjacent next pending word, and the 'knee joint' comprises the 'knee joint', so that the 'knee joint' can cover the 'knee joint', and the pending word sequence 'knee joint, joint swelling and pain'. Then, the second knee joint is set as a currently undetermined word, the joint is set as an adjacent next undetermined word, the currently undetermined word knee joint comprises the next undetermined word joint, the knee joint can be set to cover the joint, and an undetermined word sequence of the knee joint, the joint swelling and pain can be obtained. And then, a third knee joint is taken as a currently undetermined word, and the joint swelling and pain is taken as an adjacent next undetermined word. Next, let "swelling and pain in joint" be the currently pending word, "swelling and pain" be the next pending word adjacent to the currently pending word, "swelling and pain in joint" include the next pending word "swelling and pain" adjacent to the currently pending word, "swelling and pain in joint" can be made to cover "swelling and pain" and the sequence of the pending word is "knee joint, swelling and pain in joint". Because the 'joint swelling and pain' is the last undetermined word, the last undetermined word can be used as an information extraction result. Namely, the 'knee joint' and the 'swelling and pain of joint' are determined as the information extraction result.

The above specific implementation manner for determining the undetermined words without inclusion relationship from the multiple undetermined words as the information extraction result provided by the embodiment of the present application is to sort the multiple undetermined words according to the first character and/or the last character, compare the currently undetermined word with the adjacent next undetermined word, and if the currently undetermined word and/or the adjacent next undetermined word do not have inclusion relationship, use the currently undetermined word and/or the adjacent next undetermined word as the information extraction result. Due to the fact that the undetermined words are sequenced, the comparison times can be reduced, the speed of determining the undetermined words without inclusion relations is increased, and the efficiency of obtaining the information extraction result is improved.

In the above example, the information extraction result includes two pending words "knee joint" and "joint pain", and the 2 pending words include the cross word "joint", and the cross word in the pending words can be processed to obtain a simplified information extraction result.

Based on this, the embodiment of the application provides an information extraction method, and when an information extraction result includes at least two undetermined words, the method can be used for further processing the undetermined words in the information extraction result, so that the information extraction result is simplified.

Fig. 3 is a flowchart illustrating an information extraction method according to an embodiment of the present application, please refer to fig. 2, where if an information extraction result includes a first pending word and a second pending word, the method further includes:

s301: and judging whether the first to-be-determined word and the second to-be-determined word comprise cross words or not based on a preset word bank.

The cross word is part of the first pending word and part of the second pending word. When the first undetermined word comprises the cross word, the first undetermined word can also comprise a first undetermined subword adjacent to the cross word. It should be noted that the first to-be-determined word may include at least one first to-be-determined sub-word. For the convenience of understanding, the embodiments of the present application are described by including a first to-be-determined sub-word as an example.

The first pending word and the second pending word can be regarded as character strings or Chinese character strings, and whether the first pending word and the second pending word comprise cross words can be determined in a character matching mode by combining a preset word bank. In a possible implementation manner, character matching may be performed on the first to-be-determined word and the second to-be-determined word first, an overlapping portion of the first to-be-determined word and the second to-be-determined word is obtained, the overlapping portion is traversed in a preset word bank, a possible word formation situation of the overlapping portion is obtained, and the cross word is determined according to the possible word formation situation of the overlapping portion. In other possible implementation manners of the embodiment of the present application, the first to-be-determined word and the second to-be-determined word may also be traversed in the preset word bank respectively to obtain all possible word formation conditions of the first to-be-determined word and all possible word formation conditions of the second to-be-determined word, the first to-be-determined word and the second to-be-determined word in all word formation conditions are removed, then all possible word formation conditions of the first to-be-determined word after the first to-be-determined word is removed are matched with all possible word formation conditions of the second to-be-determined word after the second to-be-determined word is removed, and a word successfully matched may be regarded as a cross word between the first to-be-determined word and the second to-be-determined word.

The process of determining cross-words is described below with reference to specific examples.

In this example, the information extraction result includes a first pending word "knee joint" and a second pending word "joint pain", and three possible word formation conditions of "knee joint", and "joint" can be obtained by traversing the first pending word in the preset word bank, and three possible word formation conditions of "joint", "joint pain", and "pain" can be obtained by traversing the second pending word in the preset word bank. And respectively removing the first pending word and the second pending word, wherein possible word forming conditions of the first pending word are 'knee joint', 'joint', and possible word forming conditions of the second pending word are 'joint', 'gall'. Matching the possible word forming condition of the first pending word with the possible word forming condition of the second pending word, and determining that the cross word is the joint.

S302: and if yes, judging whether the attribute of the cross word is the same as the attribute of the first undetermined subword.

Attributes may be understood as types of words. The attribute of the cross word and the attribute of the first to-be-determined word can be obtained according to a labeling result based on a preset word bank. For example, if a symptom dictionary library in a preset word library is used for labeling, the attribute of the word is a symptom, and if an organ part dictionary library in the preset word library is used for labeling, the attribute of the word is an organ part.

If the first pending word and the second pending word include a cross word, whether the attribute of the cross word is the same as the attribute of the first pending subword can be judged. Based on the preset text, such as the electronic medical record, the self writing characteristics, when several words with the same property, such as the first to-be-determined word and the cross word, which are all organ parts, are juxtaposed, they can be combined into a word with a coarser granularity, which characterizes the organ parts. Therefore, in the case of having the cross word, it may be determined whether the attributes of the first to-be-determined subword and the cross word are the same, so as to process the first to-be-determined word and the second to-be-determined word.

It should be noted that, when the first to-be-determined word and the second to-be-determined word include the cross word, if the first to-be-determined word includes a plurality of first to-be-determined subwords in addition to the cross word, it may be determined whether the attribute of the first to-be-determined subword adjacent to the cross word is the same as the attribute of the cross word. In the specific determination process, reference may be made to a case where the first to-be-determined word includes a first to-be-determined word, which is not described herein again.

S303: and if the word is the same as the first undetermined word, removing the crossed word from the second undetermined word to obtain the modified second undetermined word.

In this step, if the attribute of the first undetermined sub-word is the same as the attribute of the cross word, the first undetermined word in the first undetermined word may be juxtaposed with the cross word, and in order to avoid repetition, the cross word may be removed from the second undetermined word, so as to obtain the modified second undetermined word.

For example, the first pending word "knee joint" and the second pending word "swelling and pain in joint" have the cross word "joint", and the first pending word also has the first pending sub-word "knee", wherein both "knee" and "joint" belong to a body part, i.e. have the same attribute, so that the "knee joints" can be merged, and in order to avoid meaning repetition, the cross word "joint" can be removed from the second pending word "swelling and pain in joint", to obtain the modified second pending word "swelling and pain".

S304: and if not, removing the crossed words from the first pending word to obtain the modified first pending word.

In this step, if the attribute of the first undetermined sub-word is different from the attribute of the cross word, it may be considered that the first undetermined word in the first undetermined word is not proper to be aligned with the cross word, and the cross word may be removed from the first undetermined word to obtain the modified first undetermined word. In order not to lose the information comprised by the cross word, the cross word in the second pending word may be retained.

In order to facilitate information search and information positioning, the mapping relation among the first word segmentation result, the first pending word and the second pending word can be stored. It should be noted that, when the first pending word and/or the second pending word is modified, the mapping relationship between the first segmentation result and the modified first pending word and/or the modified second pending word is stored. In a possible implementation manner, if the attribute of the cross word is the same as that of the first undetermined subword, storing a mapping relation among the first word segmentation result, the first undetermined word and the modified second undetermined word; and if the attribute of the cross word is different from the attribute of the first undetermined subword, storing the mapping relation among the first word segmentation result, the modified first undetermined word and the second undetermined word.

Besides storing the first segmentation result, other segmentation results which are not in a preset word stock can also be stored. For the convenience of distinguishing, different formats can be adopted to store the first word segmentation result and other word segmentation results which are not in the preset word stock respectively. In a possible implementation manner, the first word segmentation result and the corresponding information extraction result thereof may be stored in a format of { "signal": "", "word" "," value ": f }, and other word segmentation results not in the preset lexicon are stored in a format of [", "O" ]. The word may represent a first word segmentation result, the signal may represent an attribute of the first word segmentation result, and the value may represent an information extraction result corresponding to the first word segmentation result. When a first long word segmentation result needs to be extracted, the first long word segmentation result can be extracted in the word, when a short information extraction result needs to be extracted, the first long word segmentation result can be extracted in the value, and the word segmentation result can be accurately positioned and extracted through the structured storage mode.

For the convenience of understanding, the storage process of the first segmentation result and other segmentation results not in the preset lexicon is described below with reference to specific examples.

For example, aiming at a sentence "knee joint swelling and pain is 6 months, and the weight is increased for 3 weeks in the electronic medical record. And performing word segmentation to obtain first word segmentation results of ' knee joint swelling and pain ', ' 6 months ' and ' 3 weeks ', and obtain other word segmentation results ' which are not in a preset word bank ', ' aggravation ' and '. ".

According to the storage format provided by the implementation manner, the word segmentation result can be stored as follows:

[ { "signal": symptom ", word": knee joint swelling and pain "," value ": [" knee joint "," organ "], [" swelling and pain "," symptom "] ] }, {" signal ": time", "word": 6 months "," value ": [ ] }, [", "," "O" ], [ "aggravation", "O" ], { "signal": time "," 3 weeks "," value ": }, [". "," O "]

In this example, when the user intends to extract "knee joint" but does not want to extract "knee joint gall", only "knee joint" may be retrieved in "value", thereby enabling extraction of only "knee joint". When the user intends to extract the knee joint gall, the knee joint gall can be retrieved from the word to extract the knee joint gall.

In the information extraction method provided in the embodiment of the present application, when the first word segmentation result includes the first to-be-determined word and the second to-be-determined word, it may be determined whether the first to-be-determined word and the second to-be-determined word have the cross word based on the preset word bank, when the first to-be-determined word has the cross word, it is determined whether an attribute of the first to-be-determined word adjacent to the cross word in the first to-be-determined word is the same as an attribute of the cross word, if so, the cross word in the second to-be-determined word is removed, and if not, the cross word in the first to-be-determined word is removed. The first undetermined word and the second undetermined word with the cross words are removed according to the attribute of the cross words and the attribute of the first undetermined subword, the cross words of one of the first undetermined word and the second undetermined subword can be simplified, and a simplified information extraction result is obtained.

Based on the specific implementation manner of the information extraction method provided by the foregoing embodiment, an embodiment of the present application further provides an information extraction device, and the information extraction device provided by the embodiment of the present application is introduced from the perspective of a functional module with reference to the accompanying drawings.

Fig. 4 is a schematic structural diagram of an information extraction apparatus according to an embodiment of the present application, please refer to fig. 4, where the apparatus includes:

a word segmentation unit 401, configured to perform word segmentation on a preset text according to a preset lexicon to obtain a first word segmentation result;

an extracting unit 402, configured to extract multiple undetermined words from the first segmentation result based on the preset lexicon, where the multiple undetermined words do not include the first segmentation result;

a determining unit 403, configured to determine, from the multiple pending words, a pending word that does not include a relationship, as an information extraction result of the first segmentation result.

Optionally, the determining unit 403 includes:

Optionally, the determining unit 403 further includes an executing subunit, configured to:

Optionally, the apparatus further comprises:

The above is a specific implementation manner of the information extraction device provided in this embodiment of the present application, where words are segmented according to a preset word bank to obtain a first segmentation result, and then a plurality of undetermined words included in the first segmentation result are extracted based on the preset word bank, where the plurality of undetermined words do not include the first segmentation result, and an undetermined word without an inclusion relation is determined from the plurality of undetermined words as an information extraction result of the first segmentation result.

When introducing elements of various embodiments of the present application, the articles "a," "an," "the," and "said" are intended to mean that there are one or more of the elements. The terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements other than the listed elements.

It should be noted that, as one of ordinary skill in the art would understand, all or part of the processes of the above method embodiments may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when executed, the computer program may include the processes of the above method embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, and the units and modules described as separate components may or may not be physically separate. In addition, some or all of the units and modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing is directed to embodiments of the present application and it is noted that numerous modifications and adaptations may be made by those skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

Claims

1. An information extraction method, the method comprising:

determining undetermined words without containing relations from the undetermined words as information extraction results of the first segmentation results;

wherein, if the information extraction result comprises a first pending word and a second pending word, the method further comprises:

if yes, judging whether the attribute of the cross word is the same as the attribute of the first undetermined sub-word, if so, removing the cross word from the second undetermined word to obtain a modified second undetermined word; if not, removing the cross word from the first to-be-determined word to obtain a modified first to-be-determined word, wherein the first to-be-determined subword is a word adjacent to the cross word in the first to-be-determined word;

the method for segmenting the preset text according to the preset word bank to obtain a first segmentation result comprises the following steps: and segmenting words of a preset text according to a preset word bank to obtain word segmentation results representing symptoms, time and other words, wherein the word segmentation results representing other words are word segmentation results which are not in the preset word bank.

2. The method of claim 1, wherein the determining the pending word without the inclusion relation from the plurality of pending words as the information extraction result of the first segmentation result comprises:

3. The method of claim 2, wherein if the currently pending word has a containment relationship with the next adjacent pending word, the method further comprises:

4. The method of claim 1, wherein if the attributes of the cross word and the first pending subword are the same, the method further comprises:

5. An information extraction apparatus, characterized in that the apparatus comprises:

a determining unit, configured to determine, from the multiple undetermined words, an undetermined word that does not include a relationship as an information extraction result of the first segmentation result;

wherein the apparatus further comprises:

the modifying unit is used for removing the cross word from the second undetermined word if the cross word is the same as the second undetermined word to obtain a modified second undetermined word; if not, removing the cross word from the first to-be-determined word to obtain a modified first to-be-determined word, wherein the first to-be-determined subword is a word adjacent to the cross word in the first to-be-determined word;

6. The apparatus of claim 5, wherein the determining unit comprises:

7. The apparatus of claim 6, wherein the determining unit further comprises an executing subunit configured to:

8. The apparatus according to claim 5, further comprising a storage unit, in particular for: