CN112765318A

CN112765318A - Natural language processing method and system for infertility clinical phenotype information

Info

Publication number: CN112765318A
Application number: CN202110072754.8A
Authority: CN
Inventors: 张晶; 罗俊峰
Original assignee: Carrier Gene Technology Suzhou Co ltd
Current assignee: Carrier Gene Technology Suzhou Co ltd
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2021-05-07

Abstract

The invention provides a natural language processing method and a system for infertility clinical phenotype information, wherein a Chinese clinical phenotype original character string is converted into a Chinese and English clinical phenotype initial character string, an independent character string and a split character string by a natural language preprocessing method, a punctuation mark splitting method and a field splitting method; and based on pre-established Chinese and English body dictionaries, performing precise matching and fuzzy matching on the clinical phenotype initial character string, the independent character string and the split character string, and finally outputting one or more bodies matched with the Chinese and English body dictionaries through a weighting rule. Where fuzzy matching is intended to be computed by semantic approximation. The invention also provides a natural language processing system and a medium, which comprise a reading module, a converting module, a splitting module, a matching module and an output module. The invention solves the problem of fast matching between Chinese clinical phenotype information and ontology dictionary, and facilitates the sequencing analysis of all exons for diseases such as infertility.

Description

Natural language processing method and system for infertility clinical phenotype information

Technical Field

The invention belongs to the field of computer processing of clinical phenotype information, and particularly relates to a natural language processing method and system for clinical phenotype information of infertility.

Background

More than 4000 million of the existing infertility patients in China become the third disease besides tumor and cardiovascular disease. With the sudden increase of social pressure and the aggravation of air and food pollution, the incidence rate of infertility is increased from 3.5% before 20 years to 12.5% after 2016 years, and some regions are over 15%, which means that every 8 couples have infertility. According to research, in addition to physical, chemical, microbial and other environmental factors, the genetic factors of an individual also have important and profound effects on the occurrence of infertility.

With the clinical wide application of high-throughput sequencing technology and the perfection of genetic variation interpretation guidelines and databases, more microscale gene single-base variation (SNV), short insertion deletion (InDel) and Copy Number Variation (CNV), sex chromosome number abnormality and Y chromosome microdeletion are gradually discovered and valued for the clinical significance of infertility. Genetic detection can clearly diagnose the etiology, so that a more effective treatment mode can be provided for the etiology in clinic, and trial and over-treatment is avoided. Meanwhile, the definite pathogenic gene variation can carry out genetic block through single-gene genetic disease diagnosis before embryo implantation, so that the progeny still has sterility or other health problems.

At present, methods for diagnosing the cause of infertility in clinical practice can be divided into methods such as single gene detection, gene set detection, whole exon sequencing and the like. Among them, the monogenic test and the gene cluster test are generally directed to the detection of infertility caused by a known single gene or a certain kind of monogenic variation. The sequencing of the whole exon is not only suitable for detecting infertility caused by variation of known single genes or gene sets, but also can screen unknown potential pathogenic variation sites related to infertility by combining clinical diagnosis phenotypes, thereby providing more effective evidence support for clinical scientific research.

Sequencing of the whole exon outputs massive site information, and manual screening of hundreds of candidate site information related to clinical phenotype information is not practical. The rapid matching of clinical phenotypic information of infertility to Human Phenotypic Ontology (HPO) by means of corresponding tools, such as exosmier, phenosizer, Phenolyzer, etc., will help to more efficiently screen potential pathogenic loci associated with infertility. However, the infertility clinical phenotype information input by the healthcare practitioner in the healthcare informatization platform is mostly presented in non-standardized languages, such as: the method has the advantages that the formats are complex and various, multi-language mixing is often used, irregular grammar is used, abbreviation or common name is used to replace standard terms, error information is input, mixed symbols and other disordered information in characters are input, and unified standards are not provided. This brings inconvenience to the realization of a fast match of infertility clinical phenotype information and phenotype ontology.

In order to solve the above technical problem, it is necessary to perform natural language processing on non-standardized clinical phenotypes, convert the non-standardized clinical phenotypes into standardized clinical phenotypes that can be recognized by a computer, and match the standardized clinical phenotypes with an ontology dictionary. Automated clinical phenotype natural language processing and ontology dictionary matching will facilitate full exon sequencing analysis for infertility.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a natural language processing method and a system for the clinical phenotype information of the infertility, which solve the problem of fast matching between the Chinese clinical phenotype information and the ontology dictionary and bring convenience for sequencing and analyzing the whole exons of diseases such as the infertility and the like.

The technical solution for realizing the purpose of the invention is as follows:

a natural language processing method for infertility clinical phenotype information comprises reading Chinese clinical phenotype character string or Chinese clinical phenotype related document; converting the Chinese clinical phenotype original character string into a Chinese and English clinical phenotype initial character string, an independent character string and a split character string by natural language preprocessing, punctuation splitting and field splitting; and performing precise matching and fuzzy matching on the clinical phenotype initial character string, the independent character string and the split character string based on a pre-established Chinese and English body dictionary, and finally outputting one or more bodies matched with the Chinese and English body dictionary through a weighting rule. Where fuzzy matching is intended to be computed by semantic approximation. The invention also provides a corresponding natural language processing system and a medium, which comprise a reading module, a converting module, a splitting module, a matching module and an output module.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

1. the method calculates the semantic approximation degree through a unique function, has innovativeness, and can quickly identify the clinical phenotype term which meets the standard and match with the ontology dictionary by a computer;

2. the method overcomes the limitation of the traditional Chinese database caused by untimely updating or incomplete database content on string-ontology matching, and improves the retrieval accuracy and efficiency by sequencing the results of Chinese and English precise matching and fuzzy matching through a weighting method;

3. the invention solves the problems that manual matching takes long time, matching results are different from person to person, automatic access to the full-exon automatic analysis process cannot be realized, and the like, and greatly improves the efficiency of full-exon sequencing analysis;

4. the invention is not limited to the matching of the infertility clinical phenotype data and the human phenotype ontology, and can also be applied to other fields, such as the matching of clinical phenotypes and ontologies of cardiovascular diseases and the like.

Drawings

FIG. 1 is a Chinese language segmentation and matching method for infertility clinical phenotype information according to example 1 of the present invention.

Fig. 2 is an english segmentation and matching method of infertility clinical phenotype information according to embodiment 2 of the present invention.

FIG. 3 is a flowchart of the natural language processing method for infertility clinical phenotype according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

Example 1

As shown in FIG. 1, the present invention provides a Chinese segmentation and matching method for infertility clinical phenotype information.

Step 101, performing natural language preprocessing on the Chinese clinical phenotype character string to obtain a preprocessed Chinese clinical phenotype initial character string.

Most of the infertility clinical phenotype information input by medical practitioners is presented in non-standardized language, including complex formats (e.g., "OR 35 MII, 18 MII"), multiple languages mixed (e.g., "all external detection + microdetermination/microproduction", "day 3-7C 2 transplantation non-pregnancy"), non-standard grammars (e.g., "prostate ca"), abbreviations OR colloquial substitution of standard terms (e.g., "RSA", "IVF", "PCOS"), error information (e.g., "CVAVD", "CUAVD"), and words mixed with symbols (e.g., "no essence", "FSH 55 ═ and" <1mL "), etc.), which increases the difficulty of matching the original character strings of chinese clinical phenotype with the ontology dictionary. In order to improve the accuracy of accurate recognition and fuzzy recognition, the inventor firstly carries out natural language preprocessing on the Chinese clinical phenotype original character string to generate the Chinese clinical phenotype original character string which can be recognized by a computer.

The natural language preprocessing is carried out on the Chinese clinical phenotype initial character string to generate the preprocessed Chinese clinical phenotype initial character string, and the method can be implemented according to the following specific modes: uniformly modifying the Chinese clinical phenotype original character string code into a UTF-8 coding format; converting all the half-angle symbols into full-angle symbols; converting Arabic numerals into Chinese upper case numerals; removing meaningless character strings, such as focus attention, none, unchecked, normal, past medical history, specifically, requiring inspection, auspicious accessories, and the like; replacing irregular clinical description with standard Chinese character description, for example, <, replacing% with less than ↓, replacing Φ with smaller diameter,. gamma, replacing cm with centimeter, replacing mL or mL with milliliter, etc.; changing the abbreviation or the abbreviation with errors manually input into a Chinese full name, for example: when CUAVD, CVAVD, CBAVD, CBVAD appear in the text, replacement is made by congenital unilateral vas defect, AsAbt is replaced by antisperm antibody, ICSI is replaced by intracytoplasmic sperm microinjection technology, SCOS is replaced by supportonly cell syndrome, MMAF is replaced by sperm flagellar multiple morphology abnormality, PCOS is replaced by polycystic ovary syndrome, IVF is replaced by in vitro fertilization, RSA is replaced by repeated spontaneous abortion, ca is replaced by cancer, etc.; replacing english names with chinese names, for example: replacing Microduplication with microreplication and Microdeletion with Microdeletion; changing capital letters into lowercase letters; finally, the preprocessed Chinese clinical phenotype initial character string is obtained.

Wherein, the meaningless character strings are provided by a pre-established Chinese non-clinical term dictionary (shown in Table 1), and irregular clinical descriptions, abbreviations with errors manually input and corresponding Chinese standardized descriptions are provided by a pre-established Chinese clinical phenotype standard term dictionary (shown in Table 2).

TABLE 1

Meaningless character string
	Self-documenting out-of-hospital diagnosis
Focus on
	Not examined
Past medical history
	Other examination results
Requiring full external scanning

TABLE 2

Abbreviation/irregular clinical description	Standardized description of Chinese
		CPPS	Total chronic pelvic pain syndrome
CUAVD，CVAVD，CBAVD，CBVAD	Congenital unilateral vasectomy defect
		IHH	Idiopathic hypogonadotropic hypogonadism
FSH	Follicle stimulating hormone
		ICSI	Intracytoplasmic sperm microinjection technology
FET	Frozen embryo transfer
		AID	Artificial insemination in non-spouse
Ca，ca，can	Cancer treatment

And 102, judging whether the preprocessed Chinese clinical phenotype initial character string contains punctuation marks or not. If the punctuation mark is not contained, the punctuation mark is not split and the punctuation mark is directly output as a Chinese independent character string.

And 103, splitting the preprocessed Chinese clinical phenotype initial character string to obtain a Chinese independent character string if the punctuation mark is included.

Punctuation marks are mixed in the initial character string of the Chinese clinical phenotype, and the long sentence structure often comprises one or more clinical phenotypes of infertility, so that the punctuation marks are required to be split to obtain the character string with independent semantics. The punctuation marks herein include, but are not limited to, periods, question marks, exclamation marks, commas, pause marks, semicolons, quotation marks, parentheses, dash marks, ellipses, repetition marks, title marks, space marks, connection marks, special names, number marks, comment marks, hidden marks, lack marks, slash marks, identification marks, substitute marks, beads, and/or arrow marks. For example: congenital unilateral vasectomy, obstructive azoospermia: male sterility. Three independent Chinese independent character strings of congenital unilateral seminiferous duct deficiency, obstructive azoospermia and male sterility can be obtained by punctuation mark splitting.

And for the Chinese clinical phenotype initial character string without punctuation marks, the punctuation marks are not split, and the Chinese clinical phenotype initial character string is directly output as a Chinese independent character string. For example: no sperm is needed, the punctuation mark is not split, and the no sperm can be directly used as the Chinese independent character string to enter the next step.

And 104, judging whether the Chinese independent character string is accurately matched with the Chinese ontology dictionary.

And 105, if the Chinese independent character string is exactly matched with the Chinese ontology dictionary, outputting the exactly matched Chinese independent character string.

Wherein the Chinese ontology dictionary comprises but is not limited to human phenotype ontology dictionary (HPO). Wherein the HPO dictionary comprises a plurality of ontologies: phenotypically abnormal entities, e.g., skeletal system abnormalities, blood or hematopoietic tissue abnormalities, etc.; genetic pattern ontologies, e.g., autosomal dominant inheritance, autosomal recessive inheritance, and the like; clinical modifying ontology, e.g., rate of progression, starting factor, location or severity, etc.; clinical development process ontology and frequency ontology, e.g., frequency, sporadic, etc.

The Chinese ontology comprises an ontology name, an ontology descriptor, an ontology synonym or alternative name, and an ontology abbreviation. The language of the ontology dictionary is Chinese.

And step 106, splitting the Chinese independent character string to obtain a Chinese split character string.

Step 107, judging whether the Chinese split character string is exactly matched with the Chinese ontology dictionary.

And 108, if the Chinese split character string is accurately matched with the Chinese ontology dictionary, outputting the accurately matched Chinese split character string.

The Chinese independent character string can be split through a corresponding word segmentation tool. Common Chinese word segmentation tools include jieba, SnowNLP, THULAC, NLPIR and the like. The splitting method comprises a maximum matching method and a search engine constructing inverted index participles.

The maximum matching and splitting method of the Chinese independent character string comprises the following steps: loading Chinese ontology training data, wherein the Chinese ontology training data comprises an ontology name, an ontology descriptor, an ontology synonym or alternative name and an ontology abbreviation of a Chinese ontology dictionary; and splitting the Chinese independent character string, and when encountering the ontology name, ontology descriptor, ontology synonym or alternative name and ontology abbreviation which already appear in the Chinese ontology training data, independently splitting the Chinese independent character string as a whole.

For example: splitting congenital unilateral vas deferens defect, wherein the congenital unilateral vas deferens defect exists in a Chinese ontology dictionary and is independently split as a whole, so that the final splitting result is as follows: congenital, unilateral and vas deferens are deficient in three Chinese split character strings. And outputting a matched character string 'vas deference deficiency' by judging whether the Chinese ontology dictionary can be accurately matched.

For example: the 'obstructive azoospermia' is split, and as the 'obstructive azoospermia' exists in a Chinese ontology dictionary, the final split result is a Chinese split character string of 'obstructive azoospermia'. And outputting a matched character string 'obstructive azoospermia' by judging whether the character string can be accurately matched with the Chinese ontology dictionary.

For example: the method is characterized in that all sperm abnormality is split, and the final split result is that all Chinese split character strings, sperm split character strings and abnormal Chinese split character strings exist in a Chinese ontology dictionary.

Optionally, the splitting of the independent Chinese character string may also be implemented by constructing an inverted index participle through a search engine.

For example: splitting the congenital unilateral vas deferens defect, wherein the final word segmentation result is as follows: "congenital", "unilateral", "insemination", "deficient", "vas deferens deficient" eight chinese split character strings. By accurate search, the 'vas deferens defect' can be found to be accurately matched with the Chinese ontology dictionary, and the 'vas deferens defect' is output.

For example: the obstructive azoospermia is split, and the final word segmentation result is as follows: four Chinese character strings of obstruction, azoospermia and non-obstruction azoospermia are split. Through accurate search, the 'azoospermia' and the 'non-obstructive azoospermia' can be found to be accurately matched with the Chinese ontology dictionary, and the 'azoospermia' and the 'non-obstructive azoospermia' are output.

For example: the 'all sperm abnormalities' is split, and the final word segmentation result is as follows: the three Chinese characters of 'all', 'sperm' and 'abnormal' are split into character strings, and no split character string exactly matched with the Chinese ontology dictionary exists.

And step 109, performing semantic approximation on the Chinese split character strings to match.

And step 110, outputting one ontology or a plurality of ontologies of the Chinese ontology dictionary with the closest semantic approximation.

For example, the Chinese independent string "all sperm abnormal" cannot be matched with the Chinese ontology dictionary precisely, and the split Chinese independent string "all", "sperm" and "abnormal" cannot be matched with the Chinese ontology dictionary precisely after being searched precisely. After step 109 is executed, fuzzy matching is performed on each body of the Chinese ontology dictionary through semantic approximation, and the first eight bodies which are most matched in sequence are output, namely the bodies are 'sperm immotility disorder', 'sperm cephalic disorder', 'sperm cervical disorder', 'sperm movement disorder', 'big head sperm', 'sperm caudal disorder', 'sperm morphological disorder' and 'sperm motility reduction'.

The semantic approximation of the chinese independent string "all sperm abnormalities" to each ontology of the chinese ontology dictionary is calculated according to the functions described in claims 7 to 9. Table 3 shows an example of a calculation method for matching the chinese independent string "all sperm are abnormal" with the chinese ontology "sperm head is abnormal". The frequency of each split character string appearing in the independent character string or the body and the number of split character string bodies are shown in table 3:

TABLE 3

The TF-IDF calculation method for splitting all the character strings in Chinese is as follows:

the cosine distance between the 'all sperm abnormalities' and the 'sperm head abnormalities' is calculated by the following method:

example 2

As shown in FIG. 2, the present invention provides an English segmentation and matching method for clinical phenotype information of infertility.

Step 201, natural language preprocessing is performed on the Chinese clinical phenotype character string to obtain a preprocessed English clinical phenotype initial character string.

The natural language preprocessing is carried out on the Chinese clinical phenotype initial character string to generate the preprocessed English clinical phenotype initial character string, and the method can be implemented according to the following specific modes: uniformly modifying the Chinese clinical phenotype original character string code into a UTF-8 coding format; converting all the full-angle symbols into half-angle symbols; converting Arabic numerals into English numerals; removing meaningless character strings, such as focus attention, none, unchecked, normal, past medical history, specifically, requiring inspection, auspicious accessories, and the like; replacing irregular clinical description with standard English description, such as less than for <, > for percent, high for ↓forhigh, diameter size for Φ, centimeter for cm, millilite for mL or mL, etc.; changing the abbreviation or the abbreviation with errors manually input into English full name, for example: when CUAVD, CVAVD, CBAVD, CBVAD appear in the text, replace with the generic systemic availability of the vas deferens, AsAbt replaces with anti-professional antibody, TESA replaces with the systemic spore administration, ICSI replaces with the systemic spore injection, SCOS replaces with the systemic clone, MMAF replaces with the multiple immunologic library of the sperm fluoride, PCOS replaces with the systemic polysome, etc.; changing capital letters to lower case letters, for example: replacing the microdropification with the microdropification, and replacing the microdroption with the microdroption; automatically translating the Chinese clinical phenotype character string into English; and finally obtaining the preprocessed English clinical phenotype initial character string.

Wherein, the meaningless character string is provided by a pre-established Chinese non-clinical term dictionary (as shown in Table 1), and the irregular clinical description, abbreviations with errors by manual input and corresponding English standardized description are provided by a pre-established English clinical phenotype standard term dictionary (as shown in Table 4).

TABLE 4

Step 202, judging whether the preprocessed English clinical phenotype initial character string contains punctuation marks. And if the punctuation mark is not contained, the punctuation mark is not split.

And step 203, splitting the preprocessed English clinical phenotype initial character string to obtain an English independent character string if the punctuation mark is included.

And step 204, judging whether the English independent character string is accurately matched with the English ontology dictionary.

And step 205, if the English independent character string is accurately matched with the English ontology dictionary, outputting the accurately matched English independent character string.

Wherein the English ontology dictionary comprises but is not limited to human phenotype ontology dictionary (HPO). Like the Chinese ontology dictionary, the HPO dictionary includes several ontologies.

The English ontology comprises an ontology name, an ontology descriptor, an ontology synonym or alternative name, an ontology abbreviation, other ontology dictionaries corresponding to the ontology across libraries, and a general description of all the contents. Other ontological dictionaries corresponding to the cross-library include, but are not limited to, international clinical medicine standard terms (SNOMED CT medical term), human disease ontologies, combinatorial phenotype ontologies, human skin disease ontologies, infectious disease ontologies, pathogenic disease ontologies, mammalian phenotype ontologies, and the like. The English noumenon dictionary is English. Meanwhile, the English ontology dictionary comprises an ontology tree (DAG) upper-lower level structure, namely the correspondence of parents-childrens between hps.

And step 206, splitting the English independent character string to obtain an English split character string.

And step 207, judging whether the English splitting character string is accurately matched with the English ontology dictionary.

And step 208, if the English splitting character string is accurately matched with the English ontology dictionary, outputting the accurately matched English splitting character string.

The splitting of the English independent character string can be realized through a corresponding word segmentation tool. Common English word segmentation tools include NLTK, Keras, Sklearn, SpaCy, Gensim, etc.

The splitting method of the English independent character string comprises an N-gram (with any length from 1 to 30) splitting rule, namely splitting into split character strings with any length. The splitting method also comprises the storage of the splitter and the like.

For example: the 'constitutive elementary present vas deferens' is split, and the final split result is as follows: "genetic", "genetic unidilative present var", "unidilative present var", "present var", "present vas", "present details", "vas", "vas details", "present" and "present" fifteen english split strings. And outputting a matching character string 'present vas deferens' by judging whether the matching character string can be accurately matched with the English ontology dictionary.

For example: the method comprises the following steps of splitting the obstructive azoospermia, wherein the final splitting result is as follows: "obstructive", "azoospermia", "obstructive azoospermia" three english split strings. Through accurate search, the 'obstrutive azoospermia' can be found to be matched with the English ontology dictionary accurately, and the 'azoospermia' and the 'obstrutive azoospermia' are output.

For example: the method is used for splitting all sphere abnormalities, and the final word segmentation result is as follows: the English splitting method comprises six English splitting character strings of all, all sphere, all probability classes, sphere probability classes and all probability classes, and does not have splitting character strings which are accurately matched with an English ontology dictionary.

Optionally, the splitting method for the english independent character string further includes natural language processing methods such as removal of stop words, reduction of word shapes, and extraction of word stems, which will not be described herein again.

And step 209, performing semantic similarity matching on the English split character strings.

And step 210, outputting one ontology or a plurality of ontologies of the English ontology dictionary with the closest semantic similarity.

For example, the english independent string "all sphere probabilities" and the english split string thereof do not have an english body that matches exactly with the english independent string, and the function described in claims 7 to 9 is used to perform fuzzy matching by semantic approximation, so that one or more bodies that match most closely with the english independent string can be found, and the specific method is not described in detail.

Example 3

As shown in FIG. 3, the embodiment of the present invention provides an overall flow and weighting rules for natural language processing of infertility clinical phenotypes.

As shown in the overall process of fig. 3, the following character strings are output by natural language processing, splitting, exact matching, and fuzzy matching of the original character strings of the chinese clinical phenotype: a Chinese independent string that exactly matches the Chinese ontology dictionary (step 304), an English independent string that exactly matches the English ontology dictionary (step 304), a Chinese split string that exactly matches the Chinese ontology dictionary (step 306), an English split string that exactly matches the English ontology dictionary (step 306), one or more ontologies of the Chinese ontology dictionary that most match the Chinese independent string (step 307), and one or more ontologies of the English ontology dictionary that most match the English independent string (step 307);

for example, in

steps

302, 303, is the primary string for chinese clinical phenotype "congenital unilateral vas deferens deficiency? ' Chinese natural language preprocessing and punctuation splitting are carried out to obtain a Chinese independent character string ' congenital unilateral seminiferous duct deficiency '. In step 304, the independent Chinese character string which is exactly matched with the Chinese ontology dictionary is not found through exact search. Step 305, splitting the congenital unilateral vas deferens defect into three Chinese split character strings of congenital, unilateral and vas deferens defect by a maximum splitting method. And step 306, outputting the Chinese split character string 'vas deference defect' which is accurately matched with the Chinese ontology dictionary through accurate search. And 307, performing fuzzy matching on the Chinese ontology dictionary and each ontology of the Chinese ontology dictionary through semantic approximation, and outputting the first three ontologies which are matched most in sequence, namely 'defective vas deferens', 'locked vas deferens' and 'infertility'.

Meanwhile, in

steps

302 and 303, is the primary character string "congenital unilateral vas deferens defect? "preprocessing English natural language and splitting punctuation mark to obtain English independent character string" systematic unidirational present vas deferens ". And step 304, finding out the English independent character string which is exactly matched with the English ontology dictionary through accurate search. Step 305, splitting the congenital unilateral vasectomy by an N-gram splitting method into: "genetic", "genetic unidilative present var", "unidilative present var", "present var", "present vas", "present details", "vas", "vas details", "present" and "present" fifteen english split strings. And step 306, outputting an English splitting character string 'present vas deferens' (defective vas) which is accurately matched with the English ontology dictionary through accurate searching. Step 307, fuzzy matching is carried out on the semantic similarity and each ontology of the English ontology dictionary, and the three ontologies which are sequentially most matched are output, namely "absent vas deferens" (absence of vas deferens), "intrinsic vas deferens" (locked vas deferens) and "abnormal vas deferens morphology" (abnormal vas deferens morphology).

Therefore, by natural language processing, splitting, exact matching, and fuzzy matching of the chinese independent string "congenital unilateral vas deference" and the english independent string "genetic unidiatral present vas deferens", the output strings of

steps

304, 306, 307 are as shown in table 5:

TABLE 5

Step 308, calculating one or more ontologies that are eventually matched with the chinese independent string and the english independent string according to the weighting function described in claim 10. The weighting method is as follows:

the overall weight after weighting of vas deferens defect is as follows:

the overall weight after weighting of "vas deferens atresia" is:

the overall weight after weighting for "infertility" is:

the overall weight after weighting of the 'abnormality of vas deferens form' is as follows:

according to the weighted weight, the final output and the final matching result of the congenital unilateral vas deferens and the genetic unidentified present vas deferens are as follows from big to small: defective vas deferens, locked vas deferens, abnormal vas deferens form and infertility.

For the convenience of understanding of those skilled in the art, the original Chinese clinical phenotype character string in the above example is subjected to natural language preprocessing and punctuation splitting to obtain only a single Chinese independent character string and a corresponding single English independent character string. In practical situations, the original Chinese clinical phenotype character strings often appear in long sentence structures, and a plurality of Chinese independent character strings and corresponding English independent character strings can be obtained through natural language preprocessing and punctuation splitting. In this case, the overall process and weighting rules of steps 304 to 308 are applied to each pair of Chinese and English independent strings.

The foregoing is directed to embodiments of the present invention and, more particularly, to a method and apparatus for controlling a power converter in a power converter, including a power converter, a power.

Claims

1. A natural language processing method for clinical phenotype information of infertility is characterized in that:

step 1: reading a Chinese clinical phenotype character string or a Chinese clinical phenotype related document, and storing the Chinese clinical phenotype character string or the Chinese clinical phenotype related document as a Chinese clinical phenotype original character string;

step 2: performing natural language preprocessing on the Chinese clinical phenotype initial character string to generate a preprocessed Chinese clinical phenotype initial character string and an English clinical phenotype initial character string;

and step 3: if the Chinese clinical phenotype initial character string contains punctuation marks, splitting the preprocessed Chinese clinical phenotype initial character string according to the punctuation marks to obtain a Chinese independent character string between the corresponding punctuation marks; if the punctuation mark is not contained, the Chinese independent character string is directly output;

if the initial English clinical phenotype character string contains punctuation marks, splitting the preprocessed initial English clinical phenotype character string according to the punctuation marks to obtain an English independent character string between the corresponding punctuation marks; if the punctuation mark is not contained, the English independent character string is directly output;

and 4, step 4: based on the Chinese ontology dictionary, searching an ontology in the Chinese ontology dictionary which is accurately matched with the Chinese independent character string, outputting the Chinese independent character string which is accurately matched with the Chinese ontology dictionary, and directly turning to the step 5 if no matching exists;

based on the English body dictionary, searching a body in the English body dictionary which is accurately matched with the English independent character string, outputting the English independent character string which is accurately matched with the English body dictionary, and directly turning to the step 5 if no matching exists;

and 5: splitting the Chinese independent character string by a splitting method to obtain a Chinese split character string;

splitting the English independent character string by a splitting method to obtain an English split character string;

step 6: based on the Chinese ontology dictionary, searching the ontology in the Chinese ontology dictionary respectively matched with the Chinese split character strings, outputting the Chinese split character strings precisely matched with the Chinese ontology dictionary, and directly converting the Chinese split character strings without precise matching to the step 7;

based on the English ontology dictionary, searching an ontology in the English ontology dictionary matched with the English split character string, outputting the English split character string accurately matched with the English ontology dictionary, and directly transferring the English split character string without accurate matching to the step 7;

and 7: calculating the semantic approximation degree of the Chinese independent character string corresponding to the Chinese split character string without the precise matching and each body of the Chinese ontology dictionary, and outputting one or more bodies of the Chinese ontology dictionary which is most matched with the Chinese independent character string corresponding to the Chinese split character string without the precise matching according to the semantic approximation degree, wherein the closer the semantic approximation degree is to zero, the more the corresponding Chinese independent character string is matched with the body of the Chinese ontology dictionary;

calculating the semantic approximation degree of each body of the English body dictionary and the English independent character string corresponding to the English split character string without the precise matching, and outputting one or more bodies of the English body dictionary which is most matched with the English independent character string corresponding to the English split character string without the precise matching according to the semantic approximation degree, wherein the closer the semantic approximation degree is to zero, the more the corresponding English independent character string is matched with the body of the English body dictionary;

and 8: and (3) performing weighted calculation on the Chinese independent character string which is output in the step (4) and is accurately matched with the Chinese body dictionary, the English independent character string which is output in the step (6) and is accurately matched with the Chinese body dictionary, the English split character string which is output in the step (6) and is accurately matched with the English body, one or more bodies of the Chinese body dictionary which is output in the step (7) and is maximally matched with the Chinese independent character string corresponding to the Chinese split character string and one or more bodies of the English body dictionary which is output in the step (7) and is maximally matched with the English independent character string to obtain the final Chinese independent character string and English independent character string and one or more bodies which are maximally matched with the English independent character string, and sequentially outputting the final Chinese independent character string and English independent character string from large to small according to the weight.

2. The natural language processing method for infertility clinical phenotype information according to claim 1, wherein the natural language preprocessing in step 2 comprises a chinese natural language preprocessing and an english natural language preprocessing, wherein:

the Chinese natural language preprocessing comprises the following steps: unifying the encoding format of the original character string of the clinical phenotype of the Chinese, the conversion between a half-corner symbol and a full-corner symbol, the conversion between an Arabic number and a Chinese capital-written number, the elimination of meaningless terms, the Chinese standardization of irregular clinical description, the conversion between an abbreviation and a standardized Chinese full-name, the conversion between an English name and a Chinese name, and the conversion between upper and lower case letters, wherein the meaningless terms are provided by a pre-established Chinese non-clinical term dictionary, and the irregular clinical description, the abbreviation and a corresponding standardized Chinese description, the English name and a corresponding Chinese name are provided by a pre-established Chinese clinical phenotype standard term dictionary;

the English natural language preprocessing comprises the following steps: unifying the Chinese clinical phenotype original character string coding format, the conversion between a half-corner symbol and a full-corner symbol, the conversion between Arabic numerals and English numerals, the elimination of meaningless terms, the English standardization of irregular clinical description, the conversion between abbreviation and standardized English full-name, the conversion between capital letters and small letters, and the translation of Chinese clinical phenotype information to English clinical phenotype information, wherein the meaningless terms are provided by a pre-established Chinese non-clinical term dictionary, and the irregular clinical description, the abbreviation and the corresponding standardized English description are provided by a pre-established English clinical phenotype standard term dictionary.

3. The method of claim 1, wherein the rule of splitting according to punctuation in step 3 is:

according to the full-angle symbols and the half-angle symbols, splitting the preprocessed Chinese clinical phenotype initial character strings into Chinese independent character strings among corresponding punctuations, or splitting the preprocessed English clinical phenotype initial character strings into English independent character strings among corresponding punctuations, wherein the independent character strings refer to character strings with independent semantics.

4. The natural language processing method for infertility clinical phenotype information according to claim 1, wherein the ontology dictionary in step 4 comprises a chinese ontology dictionary and an english ontology dictionary, wherein:

the Chinese ontology dictionary comprises phenotype abnormal ontologies, genetic pattern ontologies, clinical modification ontologies, clinical development process ontologies and frequency ontologies, wherein each ontology comprises an ontology name, an ontology descriptor, an ontology synonym or alias, an ontology abbreviation, other ontology dictionaries corresponding to the ontology across libraries and a total description of all contents, the other ontology dictionaries corresponding to the ontology across libraries comprise international clinical medicine standard terms, human disease ontologies, combined phenotype ontologies, human skin disease ontologies, infectious disease ontologies, pathogenic disease ontologies, mammal phenotype ontologies and the like, and the language used by the Chinese ontology dictionary is Chinese;

the English ontology dictionary comprises a phenotype abnormal ontology, a genetic pattern ontology, a clinical modification ontology, a clinical development process ontology and a frequency ontology, wherein each ontology comprises an ontology name, an ontology descriptor, an ontology synonym or alternative name, an ontology abbreviation, other ontology dictionaries corresponding to the ontology across libraries, and a general description of all contents, the other ontology dictionaries corresponding to the ontology across libraries comprise international clinical medicine standard terms, human disease ontologies, combined phenotype ontologies, human skin disease ontologies, infectious disease ontologies, pathogenic disease ontologies, mammal phenotype ontologies and the like, and the language used by the English ontology dictionary is English.

5. The natural language processing method for infertility clinical phenotype information according to claim 1, wherein in step 5, the chinese splitting method of the chinese independent character string comprises maximum matching segmentation, constructing inverted index segmentation by a search engine;

the English splitting method of the English independent character string comprises an N-element model splitting rule, morphology reduction and stem extraction, wherein the N-element model is 1-30 in any length.

6. The natural language processing method for infertility clinical phenotype information according to claim 1, wherein the calculating of semantic approximation in step 7 comprises the steps of:

calculating a TF-IDF matrix, wherein the TF-IDF matrix represents term frequency-inverse document frequency, and the TF-IDF matrix represents the matching frequency of the Chinese split character string and each body of the Chinese ontology dictionary or the matching frequency of the English split character string and each body of the English ontology dictionary;

reducing the dimension of the TF-IDF matrix by using a matrix Singular Value Decomposition (SVD), and converting a high-dimension semantic space into a low-dimension semantic space;

calculating cosine similarity and transforming cosine distance.

7. The natural language processing method for infertility clinical phenotype information according to claim 6, wherein the calculation formula of the TF-IDF matrix is as follows:

wherein t represents a Chinese independent character string, an English independent character string, a Chinese split character string obtained by splitting a certain body of a Chinese ontology dictionary or a certain body of an English ontology dictionary or an English split character string;

d represents a Chinese independent character string corresponding to the Chinese split character string or a certain body of a Chinese body dictionary, or an English independent character string corresponding to the English split character string or a certain body of an English body dictionary;

d represents all ontologies and Chinese independent character strings in the Chinese ontology dictionary or all ontologies and English independent character strings in the English ontology dictionary;

TF_tdthe method comprises the steps of representing the weighting frequency of a Chinese split character string appearing in a certain body of a Chinese independent character string or a Chinese body dictionary corresponding to the Chinese split character string, or the weighting frequency of an English split character string appearing in a certain body of an English independent character string or an English body dictionary corresponding to the English split character string, namely dividing the frequency of the Chinese split character string t appearing in the Chinese independent character string or the certain body d in the Chinese body dictionary corresponding to the Chinese split character string by the number of all Chinese split character strings in d, or dividing the frequency of the English split character string t appearing in the English independent character string or the certain body d in the English body dictionary corresponding to the English split character string by the number of all English split character strings in d;

IDF_tDrepresenting the frequency of the normalized terms of the Chinese split string appearing in the Chinese ontology dictionary, or the frequency of the normalized terms of the English split string appearing in the English ontology dictionary, i.e. dividing the number of all the ontologies and Chinese independent strings D in the Chinese ontology dictionary by the number of ontologies and Chinese independent strings of the Chinese ontology dictionary containing the Chinese split string t, or all the Chinese ontology dictionariesDividing the number of the body and the English independent character string D by the number of the body and the English independent character string of the English body dictionary containing the English split character string t;

TF-IDF_tdand the matching frequency of the Chinese split character string t and each body d of the Chinese ontology dictionary or the matching frequency of the English split character string t and each body d of the English ontology dictionary is represented.

8. The natural language processing method for infertility clinical phenotype information according to claim 6, wherein the computational formula of dimension reduction is as follows:

H＝USV^T (2)

in formula 2, H represents a TF-IDF matrix which is an nxm unit matrix, n is the sum of the number of all bodies in the chinese ontology dictionary and the number of chinese independent character strings, and m is the number of chinese split character strings corresponding to the chinese independent character strings and the sum of the number of the split character strings of all bodies in the chinese ontology dictionary; or n is the sum of the number of all bodies in the English body dictionary and the number of the English independent character strings, m is the number of English split character strings corresponding to the English independent character strings and the sum of the number of the split character strings of all the bodies in the English body dictionary; v is an n multiplied by n unit matrix and represents a left singular vector of H; u is an m multiplied by m unit matrix and represents a right singular vector of H; s is an n multiplied by m unit matrix, and elements on the diagonal of S represent singular values of H; t represents the transpose of the matrix;

in equation 3, ii represents the matrix coordinate of the element on the S diagonal, k ∈ [0.1 λ ∈_max，0.8λ_max]Or k ∈ [ λ ]_20％Sii，λ_90％Sii]λ represents each S_iiA value of (A)_maxRepresents the maximum lambda value of all elements on the S diagonal;

in the formula 4, the first and second groups of the compound,

representing a new TF-IDF matrix as a basis for calculating the cosine similarity of the Chinese independent character string and the Chinese corpus dictionary ontology or the cosine similarity of the English independent character string and the English corpus dictionary ontology, wherein

The elements on the S diagonal are replaced with Sii in equation 3.

9. The natural language processing method for infertility clinical phenotype information according to claim 6, wherein the cosine distance is calculated as follows:

wherein m is the sum of the number of Chinese split character strings corresponding to the Chinese independent character strings and the number of split character strings of all bodies of the Chinese ontology dictionary, or the sum of the number of English split character strings corresponding to the English independent character strings and the number of split character strings of all bodies of the English ontology dictionary; i is a sequence number from 1 to m;

r is a TF-IDF matrix corresponding to the Chinese independent character string or a TF-IDF matrix corresponding to the English independent character string;

d is a TF-IDF matrix corresponding to each body of the Chinese ontology dictionary or a TF-IDF matrix corresponding to each body of the English ontology dictionary;

l represents the matching degree of the Chinese independent character string and the Chinese ontology dictionary body or the matching degree of the English independent character string and the English ontology dictionary body, and the closer the L value is to 0, the higher the matching degree is.

10. The natural language processing method for infertility clinical phenotype information according to claim 1, wherein the weighted rule calculation formula of step 8 is as follows:

x₁、x₂、x₃、x₄、x₅、x₆the method comprises the steps of sequentially obtaining a Chinese independent character string which is accurately matched with a Chinese ontology dictionary, an English independent character string which is accurately matched with an English ontology dictionary, a Chinese split character string which is accurately matched with the Chinese ontology dictionary, an English split character string which is accurately matched with the English ontology dictionary, a Chinese ontology dictionary body which is matched with the Chinese independent character string and an English ontology dictionary body which is matched with the English independent character string;

when x is_1..4When the corresponding character string exists, x_1..4The value is 1; otherwise, x_1..4A value of 0; when x is_5,6When the corresponding character string exists, x_5,6The numerical value is 1 divided by the sum of the cosine distance corresponding to the character string plus 0.01; when x is_5,6When the corresponding character string does not exist, x_5,6A value of 0;

and obtaining one or more bodies finally matched with the Chinese independent character string and the English independent character string through the weighting equation, and sequentially outputting the bodies from large to small according to the weight.

11. A natural language processing system for clinical phenotypic information for infertility, comprising:

the reading module is used for reading the Chinese clinical phenotype character string or the Chinese clinical phenotype related document and storing the Chinese clinical phenotype character string or the Chinese clinical phenotype related document as a Chinese clinical phenotype original character string;

the conversion module is used for carrying out natural language preprocessing on the Chinese clinical phenotype original character string to generate a preprocessed Chinese clinical phenotype initial character string and an English clinical phenotype initial character string;

the segmentation module is used for splitting the preprocessed Chinese clinical phenotype initial character string and English clinical phenotype initial character string according to the punctuation marks to obtain a Chinese independent character string and an English independent character string between the corresponding punctuation marks;

the accurate matching module is used for searching an accurate searching result of the Chinese independent character string or the Chinese split character string based on the Chinese ontology dictionary; based on the English ontology dictionary, searching an accurate search result of an English independent character string or an English split character string;

the fuzzy matching module is used for searching one or more ontologies of the Chinese ontology dictionary which are most matched with the Chinese independent character strings on the basis of the semantic similarity; searching one or more bodies which are most matched with the English body dictionary with the English independent character strings;

the output module outputs one or more Chinese independent character strings corresponding to the Chinese clinical phenotype original character strings; obtaining one or more bodies finally matched with each Chinese independent character string and each English independent character string through a weighting rule of the accurately matched character string and the fuzzy matched character string, and sequentially outputting the front ends according to the weight values from large to small; the accurately matched character strings comprise Chinese independent character strings and Chinese split character strings which are accurately matched with a Chinese ontology dictionary, and English independent character strings and English split character strings which are accurately matched with an English ontology dictionary; the fuzzy matching character strings comprise one or more bodies of a Chinese body dictionary which is most matched with the Chinese independent character strings and one or more bodies of an English body dictionary which is most matched with the English independent character strings.