CN114692620A - Text processing method and device - Google Patents

Text processing method and device Download PDF

Info

Publication number
CN114692620A
CN114692620A CN202011589620.5A CN202011589620A CN114692620A CN 114692620 A CN114692620 A CN 114692620A CN 202011589620 A CN202011589620 A CN 202011589620A CN 114692620 A CN114692620 A CN 114692620A
Authority
CN
China
Prior art keywords
term
analyzed
terms
associated content
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011589620.5A
Other languages
Chinese (zh)
Inventor
葛鑫
骆卫华
赵宇
施杨斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202011589620.5A priority Critical patent/CN114692620A/en
Publication of CN114692620A publication Critical patent/CN114692620A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a text processing method and a text processing device, and the method comprises the following steps: obtaining terms of at least one domain category; acquiring the associated content of the terms; and determining the matching degree between the terms and the domain categories to which the terms belong according to the associated content of the terms and the identification data set of the domain categories to which the terms belong. In the application, the related content of the term can greatly expand the dimension of the term, and the identification data set can also greatly expand the dimension of the corresponding field category. In the process of determining the matching degree by using the associated content of the term and the identification data set of the field category to which the term belongs, the associated content and the identification data set can provide richer semantic information, so that the accuracy of quality management of the term data is improved. In addition, the whole quality management strategy can be realized by automatic information mining, information analysis and information comparison, so that the manual participation degree is greatly reduced, the generation efficiency is improved, and the production cost is reduced.

Description

Text processing method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a text processing method and apparatus.
Background
The term data of the field class has wide application prospect, and can be widely applied to the technical fields of e-commerce, social contact, machine translation and the like, so that the requirements of the technical fields on the specialty and the normalization of the text data are met.
At present, the term data of each field category can be mined by adopting a manual or automatic term mining link, the requirement on the quality of the term data is continuously improved along with the continuous development of the technical field to which the term data is applied, but the problem of uneven quality of the generated term data exists by adopting the term data mining mode.
However, in the conventional scheme, quality management of term data is performed manually, which causes problems of high labor cost, low efficiency, and low accuracy.
Disclosure of Invention
The embodiment of the application provides a text processing method, so that under an automatic term data quality verification strategy, term data quality management is achieved in a low-cost mode, and management efficiency and accuracy are improved.
Correspondingly, the embodiment of the application also provides a text processing device, electronic equipment and a storage medium, which are used for ensuring the implementation and application of the method.
In order to solve the above problem, an embodiment of the present application discloses a text processing method, including:
obtaining terms of at least one domain category;
acquiring the associated content of the term;
and determining the matching degree between the term and the domain class to which the term belongs according to the associated content of the term and the identification data set of the domain class to which the term belongs.
The embodiment of the application also discloses a text processing device, which comprises:
the first acquisition module is used for acquiring terms of at least one field category;
the second acquisition module is used for acquiring the associated content of the terms;
and the verification module is used for determining the matching degree between the term and the domain class to which the term belongs according to the associated content of the term and the identification data set of the domain class to which the term belongs.
The embodiment of the application also discloses an electronic device, which comprises: a processor; and a memory having executable code stored thereon that, when executed, causes the processor to perform a method as described in one or more of the embodiments of the application.
Embodiments of the present application also disclose one or more machine-readable media having executable code stored thereon that, when executed, causes a processor to perform a method as described in one or more of the embodiments of the present application.
Compared with the prior art, the embodiment of the application has the following advantages:
in the embodiment of the application, the associated content of a term can greatly expand the dimension of the term, so that the definition range of the term is not limited to one term word, but the definition range of the term is greatly expanded by the associated content with a larger breadth. Similarly, the identification data set may also greatly expand the dimensionality of the corresponding domain class. In the process of determining the matching degree by using the associated content of the terms and the identification data set of the category of the field to which the terms belong, the associated content and the identification data set can provide richer semantic information, and a large amount of semantic information is helpful for improving the accuracy of calculating the matching degree, so that the accuracy of quality management of the term data is improved. In addition, the whole quality management strategy can be realized by automatic information mining, information analysis and information comparison, so that the manual participation degree is greatly reduced, the generation efficiency is improved, and the production cost is reduced.
Drawings
Fig. 1 is a system architecture diagram of a text processing method according to an embodiment of the present application;
FIG. 2 is a system architecture diagram of another method of text processing according to an embodiment of the present application;
FIG. 3 is a text processing system architecture diagram of a machine translation scenario of the present application;
FIG. 4 is a text processing system architecture diagram of another machine translation scenario of the present application;
FIG. 5 is a text processing system architecture diagram of another machine translation scenario of the present application;
FIG. 6 is a medical context text processing system architecture diagram of the present application;
FIG. 7 is a flow chart of steps of a text processing method of the present application;
FIG. 8 is a flowchart illustrating steps of a text processing method according to the present application;
FIG. 9 is a flow chart illustrating a matching degree solution process of the present application;
FIG. 10 is a block diagram of a text processing apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an apparatus according to an embodiment of the present application.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.
Referring to fig. 1, a system architecture diagram of a text processing method provided in an embodiment of the present application is shown, including: the term data mining service side and the term data verification service side.
The term data mining service end can perform term data mining on the basis of data sources such as word banks, websites and databases to obtain terms of multiple field categories, the terms belonging to a certain field category can be used as a special mark in the field, the terms can use characters as carriers to express or define appointed symbols of professional concepts, and in addition, other forms of carriers such as symbols can be used as the terms.
Specifically, the form of the data source as the term data source may be multiple, and in one implementation, the data source may be an encyclopedic knowledge website, and terms under a certain field category are obtained by using the field category as a search term, searching term content corresponding to the field category, and performing term mining on the term content. "etc. describe that by the term mining means," treatment "," disease "," health "," treatment "therein can be extracted as terms in the medical field. In another implementation, the data source may be a search engine, and the mining of terms from the associated web page data may be implemented by retrieving web page data associated with a domain category in the search engine, e.g., using a common crawler or the like, retrieving web page data associated with a domain category in the search engine, and mining terms in the associated web page data.
Further, the data source may be a lexicon or a database, the lexicon may include a dictionary lexicon, such as an industrial dictionary, a medical dictionary, and the like, or may include a network lexicon, an input method lexicon, and the like, the database may include a translation software database, a commodity database, and the like, and since the lexicon and the database already contain attributes such as special effects, categories, and the like of words, terms of various field categories can be mined through simple screening.
The term data mining service side can adopt various mining means to obtain terms of various field types from the original data based on the original data acquired from the data source. For example, when term mining is performed on text raw data, term mining can be implemented according to the word Frequency, the word degree, and the context of words in the text raw data, and by combining methods such as a vector space model, a word Frequency-Inverse Document Frequency (TFIDF), bayesian inference, and the like.
The term data verification server can obtain terms mined to various field categories by the term data mining server from the term data mining server, and further verify the matching degree of each term and the field category, the greater the matching degree, the higher the quality of the term is, the matching degree enables the quality of the term to be quantized, so that the quality of the term can be controlled, for example, the term with the smaller matching degree is removed from the field category.
Further, an application scenario of a term verification result generated by the term data verification server is described, and referring to fig. 2, a system architecture diagram of another text processing method provided in the embodiment of the present application is shown, including: the term data mining service terminal, the term data verification service terminal and the term data application service terminal.
The term data application server side can perform substantial application on term data, and locally establish a term library to provide term support for services provided by the term data application server side.
Referring to fig. 3, a diagram of a text processing system architecture of a machine translation scenario provided by an embodiment of the present application is shown.
In the embodiment of the present application, the machine translation scenario relies on a large number of high-quality professional terms, and the industries involved in machine translation are very wide, such as: medical, educational, automotive, and the like. Different industries have respective emphasis on customization requirements of machine translation, and the construction of a domain-specific machine translation model not only needs high-quality domain bilingual data, but also needs domain-class term data which plays an indispensable important role in the construction of the high-quality domain machine translation model. Therefore, for a machine translation scenario, the term data application server may be a translation service implementation end, the term data application server may be connected to each field term library, each field term library may be set locally in the term data application server, or may be set in a cloud, the term data application server may upload terms in each field term library to the term data verification server, and after verification by the term data verification server, the term matching degree is sent to the term data application server, and the term data application server may correct, filter, and re-classify terms in each field term library according to the term matching degree. And then, the term data application server can receive the text to be translated uploaded by the client, determine one or more target field categories associated with the text to be translated according to semantic recognition of the text to be translated, select terms in the target field categories to use in the translation process, or select terms in the target field categories to replace non-professional words in the primarily translated text when primarily completing translation, so as to achieve the purpose of producing the translated text with higher quality, and finally, the term data application server can send the translated text to the corresponding client of the user, so as to provide high-quality translation service for the user.
The user may be a personal user, the user may also be an agency of various affairs, such as a patent agency, the user may also be a patent management department in each region, the user may also be various enterprise users, the enterprise users may also be various business entities, translation companies, foreign trade companies, and the like, the user may also be various business entities, related entities, and the like, which is not limited in the embodiments of the present application.
In the embodiment of the present application, for an individual user, there may be three interaction schemes as follows:
(1) the personal user can upload the text to be translated to the term data application service end through the personal client, and the term data application service end translates the text to be translated based on high-quality terms, so that high-quality translated text is generated and provided for the personal client, and the personal user can learn the translated text on the personal client.
(2) The personal user can upload the text to be translated to the term data application server through the personal client, the term data application server translates the text to be translated by using an original translation algorithm, then selects high-quality terms to replace non-professional words in the translation result according to the translation result, so that the high-quality translation text is generated and provided for the personal client and the translation company client, so that the personal user can learn the translation text on the personal client, and the enterprise user can obtain the translated text through the translation company client, so that the translation cost of an enterprise is reduced. The text to be translated and the translation result can be in two different languages.
(3) The personal client uploads the translated text to the term data application server, the term data application server calibrates the translated text based on high-quality terms, and selects the high-quality terms to replace non-professional words in the translated text in the calibration process, so that the high-quality translated text is generated and provided for the personal client, and an individual user can learn the translated text on the personal client.
It should be noted that the individual user may be replaced with the various enterprise users, such as the law firm, the translation company, the foreign trade company, and the like, so that the client of the enterprise user and the term data application server perform the similar interaction, which may reduce the translation cost of the enterprise user and improve the translation efficiency.
In addition, in the step (1), the term data application server can also establish a corresponding relation between the professional words of the original language of the text to be translated and the high-quality terms used for translation, and then send the high-quality terms to the personal client for display, and in the display process, the translation text and the text to be translated can be correspondingly displayed; the association relationship between the high-quality term and the professional word of the original language is displayed, for example, in the text to be translated and the translated text, the high-quality term and the professional word of the original language are connected through a connecting line, and the corresponding relationship between the high-quality term and the professional word of the original language is reflected. Of course, brackets may also be added in the translated text after the high quality terms, with the professional words of the original language shown in brackets. The embodiments of the present application are not limited thereto.
In (2) and (3), the term data application server may further establish a corresponding relationship between the non-professional term and the high-quality term used to replace the non-professional term, and send the corresponding relationship to the personal client for display, and in the display process, the non-professional term and the high-quality term may be correspondingly displayed, and the two terms may be highlighted, where the highlighting includes bold display, highlight display, and the like, so that the personal user of the personal client can sense the conversion between the original language and the translated language, thereby achieving the purpose of learning a new language.
In practical application, an input interface may be displayed in the client, after a word or a sentence is input by the client, the user uploads the word or the sentence to the term data application server, the term data application server returns the processing result to the client, and the client may display the processing result in a result display interface, thereby achieving the effect of learning while translating.
Further, referring to fig. 4, it shows another text processing system architecture diagram of a machine translation scenario provided in this embodiment of the present application, which may be specifically applied in the patent field, where the field has a requirement for language reflection of patent documents, the term data application server may be a patent translation server, the patent translation server is connected to a patent term library, the patent term library may be disposed locally in the patent translation server or may be disposed in the cloud, the patent translation server may upload patent terms in the patent term library to a term data verification server, after verification by the term data verification server, send the term matching degree to the patent translation server, and the patent translation server may receive, according to the term matching degree, a to-be-translated patent document uploaded by the patent uploading system and recognize, according to semantics of the to-be-translated patent document, determining one or more target field categories associated with the patent document to be translated, selecting terms in the target field categories for use in the translation process, or selecting terms in the target field categories to replace non-professional words in the primarily translated patent document when primarily completing translation so as to achieve the purpose of producing the translated patent document with higher quality,
after translation is completed, the patent translation server side can send the translated high-quality patent files to the agent client side of the corresponding agent, and therefore translation cost of the agent is saved. In addition, the patent translation server can also send the translated high-quality patent files to the client of the patent auditing unit corresponding to the auditor, so that the auditing difficulty of the auditor on the patent files is saved.
Further, referring to fig. 5, there is shown a text processing system architecture diagram of another machine translation scenario provided by the embodiment of the present application, in which the machine translation scenario relies on a large number of high-quality professional terms, and the industry involved in machine translation is very wide, such as: medical, educational, automotive, and the like. Enterprises in different industries have respective emphasis on customizing requirements of machine translation, and the establishment of a domain-specific machine translation model not only needs high-quality domain bilingual data, but also needs domain-class term data which plays an essential and important role in establishing a high-quality domain machine translation model.
Assuming that an enterprise corresponding to the enterprise client is an enterprise in the automobile industry, aiming at a machine translation scene, the term data application server can be a translation service implementation terminal and can be connected with an automobile term library, the automobile term data application server can upload automobile terms to the term data verification server, the term data verification server verifies the automobile terms and sends term matching degrees to the term data application server, and the term data application server can correct, screen, reclassify and the like the automobile terms in the automobile term library according to the term matching degrees. And finally, the term data application server can send the translated text to the enterprise client of the corresponding enterprise user to provide high-quality translation service corresponding to the domain of the enterprise user.
Further, various fields in which the term data application server can be applied are described in detail:
for example, referring to fig. 6, which shows a text processing system architecture diagram in the medical field provided by the embodiment of the present application, for the medical field, the term data application server may be a medical record management server, the medical record management server is connected to a medical term library, the medical term library may be located locally at the medical record management server or may be located in a cloud, the medical record management server may upload medical terms in the medical term library to the term data verification server, after verification by the term data verification server, the term matching degree is sent to the medical record management server, the medical record management server may modify, filter, re-classify and the like terms in the medical term library according to the term matching degree, and then the medical record management server may receive medical record files uploaded by a doctor in a hospital through a terminal and based on an updated medical term library, and selecting corresponding terms with higher quality to replace irregular words in the medical record file so as to realize the standardization of the medical record file, and storing the corrected medical record file locally or sending the medical record file to a corresponding patient client so as to provide the medical record file with high quality for the patient.
Aiming at the logistics field, the term data application server can be a logistics management server, the logistics management server can manage logistics documents generated in each logistics link, and based on a logistics term library, corresponding terms are selected to replace irregular words in the logistics documents, or corresponding terms are selected to serve as header contents when the logistics documents are established.
Aiming at the field of automobiles, the term data application server can be an automobile forum server, the automobile forum server can provide a browsable automobile forum website, and in the process of establishing and maintaining the automobile forum website, corresponding terms are selected based on an automobile term library to realize the establishment and maintenance of the website.
The term data verification server can provide the matching degree between the term and the belonging field category to the term data application server aiming at the term of each field category, and the term data application server can realize the cleaning of the terms in the local term library based on the verification result of the term, for example, the terms with small matching degree are removed from the term library, or the terms are subjected to importance sorting according to the degree of matching of the terms, so that the service realized based on the terms can be selected according to the sorting when the terms are selected for use. For example, when the medical record management server replaces an irregular term in a medical record file, if there are a plurality of terms associated with the irregular term, a term with the largest matching degree may be selected for replacement.
In this embodiment of the present application, the term data verification server may pre-construct a corresponding identification data set for each field category, acquire corresponding association content based on each term, and determine a matching degree between the term and the field category to which the term belongs according to the association content of the term and the identification data set of the field category to which the term belongs.
Specifically, the data in the identification data set is used for accurately reflecting the characteristics of the domain category, in an implementation manner, the identification data set may include reference structured terms or reference text paragraphs predefined for the domain category, and the defined texts may accurately reflect the characteristics of the domain category, where the reference structured terms may be recognized proprietary terms or high-quality terms in the domain category, and the reference text paragraphs may also be called seed texts, which are recognized text paragraphs with high confidence for accurately describing the domain.
For example, for the "medical" domain category, reference to structured terms may include: the terms "treating," "disease," "health," "treatment," and the like, reference text paragraphs may include: a text passage for more authoritative description and definition of "medical" and the like.
In the case of a term in text form, the associated content of the term may include associated words of the term, such as synonyms, near-synonyms, aliases, names of medium/english, etc., and may also include text passages explained further for the term.
For example, the term "coronary heart disease" in the medical field may include related words, such as the alias "coronary atherosclerotic heart disease", the english name "coronary atherosclerotic heart disease", related terms such as "visit", "medical", and the like, and may also include related text paragraphs: "diagnosis of coronary heart disease depends mainly on typical clinical symptoms, and is combined with auxiliary examination to find evidence of myocardial ischemia or coronary artery obstruction".
In an implementation manner, if the term, the associated content of the term, and the identification data set are in a text form, specifically, in the process of determining the matching degree between the term and the domain category to which the term belongs, a first text feature of the associated content of the term and a second text feature of data in the identification data set of the domain category to which the term belongs may be extracted, and by analyzing the degree of association between the first text feature and the second text feature, the degree of association may be finally converted into the matching degree between the term and the domain category to which the term belongs.
Of course, the way of calculating the degree of matching between the term and the domain category to which the term belongs in the embodiments of the present application is not limited to the comparison of the text features, and for example, the degree of matching between the term and the domain category to which the term belongs may also be determined according to the word frequency by simply determining the word frequency of the reference structured term in the associated content of the term appearing in the recognition data set.
For example, for the term "coronary heart disease" in the medical field category, the associated content of the term includes: associated words such as "visit", "treatment", "medicine", and the like, and associated text paragraphs "diagnosis of coronary heart disease mainly depends on typical clinical symptoms, and then is combined with auxiliary examination to find evidence of myocardial ischemia or coronary artery obstruction". The set of identification data for the medical field category includes: referring to the structured terms, such as "treatment", "disease", "health", "treatment", and the text paragraph "the medical health system is clearly positioned, Chinese medical health creates a series of magnifications, and all the aspects of medical examination, clinical diagnosis, etc. are achieved"
By extracting the relevant words of the relevant content of the terms and the vector features of the reference structured terms of the identification data set of the medical field category and calculating the vector distance between the relevant words and the reference structured terms according to the vector features, a first score converted from the vector distance can be obtained. By extracting the vector characteristics of the text paragraphs in the associated content of the terms and the reference text paragraphs of the identification data set of the medical field category and calculating the vector distance between the text paragraphs and the reference text paragraphs according to the vector characteristics, a second score converted from the vector distance can be obtained. Different weight values can be set according to the feature comparison result of the associated word and the reference structured term and the feature comparison result of the associated text paragraph and the reference text paragraph, and then after the first score and the second score are weighted and summed, the final score is used as the matching degree between the term coronary heart disease and the medical field category. As can be seen from the above example, since the similarity and the association between the associated word of the term and the reference structured term of the identification data set are large, and there is a large association between the associated text passage of the term and the reference text passage of the identification data set, both the first score and the second score are relatively high, so that the matching degree between the term "coronary heart disease" and the medical field category is high, and the term "coronary heart disease" of the medical field category has a high quality.
In this embodiment of the application, referring to fig. 1, in an implementation manner, an associated content obtaining module of a term data verification server may use a term as a search term, perform search on a knowledge website for a corresponding term page, and use part or all of content in the corresponding term page as associated content of the term.
For example, for the term "coronary heart disease" in the medical field, an entry page corresponding to the entry of "coronary heart disease" may be retrieved at an encyclopedic knowledge site, and the entry page may include the above alias, english name, text paragraph for describing the entry, and other associated information.
Referring to fig. 1, in another implementation manner, an associated content obtaining module of a term data verification server may use a term as a query term to perform query on an associated web page in a search engine, where a corresponding query result may include a plurality of web page links associated with the term, and the web page links are ordered according to a sequence of a large degree of association between the content and the term, in an embodiment of the present application, the query result may be filtered, and a part or all of the content in the web page corresponding to the remaining web page links after filtering is used as the associated content of the term, for example, the web page corresponding to the web page link with the largest association degree may be selected and the remaining web page links are filtered, or web pages corresponding to a plurality of web page links with a first degree of association are filtered, and the remaining web page links are filtered.
It should be noted that, the acquisition source of the associated content in the embodiment of the present application is not limited to the knowledge website and the search engine, but may also include other sources such as a word bank and a database, and the embodiment of the present application is not limited to this.
It should be noted that, in order to ensure that the solution accuracy of the matching degree between the term and the domain category to which the term belongs is high, the data volume of the associated content of the term and the data volume of the identification data set of the domain category need to be high, for example, the source web page data of the associated content as the term needs to be greater than 50 web pages, or the text data volume in the entry page of the associated content as the term needs to be greater than 1 thousand characters; the number of reference structured terms in the recognition data set needs to be greater than 1 ten thousand, and the number of words in the reference text paragraph needs to be greater than 10 ten thousand. Each data amount threshold may be set according to an actual situation, which is not limited in the embodiment of the present application. In addition, the content of the associated text passage and the reference text passage can be sentences or paragraphs.
In the embodiment of the application, the associated content of the term can greatly expand the dimension of the term, so that the definition range of the term is not limited to one term word, but the definition range of the term is greatly expanded by the associated content with larger breadth. Similarly, the identification data set may also greatly expand the dimensionality of the corresponding domain class. In the process of determining the matching degree by using the associated content of the term and the identification data set of the category of the field to which the term belongs, the associated content and the identification data set can provide richer semantic information, and a large amount of semantic information is helpful for improving the accuracy of calculating the matching degree, so that the accuracy of quality management on the term data is improved. In addition, the whole quality management strategy can be realized by automatic information mining, information analysis and information comparison, so that the manual participation degree is greatly reduced, the generation efficiency is improved, and the production cost is reduced.
Referring to fig. 7, an embodiment of the present application provides a flowchart of steps of a text processing method, including:
step 101, at least one domain category term is obtained.
In the embodiments of the present application, a term belonging to a category of a certain field may be used as a specific identifier in the field, the term may use a letter as a vector to express or define a convention symbol of a professional concept, and in addition, the term may use other forms of vectors such as a symbol.
Specifically, referring to fig. 1, the data source as the term data source may have a plurality of forms, and in one implementation, the data source may be an encyclopedic knowledge website, and terms content corresponding to a certain domain category is searched by using the domain category as a search term, and term mining is performed on the term content to obtain terms in the domain category. In another implementation, the data source may be a search engine, and mining terms from the associated web page data may be accomplished by retrieving web page data associated with the domain category in the search engine.
Further, the data source may be a lexicon or a database, the lexicon may include a dictionary lexicon, such as an industrial dictionary, a medical dictionary, and the like, or may include a network lexicon, an input method lexicon, and the like, the database may include a translation software database, a commodity database, and the like, and since the lexicon and the database already contain attributes such as special effects, categories, and the like of words, terms of various field categories can be mined through simple screening.
Step 102, obtaining the associated content of the term.
In the embodiment of the present application, the associated content of a term may greatly expand the dimension of the term, so that the definition range of the term is not limited to one term word, but greatly expands the definition range of the term by the associated content with a larger breadth, and in the case of the term being in a text form, the associated content of the term may include the associated word of the term, such as a synonym, a near synonym, an alias, a chinese/english name, and the like, and may also include a text passage explained further for the term.
For example, the term "coronary heart disease" in the medical field may include related words, such as the alias "coronary atherosclerotic heart disease", the english name "coronary atherosclerotic heart disease", related terms such as "visit", "medical", and the like, and may also include related text paragraphs: "diagnosis of coronary heart disease depends mainly on typical clinical symptoms, and is combined with auxiliary examination to find evidence of myocardial ischemia or coronary artery obstruction".
Referring to fig. 1, in an implementation manner, an associated content obtaining module of a term data verification server may use a term as a search entry, perform search on a knowledge website for a corresponding entry page, and use part or all of content in the corresponding entry page as associated content of the term.
In another implementation manner, the associated content obtaining module of the term data verification server may use a term as a query term to perform query on an associated web page in a search engine, and a corresponding query result may include a plurality of web page links associated with the term, and the web page links are ordered according to a sequence of the relevance between the content and the term from large to small.
Step 103, determining the matching degree between the term and the domain class to which the term belongs according to the associated content of the term and the identification data set of the domain class to which the term belongs.
In an embodiment of the present application, in an implementation manner, if the term, the content associated with the term, and the identification data set are in a text form, specifically, in a process of determining a matching degree between the term and the category of the field to which the term belongs, a first text feature of the content associated with the term and a second text feature of data in the identification data set of the category to which the term belongs may be extracted, and by analyzing a degree of association between the first text feature and the second text feature, the degree of association may be finally converted into a matching degree between the term and the category of the field to which the term belongs.
Of course, the way of calculating the degree of matching between the term and the domain category to which the term belongs in the embodiments of the present application is not limited to the above way, and for example, the degree of matching between the term and the domain category to which the term belongs may also be determined according to the word frequency by simply determining the word frequency of the reference structured term in the associated content of the term appearing in the recognition data set.
In summary, in the embodiments of the present application, the associated content of a term can greatly expand the dimension of the term, so that the definition range of the term is not limited to only one term word, but the definition range of the term is greatly expanded by the associated content with a larger scope. Similarly, the identification data set may also greatly expand the dimensionality of the corresponding domain class. In the process of determining the matching degree by using the associated content of the terms and the identification data set of the category of the field to which the terms belong, the associated content and the identification data set can provide richer semantic information, and a large amount of semantic information is helpful for improving the accuracy of calculating the matching degree, so that the accuracy of quality management of the term data is improved. In addition, the whole quality management strategy can be realized by automatic information mining, information analysis and information comparison, so that the manual participation degree is greatly reduced, the generation efficiency is improved, and the production cost is reduced.
Referring to fig. 8, a flowchart illustrating specific steps of an embodiment of a text processing method of the present application is shown.
In step 201, at least one domain category term is obtained.
Specifically, the step may specifically refer to the description in step 101, and is not described herein again.
Step 202, acquiring the associated webpage data as the associated content according to the terms.
In the embodiment of the application, rich network resources can be relied on, and data associated with terms can be searched from the network resources as associated content. Among them, it is common to use web page resources provided in websites, and in addition, the network resources include, but are not limited to, web page resources, word stock resources, database resources, and the like.
Optionally, step 202 may include:
substep 2021, obtaining the web page data from the target knowledge site as the associated content according to the terms.
In this embodiment of the application, the associated content may include webpage data acquired from a target knowledge website, the target knowledge website may be an encyclopedic knowledge website, a proprietary domain knowledge website, and the like, and data in the target knowledge website may be established in a format of entry-entry content, so that terms may be used as query entries, corresponding entry content is acquired in the target knowledge website, and webpage data corresponding to the entry content is used as the associated content.
Optionally, sub-step 2021 may comprise:
and a substep A1, in at least one target knowledge website, taking the terms as query terms and obtaining term pages corresponding to the terms.
Referring to fig. 1, the associated content acquiring module of the term data verification server may use a term as a search term, perform search on a target knowledge website in the knowledge website for a corresponding term page, and use part or all of the content in the corresponding term page as the associated content of the term.
For example, for the term "coronary heart disease" in the medical field, an entry page corresponding to the entry of "coronary heart disease" may be retrieved at an encyclopedic knowledge site, and the entry page may include the above alias, english name, text paragraph for describing the entry, and other associated information.
Substep 2022, obtaining the web page data from the target search engine as the associated content according to the term.
In the embodiment of the present application, the associated content may also include web page data acquired from a target search engine, the target search engine stores a large amount of web page data, and may query the web page links of the associated web pages through corresponding query indexes.
Optionally, sub-step 2022 may comprise:
sub-step a2, in at least one target search engine, queries with the terms, and obtains query results including a plurality of web page data.
Referring to fig. 1, an associated content obtaining module of the term data verification server may use a term as a query term to perform a query on an associated web page in a search engine, and a corresponding query result may include a plurality of web page links associated with the term, and according to logic of the search engine, the plurality of web page links may be ordered in an order from a large degree of association between content and the term to a small degree of association between content and the term.
Sub-step A3, selecting at least one target web page data from the query result as the associated content.
Specifically, the query result may be filtered, and part or all of the content in the web page corresponding to the remaining web links after filtering is used as the related content of the term, for example, filtering a plurality of web links may select the web page corresponding to the web link with the largest relevance as the target web page data, and filter the remaining web links, or select web pages corresponding to a plurality of web links with top relevance as the target web page data, and filter the remaining web links.
Further, as more noise data (text irrelevant to the content of the term in the present application) may exist in the target web page data relevant to the term acquired by the search engine, further noise data filtering may be performed on the selected target web page data, specifically, some text features of the noise data may be predefined in the present application embodiment, so as to establish a corresponding noise data template, and search for data in which the text features in the target web page data match the text features of the noise data by using the noise data template, and screen out the data as the noise data, thereby ensuring the data quality of the relevant content.
Step 203, analyzing the associated content according to a preset structured analysis rule to obtain a structured term to be analyzed and/or a text paragraph to be analyzed in the associated content; the structured terms to be analyzed include: terms are presented in a page in a structured form.
Optionally, step 203 may be implemented by obtaining the structural terms to be analyzed and/or the text paragraphs to be analyzed from the web page data according to the structural analysis template corresponding to the web page data.
In the embodiment of the application, the associated content may include a term page acquired from the target knowledge website and a web page acquired from the target search engine, and both the term page and the web page are structured page data. Specifically, the actual requirement for term verification in the embodiment of the present application is to expand the dimension of the term through the associated content having the term and the data in the form of the text paragraph, so according to the actual requirement for term verification in the embodiment of the present application, a structured parsing rule for page data is predefined to parse the structured term to be analyzed and/or the text paragraph to be analyzed in the associated content, a structured parsing template is established according to the preset structured parsing rule, and the associated content is parsed through the structured parsing template to obtain the parsing result. The structured parsing rule is set based on the position of the content to be parsed in the page data structure or based on the code structure of the parsed content in the page data structure, and then the content to be extracted is parsed from the position or the code structure of the page data by the structured parsing rule.
Specifically, for an entry page obtained from a target knowledge site, the page structure of the entry page is relatively fixed, for example, in a HyperText Markup Language (HTML) code of a knowledge site a, a structured term to be analyzed may have a corresponding structured tag:
<dt>XX</dt>
<dd title="XX">
for example, in the knowledge website a, in the entry page corresponding to the term "coronary atherosclerotic heart disease", the alias "coronary heart disease" corresponding to the term may be a structured term to be analyzed, which is as follows in the HTML code:
< dt > alternative name </dt >
< dd title ═ coronary heart disease >
The term pages of the terms can be parsed by adding the structured tags in the format into the structured parsing template, so that the terms in the term pages including the structured tags in the format are used as the structured terms to be parsed.
In addition, in the HTML code of the knowledge website a, the text paragraphs to be analyzed may have corresponding paragraph tags:
<div class="para"label-module="para">XXX</div>
for example, in the knowledge website a, in the term "coronary atherosclerotic heart disease" corresponding entry page, the term corresponds to a text paragraph: "risk factors for coronary heart disease include modifiable risk factors and non-modifiable risk factors. Understanding and intervening in risk factors to help prevent and treat coronary heart disease "can be a text paragraph to be analyzed, which is in HTML code as follows:
the risk factors for coronary heart disease include modifiable and non-modifiable risk factors. Understanding and intervening with risk factors can help prevent and treat coronary heart disease. </div >
The method and the device for parsing the term entry page in the embodiment of the application can parse the term entry page by adding the paragraph tag in the format into the structured parsing template, so that the paragraph containing the paragraph tag in the format in the term entry page is used as the text paragraph to be parsed.
It should be noted that, for different knowledge websites, the definition of the structural analysis rule may be different, and in the embodiment of the present application, the page structure special effect of the knowledge website serving as the source of the associated content may be analyzed in advance to establish the corresponding structural analysis rule, so as to achieve the purpose of differential processing. Similarly, different web pages acquired from the target search are more likely to have different definitions of the structural analysis rules, and the differentiation processing means can also be applied to the web pages acquired from the target search.
In addition, the above example is established when the associated content is a text, and when the associated content includes data in other forms, corresponding text information may be extracted from the data in other forms through a corresponding analysis means, for example, when the associated content includes a large number of web page pictures and videos, corresponding text may be extracted from the pictures and video data through an Optical Character Recognition (OCR) means.
For example, referring to fig. 9, which shows a flowchart of a matching degree solving process of the present application, for the medical field terms, the logistics field terms, and the automobile field terms shown in fig. 1, the verification module of the term data verification server in fig. 9 may first perform structured parsing on the associated content of the medical terms, the logistics terms, and the automobile terms, so as to obtain parsing results of three fields including structured terms to be analyzed and/or text paragraphs to be analyzed, so as to subsequently perform a process of solving the matching degree between a term and a category of the field according to the parsing results and the identification data sets of the fields.
Step 204, determining the matching degree between the term and the domain category to which the term belongs according to the structured term to be analyzed and/or the text paragraph to be analyzed and the identification data set of the domain category to which the term belongs.
In the embodiment of the present application, referring to fig. 9, for analysis results and identification data sets of associated contents in the medical field, the logistics field, and the automobile field, in the process of solving the matching degree between a term and a category of the field based on the analysis results and the identification data sets, a method based on text features/machine learning models/structured terms may be adopted to realize the calculation of the matching degree.
Optionally, the identifying the data set includes: presetting N-element relation characteristics among reference words; step 204 may include:
substep 2041, under the condition that the associated content includes at least one text paragraph to be analyzed, acquiring an N-element relationship characteristic between words to be analyzed according to the text paragraph to be analyzed.
In an implementation process of calculating the matching degree between terms and the domain categories, the matching can be realized by matching text paragraphs to be analyzed in the associated content with preset reference inter-word N-gram (N-gram) features in the identification data set, wherein the preset reference N-gram in the identification data set can be obtained by analyzing the reference text paragraphs in the identification data set.
Specifically, an n-gram feature is a textual feature defined as: if a sentence S consists of m words (w1w2w3 … wm), then the n-gram is defined as: { wiwi +1 … wi + n-1|1 ≦ i ≦ m-n +1 }. The definition shows that the n-gram feature is an association relationship established by continuous n words in a sentence.
Where n may be an integer greater than or equal to 1, and 1-gram is a language feature containing 1 word, such as for a text paragraph: "how atherosclerosis, angina pectoris, and myocardial infarction are formed and developed" with severe consequences for myocardial angina and myocardial infarction, where each word may establish a 1-gram.
The 2-gram is a language feature obtained after a corresponding relationship is established between 2 consecutive words, and for the text passage, the 2-gram may include: "atherosclerosis", "sclerosis-myocardium", "myocardial-angina", and the like. Similarly, n-gram features such as 3-grams, 4-grams, etc. can also be built in sequence.
It should be noted that the preset reference inter-word N-gram relationship features included in the identification data set may be obtained by analyzing reference text paragraphs included in the identification data set, the reference text paragraphs may be recognized text paragraphs with higher confidence for accurately describing the field, the reference text paragraphs may also be called seed texts, and for the "medical" field category, the reference text paragraphs may include text paragraphs that describe and define the "medical" field more authoritatively, and the like. Such as: "medical treatment is a Chinese word and phrase, and has two meanings of treatment and disease treatment" and "Chinese medical history is thousands of years, and this word eye appears in recent decades, and this is the new word eye for international connection, and the treatment was mostly used before. However, medical care also includes health care content, and a "medical accident" refers to an accident that a medical institution and its medical staff may damage the patient's body by mistake in medical activities, violating medical and health administration laws, related laws, regulations, departments, medical and nursing regulations, and conventions. "and the like.
In the embodiment of the application, N-gram features (including 1-gram, 2-gram and 3-gram …) of all reference text paragraphs of the field type can be extracted, and a reference inter-word N-element relational feature library containing all the extracted N-gram features is established based on the extracted N-gram features of the reference text paragraphs.
Substep 2042, matching the inter-word N-gram relationship features to be analyzed with the reference inter-word N-gram relationship features, and determining the matching degree between the term and the field category to which the term belongs.
Specifically, in the embodiment of the present application, before the operation of matching the inter-word N-gram feature to be analyzed with the reference inter-word N-gram feature is performed, the number of occurrences of each N-gram feature in the preset reference inter-word N-gram feature library may be counted, and the corresponding relationship between each N-gram feature and the number of occurrences may be established.
For example, with respect to the reference text passage provided in sub-step 2041: for example of how myocardial angina and myocardial infarction, atherosclerosis, angina and myocardial infarction are formed and developed, the number of each 1-gram, 2-gram, 3-gram … can be counted in a preset reference interword N-gram relational feature library, such as 1-gram "atherosclerosis" occurring 1 time and 1-gram "myocardium" occurring 2 times; 2 occurrences of 2-gram "myocardial-angina", 1 occurrence of 2-gram "atherosclerosis", and so on.
After the corresponding relationship between each n-gram feature and the occurrence number is established, the number corresponding to each n-gram feature extracted from the associated text of the term can be inquired from the corresponding relationship based on the n-gram features (including 1-gram, 2-gram and 3-gram …) extracted from the associated text of the term, the inquired number of each n-gram is added, and the addition result can be used as the matching degree between the term and the field category to which the term belongs.
Further, in the case that the sum result is greater than or equal to a predetermined threshold, the term may be considered to belong to the domain class, and the data quality of the term is high.
For example, referring to the above example, if the N-gram features extracted from the associated text of the term include 1-gram "atherosclerosis", 2-gram "myocardial-angina" and 2-gram "atherosclerosis", the three N-gram features are queried in the preset reference inter-word N-gram feature library, and the number of corresponding occurrences is 1, 2 and 1, respectively, i.e., the matching degree between the term and the domain category to which the term belongs may be 4.
It should be noted that the foregoing describes a matching degree calculation process taking chinese as an example, and similarly, for other forms of languages or symbols, the matching degree calculation may also be performed by using the same idea. In addition, in order to ensure that the solution accuracy of the degree of matching between a term and a domain category to which the term belongs is high, it is necessary that the data amount of the associated content of the term and the data amount of the identification data set of the domain category are high, that is, the data amount of the associated content and the identification data set may be required to be greater than a preset data amount threshold. Each data amount threshold may be set according to an actual situation, which is not limited in the embodiment of the present application.
Optionally, the identifying the data set includes: a machine learning model; step 204 may include:
substep 2043, under the condition that the associated content includes at least one text paragraph to be analyzed, obtaining input features according to the text paragraph to be analyzed.
Substep 2044, inputting the input features into the machine learning model, and determining the degree of match between the term and the domain class to which the term belongs.
In another implementation process for calculating the matching degree between a term and a domain category, the matching degree can be calculated by associating a text paragraph to be analyzed in the content with a preset machine learning model in the recognition data set, wherein the preset machine learning model in the recognition data set can be obtained by training based on a reference text paragraph and a reference structured term in the recognition data set.
The set of recognition data can include one or more machine learning models, e.g., the machine learning models can include a language model, a domain vector space model, a classification model based on a K-means clustering algorithm (K-means) or a K-nearest neighbor algorithm (K-nn), and the like.
The machine learning model can be understood as a mathematical model, the mathematical model is a scientific or engineering model constructed by using a mathematical logic method and a mathematical language, the mathematical model is a mathematical structure which is generally or approximately expressed by adopting the mathematical language aiming at the characteristic or quantity dependency relationship of a certain object system, and the mathematical structure is a pure relation structure of a certain system which is described by means of mathematical symbols. The mathematical model may be one or a set of algebraic, differential, integral or statistical equations, and combinations thereof, by which the interrelationships or causal relationships between the variables of the system are described quantitatively or qualitatively. In addition to mathematical models described by equations, there are also models described by other mathematical tools, such as algebra, geometry, topology, mathematical logic, etc. Where the mathematical model describes the behavior and characteristics of the system rather than the actual structure of the system. The simulator comprises a simulator, a decision tree, a random forest, an eXtreme Gradient boost (xgboost), a Light Gradient boost Machine (Light Gradient boost Machine), a k-means classification or a k-nn classification, and the like, wherein the simulator adopts a Machine learning method, a deep learning method and the like to train a model, the Machine learning method can comprise linear regression, a decision tree, the random forest, the eXtreme Gradient boost (xgboost), the Light Gradient boost Machine (Light Gradient boost Machine), the k-means classification or the k-nn classification, and the deep learning method can comprise a Convolutional Neural Network (CNN), a Long Short-Term Memory network (Long Short-Term Memory, LSTM), a Gated cycle Unit (Gated Current Unit, GRU) and the like.
Specifically, the recognition data set comprises a language model (such as an n-gram model), a domain vector space model, and a classification model based on a K-means clustering algorithm (K-means) or a K-nearest neighbor algorithm (K-nn).
In an implementation manner, the machine learning model may be a domain vector space model, the domain vector space model is a classification model, and different texts may be expressed by respective vector features, and a vector distance between the vector features may reflect a semantic association degree between the texts, so that the embodiment of the present application may determine a matching degree between a term and a domain category to which the term belongs by using the domain vector space model.
The domain vector space model can be obtained by training based on a reference text paragraph and a reference structured term in a recognition data set of a domain category, a training target is that vector features based on the associated content of the input term are used as input features, the probability of the domain category to which the term belongs is output through calculation of vector distance, and the probability is used as the matching degree between the term and the domain category to which the term belongs. In the training process, the domain vector space model can establish a verification set according to the reference text paragraphs and the vector features of the reference structured terms, and in practical application, the matching degree between the terms and the domain categories to which the terms belong can be obtained by calculating the vector distance between the input features and the vector features in the verification set.
For example, the embodiment of the present application may extract the associated content and the vector feature of each text paragraph to be analyzed; calculating the vector distance between the associated content and each text paragraph to be analyzed according to the vector characteristics of the associated content and the text paragraph to be analyzed; selecting a target vector distance which is greater than or equal to a preset distance threshold value from all vector distances; and determining a matching value which is used for reflecting the matching degree of the term text and the corresponding field category of the term text according to the target vector distance, and determining that the term text belongs to the field category when the matching value is greater than or equal to a preset first matching threshold value. Each text paragraph to be analyzed can be set with a weight according to the accuracy of the data, when there are multiple target vector distances, the weighted average calculation can be performed on each target vector distance according to the weight of the text paragraph to be analyzed corresponding to each target vector distance, and the finally obtained weighted average can be a matching value.
In addition, the machine learning model may be a neural network model, and the neural network model may also implement a function of a classification model, that is, a training target of the neural network model is a convolution feature based on the associated content of the input term as an input feature, and a probability that the term belongs to the domain category is output through calculation of the convolution feature, and the probability is used as a matching degree between the term and the domain category to which the term belongs.
In one implementation, the machine learning model may be a classification model constructed based on a k-means or k-nn method, or a classification model constructed based on a binary classification method. The classification model can output the class of the term considered by the model or the probability of the term belonging to each class according to the characteristic of the associated content of the term based on the classification method adopted by the classification model, so as to obtain the matching degree between the term and the class of the field to which the term belongs according to the output results.
Optionally, the identifying the data set includes: a preset reference structured term; step 204 may include:
sub-step 2045, in the case where the associated content includes at least one structured term to be analyzed, matching the structured term to be analyzed with the reference structured term, and determining a degree of matching between the term and the domain category to which the term belongs.
In an embodiment of the present application, the preset reference structured term included in the identification data set may be a reference structured term predefined for the domain category, and the reference structured term may be a proprietary term or a high quality term recognized in the domain category, for example, for the "medical" domain category, the reference structured term may include: terms such as "treating", "disease", "health", "treatment", and the like.
In one implementation, the present application embodiment may determine, in a case that a matching value is greater than or equal to a second preset number threshold, a number of reference structured terms included in all structured terms to be analyzed of the associated content, a matching degree between a term and a domain category to which the term belongs, and the domain category to which the term belongs.
For example, the structural terms to be analyzed of the related content of the term "coronary heart disease" include "treatment", "medical treatment", "treatment", and a matching value of 2 can be derived for the above preset reference structural terms "treatment", "disease", "health", "treatment".
Optionally, the method may further include:
and step 205, obtaining sample texts belonging to the field categories.
In the embodiment of the present application, referring to fig. 1, a sample text as a data source of an identification data set of a domain category may also be obtained from a website, a lexicon, or a database, and because the identification data set exists as a function for implementing a similar verification set, the requirement on the data quality is high, in one case, the high data quality of the data in the identification data set may be ensured, so as to meet the accuracy requirement of the embodiment of the present application on the matching degree of a term and a domain category to which the term belongs. In another case, a larger data amount of the data in the data set can be ensured to be identified, so that the precision requirement of the matching degree of the terms and the domain categories to which the terms belong in the embodiment of the application is met.
Specifically, the process of obtaining the sample text may also refer to the process of obtaining the associated text in fig. 1 in the above embodiment, for example, obtaining the associated text from a data source such as a word bank, a database, a knowledge website, a search engine, and the like, and in addition, the sample text may also be obtained by defining the domain type by a developer, which is not limited in the embodiment of the present application.
And step 206, constructing an identification data set of the field type according to the sample text.
In an implementation manner, the process of constructing the identification data set of the domain category from the sample text may refer to the content of step 203, that is, parsing the sample text according to a preset structured parsing rule to obtain a reference structured term and/or a reference text paragraph to be analyzed in the sample text, and directly constructing the reference structured term and the reference text paragraph to be analyzed as the identification data set, or constructing the reference structured term and the text feature (vector feature, n-gram feature) of the reference text paragraph to be analyzed as the identification data set.
In another implementation, a training set may be constructed from the sample text, and a machine learning model as in sub-step 2044 may be constructed as the recognition data set, so as to achieve the purpose of determining the degree of matching between the term and the domain class to which the term belongs through the machine learning model.
In summary, in the embodiment of the present application, the associated content of a term can greatly expand the dimension of the term, so that the definition range of the term is not limited to one term word, but the definition range of the term is greatly expanded by the associated content with a larger extent. Similarly, the set of identification data may also greatly expand the dimensionality of the corresponding domain category. In the process of determining the matching degree by using the associated content of the terms and the identification data set of the category of the field to which the terms belong, the associated content and the identification data set can provide richer semantic information, and a large amount of semantic information is helpful for improving the accuracy of calculating the matching degree, so that the accuracy of quality management of the term data is improved. In addition, the whole quality management strategy can be realized by automatic information mining, information analysis and information comparison, so that the manual participation degree is greatly reduced, the generation efficiency is improved, and the production cost is reduced.
It should be noted that for simplicity of description, the method embodiments are described as a series of acts, but those skilled in the art should understand that the embodiments are not limited by the described order of acts, as some steps can be performed in other orders or simultaneously according to the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.
On the basis of the above embodiments, the present embodiment further provides a text processing apparatus, which is applied to electronic devices such as a terminal device and a server.
Referring to fig. 10, a block diagram of a structure of an embodiment of a text processing apparatus according to an embodiment of the present application is shown, and specifically, the following modules may be included:
a first obtaining module 301, configured to obtain terms of at least one domain category;
a second obtaining module 302, configured to obtain associated content of the term;
optionally, the second obtaining module 302 includes:
and the acquisition sub-module is used for acquiring the associated webpage data as associated content according to the terms.
Optionally, the obtaining sub-module includes:
a third obtaining unit, configured to obtain, according to the term, the web page data from a target knowledge site as the associated content;
optionally, the third obtaining unit includes:
and the first acquisition subunit is used for acquiring a term page corresponding to the term by taking the term as a query term in at least one target knowledge website.
And/or, a fourth obtaining unit, configured to obtain, according to the term, the web page data from a target search engine as the associated content.
Optionally, the fourth obtaining unit includes:
the second acquisition subunit is used for performing query by using the terms in at least one target search engine and acquiring a query result comprising a plurality of webpage data;
and the determining subunit is used for selecting at least one piece of target webpage data from the query result as the associated content.
A verification module 303, configured to determine a matching degree between the term and the domain category to which the term belongs according to the associated content of the term and the identification data set of the domain category to which the term belongs.
Optionally, the verification module includes:
the analysis submodule is used for analyzing the associated content according to a preset structured analysis rule to obtain structural terms to be analyzed and/or text paragraphs to be analyzed in the associated content; the structured terms to be analyzed include: terms that are presented in a page in a structured form;
optionally, in a case that the associated content is web page data, the parsing sub-module includes:
and the analysis unit is used for acquiring the structural terms to be analyzed and/or the text paragraphs to be analyzed from the webpage data according to the structural analysis template corresponding to the webpage data.
And the verification sub-module is used for determining the matching degree between the term and the domain class to which the term belongs according to the structured term to be analyzed and/or the text paragraph to be analyzed and the identification data set of the domain class to which the term belongs.
Optionally, the identifying the data set includes: presetting N-element relation characteristics among reference words; the verification sub-module comprises:
the first obtaining unit is used for obtaining N-element relation characteristics among words to be analyzed according to the text paragraphs to be analyzed under the condition that the associated content comprises at least one text paragraph to be analyzed;
and the first verification unit is used for matching the to-be-analyzed inter-word N-element relation characteristic with the reference inter-word N-element relation characteristic and determining the matching degree between the term and the field category to which the term belongs.
Optionally, the identifying the data set includes: a machine learning model; the verification sub-module comprises:
the second obtaining unit is used for obtaining input characteristics according to the text paragraphs to be analyzed under the condition that the associated content comprises at least one text paragraph to be analyzed;
and the second verification unit is used for inputting the input features into the machine learning model and determining the matching degree between the term and the field class to which the term belongs.
Optionally, the identifying the data set includes: a preset reference structured term; the verification sub-module comprises:
and the third verification unit is used for matching the structural terms to be analyzed with the reference structural terms under the condition that the associated content comprises at least one structural term to be analyzed, and determining the matching degree between the terms and the domain categories to which the terms belong.
Optionally, the apparatus further comprises:
the third acquisition module is used for acquiring sample texts belonging to the field categories;
and the construction module is used for constructing the identification data set of the field category according to the sample text.
In summary, in the embodiments of the present application, the associated content of a term can greatly expand the dimension of the term, so that the definition range of the term is not limited to only one term word, but the definition range of the term is greatly expanded by the associated content with a larger breadth. Similarly, the identification data set may also greatly expand the dimensionality of the corresponding domain class. In the process of determining the matching degree by using the associated content of the terms and the identification data set of the category of the field to which the terms belong, the associated content and the identification data set can provide richer semantic information, and a large amount of semantic information is helpful for improving the accuracy of calculating the matching degree, so that the accuracy of quality management of the term data is improved. In addition, the whole quality management strategy can be realized by automatic information mining, information analysis and information comparison, so that the manual participation degree is greatly reduced, the generation efficiency is improved, and the production cost is reduced.
The embodiments of the present application also provide a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the one or more modules may cause the device to execute instructions (instructions) of method steps in the embodiments of the present application.
Embodiments of the present application provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an electronic device to perform a method as described in one or more of the above embodiments. In the embodiment of the present application, the electronic device includes various types of devices such as a terminal device and a server (cluster).
Embodiments of the present disclosure may be implemented as an apparatus, which may include electronic devices such as a terminal device, a server (cluster), etc., using any suitable hardware, firmware, software, or any combination thereof, to perform a desired configuration. Fig. 11 schematically illustrates an example apparatus 700 that may be used to implement various ones of the embodiments described in the present application.
For one embodiment, fig. 11 illustrates an exemplary apparatus 700 having one or more processors 702, a control module (chipset) 704 coupled to at least one of the processor(s) 702, a memory 706 coupled to the control module 704, a non-volatile memory (NVM)/storage 708 coupled to the control module 704, one or more input/output devices 710 coupled to the control module 704, and a network interface 712 coupled to the control module 704.
The processor 702 may include one or more single-core or multi-core processors, and the processor 702 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 700 can be used as a terminal device, a server (cluster), or other devices described in this embodiment.
In some embodiments, the apparatus 700 may include one or more computer-readable media (e.g., the memory 706 or the NVM/storage 708) having instructions 714 and one or more processors 702 in combination with the one or more computer-readable media configured to execute the instructions 714 to implement modules to perform the actions described in this disclosure.
For one embodiment, control module 704 may include any suitable interface controllers to provide any suitable interface to at least one of processor(s) 702 and/or any suitable device or component in communication with control module 704.
The control module 704 may include a memory controller module to provide an interface to the memory 706. The memory controller module may be a hardware module, a software module, and/or a firmware module.
The memory 706 may be used, for example, to load and store data and/or instructions 714 for the apparatus 700. For one embodiment, memory 706 may comprise any suitable volatile memory, such as suitable DRAM. In some embodiments, the memory 706 may comprise a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).
For one embodiment, control module 704 may include one or more input/output controllers to provide an interface to NVM/storage 708 and input/output device(s) 710.
For example, NVM/storage 708 may be used to store data and/or instructions 714. NVM/storage 708 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more hard disk drive(s) (HDD (s)), one or more Compact Disc (CD) drive(s), and/or one or more Digital Versatile Disc (DVD) drive (s)).
NVM/storage 708 may include storage resources that are physically part of the device on which apparatus 700 is installed, or it may be accessible by the device and need not be part of the device. For example, NVM/storage 708 may be accessible over a network via input/output device(s) 710.
Input/output device(s) 710 may provide an interface for apparatus 700 to communicate with any other suitable device, input/output device(s) 710 may include communication components, audio components, sensor components, and so forth. Network interface 712 may provide an interface for device 700 to communicate over one or more networks, and device 700 may communicate wirelessly with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as access to a communication standard-based wireless network, such as WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.
For one embodiment, at least one of the processor(s) 702 may be packaged together with logic for one or more controller(s) (e.g., memory controller module) of control module 704. For one embodiment, at least one of the processor(s) 702 may be packaged together with logic for one or more controllers of control module 704 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 702 may be integrated on the same die with logic for one or more controller(s) of control module 704. For one embodiment, at least one of the processor(s) 702 may be integrated on the same die with logic for one or more controllers of control module 704 to form a system on a chip (SoC).
In various embodiments, the apparatus 700 may be, but is not limited to being: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, apparatus 700 may have more or fewer components and/or different architectures. For example, in some embodiments, device 700 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.
The detection device can adopt a main control chip as a processor or a control module, sensor data, position information and the like are stored in a memory or an NVM/storage device, a sensor group can be used as an input/output device, and a communication interface can comprise a network interface.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the true scope of the embodiments of the present application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The text processing method, the text processing apparatus, the electronic device, and the storage medium provided by the present application are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (24)

1. A method of text processing, the method comprising:
obtaining terms of at least one domain category;
acquiring the associated content of the term;
and determining the matching degree between the term and the domain class to which the term belongs according to the associated content of the term and the identification data set of the domain class to which the term belongs.
2. The method according to claim 1, wherein the determining the matching degree between the term and the domain category to which the term belongs according to the associated content of the term and the identification data set of the domain category to which the term belongs comprises:
analyzing the associated content according to a preset structured analysis rule to obtain structural terms to be analyzed and/or text paragraphs to be analyzed in the associated content; the structured terms to be analyzed include: terms that are presented in a page in a structured form;
and determining the matching degree between the term and the domain class to which the term belongs according to the structured term to be analyzed and/or the text paragraph to be analyzed and the identification data set of the domain class to which the term belongs.
3. The method according to claim 2, wherein, in a case that the associated content is web page data, the parsing the associated content according to a preset structured parsing rule to obtain a structural term to be analyzed and/or a text paragraph to be analyzed in the associated content includes:
and acquiring the structural terms to be analyzed and/or the text paragraphs to be analyzed from the webpage data according to the structural analysis template corresponding to the webpage data.
4. The method of claim 2, wherein identifying the set of data comprises: presetting N-element relation characteristics among reference words;
the determining the matching degree between the term and the domain category to which the term belongs according to the structured term to be analyzed and/or the text paragraph to be analyzed and the identification data set of the domain category to which the term belongs comprises:
under the condition that the associated content comprises at least one text paragraph to be analyzed, acquiring N-element relation characteristics among words to be analyzed according to the text paragraph to be analyzed;
and matching the inter-word N-element relation characteristics to be analyzed with the reference inter-word N-element relation characteristics, and determining the matching degree between the term and the field category to which the term belongs.
5. The method of claim 2, wherein identifying the set of data comprises: a machine learning model;
the determining the matching degree between the term and the domain category to which the term belongs according to the structured term to be analyzed and/or the text paragraph to be analyzed and the identification data set of the domain category to which the term belongs comprises:
under the condition that the associated content comprises at least one text paragraph to be analyzed, acquiring input characteristics according to the text paragraph to be analyzed;
inputting the input features into the machine learning model, and determining the matching degree between the term and the field category to which the term belongs.
6. The method of claim 2, wherein identifying the set of data comprises: a preset reference structured term;
the determining the matching degree between the term and the domain category to which the term belongs according to the structured term to be analyzed and/or the text paragraph to be analyzed and the identification data set of the domain category to which the term belongs comprises:
in the case that the associated content comprises at least one structural term to be analyzed, matching the structural term to be analyzed with the reference structural term, and determining the matching degree between the term and the field category to which the term belongs.
7. The method of any one of claims 1-6, further comprising:
acquiring sample texts belonging to the field category;
and constructing an identification data set of the field category according to the sample text.
8. The method according to any one of claims 1-6, wherein said obtaining the associated content of the term comprises:
and acquiring associated webpage data as associated content according to the terms.
9. The method according to claim 8, wherein the obtaining associated web page data as associated content according to the term comprises:
acquiring the webpage data from a target knowledge website as the associated content according to the terms;
and/or acquiring the webpage data from a target search engine as the associated content according to the terms.
10. The method of claim 9, wherein obtaining the web page data from a target knowledge site as the associated content according to the term comprises:
and in at least one target knowledge website, taking the terms as query terms and acquiring term pages corresponding to the terms.
11. The method of claim 9, wherein the obtaining the web page data from a target search engine as the associated content according to the term comprises:
in at least one target search engine, querying by the terms to obtain a query result comprising a plurality of webpage data;
and selecting at least one piece of target webpage data from the query result as the associated content.
12. A text processing apparatus, characterized in that the apparatus comprises:
a first obtaining module, configured to obtain a term of at least one domain category;
the second acquisition module is used for acquiring the associated content of the terms;
and the verification module is used for determining the matching degree between the term and the domain category to which the term belongs according to the associated content of the term and the identification data set of the domain category to which the term belongs.
13. The apparatus of claim 12, wherein the authentication module comprises:
the analysis submodule is used for analyzing the associated content according to a preset structured analysis rule to obtain structural terms to be analyzed and/or text paragraphs to be analyzed in the associated content; the structured terms to be analyzed include: terms that are presented in a page in a structured form;
and the verification sub-module is used for determining the matching degree between the term and the domain class to which the term belongs according to the structured term to be analyzed and/or the text paragraph to be analyzed and the identification data set of the domain class to which the term belongs.
14. The apparatus of claim 13, wherein in the case that the associated content is web page data, the parsing sub-module comprises:
and the analysis unit is used for acquiring the structural terms to be analyzed and/or the text paragraphs to be analyzed from the webpage data according to the structural analysis template corresponding to the webpage data.
15. The apparatus of claim 13, wherein the set of identification data comprises: presetting N-element relation characteristics among reference words; the verification sub-module comprises:
the first obtaining unit is used for obtaining N-element relation characteristics among words to be analyzed according to the text paragraphs to be analyzed under the condition that the associated content comprises at least one text paragraph to be analyzed;
and the first verification unit is used for matching the to-be-analyzed inter-word N-element relation characteristic with the reference inter-word N-element relation characteristic and determining the matching degree between the term and the field category to which the term belongs.
16. The apparatus of claim 13, wherein the set of identification data comprises: a machine learning model; the verification sub-module comprises:
the second obtaining unit is used for obtaining input characteristics according to the text paragraphs to be analyzed under the condition that the associated content comprises at least one text paragraph to be analyzed;
and the second verification unit is used for inputting the input features into the machine learning model and determining the matching degree between the term and the field class to which the term belongs.
17. The apparatus of claim 13, wherein the set of identification data comprises: a preset reference structured term; the verification sub-module comprises:
and the third verification unit is used for matching the structural terms to be analyzed with the reference structural terms under the condition that the associated content comprises at least one structural term to be analyzed, and determining the matching degree between the terms and the domain categories to which the terms belong.
18. The apparatus of any one of claims 12 to 17, further comprising:
the third acquisition module is used for acquiring sample texts belonging to the field categories;
and the construction module is used for constructing the identification data set of the field type according to the sample text.
19. The apparatus according to any one of claims 12 to 17, wherein the second obtaining module comprises:
and the acquisition submodule is used for acquiring the associated webpage data as the associated content according to the terms.
20. The apparatus of claim 19, wherein the acquisition submodule comprises:
a third obtaining unit, configured to obtain, according to the term, the web page data from a target knowledge site as the associated content;
and/or, a fourth obtaining unit, configured to obtain, according to the term, the web page data from a target search engine as the associated content.
21. The apparatus of claim 20, wherein the third obtaining unit comprises:
and the first acquisition subunit is used for acquiring a term page corresponding to the term by taking the term as a query term in at least one target knowledge website.
22. The apparatus of claim 20, wherein the fourth obtaining unit comprises:
the second acquisition subunit is used for performing query by using the terms in at least one target search engine and acquiring a query result comprising a plurality of webpage data;
and the determining subunit is used for selecting at least one piece of target webpage data from the query result as the associated content.
23. An electronic device, comprising: a processor; and
memory having stored thereon executable code which, when executed, causes the processor to perform the method of one or more of claims 1-11.
24. One or more machine-readable media having executable code stored thereon that, when executed, causes a processor to perform the method of one or more of claims 1-11.
CN202011589620.5A 2020-12-28 2020-12-28 Text processing method and device Pending CN114692620A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011589620.5A CN114692620A (en) 2020-12-28 2020-12-28 Text processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011589620.5A CN114692620A (en) 2020-12-28 2020-12-28 Text processing method and device

Publications (1)

Publication Number Publication Date
CN114692620A true CN114692620A (en) 2022-07-01

Family

ID=82132612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011589620.5A Pending CN114692620A (en) 2020-12-28 2020-12-28 Text processing method and device

Country Status (1)

Country Link
CN (1) CN114692620A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080751A (en) * 2022-08-16 2022-09-20 之江实验室 Medical standard term management system and method based on general model
CN116562271A (en) * 2023-07-10 2023-08-08 之江实验室 Quality control method and device for electronic medical record, storage medium and electronic equipment
CN117809827A (en) * 2024-03-01 2024-04-02 吉林大学 Nursing information management system based on Internet of things

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080751A (en) * 2022-08-16 2022-09-20 之江实验室 Medical standard term management system and method based on general model
CN115080751B (en) * 2022-08-16 2022-11-11 之江实验室 Medical standard term management system and method based on general model
CN116562271A (en) * 2023-07-10 2023-08-08 之江实验室 Quality control method and device for electronic medical record, storage medium and electronic equipment
CN116562271B (en) * 2023-07-10 2023-10-10 之江实验室 Quality control method and device for electronic medical record, storage medium and electronic equipment
CN117809827A (en) * 2024-03-01 2024-04-02 吉林大学 Nursing information management system based on Internet of things

Similar Documents

Publication Publication Date Title
US11720572B2 (en) Method and system for content recommendation
Alzubi et al. COBERT: COVID-19 question answering system using BERT
Hegazi et al. Preprocessing Arabic text on social media
US10515125B1 (en) Structured text segment indexing techniques
US8560300B2 (en) Error correction using fact repositories
KR20180048624A (en) A training device of the Q &amp; A system and a computer program for it
CN114692620A (en) Text processing method and device
KR102155768B1 (en) Method for providing question and answer data set recommendation service using adpative learning from evoloving data stream for shopping mall
US20190188271A1 (en) Supporting evidence retrieval for complex answers
US20220405484A1 (en) Methods for Reinforcement Document Transformer for Multimodal Conversations and Devices Thereof
US10706045B1 (en) Natural language querying of a data lake using contextualized knowledge bases
US11194798B2 (en) Automatic transformation of complex tables in documents into computer understandable structured format with mapped dependencies and providing schema-less query support for searching table data
US20230030086A1 (en) System and method for generating ontologies and retrieving information using the same
US11580100B2 (en) Systems and methods for advanced query generation
CN117094334A (en) Data processing method, device and equipment based on large language model
US11227183B1 (en) Section segmentation based information retrieval with entity expansion
CN115827819A (en) Intelligent question and answer processing method and device, electronic equipment and storage medium
US20220180215A1 (en) System and computer network for knowledge search and analysis
Dhole et al. NLP based retrieval of medical information for diagnosis of human diseases
Francis et al. SmarTxT: A Natural Language Processing Approach for Efficient Vehicle Defect Investigation
Fan et al. Few-shot named entity recognition framework for forestry science metadata extraction
Hao et al. QSem: A novel question representation framework for question matching over accumulated question–answer data
US20230109411A1 (en) Computer-implemented method of searching large-volume un-structured data with feedback loop and data processing device or system for the same
US12032565B2 (en) Systems and methods for advanced query generation
DeVille et al. Text as Data: Computational Methods of Understanding Written Expression Using SAS

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination