WO2019080428A1 - Method for obtaining target document and application server - Google Patents

Method for obtaining target document and application server

Info

Publication number
WO2019080428A1
WO2019080428A1 PCT/CN2018/077627 CN2018077627W WO2019080428A1 WO 2019080428 A1 WO2019080428 A1 WO 2019080428A1 CN 2018077627 W CN2018077627 W CN 2018077627W WO 2019080428 A1 WO2019080428 A1 WO 2019080428A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
synonym
search keyword
keyword
information
Prior art date
Application number
PCT/CN2018/077627
Other languages
French (fr)
Chinese (zh)
Inventor
阮晓雯
周瑜
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019080428A1 publication Critical patent/WO2019080428A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Definitions

  • the present application relates to the field of data analysis technologies, and in particular, to a target document acquisition method and an application server.
  • the present application proposes a target document acquisition method and an application server to solve the problem.
  • the present application provides a target document obtaining method, the method comprising the steps of: acquiring at least one document and document information corresponding to the document, and preprocessing the document information; acquiring a search keyword Establishing a document selection model based on a character deletion table, a synonymous synonyms table, and a specification parameter table; inputting the preprocessed document information into the document selection model, the document selection model according to the retrieval keyword to the document information Processing; calculating a word frequency and a density score of the search keyword in the document output by the document selection model according to a preset keyword frequency and density algorithm, and correlating the document according to the word frequency and density score Degree sorting; and outputting, according to the preset relevance threshold, the target document in the document whose relevance is greater than the preset relevance threshold.
  • the present application further provides an application server, including a memory and a processor, where the memory stores a target document acquisition system executable on the processor, where the target document acquisition system is The steps of the target document acquisition method as described above are implemented when the processor is executed.
  • the present application further provides a computer readable storage medium storing a target document acquisition system, the target document acquisition system being executable by at least one processor, such that The at least one processor performs the steps of the target document acquisition method as described above.
  • the target document obtaining method, the application server, and the computer readable storage medium proposed by the present application first obtain a search keyword; secondly, establish a document selection based on a character deletion table, a synonym synonym table, and a specification parameter table. Model; inputting the preprocessed document information into the document selection model again, the document selection model processing the document information according to the retrieval keyword; and then calculating the according to a preset keyword word frequency and density algorithm a word frequency and a density score of the search keyword in the document output by the document selection model, and sorting the documents according to the word frequency and the density score; finally outputting the document according to a preset relevance threshold The target document whose relevance is greater than the preset relevance threshold.
  • the target document acquisition method, the application server and the computer readable storage medium proposed by the application can quickly and accurately obtain the target document on the network, and can be applied to different regions, thereby greatly improving efficiency and reducing cost.
  • 1 is a schematic diagram of an optional hardware architecture of an application server of the present application
  • FIG. 2 is a schematic diagram of a program module of an implementation manner of a target document obtaining system of the present application
  • FIG. 3 is a schematic flowchart of a first embodiment of a method for acquiring an object of the present application
  • FIG. 4 is a schematic flowchart of a second embodiment of a method for acquiring an object of the present application
  • FIG. 5 is a schematic flowchart diagram of a third implementation manner of an object document obtaining method according to the present application.
  • FIG. 6 is a schematic flowchart of a fourth embodiment of a method for acquiring an object of the present application.
  • FIG. 7 is a schematic flowchart diagram of a fifth embodiment of an object document obtaining method of the present application.
  • the application server 1 may include, but is not limited to, the memory 11, the processor 12, and the network interface 13 being communicably connected to each other through a system bus. It is pointed out that Figure 1 only shows the application server 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
  • the application server 1 may be a computing device such as a rack server, a blade server, a tower server, or a rack server.
  • the application server 1 may be an independent server or a server cluster composed of multiple servers. .
  • the memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like.
  • the memory 11 may be an internal storage unit of the application server 1, such as a hard disk or memory of the application server 1.
  • the memory 11 may also be an external storage device of the application server 1, such as a plug-in hard disk equipped on the application server 1, a smart memory card (SMC), and a secure digital number. (Secure Digital, SD) card, flash card, etc.
  • SMC smart memory card
  • SD Secure Digital
  • the memory 11 can also include both the internal storage unit of the application server 1 and its external storage device.
  • the memory 11 is generally used to store an operating system installed in the application server 1 and various types of application software, such as program code of the target document acquisition system 200. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 12 is typically used to control the overall operation of the application server 1.
  • the processor 12 is configured to run program code or process data stored in the memory 11, such as running the target document acquisition system 200 and the like.
  • the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the application server 1 and other electronic devices.
  • the present application proposes a target document acquisition system 200.
  • FIG. 2 it is a program module diagram of the first embodiment of the target document obtaining system 200 of the present application.
  • the target document acquisition system 200 includes a series of computer program instructions stored on the memory 11, and when the computer program instructions are executed by the processor 12, the target document acquisition of the embodiments of the present application may be implemented. operating.
  • the target document acquisition system 200 can be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 2, the target document acquisition system 200 can be divided into an acquisition module 21, an establishment module 22, a call module 23, a processing module 24, a comparison module 25, and an output module 26. among them:
  • the obtaining module 21 is configured to acquire at least one document and document information corresponding to the document, and preprocess the document information.
  • the document may be various documents from insurance institutions and medical institutions, and the insurance institution and the medical institution have a database for storing medical insurance reimbursement documents, drug lists and the like; the medical institutions include hospitals and clinics established in different places. Wait.
  • the preprocessing includes the steps of: segmenting the document to obtain at least one word; performing part of speech analysis on the word to obtain first information of the word; and using the word as a predetermined part of speech or
  • the first information is a word that presets the first information as a candidate word.
  • the part of speech includes: nouns, verbs, adjectives, several times, quantifiers, pronouns, adverbs, conjunctions, auxiliary words, etc.;
  • the first information includes: person name, institution name, place name, time, date, percentage, etc. .
  • the name of a pharmaceutical compound is often an important candidate, and the part of a pharmaceutical compound is usually a noun, so the default part of speech can be a noun.
  • the above processing steps can be implemented by using the following tools.
  • the document is a Chinese document
  • the ICTCLAS Institute of Computing Technology Chinese Lexical Analysis System
  • the HIT-IRLAS lexical method of Harbin Institute of Technology can be used.
  • Analyzers etc.
  • the target document is an English document
  • Stanford Parse also known as the Stanford Lexical Analyzer
  • the candidate words may also be subjected to shallow syntactic analysis or block analysis to form block structure information, and the block structure information is further used as a candidate word, for example, the block structure information may be a non-recursive name phrase, a verb phrase, etc. Wait.
  • the obtaining module 21 is further configured to acquire a search keyword.
  • the search keyword may include one or more, and when the method is used to extract a recurrent hypoglycemic document, the search keyword may be defined as “glucose”. Users can set different keywords according to different needs.
  • the establishing module 22 is configured to establish a document selection model based on a character deletion table, a synonym synonym table, and a specification parameter table.
  • the processing of the word segmentation belongs to a fuzzy process, and the document selection model is established, and the required document can be accurately selected according to the candidate words after the pre-processing.
  • the character deletion table includes a character that is inconsistent with the search keyword among the candidate words, and in some cases, the target document includes a plurality of sentences, symbols, words, and the like, and the target is preprocessed.
  • the words obtained after the document is segmented may include some characters and words that do not meet the requirements, and the character deletion table is created to deal with the words after the mistake and the inappropriate word segmentation.
  • the synonymous synonym table includes synonyms and synonyms of the search keywords, and may also include foreign language vocabularies corresponding to different languages.
  • a synonym is a group of words that have similar meanings or are related to each other. The same word can have multiple synonyms in the same language.
  • keywords that need to be searched different people have different ways of writing them in different places.
  • computers and computers are synonymous. In different fields, even the same words have different meanings. Therefore, the selected technical field is also important for correct retrieval.
  • the specification parameter table includes multiple parameters corresponding to the search keyword.
  • the specification parameters of glucose include the amount and frequency of use.
  • the search keyword has a specification parameter definition, the candidate parameter can be accurately positioned using the specification parameter table.
  • establishing a document selection model includes the following steps: analyzing a search keyword to obtain a technical field of the search keyword; in the technical field, setting a character deletion table according to the analysis result; in the technical field, obtaining the data from a database Key words synonym, synonym and establish synonym synonym table; in the technical field, the keyword is analyzed and the specification parameter of the keyword is selected to establish a specification parameter table; and the character deletion table, the synonym synonym table and the specification parameter table Make a dynamic update.
  • the calling module 23 is configured to input the pre-processed document information into the document selection model, and the document selection model processes the document information.
  • the preprocessed document information is input as input information to the document selection model, and the document selection model processes the document information according to a preset condition
  • the processing step includes: calling a character deletion table to In the document information, the characters, words and words that are wrong, redundant, and obviously related to the search keyword are deleted; the synonymous synonyms table is called to replace the search keyword, and the search keyword after the replacement is searched, and the search key is The document information matching the word and its synonym synonym is saved; the specification parameter table is used to compare and analyze the specification parameters corresponding to the search keyword and the synonymous synonym, and the document information matching the data in the specification parameter table is saved.
  • the following steps may be further included: establishing a bracket recognition model, and identifying different usage manners of the brackets to obtain accurate classification data.
  • the bracket recognition model can identify different functions of the brackets, including a unilateral relationship, a parallel relationship, and an inclusion relationship, wherein the unilateral relationship refers to the parentheses as a separator to segment the document information in the target document, and the parallel relationship refers to the document.
  • the parentheses in the information are used to display the aliases of some words, and the inclusion relationship refers to the specific parameter information of the partial nouns in the information in the document information.
  • the processing module 24 is configured to calculate, according to a preset keyword frequency and density algorithm, a word frequency and a density score of the keyword in the document output by the document selection model, and the document according to the word frequency and density score Sort the relevance.
  • the keyword frequency and density score M is:
  • M ⁇ log (total number of documents/(number of documents containing keywords +1))*exp(count(keyword), S), where count (keyword) is the number of times the query word is hit in the search result, Log (total number of documents / (number of documents containing keywords + 1)) is the importance of keywords in the query results, and S is a preset parameter.
  • the comparison module 25 is configured to compare the relevance of the document with a preset relevance threshold.
  • the output module 26 is configured to output a target document that is greater than a preset relevance threshold according to a preset correlation threshold.
  • the target document obtaining method when the target document obtaining method is applied to the medical field, for example, for acquiring a recurrent hypoglycemic document, the following steps may be further included: analyzing the filtered recurrent hypoglycemic document to obtain the patient's identity information. Obtaining historical medical data of the patient from the database according to the identity information of the patient; obtaining data of the patient's glucose use, disease detection, and treatment mode from the historical medical treatment data; and obtaining all recurrent episodes of the patient according to the above data Blood glucose receipts.
  • the present application also proposes a target document acquisition method.
  • FIG. 3 it is a schematic flowchart of the first implementation manner of the target document obtaining method of the present application.
  • the order of execution of the steps in the flowchart shown in FIG. 3 may be changed according to different requirements, and some steps may be omitted.
  • Step S110 Acquire at least one document and document information corresponding to the document, and preprocess the document information.
  • step S120 a search keyword is obtained.
  • Step S130 establishing a document selection model based on a character deletion table, a synonym synonym table, and a specification parameter table.
  • the processing of the word segmentation belongs to a fuzzy process, and the document selection model is established, and the required document can be accurately selected according to the candidate words after the pre-processing.
  • a document selection model based on a character deletion table, a synonym synonym table, and a specification parameter table can quickly and accurately obtain a desired document.
  • Step S140 input the pre-processed document information into the document selection model, and the document selection model processes the document information according to the retrieval keyword.
  • the document selection model is invoked to process the document information, and the matched document can be quickly obtained.
  • the processing further includes: establishing a bracket recognition model to identify different usage patterns of the brackets to obtain accurate classification data.
  • the data in parentheses is diverse, including the interpretation of the previous word, quantitative description, synonyms, foreign words, etc. At the same time, the parentheses can only exist as a sentence segmentation. Establishing the item number identification model according to different situations can help to obtain Better results.
  • Step S150 Calculate a word frequency and a density score of the search keyword in the document output by the document selection model according to a preset keyword frequency and density algorithm, and correlate the document according to the word frequency and density score. Degree sorting.
  • the keyword frequency and density score M is:
  • M ⁇ log (total number of documents/(number of documents containing keywords +1))*exp(count(keyword), S), where count (keyword) is the number of times the query word is hit in the search result, Log (total number of documents / (number of documents containing keywords + 1)) is the importance of keywords in the query results, and S is a preset parameter.
  • Step S160 Output, according to the preset relevance threshold, a target document in the document that is greater than the preset relevance threshold.
  • setting the relevance threshold can obtain the required document more conveniently and accurately, and the user can also finely adjust the relevance threshold according to the result of the review, and the retrieval method is more perfect through the feedback operation.
  • step S110 “acquiring at least one document and document information corresponding to the document, and pre-processing the document information” specifically includes the following steps:
  • Step S210 segmenting the document to obtain at least one word.
  • Step S220 performing part of speech analysis on the words to obtain first information of the words.
  • the part of speech includes: nouns, verbs, adjectives, several times, quantifiers, pronouns, adverbs, conjunctions, auxiliary words, etc.;
  • the first information includes: person name, institution name, place name, time, date, percentage, etc. .
  • the name of a pharmaceutical compound is often an important candidate, and the part of a pharmaceutical compound is usually a noun, so the default part of speech can be a noun.
  • Step S230 the words whose words are predetermined part of speech or the first information is preset first information are used as candidate words.
  • the step S130 "establishing a document selection model based on the character deletion table, the synonymous synonyms table and the specification parameter table" includes the following steps:
  • Step S310 analyzing the search keyword to obtain a technical field of the search keyword.
  • the search keyword often represents a special meaning of its special field, whereby the technical field of the search keyword can be determined. For example, if the search keyword is “binary tree”, the technical field can be reduced to a computer. Algorithms, etc.
  • Step S320 in the technical field, setting a character deletion table according to the analysis result.
  • the character deletion table includes a character that is inconsistent with the search keyword among the candidate words, and in some cases, the target document includes a plurality of sentences, symbols, words, and the like, and the target is preprocessed.
  • the words obtained after the document is segmented may include some characters and words that do not meet the requirements, and the character deletion table is created to deal with the words after the mistake and the inappropriate word segmentation.
  • Step S330 in the technical field, synonym and synonym of the keyword are obtained from a database and a synonym synonym table is established.
  • the synonymous synonym table includes synonyms and synonyms of the search keywords, and may also include foreign language vocabularies corresponding to different languages.
  • a synonym is a group of words that have similar meanings or are related to each other. The same word can have multiple synonyms in the same language.
  • keywords that need to be searched different people have different ways of writing them in different places.
  • computers and computers are synonymous. In different fields, even the same words have different meanings. Therefore, the selected technical field is also important for correct retrieval.
  • Step S340 in the technical field, selecting the specification parameter of the keyword after analyzing the keyword to establish the specification parameter table.
  • the specification parameter table includes multiple parameters corresponding to the search keyword.
  • the specification parameters of glucose include the amount and frequency of use.
  • the search keyword has a specification parameter definition, the candidate parameter can be accurately positioned using the specification parameter table.
  • Step S350 dynamically updating the character deletion table, the synonym synonym table, and the specification parameter table.
  • the character deletion table, the synonym synonym table, and the specification parameter table may be dynamically updated according to the obtained information, so that the character deletion table and the The synonym synonym table and the specification parameter table are more perfect, so that the document selection model based on the character deletion table, the synonym synonym table and the specification parameter table is more accurate.
  • FIG. 6 is a schematic flowchart diagram of a fourth embodiment of the method for acquiring an object of the present application.
  • the step of the step of “putting the pre-processed document information into the document selection model, and the document selection model processing the document information according to the retrieval keyword” specifically includes the following steps. :
  • Step S410 calling the character deletion table to delete characters, words that are incorrect, redundant, and obviously related to the search keyword in the document information.
  • Step S420 calling the synonym synonym table to replace the search keyword, searching the replaced search keyword, and saving document information matching the search keyword and its synonym similarity.
  • Step S430 the specification parameter table is invoked to perform comparison analysis on the specification parameters corresponding to the search keyword and the synonymous synonym, and the document information matching the data in the specification parameter table is saved.
  • FIG. 7 is a schematic flowchart diagram of a fifth embodiment of a method for acquiring an object of the present application.
  • the step of “outputting the relevance in the document is greater than the preset relevance threshold according to a preset relevance threshold.
  • Step S610 analyzing the filtered target document to obtain identity information of the patient.
  • the target document selected may be a recurrent hypoglycemic document.
  • Step S620 obtaining historical medical treatment data of the patient from the database according to the identity information of the patient.
  • Step S630 obtaining data such as glucose usage, disease detection, and treatment mode of the patient from the historical medical treatment data.
  • Step S640 obtaining all recurrent hypoglycemia documents of the patient according to the above data.
  • the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • a storage medium such as ROM/RAM, disk
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in the various embodiments of the present application.

Abstract

Disclosed in the present application is a method for obtaining a target document. The method comprises: obtaining search keywords; establishing a document selection model based on a character deletion table, a synonym and near-synonym table, and a specification parameter table; inputting preprocessed document information to the document selection model, so that the document selection module processes the document information according to the search keywords; calculating, according to a preset keyword frequency and density algorithm, word frequencies and density scores of the search keywords in documents output by the document selection model, and performing relevance ranking on the documents according to the word frequencies and the density scores; and outputting, according to a preset relevance threshold, a target document with relevance greater than the preset relevance threshold in the documents. The present application also provides an application server and a computer readable storage medium. By means of the method for obtaining a target document, the application server, and the computer readable storage medium provided by the present application, a target document can be quickly obtained.

Description

目标文档获取方法及应用服务器Target document acquisition method and application server
本申请要求于2017年10月23日提交中国专利局、申请号为201710994507.7、发明名称为“目标文档获取方法及应用服务器”的中国专利申请的优先权,其全部内容通过引用结合在申请中。The present application claims the priority of the Chinese Patent Application, filed on Jan. 23, 2017, filed Jan.
技术领域Technical field
本申请涉及数据分析技术领域,尤其涉及一种目标文档获取方法及应用服务器。The present application relates to the field of data analysis technologies, and in particular, to a target document acquisition method and an application server.
背景技术Background technique
随着信息时代的来临,人们将大量的信息存储在大容量的存储设备并利用数据库管理系统进行信息整合和管理,通过查询数据库从而获得所需的信息。目前,基于关键词匹配的检索,由于词汇的歧义、查询条件和表达形式的不统一,使得检索遇到很多问题。例如,在医保政策下,限定性胰岛素使用逻辑分为两种,其中一种便是限反复发作低血糖,转换为数据特征就是有两次或以上葡萄糖使用记录,即需要用自然语言抓取药品中涉及“葡萄糖”字段信息。但不同的城市读取在录入数据时存在各种书写格式、方式不同,很多时候对数据难以正确解析。如直接使用原始数据进行自然语言抓取“葡萄糖”产生效果会不大理想,甚至与真实结果偏离等问题。若对某个地区做特殊处理,则迁移到其他地区时又需要重新处理,增加了很多时间成本。With the advent of the information age, people store a large amount of information in large-capacity storage devices and use the database management system for information integration and management, and obtain the required information by querying the database. At present, based on keyword matching retrieval, due to the ambiguity of vocabulary, query conditions and expressions, the retrieval encounters many problems. For example, under the medical insurance policy, there are two types of defined insulin use logic, one of which is limited to repeated episodes of hypoglycemia. The conversion to data characteristics means that there are two or more glucose usage records, that is, the need to capture drugs in natural language. The "glucose" field information is involved. However, different cities read different types of writing formats and methods when entering data. In many cases, it is difficult to correctly parse the data. If the direct use of raw data for natural language capture "glucose" production effect will be less ideal, and even deviate from the real results. If special treatment is applied to an area, it will need to be reprocessed when moving to other areas, adding a lot of time cost.
因此,针对以上问题,亟需提供一种新的检索方法,以获得真实的检索结果并适应不同地区的情况,降低成本。Therefore, in view of the above problems, it is urgent to provide a new retrieval method to obtain real retrieval results and adapt to different regions and reduce costs.
发明内容Summary of the invention
有鉴于此,本申请提出一种目标文档获取方法及应用服务器,以解决如 何的问题。In view of this, the present application proposes a target document acquisition method and an application server to solve the problem.
首先,为实现上述目的,本申请提出一种目标文档获取方法,该方法包括步骤:获取至少一个文档及与所述文档对应的文档信息,并对所述文档信息进行预处理;获取检索关键字;建立基于字符删除表,同义近义词表及规格参数表的文档选择模型;将预处理后的文档信息输入所述文档选择模型,所述文档选择模型根据所述检索关键字对所述文档信息进行处理;根据预设的关键词词频及密度算法计算所述文档选择模型输出的所述文档中所述检索关键词的词频及密度分数,并根据所述词频及密度分数对所述文档进行相关度排序;及根据预设相关度阈值,输出所述文档中所述相关度大于所述预设相关度阈值的目标文档。First, in order to achieve the above object, the present application provides a target document obtaining method, the method comprising the steps of: acquiring at least one document and document information corresponding to the document, and preprocessing the document information; acquiring a search keyword Establishing a document selection model based on a character deletion table, a synonymous synonyms table, and a specification parameter table; inputting the preprocessed document information into the document selection model, the document selection model according to the retrieval keyword to the document information Processing; calculating a word frequency and a density score of the search keyword in the document output by the document selection model according to a preset keyword frequency and density algorithm, and correlating the document according to the word frequency and density score Degree sorting; and outputting, according to the preset relevance threshold, the target document in the document whose relevance is greater than the preset relevance threshold.
此外,为实现上述目的,本申请还提供一种应用服务器,包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的目标文档获取系统,所述目标文档获取系统被所述处理器执行时实现如上述的目标文档获取方法的步骤。In addition, in order to achieve the above object, the present application further provides an application server, including a memory and a processor, where the memory stores a target document acquisition system executable on the processor, where the target document acquisition system is The steps of the target document acquisition method as described above are implemented when the processor is executed.
进一步地,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有目标文档获取系统,所述目标文档获取系统可被至少一个处理器执行,以使所述至少一个处理器执行如上述的目标文档获取方法的步骤。Further, to achieve the above object, the present application further provides a computer readable storage medium storing a target document acquisition system, the target document acquisition system being executable by at least one processor, such that The at least one processor performs the steps of the target document acquisition method as described above.
相较于现有技术,本申请所提出的目标文档获取方法、应用服务器及计算机可读存储介质,首先获取检索关键字;其次建立基于字符删除表,同义近义词表及规格参数表的文档选择模型;再次将预处理后的文档信息输入所述文档选择模型,所述文档选择模型根据所述检索关键字对所述文档信息进行处理;然后根据预设的关键词词频及密度算法计算所述文档选择模型输出的所述文档中所述检索关键词的词频及密度分数,并根据所述词频及密度分数对所述文档进行相关度排序;最后根据预设相关度阈值,输出所述文档中所述相关度大于所述预设相关度阈值的目标文档。采用本申请所提出的目标 文档获取方法、应用服务器及计算机可读存储介质可以快速地、准确地获得网络上的目标文档,并可以适用于不同地区,极大的提高了效率并降低了成本。Compared with the prior art, the target document obtaining method, the application server, and the computer readable storage medium proposed by the present application first obtain a search keyword; secondly, establish a document selection based on a character deletion table, a synonym synonym table, and a specification parameter table. Model; inputting the preprocessed document information into the document selection model again, the document selection model processing the document information according to the retrieval keyword; and then calculating the according to a preset keyword word frequency and density algorithm a word frequency and a density score of the search keyword in the document output by the document selection model, and sorting the documents according to the word frequency and the density score; finally outputting the document according to a preset relevance threshold The target document whose relevance is greater than the preset relevance threshold. The target document acquisition method, the application server and the computer readable storage medium proposed by the application can quickly and accurately obtain the target document on the network, and can be applied to different regions, thereby greatly improving efficiency and reducing cost.
附图说明DRAWINGS
图1是本申请应用服务器一可选的硬件架构的示意图;1 is a schematic diagram of an optional hardware architecture of an application server of the present application;
图2是本申请目标文档获取系统实施方式的程序模块示意图;2 is a schematic diagram of a program module of an implementation manner of a target document obtaining system of the present application;
图3是本申请目标文档获取方法第一实施方式的流程示意图;3 is a schematic flowchart of a first embodiment of a method for acquiring an object of the present application;
图4是本申请目标文档获取方法第二实施方式的流程示意图;4 is a schematic flowchart of a second embodiment of a method for acquiring an object of the present application;
图5是本申请目标文档获取方法第三实施方式的流程示意图;FIG. 5 is a schematic flowchart diagram of a third implementation manner of an object document obtaining method according to the present application; FIG.
图6是本申请目标文档获取方法第四实施方式的流程示意图;6 is a schematic flowchart of a fourth embodiment of a method for acquiring an object of the present application;
图7是本申请目标文档获取方法第五实施方式的流程示意图。FIG. 7 is a schematic flowchart diagram of a fifth embodiment of an object document obtaining method of the present application.
本申请目的的实现、功能特点及优点将结合实施方式,参照附图做进一步说明。The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施方式,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施方式仅用以解释本申请,并不用于限定本申请。基于本申请中的实施方式,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施方式,都属于本申请保护的范围。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施方式之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法 实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。It should be noted that the descriptions of "first", "second" and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. . Thus, features defining "first" or "second" may include at least one of the features, either explicitly or implicitly. In addition, the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.
参阅图1所示,是本申请应用服务器1一可选的硬件架构的示意图。本实施方式中,所述应用服务器1可包括,但不仅限于,可通过系统总线相互通信连接存储器11、处理器12、网络接口13。需要指出的是,图1仅示出了具有组件11-13的应用服务器1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。Referring to FIG. 1, it is a schematic diagram of an optional hardware architecture of the application server 1 of the present application. In this embodiment, the application server 1 may include, but is not limited to, the memory 11, the processor 12, and the network interface 13 being communicably connected to each other through a system bus. It is pointed out that Figure 1 only shows the application server 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
其中,所述应用服务器1可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器等计算设备,该应用服务器1可以是独立的服务器,也可以是多个服务器所组成的服务器集群。The application server 1 may be a computing device such as a rack server, a blade server, a tower server, or a rack server. The application server 1 may be an independent server or a server cluster composed of multiple servers. .
所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施方式中,所述存储器11可以是所述应用服务器1的内部存储单元,例如该应用服务器1的硬盘或内存。在另一些实施方式中,所述存储器11也可以是所述应用服务器1的外部存储设备,例如该应用服务器1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器11还可以既包括所述应用服务器1的内部存储单元也包括其外部存储设备。本实施方式中,所述存储器11通常用于存储安装于所述应用服务器1的操作系统和各类应用软件,例如目标文档获取系统200的程序代码等。此外,所述存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the application server 1, such as a hard disk or memory of the application server 1. In other embodiments, the memory 11 may also be an external storage device of the application server 1, such as a plug-in hard disk equipped on the application server 1, a smart memory card (SMC), and a secure digital number. (Secure Digital, SD) card, flash card, etc. Of course, the memory 11 can also include both the internal storage unit of the application server 1 and its external storage device. In the present embodiment, the memory 11 is generally used to store an operating system installed in the application server 1 and various types of application software, such as program code of the target document acquisition system 200. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
所述处理器12在一些实施方式中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述应用服务器1的总体操作。本实施方式中,所述处理 器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行所述的目标文档获取系统200等。The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used to control the overall operation of the application server 1. In this embodiment, the processor 12 is configured to run program code or process data stored in the memory 11, such as running the target document acquisition system 200 and the like.
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述应用服务器1与其他电子设备之间建立通信连接。The network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the application server 1 and other electronic devices.
至此,己经详细介绍了本申请相关设备的硬件结构和功能。下面,将基于上述介绍提出本申请的各个实施方式。So far, the hardware structure and functions of the devices related to this application have been described in detail. Hereinafter, various embodiments of the present application will be made based on the above description.
首先,本申请提出一种目标文档获取系统200。First, the present application proposes a target document acquisition system 200.
参阅图2所示,是本申请目标文档获取系统200第一实施方式的程序模块图。Referring to FIG. 2, it is a program module diagram of the first embodiment of the target document obtaining system 200 of the present application.
在一实施方式中,所述目标文档获取系统200包括一系列的存储于存储器11上的计算机程序指令,当该计算机程序指令被处理器12执行时,可以实现本申请各实施方式的目标文档获取操作。在一些实施方式中,基于该计算机程序指令各部分所实现的特定的操作,目标文档获取系统200可以被划分为一个或多个模块。例如,在图2中,所述目标文档获取系统200可以被分割成获取模块21、建立模块22、调用模块23、处理模块24、比较模块25及输出模块26。其中:In an embodiment, the target document acquisition system 200 includes a series of computer program instructions stored on the memory 11, and when the computer program instructions are executed by the processor 12, the target document acquisition of the embodiments of the present application may be implemented. operating. In some embodiments, the target document acquisition system 200 can be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 2, the target document acquisition system 200 can be divided into an acquisition module 21, an establishment module 22, a call module 23, a processing module 24, a comparison module 25, and an output module 26. among them:
所述获取模块21,用于获取至少一个文档及与所述文档对应的文档信息,并对所述文档信息进行预处理。The obtaining module 21 is configured to acquire at least one document and document information corresponding to the document, and preprocess the document information.
具体地,所述文档可为来自于保险机构及医疗机构的各种单据,保险机构及医疗机构具有存储医疗保险报销单据、药品清单等单据的数据库;医疗机构包括设立于不同地方的医院、诊所等。Specifically, the document may be various documents from insurance institutions and medical institutions, and the insurance institution and the medical institution have a database for storing medical insurance reimbursement documents, drug lists and the like; the medical institutions include hospitals and clinics established in different places. Wait.
具体地,所述预处理包括以下步骤:对所述文档进行分词,以获得至少一个词语;对所述词语进行词性分析以获得所述词语的第一信息;将所述词语为预定词性或者所述第一信息为预设第一信息的词语作为候选词语。Specifically, the preprocessing includes the steps of: segmenting the document to obtain at least one word; performing part of speech analysis on the word to obtain first information of the word; and using the word as a predetermined part of speech or The first information is a word that presets the first information as a candidate word.
具体地,词语的词性具体包括:名词、动词、形容词、数次、量词、代词、副词、连词及助词等;所述的第一信息包括:人名、机构名、地名、时 间、日期及百分比等。例如,对于医药领域,药品化合物的名称常常是重要的候选词语,而药品化合物的名称的词性通常为名词,因此预设词性可以为名词。Specifically, the part of speech includes: nouns, verbs, adjectives, several times, quantifiers, pronouns, adverbs, conjunctions, auxiliary words, etc.; the first information includes: person name, institution name, place name, time, date, percentage, etc. . For example, in the field of medicine, the name of a pharmaceutical compound is often an important candidate, and the part of a pharmaceutical compound is usually a noun, so the default part of speech can be a noun.
具体地,上述的处理步骤,可使用以下工具实施,比如,当文档是中文文档时,可以使用中国科学院ICTCLAS(Institute of Computing Technology Chinese Lexical Analysis System,汉语词法分析系统)、哈工大的HIT-IRLAS词法分析器等;当目标文档是英文文档时,可以使用Stanford Parse(也称斯坦福词法分析器)。优选地,还可以对候选词语进行浅层句法分析或者语块分析,形成语块结构信息,进一步地将语块结构信息作为候选词语,比如语块结构信息可以是非递归的名称短语、动词短语等等。Specifically, the above processing steps can be implemented by using the following tools. For example, when the document is a Chinese document, the ICTCLAS (Institute of Computing Technology Chinese Lexical Analysis System) and the HIT-IRLAS lexical method of Harbin Institute of Technology can be used. Analyzers, etc.; when the target document is an English document, Stanford Parse (also known as the Stanford Lexical Analyzer) can be used. Preferably, the candidate words may also be subjected to shallow syntactic analysis or block analysis to form block structure information, and the block structure information is further used as a candidate word, for example, the block structure information may be a non-recursive name phrase, a verb phrase, etc. Wait.
所述获取模块21还用于获取检索关键字。The obtaining module 21 is further configured to acquire a search keyword.
具体地,检索关键字可包括一个,也可以包括多个,当本方法用于提取反复发作低血糖单据时,检索关键字可以定为“葡萄糖”。用户可以根据不同的需要设定不同的关键词。Specifically, the search keyword may include one or more, and when the method is used to extract a recurrent hypoglycemic document, the search keyword may be defined as “glucose”. Users can set different keywords according to different needs.
所述建立模块22用于建立基于字符删除表,同义近义词表及规格参数表的文档选择模型。The establishing module 22 is configured to establish a document selection model based on a character deletion table, a synonym synonym table, and a specification parameter table.
具体地,上述对所述文档进行分词之后,对分词的处理属于模糊处理,建立文档选择模型,可以根据预处理之后的候选词语精确选出需要的文档。Specifically, after the word segmentation is performed on the document, the processing of the word segmentation belongs to a fuzzy process, and the document selection model is established, and the required document can be accurately selected according to the candidate words after the pre-processing.
具体地,所述字符删除表中包括与所述候选词语中明显与检索关键字不相符的字符,在一些情况下,目标文档中包括多个句子、符号、词语等,在预处理中对目标文档进行分词后获得的词语可能包括一些不符合要求的字符、词语,建立字符删除表对因错误、不当分词后的词语进行处理。Specifically, the character deletion table includes a character that is inconsistent with the search keyword among the candidate words, and in some cases, the target document includes a plurality of sentences, symbols, words, and the like, and the target is preprocessed. The words obtained after the document is segmented may include some characters and words that do not meet the requirements, and the character deletion table is created to deal with the words after the mistake and the inappropriate word segmentation.
具体地,所述同义近义词表包括与检索关键词对应的同义词、近义词,还可以包括不同语言对应的外语词汇。同义词是指一组意思相近或者相互关联的词汇,同一词汇可以有多个同一语言的同义词。对于需要检索的关键词来说,在不同地方不同人对其有不同的写法,例如,计算机与电脑是同义词。 在不同领域,就算是一样的词语也具有不同的意义,因此,选定技术领域对正确检索也具有重要意义。Specifically, the synonymous synonym table includes synonyms and synonyms of the search keywords, and may also include foreign language vocabularies corresponding to different languages. A synonym is a group of words that have similar meanings or are related to each other. The same word can have multiple synonyms in the same language. For keywords that need to be searched, different people have different ways of writing them in different places. For example, computers and computers are synonymous. In different fields, even the same words have different meanings. Therefore, the selected technical field is also important for correct retrieval.
具体地,所述规格参数表中包括对应检索关键词的多种参数。以葡萄糖为例,在反复发作低血糖病症中,葡萄糖的规格参数包括用量、使用频率等。当检索关键词具有规格参数限定时,使用规格参数表可以对候选词语进行精确定位。Specifically, the specification parameter table includes multiple parameters corresponding to the search keyword. Taking glucose as an example, in the case of recurrent hypoglycemia, the specification parameters of glucose include the amount and frequency of use. When the search keyword has a specification parameter definition, the candidate parameter can be accurately positioned using the specification parameter table.
具体地,建立文档选择模型包括以下步骤:对检索关键词进行分析,获得该检索关键词的技术领域;在该技术领域,根据分析结果设置字符删除表;在该技术领域,从数据库中获得该关键词的同义词、近义词并建立同义近义词表;在该技术领域,对该关键词分析后选取该关键词的规格参数建立规格参数表;及对字符删除表、同义近义词表及规格参数表进行动态更新。Specifically, establishing a document selection model includes the following steps: analyzing a search keyword to obtain a technical field of the search keyword; in the technical field, setting a character deletion table according to the analysis result; in the technical field, obtaining the data from a database Key words synonym, synonym and establish synonym synonym table; in the technical field, the keyword is analyzed and the specification parameter of the keyword is selected to establish a specification parameter table; and the character deletion table, the synonym synonym table and the specification parameter table Make a dynamic update.
所述调用模块23用于将预处理后的文档信息输入所述文档选择模型,所述文档选择模型对所述文档信息进行处理。The calling module 23 is configured to input the pre-processed document information into the document selection model, and the document selection model processes the document information.
具体地,预处理后的文档信息作为输入信息被输入到所述文档选择模型,所述文档选择模型根据预设条件对所述文档信息进行处理,该处理步骤包括:调用字符删除表对所述文档信息中与检索关键词相比错误、多余、明显相关的字符、词语进行删除;调用同义近义词表对检索关键词进行替换,对替换后的检索关键词进行检索,将与所述检索关键词及其同义近义词匹配的文档信息保存;调用规格参数表对检索关键词及其同义近义词对应的规格参数进行比对分析,将与规格参数表中的数据匹配的文档信息保存。Specifically, the preprocessed document information is input as input information to the document selection model, and the document selection model processes the document information according to a preset condition, the processing step includes: calling a character deletion table to In the document information, the characters, words and words that are wrong, redundant, and obviously related to the search keyword are deleted; the synonymous synonyms table is called to replace the search keyword, and the search keyword after the replacement is searched, and the search key is The document information matching the word and its synonym synonym is saved; the specification parameter table is used to compare and analyze the specification parameters corresponding to the search keyword and the synonymous synonym, and the document information matching the data in the specification parameter table is saved.
具体地,在对文档信息进行处理时,还可以包括以下步骤:建立括号识别模型,对括号的不同使用方式进行识别以获取精确的分类数据。Specifically, when processing the document information, the following steps may be further included: establishing a bracket recognition model, and identifying different usage manners of the brackets to obtain accurate classification data.
具体地,括号识别模型可以对括号的不同功能进行识别,包括单边关系、并行关系及包含关系,其中,单边关系指括号作为分隔符对目标文档中的文档信息进行分割,并行关系指文档信息中括号用于显示部分词语具有的别名,包含关系指文档信息中括号中的内容为部分名词的具体参数信息。Specifically, the bracket recognition model can identify different functions of the brackets, including a unilateral relationship, a parallel relationship, and an inclusion relationship, wherein the unilateral relationship refers to the parentheses as a separator to segment the document information in the target document, and the parallel relationship refers to the document. The parentheses in the information are used to display the aliases of some words, and the inclusion relationship refers to the specific parameter information of the partial nouns in the information in the document information.
所述处理模块24用于根据预设的关键词词频及密度算法计算所述文档选择模型输出的所述文档中所述关键词的词频及密度分数,根据所述词频及密度分数对所述文档进行相关度排序。The processing module 24 is configured to calculate, according to a preset keyword frequency and density algorithm, a word frequency and a density score of the keyword in the document output by the document selection model, and the document according to the word frequency and density score Sort the relevance.
具体地,所述关键词词频及密度分数M为:Specifically, the keyword frequency and density score M is:
M=∑log(文档总数/(包含关键词的文档数目+1))*exp(count(关键词),S),其中,count(关键词)为查询词在检索结果中击中的次数,log(文档总数/(包含关键词的文档数目+1))为关键词在查询结果中的重要程度,S为预设参数。M=∑log (total number of documents/(number of documents containing keywords +1))*exp(count(keyword), S), where count (keyword) is the number of times the query word is hit in the search result, Log (total number of documents / (number of documents containing keywords + 1)) is the importance of keywords in the query results, and S is a preset parameter.
所述比较模块25用于将所述文档的相关度与预设相关度阈值进行比较。The comparison module 25 is configured to compare the relevance of the document with a preset relevance threshold.
所述输出模块26用于根据预设相关度阈值,输出大于预设相关度阈值的目标文档。The output module 26 is configured to output a target document that is greater than a preset relevance threshold according to a preset correlation threshold.
进一步地,当所述目标文档获取方法运用于医疗领域,比如,用于获取反复发作低血糖单据时,还可包括以下步骤:对筛选出的反复发作低血糖单据进行分析,获得患者的身份信息;根据患者的身份信息从数据库中获得该患者的历史诊疗数据;从所述历史诊疗数据中获得该患者的葡萄糖使用、疾病检测及治疗方式等数据;及根据以上数据获得该患者所有反复发作低血糖单据。Further, when the target document obtaining method is applied to the medical field, for example, for acquiring a recurrent hypoglycemic document, the following steps may be further included: analyzing the filtered recurrent hypoglycemic document to obtain the patient's identity information. Obtaining historical medical data of the patient from the database according to the identity information of the patient; obtaining data of the patient's glucose use, disease detection, and treatment mode from the historical medical treatment data; and obtaining all recurrent episodes of the patient according to the above data Blood glucose receipts.
此外,本申请还提出一种目标文档获取方法。In addition, the present application also proposes a target document acquisition method.
参阅图3所示,是本申请目标文档获取方法第一实施方式的流程示意图。在本实施方式中,根据不同的需求,图3所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。Referring to FIG. 3, it is a schematic flowchart of the first implementation manner of the target document obtaining method of the present application. In the present embodiment, the order of execution of the steps in the flowchart shown in FIG. 3 may be changed according to different requirements, and some steps may be omitted.
步骤S110,获取至少一个文档及与所述文档对应的文档信息,并对所述文档信息进行预处理。Step S110: Acquire at least one document and document information corresponding to the document, and preprocess the document information.
步骤S120,获取检索关键字。In step S120, a search keyword is obtained.
步骤S130,建立基于字符删除表,同义近义词表及规格参数表的文档选择模型。Step S130, establishing a document selection model based on a character deletion table, a synonym synonym table, and a specification parameter table.
具体地,上述对所述文档进行分词之后,对分词的处理属于模糊处理, 建立文档选择模型,可以根据预处理之后的候选词语精确选出需要的文档。例如,建立基于字符删除表,同义近义词表及规格参数表的文档选择模型可以快速准确的获得想要的文档。Specifically, after the word segmentation is performed on the document, the processing of the word segmentation belongs to a fuzzy process, and the document selection model is established, and the required document can be accurately selected according to the candidate words after the pre-processing. For example, a document selection model based on a character deletion table, a synonym synonym table, and a specification parameter table can quickly and accurately obtain a desired document.
步骤S140,将预处理后的文档信息输入所述文档选择模型,所述文档选择模型根据所述检索关键字对所述文档信息进行处理。Step S140, input the pre-processed document information into the document selection model, and the document selection model processes the document information according to the retrieval keyword.
具体地,调用所述文档选择模型对所述文档信息进行处理,可以快速获得匹配的文档。所述处理还包括:建立括号识别模型,对括号的不同使用方式进行识别以获取精确的分类数据。括号中的数据多种多样,包括对前一词语的解释、定量描述、同义词、外语词等,同时,括号也可以只是作为语句分割而存在,根据不同的情况建立货号识别模型,有助于获得更优的结果。Specifically, the document selection model is invoked to process the document information, and the matched document can be quickly obtained. The processing further includes: establishing a bracket recognition model to identify different usage patterns of the brackets to obtain accurate classification data. The data in parentheses is diverse, including the interpretation of the previous word, quantitative description, synonyms, foreign words, etc. At the same time, the parentheses can only exist as a sentence segmentation. Establishing the item number identification model according to different situations can help to obtain Better results.
步骤S150,根据预设的关键词词频及密度算法计算所述文档选择模型输出的所述文档中所述检索关键词的词频及密度分数,并根据所述词频及密度分数对所述文档进行相关度排序。Step S150: Calculate a word frequency and a density score of the search keyword in the document output by the document selection model according to a preset keyword frequency and density algorithm, and correlate the document according to the word frequency and density score. Degree sorting.
具体地,所述关键词词频及密度分数M为:Specifically, the keyword frequency and density score M is:
M=∑log(文档总数/(包含关键词的文档数目+1))*exp(count(关键词),S),其中,count(关键词)为查询词在检索结果中击中的次数,log(文档总数/(包含关键词的文档数目+1))为关键词在查询结果中的重要程度,S为预设参数。M=∑log (total number of documents/(number of documents containing keywords +1))*exp(count(keyword), S), where count (keyword) is the number of times the query word is hit in the search result, Log (total number of documents / (number of documents containing keywords + 1)) is the importance of keywords in the query results, and S is a preset parameter.
步骤S160,根据预设相关度阈值,输出所述文档中所述相关度大于所述预设相关度阈值的目标文档。Step S160: Output, according to the preset relevance threshold, a target document in the document that is greater than the preset relevance threshold.
具体地,设置相关度阈值可以更加方便且准确的获得需要的文档,并且,用户也可以根据查看的结果对所述相关度阈值进行微调,通过反馈的操作使得检索方法更加完善。Specifically, setting the relevance threshold can obtain the required document more conveniently and accurately, and the user can also finely adjust the relevance threshold according to the result of the review, and the retrieval method is more perfect through the feedback operation.
如图4所示,是本申请目标文档获取方法的第二实施方式的流程示意图。在第一实施方式中,步骤S110“获取至少一个文档及与所述文档对应的文档信息,并对所述文档信息进行预处理”中预处理具体包括以下步骤:As shown in FIG. 4, it is a schematic flowchart of a second implementation manner of the target document obtaining method of the present application. In the first embodiment, the pre-processing in step S110, “acquiring at least one document and document information corresponding to the document, and pre-processing the document information” specifically includes the following steps:
步骤S210,对所述文档进行分词,以获得至少一个词语。Step S210, segmenting the document to obtain at least one word.
步骤S220,对所述词语进行词性分析以获得所述词语的第一信息。Step S220, performing part of speech analysis on the words to obtain first information of the words.
具体地,词语的词性具体包括:名词、动词、形容词、数次、量词、代词、副词、连词及助词等;所述的第一信息包括:人名、机构名、地名、时间、日期及百分比等。例如,对于医药领域,药品化合物的名称常常是重要的候选词语,而药品化合物的名称的词性通常为名词,因此预设词性可以为名词。Specifically, the part of speech includes: nouns, verbs, adjectives, several times, quantifiers, pronouns, adverbs, conjunctions, auxiliary words, etc.; the first information includes: person name, institution name, place name, time, date, percentage, etc. . For example, in the field of medicine, the name of a pharmaceutical compound is often an important candidate, and the part of a pharmaceutical compound is usually a noun, so the default part of speech can be a noun.
步骤S230,将所述词语为预定词性或者所述第一信息为预设第一信息的词语作为候选词语。Step S230, the words whose words are predetermined part of speech or the first information is preset first information are used as candidate words.
如图5所示,是本申请目标文档获取方法的第三实施方式的流程示意图。在第一实施方式中,步骤S130“建立基于字符删除表,同义近义词表及规格参数表的文档选择模型”中文档选择模型的建立具体包括以下步骤:As shown in FIG. 5, it is a schematic flowchart of a third embodiment of the target document obtaining method of the present application. In the first embodiment, the step S130 "establishing a document selection model based on the character deletion table, the synonymous synonyms table and the specification parameter table" includes the following steps:
步骤S310,对所述检索关键词进行分析,获得所述检索关键词的技术领域。Step S310, analyzing the search keyword to obtain a technical field of the search keyword.
具体地,所述检索关键词往往代表其特殊领域的特殊意义,借此可以确定所述检索关键词的技术领域,例如若所述检索关键词为“二叉树”,则可以缩小技术领域到计算机、算法等。Specifically, the search keyword often represents a special meaning of its special field, whereby the technical field of the search keyword can be determined. For example, if the search keyword is “binary tree”, the technical field can be reduced to a computer. Algorithms, etc.
步骤S320,在所述技术领域,根据分析结果设置字符删除表。Step S320, in the technical field, setting a character deletion table according to the analysis result.
具体地,所述字符删除表中包括与所述候选词语中明显与检索关键字不相符的字符,在一些情况下,目标文档中包括多个句子、符号、词语等,在预处理中对目标文档进行分词后获得的词语可能包括一些不符合要求的字符、词语,建立字符删除表对因错误、不当分词后的词语进行处理。Specifically, the character deletion table includes a character that is inconsistent with the search keyword among the candidate words, and in some cases, the target document includes a plurality of sentences, symbols, words, and the like, and the target is preprocessed. The words obtained after the document is segmented may include some characters and words that do not meet the requirements, and the character deletion table is created to deal with the words after the mistake and the inappropriate word segmentation.
步骤S330,在所述技术领域,从数据库中获得所述关键词的同义词、近义词并建立同义近义词表。Step S330, in the technical field, synonym and synonym of the keyword are obtained from a database and a synonym synonym table is established.
具体地,所述同义近义词表包括与检索关键词对应的同义词、近义词,还可以包括不同语言对应的外语词汇。同义词是指一组意思相近或者相互关联的词汇,同一词汇可以有多个同一语言的同义词。对于需要检索的关键词 来说,在不同地方不同人对其有不同的写法,例如,计算机与电脑是同义词。在不同领域,就算是一样的词语也具有不同的意义,因此,选定技术领域对正确检索也具有重要意义。Specifically, the synonymous synonym table includes synonyms and synonyms of the search keywords, and may also include foreign language vocabularies corresponding to different languages. A synonym is a group of words that have similar meanings or are related to each other. The same word can have multiple synonyms in the same language. For keywords that need to be searched, different people have different ways of writing them in different places. For example, computers and computers are synonymous. In different fields, even the same words have different meanings. Therefore, the selected technical field is also important for correct retrieval.
步骤S340,在所述技术领域,对所述关键词分析后选取所述关键词的规格参数建立所述规格参数表。Step S340, in the technical field, selecting the specification parameter of the keyword after analyzing the keyword to establish the specification parameter table.
具体地,所述规格参数表中包括对应检索关键词的多种参数。以葡萄糖为例,在反复发作低血糖病症中,葡萄糖的规格参数包括用量、使用频率等。当检索关键词具有规格参数限定时,使用规格参数表可以对候选词语进行精确定位。Specifically, the specification parameter table includes multiple parameters corresponding to the search keyword. Taking glucose as an example, in the case of recurrent hypoglycemia, the specification parameters of glucose include the amount and frequency of use. When the search keyword has a specification parameter definition, the candidate parameter can be accurately positioned using the specification parameter table.
步骤S350,对所述字符删除表、所述同义近义词表及所述规格参数表进行动态更新。Step S350, dynamically updating the character deletion table, the synonym synonym table, and the specification parameter table.
具体地,当越来越多的信息可以获得时,所述字符删除表、所述同义近义词表及所述规格参数表可以根据获得的信息进行动态更新,以使得所述字符删除表、所述同义近义词表及所述规格参数表更加完善,从而使得基于所述字符删除表、所述同义近义词表及所述规格参数表的文档选择模型更加准确。Specifically, when more and more information is available, the character deletion table, the synonym synonym table, and the specification parameter table may be dynamically updated according to the obtained information, so that the character deletion table and the The synonym synonym table and the specification parameter table are more perfect, so that the document selection model based on the character deletion table, the synonym synonym table and the specification parameter table is more accurate.
如图6所示,是本申请目标文档获取方法的第四实施方式的流程示意图。本实施方式中,所述步骤“将预处理后的文档信息输入所述文档选择模型,所述文档选择模型根据所述检索关键字对所述文档信息进行处理”中的处理步骤具体包括以下步骤:FIG. 6 is a schematic flowchart diagram of a fourth embodiment of the method for acquiring an object of the present application. In this embodiment, the step of the step of “putting the pre-processed document information into the document selection model, and the document selection model processing the document information according to the retrieval keyword” specifically includes the following steps. :
步骤S410,调用所述字符删除表对所述文档信息中与所述检索关键词相比错误、多余、明显相关的字符、词语进行删除。Step S410, calling the character deletion table to delete characters, words that are incorrect, redundant, and obviously related to the search keyword in the document information.
步骤S420,调用所述同义近义词表对所述检索关键词进行替换,对替换后的所述检索关键词进行检索,将与所述检索关键词及其同义近义词匹配的文档信息保存。Step S420, calling the synonym synonym table to replace the search keyword, searching the replaced search keyword, and saving document information matching the search keyword and its synonym similarity.
步骤S430,调用所述规格参数表对所述检索关键词及其同义近义词对应 的规格参数进行比对分析,将与规格参数表中的数据匹配的文档信息保存。Step S430, the specification parameter table is invoked to perform comparison analysis on the specification parameters corresponding to the search keyword and the synonymous synonym, and the document information matching the data in the specification parameter table is saved.
如图7所示,是本申请目标文档获取方法的第五实施方式的流程示意图。本实施方式中,当所述目标文档获取方法用于获取反复发作低血糖单据时,所述步骤“根据预设相关度阈值,输出所述文档中所述相关度大于所述预设相关度阈值的目标文档”之后,还可包括如下步骤:FIG. 7 is a schematic flowchart diagram of a fifth embodiment of a method for acquiring an object of the present application. In this embodiment, when the target document obtaining method is used to acquire a recurrent hypoglycemic document, the step of “outputting the relevance in the document is greater than the preset relevance threshold according to a preset relevance threshold. After the target document, the following steps can also be included:
步骤S610,对筛选出的所述目标文档进行分析,获得患者的身份信息。Step S610, analyzing the filtered target document to obtain identity information of the patient.
具体地,筛选出的所述目标文档可为反复发作低血糖单据。Specifically, the target document selected may be a recurrent hypoglycemic document.
步骤S620,根据患者的所述身份信息从数据库中获得该患者的历史诊疗数据。Step S620, obtaining historical medical treatment data of the patient from the database according to the identity information of the patient.
步骤S630,从所述历史诊疗数据中获得该患者的葡萄糖使用、疾病检测及治疗方式等数据。Step S630, obtaining data such as glucose usage, disease detection, and treatment mode of the patient from the historical medical treatment data.
步骤S640,根据以上数据获得该患者所有反复发作低血糖单据。Step S640, obtaining all recurrent hypoglycemia documents of the patient according to the above data.
上述本申请实施方式序号仅仅为了描述,不代表实施方式的优劣。The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施方式方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施方式所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in the various embodiments of the present application.
以上仅为本申请的优选实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only a preferred embodiment of the present application, and thus does not limit the scope of the patent application, and the equivalent structure or equivalent process transformation made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims (20)

  1. 一种目标文档获取方法,应用于应用服务器,其特征在于,所述方法包括步骤:A method for acquiring a target document is applied to an application server, and the method includes the steps of:
    获取至少一个文档及与所述文档对应的文档信息,并对所述文档信息进行预处理;Obtaining at least one document and document information corresponding to the document, and preprocessing the document information;
    获取检索关键字;Get the search keyword;
    建立基于字符删除表,同义近义词表及规格参数表的文档选择模型;Establish a document selection model based on a character deletion table, a synonym synonym table, and a specification parameter table;
    将预处理后的文档信息输入所述文档选择模型,所述文档选择模型根据所述检索关键字对所述文档信息进行处理;Importing the pre-processed document information into the document selection model, the document selection model processing the document information according to the retrieval keyword;
    根据预设的关键词词频及密度算法计算所述文档选择模型输出的所述文档中所述检索关键词的词频及密度分数,并根据所述词频及密度分数对所述文档进行相关度排序;及Calculating a word frequency and a density score of the search keyword in the document output by the document selection model according to a preset keyword frequency and density algorithm, and sorting the documents according to the word frequency and the density score; and
    根据预设相关度阈值,输出所述文档中所述相关度大于所述预设相关度阈值的目标文档。And outputting, in the document, the target document whose relevance is greater than the preset relevance threshold according to a preset relevance threshold.
  2. 如权利要求1所述的目标文档获取方法,其特征在于,所述步骤“获取至少一个文档及与所述文档对应的文档信息,并对所述文档信息进行预处理”之预处理还包括以下步骤:The target document obtaining method according to claim 1, wherein the preprocessing of the step of "acquiring at least one document and document information corresponding to the document and preprocessing the document information" further includes the following step:
    对所述文档进行分词,以获得至少一个词语;Segmenting the document to obtain at least one word;
    对所述词语进行词性分析以获得所述词语的第一信息;及Performing part-of-speech analysis on the words to obtain first information of the words; and
    将所述词语为预定词性或者所述第一信息为预设第一信息的词语作为候选词语。The words whose words are predetermined part of speech or whose first information is preset first information are used as candidate words.
  3. 如权利要求1所述的目标文档获取方法,其特征在于,所述字符删除表中包括与所述候选词语中明显与检索关键字不相符的字符;述同义近义词表包括与检索关键词对应的同义词、近义词;所述规格参数表中包括对应检索关键词的多种参数。The target document obtaining method according to claim 1, wherein the character deletion table includes a character that is inconsistent with the search keyword among the candidate words; and the synonymous synonym table includes a keyword corresponding to the search keyword. Synonyms and synonyms; the specification parameter table includes various parameters corresponding to the search keyword.
  4. 如权利要求3所述的目标文档获取方法,其特征在于,所述目标文档 选择模型建立的步骤包括:The method for acquiring a target document according to claim 3, wherein the step of establishing the target document selection model comprises:
    对所述检索关键词进行分析,获得所述检索关键词的技术领域;Performing analysis on the search keyword to obtain a technical field of the search keyword;
    在所述技术领域,根据分析结果设置字符删除表;In the technical field, a character deletion table is set according to the analysis result;
    在所述技术领域,从数据库中获得所述关键词的同义词、近义词并建立同义近义词表;In the technical field, synonym and synonym of the keyword are obtained from a database and a synonym synonym table is established;
    在所述技术领域,对所述关键词分析后选取所述关键词的规格参数建立所述规格参数表;及In the technical field, the specification parameter table is established by selecting the specification parameter of the keyword after the keyword analysis; and
    对所述字符删除表、所述同义近义词表及所述规格参数表进行动态更新。Dynamically updating the character deletion table, the synonym synonym table, and the specification parameter table.
  5. 如权利要求1所述的目标文档获取方法,其特征在于,所述步骤“将预处理后的文档信息输入所述文档选择模型,所述文档选择模型根据所述检索关键字对所述文档信息进行处理”中,所述处理步骤包括:The target document obtaining method according to claim 1, wherein said step "putting preprocessed document information into said document selection model, said document selection model is responsive to said retrieval keyword to said document information In the process of processing, the processing steps include:
    调用所述字符删除表对所述文档信息中与所述检索关键词相比错误、多余、明显相关的字符、词语进行删除;Calling the character deletion table to delete characters, words that are incorrect, redundant, and obviously related to the search keyword in the document information;
    调用所述同义近义词表对所述检索关键词进行替换,对替换后的所述检索关键词进行检索,将与所述检索关键词及其同义近义词匹配的文档信息保存;及Calling the synonym synonym table to replace the search keyword, searching the replaced search keyword, and saving document information matching the search keyword and its synonymous synonym; and
    调用所述规格参数表对所述检索关键词及其同义近义词对应的规格参数进行比对分析,将与规格参数表中的数据匹配的文档信息保存。The specification parameter table is called to perform comparison analysis on the specification parameters corresponding to the search keyword and the synonymous synonym, and the document information matching the data in the specification parameter table is saved.
  6. 如权利要求5所述的目标文档获取方法,其特征在于,所述步骤“将预处理后的文档信息输入所述文档选择模型,所述文档选择模型根据所述检索关键字对所述文档信息进行处理”中,所述处理步骤还包括:The target document obtaining method according to claim 5, wherein the step "puts the preprocessed document information into the document selection model, and the document selection model pairs the document information according to the retrieval keyword In the processing, the processing step further includes:
    建立括号识别模型,对括号的不同使用方式进行识别以获取精确的分类数据。Create a bracket recognition model to identify the different ways in which the brackets are used to obtain accurate categorical data.
  7. 如权利要求1所述的目标文档获取方法,其特征在于,所述关键词词频及密度分数M为:M=∑log(文档总数/(包含所述检索关键词的文档数目+1))*exp(count(所述检索关键词),S),其中,count(所述检索关键词)为所述 检索关键词在检索结果中击中的次数,log(文档总数/(包含所述检索关键词的文档数目+1))为所述检索关键词在查询结果中的重要程度,S为预设参数。The target document acquisition method according to claim 1, wherein the keyword word frequency and density score M are: M = ∑log (total number of documents / (number of documents including the search keyword + 1)) * Exp(count (the search keyword), S), wherein count (the search keyword) is the number of times the search keyword is hit in the search result, log (total number of documents / (including the search key) The number of documents of the word +1)) is the importance degree of the search keyword in the query result, and S is a preset parameter.
  8. 如权利要求1所述的目标文档获取方法,其特征在于,当所述目标文档获取方法用于获取反复发作低血糖单据时,所述步骤“根据预设相关度阈值,输出所述文档中所述相关度大于所述预设相关度阈值的目标文档”之后,还包括如下步骤:The target document acquisition method according to claim 1, wherein when the target document acquisition method is used to acquire a recurrent hypoglycemic document, the step of “outputting the document according to a preset correlation threshold” After the target document whose correlation is greater than the preset relevance threshold, the following steps are also included:
    对筛选出的所述目标文档进行分析,获得患者的身份信息;Performing analysis on the selected target document to obtain identity information of the patient;
    根据患者的所述身份信息从数据库中获得该患者的历史诊疗数据;Obtaining historical medical data of the patient from the database according to the identity information of the patient;
    从所述历史诊疗数据中获得该患者的葡萄糖使用、疾病检测及治疗方式等数据;及Obtaining data such as glucose use, disease detection, and treatment methods of the patient from the historical medical treatment data; and
    根据以上数据获得该患者所有反复发作低血糖单据。According to the above data, all recurrent hypoglycemia documents of the patient were obtained.
  9. 一种应用服务器,其特征在于,所述应用服务器包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的目标文档获取系统,所述目标文档获取系统被所述处理器执行时实现如权下步骤:An application server, comprising: a memory, a processor, on the memory, a target document acquisition system executable on the processor, where the target document acquisition system is used by the processor The implementation steps are as follows:
    获取至少一个文档及与所述文档对应的文档信息,并对所述文档信息进行预处理;Obtaining at least one document and document information corresponding to the document, and preprocessing the document information;
    获取检索关键字;Get the search keyword;
    建立基于字符删除表,同义近义词表及规格参数表的文档选择模型;Establish a document selection model based on a character deletion table, a synonym synonym table, and a specification parameter table;
    将预处理后的文档信息输入所述文档选择模型,所述文档选择模型根据所述检索关键字对所述文档信息进行处理;Importing the pre-processed document information into the document selection model, the document selection model processing the document information according to the retrieval keyword;
    根据预设的关键词词频及密度算法计算所述文档选择模型输出的所述文档中所述检索关键词的词频及密度分数,并根据所述词频及密度分数对所述文档进行相关度排序;及Calculating a word frequency and a density score of the search keyword in the document output by the document selection model according to a preset keyword frequency and density algorithm, and sorting the documents according to the word frequency and the density score; and
    根据预设相关度阈值,输出所述文档中所述相关度大于所述预设相关度阈值的目标文档。And outputting, in the document, the target document whose relevance is greater than the preset relevance threshold according to a preset relevance threshold.
  10. 如权利要求9所述的应用服务器,其特征在于,所述步骤“获取至 少一个文档及与所述文档对应的文档信息,并对所述文档信息进行预处理”之预处理还包括以下步骤:The application server according to claim 9, wherein the preprocessing of the step of "acquiring at least one document and document information corresponding to the document and preprocessing the document information" further comprises the following steps:
    对所述文档进行分词,以获得至少一个词语;Segmenting the document to obtain at least one word;
    对所述词语进行词性分析以获得所述词语的第一信息;及Performing part-of-speech analysis on the words to obtain first information of the words; and
    将所述词语为预定词性或者所述第一信息为预设第一信息的词语作为候选词语。The words whose words are predetermined part of speech or whose first information is preset first information are used as candidate words.
  11. 如权利要求9所述的应用服务器,其特征在于,所述字符删除表中包括与所述候选词语中明显与检索关键字不相符的字符;述同义近义词表包括与检索关键词对应的同义词、近义词;所述规格参数表中包括对应检索关键词的多种参数。The application server according to claim 9, wherein said character deletion table includes a character that is inconsistent with the search keyword among said candidate words; said synonymous synonym table includes a synonym corresponding to the search keyword And synonym; the specification parameter table includes various parameters corresponding to the search keyword.
  12. 如权利要求11所述的应用服务器,其特征在于,所述目标文档选择模型建立的步骤包括:The application server according to claim 11, wherein the step of establishing the target document selection model comprises:
    对所述检索关键词进行分析,获得所述检索关键词的技术领域;Performing analysis on the search keyword to obtain a technical field of the search keyword;
    在所述技术领域,根据分析结果设置字符删除表;In the technical field, a character deletion table is set according to the analysis result;
    在所述技术领域,从数据库中获得所述关键词的同义词、近义词并建立同义近义词表;In the technical field, synonym and synonym of the keyword are obtained from a database and a synonym synonym table is established;
    在所述技术领域,对所述关键词分析后选取所述关键词的规格参数建立所述规格参数表;及In the technical field, the specification parameter table is established by selecting the specification parameter of the keyword after the keyword analysis; and
    对所述字符删除表、所述同义近义词表及所述规格参数表进行动态更新。Dynamically updating the character deletion table, the synonym synonym table, and the specification parameter table.
  13. 如权利要求9所述的应用服务器,其特征在于,所述步骤“将预处理后的文档信息输入所述文档选择模型,所述文档选择模型根据所述检索关键字对所述文档信息进行处理”中,所述处理步骤包括:The application server according to claim 9, wherein said step "putting preprocessed document information into said document selection model, said document selection model processing said document information based on said retrieval key The processing steps include:
    调用所述字符删除表对所述文档信息中与所述检索关键词相比错误、多余、明显相关的字符、词语进行删除;Calling the character deletion table to delete characters, words that are incorrect, redundant, and obviously related to the search keyword in the document information;
    调用所述同义近义词表对所述检索关键词进行替换,对替换后的所述检索关键词进行检索,将与所述检索关键词及其同义近义词匹配的文档信息保 存;及Recalling the synonym synonym table to replace the search keyword, searching the replaced search keyword, and saving document information matching the search keyword and its synonymous synonym; and
    调用所述规格参数表对所述检索关键词及其同义近义词对应的规格参数进行比对分析,将与规格参数表中的数据匹配的文档信息保存。The specification parameter table is called to perform comparison analysis on the specification parameters corresponding to the search keyword and the synonymous synonym, and the document information matching the data in the specification parameter table is saved.
  14. 如权利要求9所述的应用服务器,其特征在于,所述关键词词频及密度分数M为:M=∑log(文档总数/(包含所述检索关键词的文档数目+1))*exp(count(所述检索关键词),S),其中,count(所述检索关键词)为所述检索关键词在检索结果中击中的次数,log(文档总数/(包含所述检索关键词的文档数目+1))为所述检索关键词在查询结果中的重要程度,S为预设参数。The application server according to claim 9, wherein said keyword word frequency and density score M are: M = ∑log (total number of documents / (number of documents including said search keyword +1)) * exp ( Count (the search keyword), S), wherein count (the search keyword) is the number of times the search keyword is hit in the search result, log (total number of documents / (including the search keyword) The number of documents +1)) is the importance degree of the search keyword in the query result, and S is a preset parameter.
  15. 如权利要求9所述的应用服务器,其特征在于,当所述目标文档获取方法用于获取反复发作低血糖单据时,所述步骤“根据预设相关度阈值,输出所述文档中所述相关度大于所述预设相关度阈值的目标文档”之后,还包括如下步骤:The application server according to claim 9, wherein when said target document acquisition method is for acquiring a recurrent hypoglycemic document, said step "outputting said correlation in said document according to a preset relevance threshold After the target document whose degree is greater than the preset relevance threshold, the following steps are also included:
    对筛选出的所述目标文档进行分析,获得患者的身份信息;Performing analysis on the selected target document to obtain identity information of the patient;
    根据患者的所述身份信息从数据库中获得该患者的历史诊疗数据;Obtaining historical medical data of the patient from the database according to the identity information of the patient;
    从所述历史诊疗数据中获得该患者的葡萄糖使用、疾病检测及治疗方式等数据;及Obtaining data such as glucose use, disease detection, and treatment methods of the patient from the historical medical treatment data; and
    根据以上数据获得该患者所有反复发作低血糖单据。According to the above data, all recurrent hypoglycemia documents of the patient were obtained.
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有目标文档获取系统,所述目标文档获取系统可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:A computer readable storage medium storing a target document acquisition system, the target document acquisition system being executable by at least one processor to cause the at least one processor to perform the following steps:
    获取至少一个文档及与所述文档对应的文档信息,并对所述文档信息进行预处理;Obtaining at least one document and document information corresponding to the document, and preprocessing the document information;
    获取检索关键字;Get the search keyword;
    建立基于字符删除表,同义近义词表及规格参数表的文档选择模型;Establish a document selection model based on a character deletion table, a synonym synonym table, and a specification parameter table;
    将预处理后的文档信息输入所述文档选择模型,所述文档选择模型根据所述检索关键字对所述文档信息进行处理;Importing the pre-processed document information into the document selection model, the document selection model processing the document information according to the retrieval keyword;
    根据预设的关键词词频及密度算法计算所述文档选择模型输出的所述文档中所述检索关键词的词频及密度分数,并根据所述词频及密度分数对所述文档进行相关度排序;及Calculating a word frequency and a density score of the search keyword in the document output by the document selection model according to a preset keyword frequency and density algorithm, and sorting the documents according to the word frequency and the density score; and
    根据预设相关度阈值,输出所述文档中所述相关度大于所述预设相关度阈值的目标文档。And outputting, in the document, the target document whose relevance is greater than the preset relevance threshold according to a preset relevance threshold.
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述步骤“获取至少一个文档及与所述文档对应的文档信息,并对所述文档信息进行预处理”之预处理还包括以下步骤:The computer readable storage medium according to claim 16, wherein the preprocessing of the step of "acquiring at least one document and document information corresponding to the document and preprocessing the document information" further comprises The following steps:
    对所述文档进行分词,以获得至少一个词语;Segmenting the document to obtain at least one word;
    对所述词语进行词性分析以获得所述词语的第一信息;及Performing part-of-speech analysis on the words to obtain first information of the words; and
    将所述词语为预定词性或者所述第一信息为预设第一信息的词语作为候选词语。The words whose words are predetermined part of speech or whose first information is preset first information are used as candidate words.
  18. 如权利要求16所述的计算机可读存储介质,其特征在于,所述字符删除表中包括与所述候选词语中明显与检索关键字不相符的字符;述同义近义词表包括与检索关键词对应的同义词、近义词;所述规格参数表中包括对应检索关键词的多种参数。The computer readable storage medium according to claim 16, wherein said character deletion table includes a character that is inconsistent with the search key among said candidate words; said synonymous synonym table includes and searches for a keyword Corresponding synonyms and synonyms; the specification parameter table includes various parameters corresponding to the search keyword.
  19. 如权利要求18所述的计算机可读存储介质,其特征在于,所述目标文档选择模型建立的步骤包括:The computer readable storage medium of claim 18, wherein the step of establishing the target document selection model comprises:
    对所述检索关键词进行分析,获得所述检索关键词的技术领域;Performing analysis on the search keyword to obtain a technical field of the search keyword;
    在所述技术领域,根据分析结果设置字符删除表;In the technical field, a character deletion table is set according to the analysis result;
    在所述技术领域,从数据库中获得所述关键词的同义词、近义词并建立同义近义词表;In the technical field, synonym and synonym of the keyword are obtained from a database and a synonym synonym table is established;
    在所述技术领域,对所述关键词分析后选取所述关键词的规格参数建立所述规格参数表;及In the technical field, the specification parameter table is established by selecting the specification parameter of the keyword after the keyword analysis; and
    对所述字符删除表、所述同义近义词表及所述规格参数表进行动态更新。Dynamically updating the character deletion table, the synonym synonym table, and the specification parameter table.
  20. 如权利要求16所述的计算机可读存储介质,其特征在于,所述步骤 “将预处理后的文档信息输入所述文档选择模型,所述文档选择模型根据所述检索关键字对所述文档信息进行处理”中,所述处理步骤包括:A computer readable storage medium according to claim 16, wherein said step "putting preprocessed document information into said document selection model, said document selection model being said to said document based on said retrieval key In the information processing, the processing steps include:
    调用所述字符删除表对所述文档信息中与所述检索关键词相比错误、多余、明显相关的字符、词语进行删除;Calling the character deletion table to delete characters, words that are incorrect, redundant, and obviously related to the search keyword in the document information;
    调用所述同义近义词表对所述检索关键词进行替换,对替换后的所述检索关键词进行检索,将与所述检索关键词及其同义近义词匹配的文档信息保存;及Calling the synonym synonym table to replace the search keyword, searching the replaced search keyword, and saving document information matching the search keyword and its synonymous synonym; and
    调用所述规格参数表对所述检索关键词及其同义近义词对应的规格参数进行比对分析,将与规格参数表中的数据匹配的文档信息保存。The specification parameter table is called to perform comparison analysis on the specification parameters corresponding to the search keyword and the synonymous synonym, and the document information matching the data in the specification parameter table is saved.
PCT/CN2018/077627 2017-10-23 2018-02-28 Method for obtaining target document and application server WO2019080428A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710994507.7 2017-10-23
CN201710994507.7A CN108427702B (en) 2017-10-23 2017-10-23 Target document acquisition method and application server

Publications (1)

Publication Number Publication Date
WO2019080428A1 true WO2019080428A1 (en) 2019-05-02

Family

ID=63155679

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/077627 WO2019080428A1 (en) 2017-10-23 2018-02-28 Method for obtaining target document and application server

Country Status (2)

Country Link
CN (1) CN108427702B (en)
WO (1) WO2019080428A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109215754A (en) * 2018-09-10 2019-01-15 平安科技(深圳)有限公司 Medical record data processing method, device, computer equipment and storage medium
CN109815499B (en) * 2019-01-25 2023-05-23 杭州凡闻科技有限公司 Information association method and system
CN111859896B (en) * 2019-04-01 2022-11-25 长鑫存储技术有限公司 Formula document detection method and device, computer readable medium and electronic equipment
CN110135264A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Data entry method, device, computer equipment and storage medium
CN114627581B (en) * 2022-05-16 2022-08-05 深圳零匙科技有限公司 Coerced fingerprint linkage alarm method and system for intelligent door lock

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021866A (en) * 2007-03-13 2007-08-22 白云 Method for criminating electronci file and relative degree with certain field and application thereof
CN102955812A (en) * 2011-08-29 2013-03-06 阿里巴巴集团控股有限公司 Method and device for building index database as well as method and device for querying
CN105630940A (en) * 2015-12-21 2016-06-01 天津大学 Readability indicator based information retrieval method
CN106570058A (en) * 2016-09-29 2017-04-19 山东浪潮商用系统有限公司 Searching method and search engine

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3931214B2 (en) * 2001-12-17 2007-06-13 日本アイ・ビー・エム株式会社 Data analysis apparatus and program
CN101004753B (en) * 2007-01-25 2010-08-11 北京搜狗科技发展有限公司 Method and system for recognizing conception type files
CN103678576B (en) * 2013-12-11 2016-08-17 华中师范大学 The text retrieval system analyzed based on dynamic semantics

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021866A (en) * 2007-03-13 2007-08-22 白云 Method for criminating electronci file and relative degree with certain field and application thereof
CN102955812A (en) * 2011-08-29 2013-03-06 阿里巴巴集团控股有限公司 Method and device for building index database as well as method and device for querying
CN105630940A (en) * 2015-12-21 2016-06-01 天津大学 Readability indicator based information retrieval method
CN106570058A (en) * 2016-09-29 2017-04-19 山东浪潮商用系统有限公司 Searching method and search engine

Also Published As

Publication number Publication date
CN108427702A (en) 2018-08-21
CN108427702B (en) 2021-02-09

Similar Documents

Publication Publication Date Title
US11093854B2 (en) Emoji recommendation method and device thereof
US9558264B2 (en) Identifying and displaying relationships between candidate answers
US9323794B2 (en) Method and system for high performance pattern indexing
WO2019091026A1 (en) Knowledge base document rapid search method, application server, and computer readable storage medium
US20200081899A1 (en) Automated database schema matching
WO2019080428A1 (en) Method for obtaining target document and application server
US9318027B2 (en) Caching natural language questions and results in a question and answer system
US9965548B2 (en) Analyzing natural language questions to determine missing information in order to improve accuracy of answers
CN107818815B (en) Electronic medical record retrieval method and system
EP2092419B1 (en) Method and system for high performance data metatagging and data indexing using coprocessors
Rokach et al. Negation recognition in medical narrative reports
US20150227505A1 (en) Word meaning relationship extraction device
CN111899829B (en) Full-text retrieval matching engine based on ICD9/10 participle lexicon
US11657076B2 (en) System for uniform structured summarization of customer chats
US20210049169A1 (en) Systems and methods for text based knowledge mining
US20190266158A1 (en) System and method for optimizing search query to retreive set of documents
WO2022160454A1 (en) Medical literature retrieval method and apparatus, electronic device, and storage medium
US11227183B1 (en) Section segmentation based information retrieval with entity expansion
US10614102B2 (en) Method and system for creating entity records using existing data sources
CN116719840A (en) Medical information pushing method based on post-medical-record structured processing
Al-Lahham Index term selection heuristics for Arabic text retrieval
US11200261B2 (en) System and method for retrieving data records
WO2021168650A1 (en) Question query apparatus and method, device, and storage medium
US11269937B2 (en) System and method of presenting information related to search query
CN113111660A (en) Data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25/09/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18871227

Country of ref document: EP

Kind code of ref document: A1