CN112035500A - Knowledge base updating method, device, server and computer storage medium - Google Patents

Knowledge base updating method, device, server and computer storage medium Download PDF

Info

Publication number
CN112035500A
CN112035500A CN202010904022.6A CN202010904022A CN112035500A CN 112035500 A CN112035500 A CN 112035500A CN 202010904022 A CN202010904022 A CN 202010904022A CN 112035500 A CN112035500 A CN 112035500A
Authority
CN
China
Prior art keywords
source document
text
unit
text unit
knowledge base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010904022.6A
Other languages
Chinese (zh)
Other versions
CN112035500B (en
Inventor
申亚坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Ltd
Original Assignee
Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Ltd filed Critical Bank of China Ltd
Priority to CN202010904022.6A priority Critical patent/CN112035500B/en
Publication of CN112035500A publication Critical patent/CN112035500A/en
Application granted granted Critical
Publication of CN112035500B publication Critical patent/CN112035500B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application provides a method, a device, a server and a computer storage medium for updating a knowledge base, wherein the method comprises the steps of obtaining a source document; splitting a source document into a plurality of text units according to a preset splitting rule; executing preprocessing operation on the text unit to obtain a preprocessed text unit; the preprocessing operation comprises any one or combination of text translation, paragraph division, abstract extraction and synonym labeling; and sending each preprocessed text unit to an editing terminal, and writing the knowledge generated by the editing terminal according to the preprocessed text units into a knowledge base. According to the scheme, the source document is divided into the text units, so that the different text units of one source document can be edited by the editing terminals in parallel, and the speed of extracting knowledge from one source document is remarkably improved.

Description

Knowledge base updating method, device, server and computer storage medium
Technical Field
The present invention relates to the field of text processing technologies, and in particular, to a method and an apparatus for updating a knowledge base, a server, and a computer storage medium.
Background
The knowledge base system is a common system of a bank, business knowledge corresponding to various businesses of the bank is stored in the knowledge base, and a business worker can timely find the corresponding business knowledge from the knowledge base, so that the business is handled for a client according to the business knowledge.
One of the main sources of business knowledge in the knowledge base is that related personnel edit a source document containing business knowledge at a corresponding editing terminal so as to extract the business knowledge in the source document, and then the business knowledge is sent to a server deploying the knowledge base system through the editing terminal, so that the process of adding the business knowledge to the knowledge base is completed.
However, a source document is often rich in content, and the time required for extracting all business knowledge in the source document is long, so that the efficiency of adding knowledge into the knowledge base is low.
Disclosure of Invention
Based on the above shortcomings in the prior art, the present application provides a method, an apparatus, a server and a computer storage medium for updating a knowledge base, so as to provide a solution for efficiently adding new knowledge to the knowledge base.
The first aspect of the present application provides a method for updating a knowledge base, including:
acquiring a source document;
splitting the source document into a plurality of text units according to a preset splitting rule;
executing preprocessing operation on the text unit to obtain a preprocessed text unit; the preprocessing operation comprises any one or combination of text translation, paragraph division, abstract extraction and synonym labeling;
and sending each preprocessed text unit to an editing terminal, and writing the knowledge generated by the editing terminal according to the preprocessed text units into a knowledge base.
Optionally, the obtaining the source document includes:
receiving a network address of a source document sent by a client;
and pulling the source document from the corresponding website according to the network address of the source document.
Optionally, the splitting the source document into a plurality of text units according to a preset splitting rule includes:
identifying and obtaining each primary title of the source document;
and determining the text corresponding to each primary title as a text unit.
Optionally, the performing a preprocessing operation on the text unit to obtain a preprocessed text unit includes:
identifying synonyms in the text unit by utilizing a semantic identification algorithm, and marking each identified synonym in the text unit; and the marked text unit is used as a preprocessed text unit.
A second aspect of the present application provides an apparatus for updating a knowledge base, including:
an acquisition unit configured to acquire a source document;
the splitting unit is used for splitting the source document into a plurality of text units according to a preset splitting rule;
the processing unit is used for executing preprocessing operation on each text unit to obtain preprocessed text units; the preprocessing operation comprises any one or combination of text translation, paragraph division, abstract extraction and synonym labeling;
and the updating unit is used for sending each preprocessed text unit to the editing terminal and writing the knowledge generated by the editing terminal according to the preprocessed text units into a knowledge base.
Optionally, when the obtaining unit obtains the source document, the obtaining unit is specifically configured to:
receiving a network address of a source document sent by a client;
and pulling the source document from the corresponding website according to the network address of the source document.
Optionally, when the splitting unit splits the source document into a plurality of text units according to a preset splitting rule, the splitting unit is specifically configured to:
identifying and obtaining each primary title of the source document;
and determining the text corresponding to each primary title as a text unit.
Optionally, when the processing unit executes a preprocessing operation on the text unit to obtain a preprocessed text unit, the processing unit is specifically configured to:
identifying synonyms in the text unit by utilizing a semantic identification algorithm, and marking each identified synonym in the text unit; and the marked text unit is used as a preprocessed text unit.
A third aspect of the present application provides a server comprising a memory and a processor;
wherein the memory is for storing a computer program;
the processor is configured to execute the computer program, and in particular to implement the method for updating a knowledge base provided in any of the first aspects of the present application.
A fourth aspect of the present application provides a computer storage medium for storing a computer program, which, when executed, is particularly adapted to implement the method for updating a knowledge base provided in any of the first aspects of the present application.
The application provides a method, a device, a server and a computer storage medium for updating a knowledge base, wherein the method comprises the steps of obtaining a source document; splitting a source document into a plurality of text units according to a preset splitting rule; executing preprocessing operation on the text unit to obtain a preprocessed text unit; the preprocessing operation comprises any one or combination of text translation, paragraph division, abstract extraction and synonym labeling; and sending each preprocessed text unit to an editing terminal, and writing the knowledge generated by the editing terminal according to the preprocessed text units into a knowledge base. According to the scheme, the source document is divided into the text units, so that the different text units of one source document can be edited by the editing terminals in parallel, and the speed of extracting knowledge from one source document is remarkably improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for updating a knowledge base according to an embodiment of the present application;
fig. 2 is a schematic diagram of a source document and a text unit obtained by splitting according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an apparatus for updating a knowledge base according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to better understand the method for updating the knowledge base provided in the embodiment of the present application, a brief description is first given below of the knowledge base and the business knowledge stored in the knowledge base.
As is well known, banks can handle a wide variety of businesses, including corporate and individual customers, including but not limited to making credit certificates, paying fees instead of customers, purchasing financial products, and the like. Each business has its own specific handling procedures and associated handling specifications, for example, a business may specify that only city-level branches can be handled, or that a customer must meet certain conditions to handle it, and when a customer consults a banking operator for a certain business, the operator may need to introduce the customer with an overview of the business.
For a business, the handling process, the handling specification and the text description of the business overview are collectively called business knowledge corresponding to the business, and the bank knowledge base system is a database system for storing the business knowledge of each business currently supported by the bank.
By establishing the knowledge base system, a salesman can access the knowledge base system on the business handling terminal when needed, retrieve corresponding business knowledge in the knowledge base by taking the business name as a keyword, namely retrieve the business flow and the business specification of the business required to be handled for the client currently or retrieve the business profile of the business consulted by the client currently, and then provide corresponding services for the client according to the retrieved knowledge.
Through the application scene of the knowledge base system, it can be found that in order to make the knowledge base fully play a role, the knowledge base needs to be updated in time, namely new business knowledge is added to the knowledge base in time. The new business knowledge may come from a source document corresponding to a new online business of the bank, or may come from a source document used for describing specific changed items after the bank changes a certain business. The existing method for updating the knowledge base relies on an editing terminal to edit the whole source document, and long time is needed for extracting new business knowledge from the source document, so that the efficiency of updating the knowledge base is low.
In order to extract business knowledge contained in a source document from a source document with a longer length more quickly and update newly added business knowledge to a knowledge base of a bank in time, an embodiment of the application provides an updating method of the knowledge base, and a plurality of editing terminals can simultaneously edit different parts of one source document by splitting the source document, so that parallel processing of the source document is realized, and the efficiency of extracting new knowledge from the source document is obviously improved.
Referring to fig. 1, the method for updating a knowledge base provided in an embodiment of the present application is described in detail below with reference to the accompanying drawings, where the method for updating a knowledge base provided in an embodiment of the present application may include the following steps:
the method provided by any embodiment of the application can be realized by considering the execution subject of the method as a knowledge base system deployed on a server of a bank.
S101, acquiring a source document.
The source document obtained in step S101 is used to refer to a document containing business knowledge that needs to be added to the knowledge base.
Generally, after a bank develops a new service, relevant personnel will make a corresponding service flow and a corresponding service specification for the new service, and write corresponding introduction of service profiles and the like, these contents will be generally and uniformly recorded in a document, before the new service enters a service system of the bank and is opened to a target client for transaction, the service knowledge of the new service needs to be updated into a knowledge base for the service personnel to review during subsequent transaction, and at this time, the document recording the above information is the source document in step S101.
In addition, after a service is changed, generally, a service flow or a service specification is changed, a document for recording the changed service flow or service specification is also formed, and before a subprogram corresponding to the service in the service system is updated, the changed service knowledge (i.e., the changed service flow or service specification) needs to be updated to the knowledge base, where the document in which the changed service flow or service specification is recorded is the source document in step S101.
There are a number of alternative methods of obtaining the source document. First, a bank employee who stores a source document may access the knowledge base on a corresponding work terminal, and then actively upload the source document of the electronic version to the knowledge base, that is, the obtaining of the source document in step S101 may specifically be receiving the source document uploaded by the work terminal.
Secondly, a large bank is often provided with branches in a plurality of regions, computer systems of the branches of each region are connected with each other to form an intranet system of the bank, a certain source document may be generated in a branch of a certain region, a knowledge base system is generally deployed on a server of a head office, at this time, employees of the branch can access the knowledge base system (also called a knowledge base for short) through a work terminal, and input a network address of the source document generated in the branch on an input interface provided by the knowledge base system, wherein the network address is a network address of the server storing the source document in the intranet system of the bank, and after the knowledge base system obtains the network address of the source document, the server storing the source document can be directly accessed, and the source document is pulled from the server. That is, another alternative way to execute step S101 is to receive the network address of the source document sent by the client; and pulling the source document from the corresponding server according to the network address of the source document.
Thirdly, when the staff of the bank does not store the active document but holds the paper document corresponding to the active document, the paper document can be scanned by using a scanning device to obtain a plurality of scanned images displaying the active document, and then the scanned images are uploaded to a knowledge base system, and the knowledge base system identifies the active document from the scanned images by using an image identification algorithm.
S102, splitting the source document into at least one text unit according to a preset splitting rule.
The splitting rules may include splitting by paragraph, splitting by title, and splitting by whole piece. In executing step S102, it may be determined which splitting rule to use for splitting according to a specific source document.
Splitting by paragraph means that each natural segment in the source document is determined as a text unit.
Splitting by paragraph can be applied to the following situations:
the content of the source document is large, and the source document satisfies only one total document title without setting a plurality of primary titles subordinate to the document title. Or, the content of the source document is more, and a plurality of primary titles are set, but the content corresponding to each primary title is too long.
That is, in step S102, the number of characters included in the source document may be counted first to obtain the total number of characters of the source document, if the total number of characters of the source document is less than or equal to the preset first character number threshold, it is determined that the source document is not applicable to the splitting rule of splitting according to paragraphs, if the total number of characters of the source document is greater than the preset first character number threshold, it is continuously determined whether the source document is provided with a first-class header, and it is determined whether the number of characters of the content corresponding to each first-class header of the source document is greater than the second character number threshold.
If the total number of characters of the source document is larger than the first character number threshold value, and the source document is not provided with the first-level titles, or the total number of characters of the source document is larger than the first character number threshold value, the source document is provided with the first-level titles, but the number of characters of the content corresponding to each first-level title is larger than the second character number, the source document is judged to be suitable for the paragraph splitting rule. Then, in step S102, each natural piece of the source document is identified and extracted one by one, and each extracted natural piece is determined as one text unit.
If the two conditions are not met, judging that the source document is not suitable for the splitting rule of splitting according to paragraphs.
Splitting according to the titles means that the content corresponding to each primary title in the source document is determined as a text unit.
Referring to fig. 2, a source document may be provided with a plurality of subordinate primary headings in addition to the general document title (e.g., "XX business description" in fig. 2), each primary heading corresponding to the content of an aspect of the source document, for example, the primary headings in fig. 2 have "business profile", "business process", "required material", "notice", etc.
After detecting that the source document has set a plurality of first-level titles in a style similar to that of fig. 2, the rule of splitting by title may be directly applied to the source document, and at this time, the specific implementation process of step S102 may be:
firstly, recognizing each primary title of a source document one by one, and then determining the text content corresponding to each primary title as a text unit.
The text content corresponding to one primary title may be understood as the text content from the one primary title to the next primary title. For the last first-level title in the source document, all the text contents from the first-level title to the end of the source document belong to the text content corresponding to the last first-level title of the source document.
Taking fig. 2 as an example, for the primary heading "service profile", the text content between "service profile" and "transaction flow" may be determined as a text unit corresponding to the primary heading "service profile".
The whole splitting means that the whole source document is directly used as a text unit after irrelevant information in the source document is deleted. Irrelevant information in a source document refers to information irrelevant to a business corresponding to the source document, for example, a department or a person who composes the source document may be recorded in the source document, and a title of other documents referred by the source document may also be recorded, and the information is not directly related to the business corresponding to the source document and thus may be considered as irrelevant information.
And (4) the whole parsing is to identify the irrelevant information from the source document, delete all the irrelevant information, and take the deleted source document as a text unit.
The whole disassembly is suitable for the situation that the source document is short and exquisite, namely, the content of the source document is less. Therefore, after the source document is obtained, if the total number of characters of the source document obtained through statistics is smaller than a preset third character number threshold, the whole piece of parsing rule can be used for parsing the source document, otherwise, if the total number of characters of the source document is larger than the third character number threshold, it is determined that the source document is not applicable to the whole piece of parsing rule.
It should be noted that, in addition to splitting by using a certain splitting rule alone, two splitting rules may also be used to split the source document. For example, the two splitting rules of splitting by paragraph and splitting by title may be applied in combination. Specifically, if the source document is provided with a plurality of first-level titles, splitting the source document according to the titles, determining the content corresponding to each first-level title as a text unit, then, for each text unit, detecting whether the number of characters of the text unit is greater than a preset fourth character number threshold, if the number of characters of one text unit is greater than the fourth character number threshold, further performing splitting according to paragraphs on the text unit, and further splitting the text unit into a plurality of smaller text units.
S103, preprocessing operation is performed on the text unit to obtain a preprocessed text unit.
The preprocessing operation comprises any one or combination of text translation, paragraph division, abstract extraction and synonym labeling.
That is, in step S103, any one or more of the preprocessing operations may be selected and executed according to actual situations, for example, each of the preprocessing operations may be executed in sequence, and after all the preprocessing operations are executed, a preprocessed text unit is obtained.
Each of the pretreatment operations is described below:
first, text translation. Text translation is to translate words in text units that do not belong to a target language into words in the target language. Generally, the target language may be Chinese, and the text translation may be understood as translating non-Chinese words in a text unit into corresponding Chinese words.
The specific processing method may be to establish a foreign language vocabulary library in advance, in which a plurality of foreign language vocabularies (here, foreign language vocabularies are used to refer to non-chinese vocabularies) commonly found in the relevant field of banks and chinese vocabularies corresponding to each foreign language vocabulary are stored. The method comprises the steps of obtaining a text unit based on a foreign language word library, detecting whether foreign language words exist in the text unit or not, searching the foreign language words in the foreign language word library aiming at each foreign language word if at least one foreign language word is detected in the text unit, extracting Chinese words corresponding to the foreign language words recorded in the foreign language word library if the foreign language words are recorded in the foreign language word library, and converging the extracted Chinese words to be a translation result of the foreign language.
Optionally, when performing text translation, each foreign language word in the text unit may be replaced by a corresponding chinese word, or a bracket may be added after each foreign language word, and the chinese word corresponding to the foreign language word is filled in the bracket.
Second, paragraph division. Paragraph segmentation is applicable to the case where there are multiple natural segments in a text unit. For a text unit including a plurality of natural segments, paragraph division means to mark each natural segment in the text unit, and optionally, each natural segment in the text unit may be further marked as the several natural segments in the source document.
And thirdly, abstract extraction. For a text unit, dividing the text unit into a plurality of sentences according to punctuation marks in the text unit, wherein between every two commas, texts between the commas and an adjacent subsequent sentence (which may also be a sign indicating that a sentence ends, such as an exclamation mark or a question mark) are all used as a sentence, then clustering the divided sentences according to the similarity between every two sentences, the number of clustering centers may be preset, for example, to 4, after clustering is completed, selecting the sentence with the largest number of N adjacent sentences from each cluster as the central sentence of the cluster, and the central sentence of each cluster forms an abstract of the text unit. N is a preset positive integer and may be generally set to 1 or 2.
For any two sentences in a cluster, if the similarity between the two sentences is greater than a certain threshold, the two sentences are determined to be a group of adjacent sentences in the cluster.
Optionally, the method for calculating the similarity between the two sentences may be:
for each sentence, converting each vocabulary of the sentence into a corresponding word vector by using a word vector model, then accumulating the word vectors of all the vocabularies contained in one sentence to obtain the sentence vector of the sentence, and finally, for each two sentences, calculating the cosine similarity of the sentence vectors of the two sentences, wherein the calculated result is the similarity of the two sentences.
The specific calculation method of the cosine similarity of the two vectors can refer to the related prior art, and is not described in detail here.
And fourthly, synonym labeling, specifically, identifying synonyms in the text unit by utilizing a semantic identification algorithm, and labeling each identified synonym in the text unit.
The semantic recognition algorithm may be that, for every two vocabularies in the text unit, context windows of the two vocabularies are respectively determined, then context vectors corresponding to the context windows of the two vocabularies are generated by using a word vector model constructed in advance, and finally, the similarity of the two context vectors is calculated, and if the similarity of the context vectors corresponding to the two vocabularies is greater than a preset threshold, the two vocabularies can be judged to be synonyms.
The word vector model (word2vec) is an existing mathematical model, and after training is performed by using a large amount of linguistic data, each vocabulary in a text unit can be converted into a corresponding word vector by the word vector model. In a text element, a context window for a word is used to refer to the combination of the first M words of the word and the last M words of the word, M being a positive integer, and M can be set equal to 5 in general.
Generating a context vector corresponding to a context window of a vocabulary, wherein for the vocabulary, a word vector model is utilized to convert front M words and rear M words of the vocabulary into corresponding word vectors, so as to obtain 2M word vectors, the dimension of each word vector is the same as that of other word vectors, and then the 2M word vectors are added, so that the context vector corresponding to the context window of the vocabulary can be obtained.
And S104, sending each preprocessed text unit to an editing terminal, and writing the knowledge generated by the editing terminal according to the preprocessed text units into a knowledge base.
And after receiving the preprocessed text unit, the editing terminal can send prompt information, wherein the prompt information can be a popup window of a computer desktop or a prompt short message sent to a mobile phone of a corresponding employee. The prompt information is used for prompting staff corresponding to the editing terminal to timely process the text unit received by the editing terminal so as to generate business knowledge corresponding to the text unit.
Optionally, after receiving the knowledge fed back by the editing terminal, the received knowledge may be sent to the auditing terminal again for auditing, and after the auditing terminal passes the auditing, the audited knowledge is written into the knowledge base.
Optionally, each time a text unit is sent to an editing terminal, the server may monitor the progress of the employee of the editing terminal in processing the text unit in real time, and set the processing status identifier of the text unit correspondingly, for example, if the processing progress is processing, the processing status identifier may be set to yellow, and if the processing progress is processing completion, the processing status identifier may be set to green.
Optionally, besides being sent to the auditing terminal for auditing by related personnel, the method can also automatically audit knowledge fed back by the editing terminal. The automatic review may specifically include identifying and deleting wrongly written words from the fed-back knowledge, translating words in the fed-back knowledge that do not belong to the target language into words in a corresponding target language (the target language may be chinese), and deleting duplicate statements in the fed-back knowledge.
Further, under the condition that a plurality of editing terminals feed back a plurality of service knowledge, the server can screen the received plurality of service knowledge, and combine every two services with the same or similar contents into one service knowledge, so as to avoid the storage space waste of the knowledge base caused by the storage of repeated service knowledge in the knowledge base.
According to the method for updating the knowledge base, the source document containing the business knowledge is split into at least one text unit according to the preset splitting rule, so that the longer source document is distributed to a plurality of editing terminals to be processed in parallel, the speed of extracting the business knowledge from the source document is increased, and the updating efficiency of the knowledge base is further improved.
In combination with the method for updating a knowledge base provided in any embodiment of the present application, an embodiment of the present application further provides an apparatus for updating a knowledge base, please refer to fig. 3, where the apparatus may include the following units:
an obtaining unit 301, configured to obtain a source document.
The splitting unit 302 is configured to split the source document into a plurality of text units according to a preset splitting rule.
The processing unit 303 is configured to perform a preprocessing operation on each text unit to obtain a preprocessed text unit.
The preprocessing operation comprises any one or combination of text translation, paragraph division, abstract extraction and synonym labeling.
And the updating unit 304 is configured to send each preprocessed text unit to the editing terminal, and write the knowledge generated by the editing terminal according to the preprocessed text units into the knowledge base.
When the obtaining unit 301 obtains the source document, it is specifically configured to:
receiving a network address of a source document sent by a client;
and pulling the source document from the corresponding website according to the network address of the source document.
When the splitting unit 302 splits the source document into a plurality of text units according to a preset splitting rule, it is specifically configured to:
identifying each primary title of the source document;
and determining the text corresponding to each primary title as a text unit.
The processing unit 303 is configured to, when performing a preprocessing operation on the text unit to obtain a preprocessed text unit:
identifying synonyms in the text unit by utilizing a semantic identification algorithm, and marking each identified synonym in the text unit; and the marked text unit is used as a preprocessed text unit.
The specific working principle of the device for updating a knowledge base provided in the embodiments of the present application may refer to corresponding steps in the method for updating a knowledge base provided in any embodiment of the present application, and details thereof are not described here.
The application provides a knowledge base updating device, which comprises an acquisition unit 301, a storage unit, a processing unit and a control unit, wherein the acquisition unit 301 is used for acquiring a source document; a splitting unit 302, configured to split a source document into multiple text units according to a preset splitting rule; the processing unit 303 performs preprocessing operations on the text units to obtain preprocessed text units; the preprocessing operation comprises any one or combination of text translation, paragraph division, abstract extraction and synonym labeling; and the updating unit 304 is configured to send each preprocessed text unit to the editing terminal, and write the knowledge generated by the editing terminal according to the preprocessed text units into the knowledge base. According to the scheme, the source document is divided into the text units, so that the different text units of one source document can be edited by the editing terminals in parallel, and the speed of extracting knowledge from one source document is remarkably improved.
The embodiment of the present application further provides a server, the structure of which is shown in fig. 4, and the server includes a memory 401 and a processor 402.
Wherein the memory 401 is used for storing a computer program;
the processor 402 is configured to execute the above computer program, and is specifically configured to implement the method for updating a knowledge base provided in any embodiment of the present application.
The embodiment of the present application further provides a computer storage medium for storing a computer program, where the computer program is specifically configured to implement the method for updating a knowledge base provided in any embodiment of the present application when executed.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
It should be noted that the terms "first", "second", and the like in the present invention are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
Those skilled in the art can make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for updating a knowledge base, comprising:
acquiring a source document;
splitting the source document into at least one text unit according to a preset splitting rule;
executing preprocessing operation on the text unit to obtain a preprocessed text unit; the preprocessing operation comprises any one or combination of text translation, paragraph division, abstract extraction and synonym labeling;
and sending each preprocessed text unit to an editing terminal, and writing the knowledge generated by the editing terminal according to the preprocessed text units into a knowledge base.
2. The updating method according to claim 1, wherein the obtaining a source document comprises:
receiving a network address of a source document sent by a client;
and pulling the source document from the corresponding server according to the network address of the source document.
3. The updating method according to claim 1, wherein the splitting the source document into at least one text unit according to a preset splitting rule includes:
identifying and obtaining each primary title of the source document;
and determining the text corresponding to each primary title as a text unit.
4. The updating method of claim 1, wherein the performing a preprocessing operation on the text unit to obtain a preprocessed text unit comprises:
identifying synonyms in the text unit by utilizing a semantic identification algorithm, and marking each identified synonym in the text unit; and the marked text unit is used as a preprocessed text unit.
5. An apparatus for updating a knowledge base, comprising:
an acquisition unit configured to acquire a source document;
the splitting unit is used for splitting the source document into at least one text unit according to a preset splitting rule;
the processing unit is used for executing preprocessing operation on each text unit to obtain preprocessed text units; the preprocessing operation comprises any one or combination of text translation, paragraph division, abstract extraction and synonym labeling;
and the updating unit is used for sending each preprocessed text unit to the editing terminal and writing the knowledge generated by the editing terminal according to the preprocessed text units into a knowledge base.
6. The updating apparatus according to claim 5, wherein the obtaining unit, when obtaining the source document, is specifically configured to:
receiving a network address of a source document sent by a client;
and pulling the source document from the corresponding server according to the network address of the source document.
7. The updating apparatus according to claim 5, wherein when the splitting unit splits the source document into at least one text unit according to a preset splitting rule, the splitting unit is specifically configured to:
identifying and obtaining each primary title of the source document;
and determining the text corresponding to each primary title as a text unit.
8. The updating apparatus according to claim 5, wherein the processing unit performs a preprocessing operation on the text unit, and when obtaining a preprocessed text unit, is specifically configured to:
identifying synonyms in the text unit by utilizing a semantic identification algorithm, and marking each identified synonym in the text unit; and the marked text unit is used as a preprocessed text unit.
9. A server, comprising a memory and a processor;
wherein the memory is for storing a computer program;
the processor is adapted to execute the computer program, in particular to implement the method of updating a knowledge base according to any of claims 1 to 4.
10. A computer storage medium for storing a computer program, which, when executed, is particularly adapted to implement the method of updating a knowledge base of any one of claims 1 to 4.
CN202010904022.6A 2020-09-01 2020-09-01 Knowledge base updating method, device, server and computer storage medium Active CN112035500B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010904022.6A CN112035500B (en) 2020-09-01 2020-09-01 Knowledge base updating method, device, server and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010904022.6A CN112035500B (en) 2020-09-01 2020-09-01 Knowledge base updating method, device, server and computer storage medium

Publications (2)

Publication Number Publication Date
CN112035500A true CN112035500A (en) 2020-12-04
CN112035500B CN112035500B (en) 2024-01-26

Family

ID=73591553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010904022.6A Active CN112035500B (en) 2020-09-01 2020-09-01 Knowledge base updating method, device, server and computer storage medium

Country Status (1)

Country Link
CN (1) CN112035500B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999480A (en) * 2012-11-09 2013-03-27 中国电子科技集团公司第十五研究所 Method and system for editing document
CN104573006A (en) * 2015-01-08 2015-04-29 南通大学 Construction method of public health emergent event domain knowledge base
CN107220296A (en) * 2017-04-28 2017-09-29 北京拓尔思信息技术股份有限公司 The generation method of question and answer knowledge base, the training method of neutral net and equipment
CN110889280A (en) * 2018-09-06 2020-03-17 上海智臻智能网络科技股份有限公司 Knowledge base construction method and device based on document splitting
CN111144116A (en) * 2019-12-25 2020-05-12 国网江苏省电力有限公司电力科学研究院 Document knowledge structuralization extraction method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999480A (en) * 2012-11-09 2013-03-27 中国电子科技集团公司第十五研究所 Method and system for editing document
CN104573006A (en) * 2015-01-08 2015-04-29 南通大学 Construction method of public health emergent event domain knowledge base
CN107220296A (en) * 2017-04-28 2017-09-29 北京拓尔思信息技术股份有限公司 The generation method of question and answer knowledge base, the training method of neutral net and equipment
CN110889280A (en) * 2018-09-06 2020-03-17 上海智臻智能网络科技股份有限公司 Knowledge base construction method and device based on document splitting
CN111144116A (en) * 2019-12-25 2020-05-12 国网江苏省电力有限公司电力科学研究院 Document knowledge structuralization extraction method and device

Also Published As

Publication number Publication date
CN112035500B (en) 2024-01-26

Similar Documents

Publication Publication Date Title
US11663411B2 (en) Ontology expansion using entity-association rules and abstract relations
US8620836B2 (en) Preprocessing of text
US9864741B2 (en) Automated collective term and phrase index
US8504492B2 (en) Identification of attributes and values using multiple classifiers
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN107145584B (en) Resume parsing method based on n-gram model
CN113254574A (en) Method, device and system for auxiliary generation of customs official documents
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
CN110427488B (en) Document processing method and device
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN111191614B (en) Document classification method and device
CN112364628B (en) New word recognition method and device, electronic equipment and storage medium
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
CN112667777A (en) Classification method for client incoming call appeal
CN112668323A (en) Text element extraction method based on natural language processing and text examination system thereof
CN108399157A (en) Dynamic abstracting method, server and the readable storage medium storing program for executing of entity and relation on attributes
CN115249007A (en) Method and device for detecting enclosing and bidding behavior based on electronic bidding document comparison
CN111753536A (en) Automatic patent application text writing method and device
CN111325019A (en) Word bank updating method and device and electronic equipment
Khemani et al. A review on reddit news headlines with nltk tool
CN112800771B (en) Article identification method, apparatus, computer readable storage medium and computer device
Han et al. A novel part of speech tagging framework for nlp based business process management
CN112035500B (en) Knowledge base updating method, device, server and computer storage medium
CN107145947B (en) Information processing method and device and electronic equipment
CN114003750B (en) Material online method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant