CN109597879B

CN109597879B - Service behavior relation extraction method and device based on 'citation relation' data

Info

Publication number: CN109597879B
Application number: CN201811463779.5A
Authority: CN
Inventors: 蓝建敏
Original assignee: Excellence Information Technology Co ltd
Current assignee: Excellence Information Technology Co ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2022-03-29
Anticipated expiration: 2038-11-30
Also published as: CN109597879A

Abstract

The invention discloses a service behavior relation extraction method and a device based on citation relation data, wherein the method comprises the following steps: collecting corpora, preprocessing the corpora and constructing a corpus; extracting business behavior words from all document titles in the corpus, classifying the business behavior words according to business fields, and forming a business behavior word bank corresponding to each business field; extracting the relation data of all file titles and the titles of the cited files from the corpus to construct a cited relation database; and counting the number and the simultaneous occurrence times of the business behavior words and the referenced business behavior words according to the citation relation database, generating a business behavior relation, and constructing a business behavior relation database. The invention can improve the business of the correlation relationship, is closer to the business reality than the single word distance, and improves the accuracy of knowledge retrieval based on tasks.

Description

Service behavior relation extraction method and device based on 'citation relation' data

Technical Field

The invention relates to the technical field of big data mining, in particular to a service behavior relation extraction method and device based on quotation relation data.

Background

In the study and practice of the prior art, the inventors of the present invention found that:

the first traditional method comprises the following steps: manually studying and reading a plurality of files, identifying business behaviors in the files and establishing correlation among the behaviors. The method has the advantages of large workload, narrow coverage and inaccurate relation weight by completely and manually constructing the relation between the business behaviors.

The second traditional method is as follows: and calculating the distance between the business behavior words by using a word2vec algorithm so as to calculate the correlation between the business behaviors. The correlation relationship calculated by the method is not strong in business, and the requirement of searching the correlation knowledge cannot be really met.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method and a device for extracting business behavior relation based on 'citation relation' data, which can improve the business performance of the correlation relation, are closer to the business reality than the distance of a single word, and improve the accuracy of task-based knowledge retrieval.

In order to solve the above problem, an embodiment of the present invention provides a method for extracting a business behavior relationship based on "citation relationship" data, including:

collecting corpora, preprocessing the corpora and constructing a corpus;

extracting business behavior words from all document titles in the corpus, classifying the business behavior words according to business fields, and forming a business behavior word bank corresponding to each business field;

extracting the relation data of all file titles and the titles of the cited files from the corpus to construct a cited relation database;

and counting the number and the simultaneous occurrence times of the business behavior words and the referenced business behavior words according to the citation relation database, generating a business behavior relation, and constructing a business behavior relation database.

Further, the collecting of the corpora specifically includes searching for the existing corpora, and downloading and capturing the corpora from the internet; and preprocessing the corpus, specifically, performing corpus cleaning, word segmentation, part of speech tagging and word removal.

Further, the extracting of the business behavior words from all document titles in the corpus specifically includes:

analyzing and word segmentation are carried out on all file titles in the corpus;

collecting service behavior words comprising known service behavior words, continuously derived service behavior words and service behavior words needing to be converted;

screening and testing business behavior words;

and carrying out initial classification and reasoning on the business behavior words.

Further, the relational data of all the document titles and the cited document titles are extracted from the corpus to construct a cited relational database, specifically:

analyzing the content of each file in the corpus, and extracting the relation data of the file title and the title of the file to be quoted;

according to the file titles, marking a business behavior label on each file to form citation relation data, and constructing a citation relation database; the citation relation data comprises a file title, a behavior tag, a cited file title and a cited behavior tag.

Another embodiment of the present invention further provides a device for extracting business behavior relationship based on "citation relationship" data, including:

the corpus database module is used for collecting corpora, preprocessing the corpora and constructing a corpus;

the business behavior word bank module is used for extracting business behavior words from all file titles in the corpus and classifying the business behavior words according to business fields to form a business behavior word bank corresponding to each business field;

the quotation relation database module is used for extracting relation data of all file titles and quotation file titles from the corpus and constructing a quotation relation database;

and the business behavior relation library module is used for counting the number and the simultaneous occurrence times of the business behavior words and the quoted business behavior words according to the citation relation database, generating a business behavior relation and constructing a business behavior relation library.

Further, the corpus module is specifically configured to: searching the existing linguistic data, and downloading and capturing the linguistic data from the network; and performing corpus cleaning, word segmentation, part of speech tagging and word removal and stop on the corpus.

Further, the business behavior lexicon module is specifically configured to:

screening and testing business behavior words;

Further, the citation relation database module is specifically configured to:

Yet another embodiment of the present invention further provides a "citation relationship" data-based business behavior relationship extraction device, which is characterized by comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the business behavior relationship extraction device implements the "citation relationship" data-based business behavior relationship extraction method as described above.

By implementing the embodiment of the invention, the business of the correlation relationship can be improved, the business reality is closer than the single word distance, and the accuracy of the knowledge retrieval based on the task is improved.

Drawings

Fig. 1 is a schematic flowchart of a business behavior relationship extraction method based on "citation relationship" data according to an embodiment of the present invention;

fig. 2 is another schematic flow chart of a business behavior relationship extraction method based on "citation relationship" data according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a business behavior relationship extraction apparatus based on "citation relationship" data according to another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In a first aspect, please refer to FIGS. 1-2. One embodiment of the present invention provides a business behavior relationship extraction method based on "citation relationship" data, including:

and S1, collecting the linguistic data, preprocessing the linguistic data and constructing a corpus.

The method comprises the following steps of collecting linguistic data, specifically searching the existing linguistic data, and downloading and capturing the linguistic data from the internet; and preprocessing the corpus, specifically, performing corpus cleaning, word segmentation, part of speech tagging and word removal.

In a specific embodiment, the method mainly collects and arranges the special policies and leader speech files of central and provincial governments of various government official networks.

It can be understood that many organizations such as business departments and companies accumulate a great deal of paper or electronic text data as business progresses. Then, for these data, we integrate slightly under the allowed conditions, and the whole paper text is electronized to be used as our corpus.

The user can also select to obtain a standard open data set at home and abroad, such as Chinese and Chinese dog searching corpus and people daily newspaper corpus at home. The crawler itself can also choose to capture some data before proceeding with the subsequent content.

In one embodiment, the corpus pre-processing is about 50% -70% of the workload of a complete Chinese natural language processing engineering application, so developers are mostly in the process of corpus pre-processing. The preprocessing of the corpus is completed through four major aspects of data cleaning, word segmentation, part of speech tagging and word stop removal.

1. Corpus cleaning

And (2) data cleaning, namely finding interesting things in the corpus as the name implies, cleaning and deleting uninterested contents regarded as noise, wherein the method comprises the steps of extracting information such as titles, abstracts and texts from the original text, and removing codes and comments such as advertisements, tags, HTML (hypertext markup language), JS (JavaScript) and the like from the captured webpage contents. Common data cleansing methods are: manual deduplication, alignment, deletion, tagging, etc., or rule extraction content, regular expression matching, extraction according to part of speech and named entities, script writing or code batch processing, etc.

2. Word segmentation

The Chinese corpus data is a batch of short texts or long texts, such as: a sentence, abstract, paragraph, or whole article. The words and expressions between the general sentences and paragraphs are continuous and have certain meanings. When text mining analysis is performed, the minimum unit granularity of text processing is expected to be words or words, so that word segmentation is needed to perform word segmentation on the whole text at this time.

Common word segmentation algorithms are: the method comprises a word segmentation method based on character string matching, a word segmentation method based on understanding, a word segmentation method based on statistics and a word segmentation method based on rules, wherein each method corresponds to a plurality of specific algorithms.

The main difficulties of the current chinese word segmentation algorithm are ambiguity recognition and new word recognition, such as: "badminton auction is finished", this can be divided into "badminton auction is finished", also can be divided into "badminton auction is finished", if do not rely on other sentences of the context, fear to know how to understand it is difficult.

3. Part-of-speech tagging

Part-of-speech tagging is to tag each word or word with a part-of-speech class, such as adjectives, verbs, nouns, etc. This allows the text to incorporate more useful language information in later processing. Part-of-speech tagging is a classic sequence tagging problem, although part-of-speech tagging is not necessary for some Chinese natural language processing. For example, common text classification does not concern the part-of-speech problem, but similar emotion analysis and knowledge reasoning are needed, and the following figure is a common Chinese part-of-speech sorting.

Common part-of-speech tagging methods can be divided into rule-based and statistical-based methods. Wherein the statistical-based methods such as part-of-speech tagging based on maximum entropy, part-of-speech output based on statistical maximum probability, and part-of-speech tagging based on HMM.

4. Stop word

Stop words generally refer to words that do not contribute to text features, such as punctuation, tone, human scale, and so on. So in general text processing, after word segmentation, the next step is to stop the word. However, for Chinese, the operation of stop words is not constant, and the stop word dictionary is determined according to specific scenes, for example, in emotion analysis, the word of tone and the exclamation mark should be retained because they have certain contribution and meaning to expressing the degree of tone and emotional color.

S2, extracting the business behavior words from all the file titles in the corpus, and classifying the business behavior words according to the business fields to form a business behavior word bank corresponding to each business field.

Wherein, the extracting of the business behavior words from all the document titles in the corpus is specifically:

screening and testing business behavior words;

In particular embodiments, targeted business activity words may allow a slew rate client to find entries.

The method mainly comprises the following steps:

1. and collecting business behavior words.

(1) Continuously derived business behavior words;

(2) most people do not know the existing business behavior words (do not know that the words have conversion rate). It can be understood that the system will select only if the user searches, from which we just find the core word.

2. And screening the business behavior words.

After new words are generated continuously and old words disappear continuously, the system can screen whether new species are generated or not continuously. But not every word is useful and words that apparently do not meet the user's needs should be cut off. Business behavior words that are obviously useless are removed. The test of the business behavior words can acquire some business behavior words which cannot be judged.

3. And testing the business behavior words.

The conversion rate of the business behavior words is checked by using a testing tool, but the conversion rate cannot be singly judged, wherein each link, customer service, website content and the like are required to be the standards of high and low conversion rate. After the test is passed, the obtained business behavior words are effective business behavior words.

4. And classifying and reasoning the business behavior words.

And S3, extracting the relation data of all the file titles and the cited file titles from the corpus to construct a cited relation database. Specifically, the method comprises the following steps:

In a particular embodiment, the citation relationship data is extracted from the referenced reference library into the citation relationship database at the time of transacting the document.

S4, according to the citation relation database, counting the number of the business behavior words and the cited business behavior words and the number of simultaneous occurrence, generating a business behavior relation, and constructing a business behavior relation database.

In a specific embodiment, the correlation between business activities is evaluated based on the number of simultaneous occurrences.

Compared with the relationship between manual work and business behavior construction based on the word2vec algorithm, the method has the following advantages:

(1) and the manual work and the machine are combined, so that the efficiency is higher than that of the simple manual construction.

(2) And the strong relation between the services implied by the quotation relation data is closer to the service authenticity than the single word distance, so that the strong relation is stronger than the service relation constructed by the word2vec algorithm.

In a second aspect, as shown in fig. 3, another embodiment of the present invention further provides a business behavior relationship extracting apparatus based on "citation relationship" data, including:

and the corpus library module 21 is used for collecting the corpus, preprocessing the corpus and constructing a corpus.

Wherein, the corpus library module 21 is specifically configured to: searching the existing linguistic data, and downloading and capturing the linguistic data from the network; and performing corpus cleaning, word segmentation, part of speech tagging and word removal and stop on the corpus.

And the business behavior word library module 22 is configured to extract business behavior words from all document titles in the corpus, and classify the business behavior words according to business fields to form a business behavior word library corresponding to each business field.

The business behavior word bank module 22 is specifically configured to:

screening and testing business behavior words;

The method mainly comprises the following steps:

1. and collecting business behavior words.

(1) Continuously derived business behavior words;

2. And screening the business behavior words.

3. And testing the business behavior words.

4. And classifying and reasoning the business behavior words.

And the citation relational database module 23 is used for extracting the relational data of all the file titles and the cited file titles from the corpus to construct a citation relational database.

Wherein, the citation relation database module 23 is specifically configured to:

And the business behavior relation library module 24 is used for counting the number and the simultaneous occurrence times of the business behavior words and the quoted business behavior words according to the citation relation database, generating a business behavior relation and constructing a business behavior relation library.

The foregoing is directed to the preferred embodiment of the present invention, and it is understood that various changes and modifications may be made by one skilled in the art without departing from the spirit of the invention, and it is intended that such changes and modifications be considered as within the scope of the invention.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Claims

1. A business behavior relation extraction method based on 'citation relation' data is characterized by comprising the following steps:

the method comprises the following steps of collecting corpora, preprocessing the corpora and constructing a corpus, wherein the collected corpora are used for searching the existing corpora, downloading and capturing the corpora from the internet; preprocessing the corpus, specifically, performing corpus cleaning, word segmentation, part-of-speech tagging and word removal;

2. The method for extracting business behavior relationship based on "citation relationship" data according to claim 1, wherein the business behavior words are extracted from all document titles in the corpus, specifically:

screening and testing business behavior words;

3. The method for extracting business behavior relationship based on "quotation relationship" data as claimed in claim 1, wherein the relational data of all document titles and cited document titles are extracted from the corpus to construct a quotation relationship database, specifically:

4. A business behavior relation extraction device based on 'citation relation' data is characterized by comprising:

the corpus library module is specifically used for: searching the existing linguistic data, and downloading and capturing the linguistic data from the network; performing corpus cleaning, word segmentation, part of speech tagging and word removal for the corpus;

5. The device for extracting business behavior relationship based on "citation relationship" data as claimed in claim 4, wherein the business behavior thesaurus module is specifically configured to:

screening and testing business behavior words;

6. The device for extracting business behavior relationship based on "citation relationship" data as claimed in claim 4, wherein the citation relationship database module is specifically configured to:

7. A citation relationship data-based business behavior relationship extraction device, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the processor implements the citation relationship data-based business behavior relationship extraction method according to any one of claims 1 to 3 when executing the computer program.