CN110347903A

CN110347903A - Intelligent information assessment and marketing system based on statistical language model algorithm

Info

Publication number: CN110347903A
Application number: CN201910647150.4A
Authority: CN
Inventors: 吴俊哲; 吴剑东
Original assignee: Jiangsu Dongwang Information Technology Co Ltd
Current assignee: Jiangsu Dongwang Information Technology Co Ltd
Priority date: 2019-07-17
Filing date: 2019-07-17
Publication date: 2019-10-18

Abstract

The invention discloses a kind of intelligent information assessment and marketing system based on statistical language model algorithm, its key points of the technical solution are that including statistical language model, bi-directional matching segmentation methods, statistical language model algorithm and bi-directional matching segmentation methods cooperate, extract keyword and internal evaluation this series of steps, engine optimization manually is scanned for information the advantage of the invention is that can replace, save a large amount of manual labors, to save human cost, the opposite more efficient processing speed of manpower and treatment effeciency, it can handle the network text data of magnanimity scale and processing result be more accurate, help to improve the accuracy of assessment result, improve the guiding performance of subsequent marketing.

Description

Intelligent information assessment and marketing system based on statistical language model algorithm

Technical field

The present invention relates to a kind of web information searching tools, more specifically, it relates to which a kind of be based on statistical language model algorithm Intelligent information assessment and marketing system.

Background technique

Search engine collecting optimization is by all kinds of search engine collecting internet pages of understanding, is indexed and determines It carries out relevant optimization to technologies such as particular keywords search result ranks, to webpage, it is made to improve search engine ranking, To improve website visiting amount, the effect of the final sale for promoting website or publicity.

The prior art is commonly referred to as: SEO (Search Engine Optimization), i.e. search engine optimization.

SEO is related to the present invention mainly 3 aspects:

First, optimize to webpage META label: content title, keyword are all that label optimizes there are also synopsis Target.

Second, the link optimized inside website is to influence the principal element of website clicking rate, correlation connection, Anchor Text chain Connecing will optimize, and meet the needs of website user.

Third to web page code compression, is improved, is mainly to maintain the uniqueness of site home page, page and main business in website Link.

The prior art has the disadvantage in that current SEO uses manual type, time-consuming and laborious, and vulnerable to practitioner Quality influence.

Summary of the invention

It is a kind of based on statistical language model algorithm in view of the deficiencies of the prior art, the present invention intends to provide Intelligent information assessment and marketing system.

To achieve the above object, the present invention adopts the following technical scheme: a kind of intelligence based on statistical language model algorithm Information evaluation and marketing system, comprising the following steps:

Step 1: statistical language model, using statistical language model algorithm, statistical language model can be used to state vocabulary The statistical property of sequence for example learns the Joint Distribution probability function of word in sequence.If successively indicating one with w1 to wm Each word in words, then the probability of occurrence of the clause can be represented simply as:

Wherein, the conditional probability in model can be calculated with word frequency:

Step 2: bi-directional matching segmentation methods, the segmenting method based on string matching are also known as mechanical segmentation method, it An initial abundant big dictionary (technical term dictionary and universaling dictionary in Fig. 1) is needed, then by word to be segmented Symbol string is matched with the element in dictionary, if energy successful match, which is come out, by the difference of scanning direction, word Symbol String matching segmenting method can be divided into positive matching and reverse matching, merge and constitute bi-directional matching segmentation methods；

Step 3: statistical language model algorithm and bi-directional matching segmentation methods cooperate, and obtain part of speech to target character Mark；

Step 4: keyword is extracted, keyword extraction formula:

A document is represented with j, represents a word in the document with i,

Tf indicates the number that a word occurs in a document；

Df indicates the document number in entire corpus containing some word；

N indicates the total number of documents in entire corpus；

From formula: as soon as the number that word occurs in a document is more, tf value is bigger, includes in entire corpus The number of files of some word is fewer, then df value is bigger, therefore the tf-df value of some word is bigger, then this word is the probability of keyword It is bigger；

Step 5: internal evaluation, the passive voice to the keyword obtained in step 4, noun, complicated noun phrase, The specific verb frequency of occurrences and technical term frequency this 5 features are assessed, and according to this 5 features, give different weights Coefficient, can be obtained whether the information content tends to Academic word, and analysis index value is higher, it is considered that the information content is got over It is valuable.

Preferably, setting current word is only related to a word before it, and calculation formula can simplify in step 1 Are as follows:

Using above formula, the probability that any one word occurs after another word can be counted, accuracy depends on The size of statistical sample.

Preferably, in step 2, Forward Maximum Method algorithm principle:

1) from left to right using m character of sentence to be slit as matching character, m is the length of longest entry in initial dictionary Degree.；

2) character is matched with element in dictionary；

If 3) successful match, come out this character as a word segmentation；

If 4) matching is unsuccessful, the last character of this character is removed, then is matched, is repeated the above process, Know cutting completely until a text；

Reverse maximum matching algorithm principle is similar with Forward Maximum Method algorithm principle, the difference is that scanning direction is become It turns left from the right side, when matching unsuccessful, remove leftmost character: two-way maximum matching method is to obtain Forward Maximum Method method Word segmentation result and the obtained result of reverse maximum matching method be compared, to determine correct segmenting method.

Preferably, further including step 6: external assessment duplicate checking, external assessment are mainly the assessment of information content duplicate checking, If the information of publication is seen everywhere on the internet, this information itself is nugatory.

The present invention compares compared with the prior art to be had the advantage that and can replace that manually to scan for engine to information excellent Change, save a large amount of manual labors, to save human cost, the opposite more efficient processing speed of manpower and treatment effeciency can be with It handles the network text data of magnanimity scale and processing result is more accurate, help to improve the accuracy of assessment result, improve The guiding performance of subsequent marketing.

Detailed description of the invention

Fig. 1 is that the present invention is based on integrally patrol in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment Collect the flow diagram of thinking；

Fig. 2 is that the present invention is based on two-way in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment Flow diagram with segmentation methods；

Fig. 3 is that the present invention is based on keywords in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment The flow diagram of extraction；

Fig. 4 is to comment the present invention is based on internal in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment The flow diagram estimated；

Fig. 5 is external assessment in intelligent information assessment and marketing system embodiment of the invention based on statistical language model algorithm The flow diagram of duplicate checking

Specific embodiment

With reference to the accompanying drawing to the present invention is based on the assessments of the intelligent information of statistical language model algorithm and marketing system to implement Example is described further.

A kind of intelligent information assessment and marketing system based on statistical language model algorithm, comprising the following steps:

But such formula calculation amount is extremely huge, if we set current word only it is related to a word before it, Calculation formula can simplify are as follows:

Using above formula, the probability that any one word occurs after another word can be counted, accuracy depends on The size of statistical sample；

Step 2: bi-directional matching segmentation methods, the segmenting method based on string matching are also known as mechanical segmentation method, it An initial abundant big dictionary (technical term dictionary and universaling dictionary in Fig. 1) is needed, then by word to be segmented Symbol string is matched with the element in dictionary, if energy successful match, which is come out, by the difference of scanning direction, word Symbol String matching segmenting method can be divided into positive matching and reverse matching, merge and constitute bi-directional matching segmentation methods, positive maximum Matching algorithm principle:

2) character is matched with element in dictionary；

If 3) successful match, come out this character as a word segmentation；

Reverse maximum matching algorithm principle is similar with Forward Maximum Method algorithm principle, the difference is that scanning direction is become It turns left from the right side, when matching unsuccessful, remove leftmost character: two-way maximum matching method is to obtain Forward Maximum Method method Word segmentation result and the obtained result of reverse maximum matching method be compared, to determine correct segmenting method；

Step 4: keyword is extracted, keyword extraction formula:

A document is represented with j, represents a word in the document with i,

Tf indicates the number that a word occurs in a document；

Df indicates the document number in entire corpus containing some word；

N indicates the total number of documents in entire corpus；

Step 5: internal evaluation, the passive voice to the keyword obtained in step 4, noun, complicated noun phrase, The specific verb frequency of occurrences and technical term frequency this 5 features are assessed, and according to this 5 features, give different weights Coefficient, can be obtained whether the information content tends to Academic word, and analysis index value is higher, it is considered that the information content is got over It is valuable；Step 6: external assessment duplicate checking, external assessment is mainly the assessment of information content duplicate checking, if the information of publication is mutual It is seen everywhere in networking, then this information itself is nugatory.

The present invention segments the information content by statistical language model algorithm, bi-directional matching algorithm, part of speech identifies, leads to It crosses keyword extraction algorithm and extracts information content keyword.

By above-mentioned function, the present invention, which can replace, manually scans for engine optimization to information, and effect includes:

1) a large amount of manual labors are saved, to save human cost；

2) the opposite more efficient processing speed of manpower and treatment effeciency, can handle the network text data of magnanimity scale, It is average to handle at least 500,000 documents per hour；

3) based on magnanimity corpus by segmentation methods and keyword extraction algorithm, keyword extraction result is often than artificial It more can reflect the trunk feature of this article.Keyword, which can be known as, entirely searches for the foundation stone of application.To user and search For index is held up, keyword is the medium of both sides' interaction.The accuracy of keyword extraction determines marketing result；

4) by external assessment function, the present invention always recommends the most Promethean information content, and significant increase search is drawn The friendliness held up；

5) by marketing achievement feedback, the present invention can self-teaching improve marketing strategy, have certain growth.

The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention, it is noted that for the art Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

1. a kind of intelligent information assessment and marketing system based on statistical language model algorithm, it is characterised in that: including following step It is rapid:

Step 1: statistical language model, using statistical language model algorithm, statistical language model can be used to state sequence of words Statistical property, for example learn sequence in word Joint Distribution probability function.If successively indicated in a word with w1 to wm Each word, then the probability of occurrence of the clause can be represented simply as:

Step 2: bi-directional matching segmentation methods, the segmenting method based on string matching are also known as mechanical segmentation method, it is needed There is an initial abundant big dictionary (technical term dictionary and universaling dictionary in Fig. 1), then by character string to be segmented It is matched with the element in dictionary, if energy successful match, which is come out, by the difference of scanning direction, character string Matching segmenting method can be divided into positive matching and reverse matching, merge and constitute bi-directional matching segmentation methods；

Step 3: statistical language model algorithm and bi-directional matching segmentation methods cooperate, and obtain part-of-speech tagging to target character；

Step 4: extracting keyword, and keyword extraction formula: representing a document with j, represents a word in the document with i,

Tf indicates the number that a word occurs in a document；

Df indicates the document number in entire corpus containing some word；

N indicates the total number of documents in entire corpus；

From formula: as soon as the number that word occurs in a document is more, tf value is bigger, includes some in entire corpus The number of files of word is fewer, then df value is bigger, therefore the tf-df value of some word is bigger, then this word is that the probability of keyword is bigger；

Step 5: internal evaluation, the passive voice to the keyword obtained in step 4, noun, complicated noun phrase are specific The verb frequency of occurrences and technical term frequency this 5 features are assessed, and according to this 5 features, give different weight coefficients, It can be obtained whether the information content tends to Academic word, analysis index value is higher, it is considered that the information content is more valuable Value.

2. the intelligent information assessment and marketing system according to claim 1 based on statistical language model algorithm, feature Be: in step 1, setting current word is only related to a word before it, and calculation formula can simplify are as follows:

Using above formula, the probability that any one word occurs after another word can be counted, accuracy depends on statistics Size.

3. the intelligent information assessment and marketing system according to claim 1 based on statistical language model algorithm, feature It is: in step 2, Forward Maximum Method algorithm principle:

1) from left to right using m character of sentence to be slit as matching character, m is the length of longest entry in initial dictionary.；

2) character is matched with element in dictionary；

If 3) successful match, come out this character as a word segmentation；

If 4) matching is unsuccessful, the last character of this character is removed, then is matched, is repeated the above process, it is known that Cutting is completely until a text；

Reverse maximum matching algorithm principle is similar with Forward Maximum Method algorithm principle, the difference is that by scanning direction become from The right side is turned left, and when matching unsuccessful, removes leftmost character: two-way maximum matching method is point for obtaining Forward Maximum Method method The result that word result and reverse maximum matching method obtain is compared, to determine correct segmenting method.

4. the intelligent information assessment and marketing system according to claim 1 based on statistical language model algorithm, feature Be: further including step 6: external assessment duplicate checking, external assessment is mainly the assessment of information content duplicate checking, if the information of publication It is seen everywhere on the internet, then this information itself is nugatory.