CN110347903A - Intelligent information assessment and marketing system based on statistical language model algorithm - Google Patents
Intelligent information assessment and marketing system based on statistical language model algorithm Download PDFInfo
- Publication number
- CN110347903A CN110347903A CN201910647150.4A CN201910647150A CN110347903A CN 110347903 A CN110347903 A CN 110347903A CN 201910647150 A CN201910647150 A CN 201910647150A CN 110347903 A CN110347903 A CN 110347903A
- Authority
- CN
- China
- Prior art keywords
- word
- language model
- matching
- statistical language
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Abstract
The invention discloses a kind of intelligent information assessment and marketing system based on statistical language model algorithm, its key points of the technical solution are that including statistical language model, bi-directional matching segmentation methods, statistical language model algorithm and bi-directional matching segmentation methods cooperate, extract keyword and internal evaluation this series of steps, engine optimization manually is scanned for information the advantage of the invention is that can replace, save a large amount of manual labors, to save human cost, the opposite more efficient processing speed of manpower and treatment effeciency, it can handle the network text data of magnanimity scale and processing result be more accurate, help to improve the accuracy of assessment result, improve the guiding performance of subsequent marketing.
Description
Technical field
The present invention relates to a kind of web information searching tools, more specifically, it relates to which a kind of be based on statistical language model algorithm
Intelligent information assessment and marketing system.
Background technique
Search engine collecting optimization is by all kinds of search engine collecting internet pages of understanding, is indexed and determines
It carries out relevant optimization to technologies such as particular keywords search result ranks, to webpage, it is made to improve search engine ranking,
To improve website visiting amount, the effect of the final sale for promoting website or publicity.
The prior art is commonly referred to as: SEO (Search Engine Optimization), i.e. search engine optimization.
SEO is related to the present invention mainly 3 aspects:
First, optimize to webpage META label: content title, keyword are all that label optimizes there are also synopsis
Target.
Second, the link optimized inside website is to influence the principal element of website clicking rate, correlation connection, Anchor Text chain
Connecing will optimize, and meet the needs of website user.
Third to web page code compression, is improved, is mainly to maintain the uniqueness of site home page, page and main business in website
Link.
The prior art has the disadvantage in that current SEO uses manual type, time-consuming and laborious, and vulnerable to practitioner
Quality influence.
Summary of the invention
It is a kind of based on statistical language model algorithm in view of the deficiencies of the prior art, the present invention intends to provide
Intelligent information assessment and marketing system.
To achieve the above object, the present invention adopts the following technical scheme: a kind of intelligence based on statistical language model algorithm
Information evaluation and marketing system, comprising the following steps:
Step 1: statistical language model, using statistical language model algorithm, statistical language model can be used to state vocabulary
The statistical property of sequence for example learns the Joint Distribution probability function of word in sequence.If successively indicating one with w1 to wm
Each word in words, then the probability of occurrence of the clause can be represented simply as:
Wherein, the conditional probability in model can be calculated with word frequency:
Step 2: bi-directional matching segmentation methods, the segmenting method based on string matching are also known as mechanical segmentation method, it
An initial abundant big dictionary (technical term dictionary and universaling dictionary in Fig. 1) is needed, then by word to be segmented
Symbol string is matched with the element in dictionary, if energy successful match, which is come out, by the difference of scanning direction, word
Symbol String matching segmenting method can be divided into positive matching and reverse matching, merge and constitute bi-directional matching segmentation methods;
Step 3: statistical language model algorithm and bi-directional matching segmentation methods cooperate, and obtain part of speech to target character
Mark;
Step 4: keyword is extracted, keyword extraction formula:
A document is represented with j, represents a word in the document with i,
Tf indicates the number that a word occurs in a document;
Df indicates the document number in entire corpus containing some word;
N indicates the total number of documents in entire corpus;
From formula: as soon as the number that word occurs in a document is more, tf value is bigger, includes in entire corpus
The number of files of some word is fewer, then df value is bigger, therefore the tf-df value of some word is bigger, then this word is the probability of keyword
It is bigger;
Step 5: internal evaluation, the passive voice to the keyword obtained in step 4, noun, complicated noun phrase,
The specific verb frequency of occurrences and technical term frequency this 5 features are assessed, and according to this 5 features, give different weights
Coefficient, can be obtained whether the information content tends to Academic word, and analysis index value is higher, it is considered that the information content is got over
It is valuable.
Preferably, setting current word is only related to a word before it, and calculation formula can simplify in step 1
Are as follows:
Using above formula, the probability that any one word occurs after another word can be counted, accuracy depends on
The size of statistical sample.
Preferably, in step 2, Forward Maximum Method algorithm principle:
1) from left to right using m character of sentence to be slit as matching character, m is the length of longest entry in initial dictionary
Degree.;
2) character is matched with element in dictionary;
If 3) successful match, come out this character as a word segmentation;
If 4) matching is unsuccessful, the last character of this character is removed, then is matched, is repeated the above process,
Know cutting completely until a text;
Reverse maximum matching algorithm principle is similar with Forward Maximum Method algorithm principle, the difference is that scanning direction is become
It turns left from the right side, when matching unsuccessful, remove leftmost character: two-way maximum matching method is to obtain Forward Maximum Method method
Word segmentation result and the obtained result of reverse maximum matching method be compared, to determine correct segmenting method.
Preferably, further including step 6: external assessment duplicate checking, external assessment are mainly the assessment of information content duplicate checking,
If the information of publication is seen everywhere on the internet, this information itself is nugatory.
The present invention compares compared with the prior art to be had the advantage that and can replace that manually to scan for engine to information excellent
Change, save a large amount of manual labors, to save human cost, the opposite more efficient processing speed of manpower and treatment effeciency can be with
It handles the network text data of magnanimity scale and processing result is more accurate, help to improve the accuracy of assessment result, improve
The guiding performance of subsequent marketing.
Detailed description of the invention
Fig. 1 is that the present invention is based on integrally patrol in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment
Collect the flow diagram of thinking;
Fig. 2 is that the present invention is based on two-way in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment
Flow diagram with segmentation methods;
Fig. 3 is that the present invention is based on keywords in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment
The flow diagram of extraction;
Fig. 4 is to comment the present invention is based on internal in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment
The flow diagram estimated;
Fig. 5 is external assessment in intelligent information assessment and marketing system embodiment of the invention based on statistical language model algorithm
The flow diagram of duplicate checking
Specific embodiment
With reference to the accompanying drawing to the present invention is based on the assessments of the intelligent information of statistical language model algorithm and marketing system to implement
Example is described further.
A kind of intelligent information assessment and marketing system based on statistical language model algorithm, comprising the following steps:
Step 1: statistical language model, using statistical language model algorithm, statistical language model can be used to state vocabulary
The statistical property of sequence for example learns the Joint Distribution probability function of word in sequence.If successively indicating one with w1 to wm
Each word in words, then the probability of occurrence of the clause can be represented simply as:
Wherein, the conditional probability in model can be calculated with word frequency:
But such formula calculation amount is extremely huge, if we set current word only it is related to a word before it,
Calculation formula can simplify are as follows:
Using above formula, the probability that any one word occurs after another word can be counted, accuracy depends on
The size of statistical sample;
Step 2: bi-directional matching segmentation methods, the segmenting method based on string matching are also known as mechanical segmentation method, it
An initial abundant big dictionary (technical term dictionary and universaling dictionary in Fig. 1) is needed, then by word to be segmented
Symbol string is matched with the element in dictionary, if energy successful match, which is come out, by the difference of scanning direction, word
Symbol String matching segmenting method can be divided into positive matching and reverse matching, merge and constitute bi-directional matching segmentation methods, positive maximum
Matching algorithm principle:
1) from left to right using m character of sentence to be slit as matching character, m is the length of longest entry in initial dictionary
Degree.;
2) character is matched with element in dictionary;
If 3) successful match, come out this character as a word segmentation;
If 4) matching is unsuccessful, the last character of this character is removed, then is matched, is repeated the above process,
Know cutting completely until a text;
Reverse maximum matching algorithm principle is similar with Forward Maximum Method algorithm principle, the difference is that scanning direction is become
It turns left from the right side, when matching unsuccessful, remove leftmost character: two-way maximum matching method is to obtain Forward Maximum Method method
Word segmentation result and the obtained result of reverse maximum matching method be compared, to determine correct segmenting method;
Step 3: statistical language model algorithm and bi-directional matching segmentation methods cooperate, and obtain part of speech to target character
Mark;
Step 4: keyword is extracted, keyword extraction formula:
A document is represented with j, represents a word in the document with i,
Tf indicates the number that a word occurs in a document;
Df indicates the document number in entire corpus containing some word;
N indicates the total number of documents in entire corpus;
From formula: as soon as the number that word occurs in a document is more, tf value is bigger, includes in entire corpus
The number of files of some word is fewer, then df value is bigger, therefore the tf-df value of some word is bigger, then this word is the probability of keyword
It is bigger;
Step 5: internal evaluation, the passive voice to the keyword obtained in step 4, noun, complicated noun phrase,
The specific verb frequency of occurrences and technical term frequency this 5 features are assessed, and according to this 5 features, give different weights
Coefficient, can be obtained whether the information content tends to Academic word, and analysis index value is higher, it is considered that the information content is got over
It is valuable;Step 6: external assessment duplicate checking, external assessment is mainly the assessment of information content duplicate checking, if the information of publication is mutual
It is seen everywhere in networking, then this information itself is nugatory.
The present invention segments the information content by statistical language model algorithm, bi-directional matching algorithm, part of speech identifies, leads to
It crosses keyword extraction algorithm and extracts information content keyword.
By above-mentioned function, the present invention, which can replace, manually scans for engine optimization to information, and effect includes:
1) a large amount of manual labors are saved, to save human cost;
2) the opposite more efficient processing speed of manpower and treatment effeciency, can handle the network text data of magnanimity scale,
It is average to handle at least 500,000 documents per hour;
3) based on magnanimity corpus by segmentation methods and keyword extraction algorithm, keyword extraction result is often than artificial
It more can reflect the trunk feature of this article.Keyword, which can be known as, entirely searches for the foundation stone of application.To user and search
For index is held up, keyword is the medium of both sides' interaction.The accuracy of keyword extraction determines marketing result;
4) by external assessment function, the present invention always recommends the most Promethean information content, and significant increase search is drawn
The friendliness held up;
5) by marketing achievement feedback, the present invention can self-teaching improve marketing strategy, have certain growth.
The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation
Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention, it is noted that for the art
Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (4)
1. a kind of intelligent information assessment and marketing system based on statistical language model algorithm, it is characterised in that: including following step
It is rapid:
Step 1: statistical language model, using statistical language model algorithm, statistical language model can be used to state sequence of words
Statistical property, for example learn sequence in word Joint Distribution probability function.If successively indicated in a word with w1 to wm
Each word, then the probability of occurrence of the clause can be represented simply as:
Wherein, the conditional probability in model can be calculated with word frequency:
Step 2: bi-directional matching segmentation methods, the segmenting method based on string matching are also known as mechanical segmentation method, it is needed
There is an initial abundant big dictionary (technical term dictionary and universaling dictionary in Fig. 1), then by character string to be segmented
It is matched with the element in dictionary, if energy successful match, which is come out, by the difference of scanning direction, character string
Matching segmenting method can be divided into positive matching and reverse matching, merge and constitute bi-directional matching segmentation methods;
Step 3: statistical language model algorithm and bi-directional matching segmentation methods cooperate, and obtain part-of-speech tagging to target character;
Step 4: extracting keyword, and keyword extraction formula: representing a document with j, represents a word in the document with i,
Tf indicates the number that a word occurs in a document;
Df indicates the document number in entire corpus containing some word;
N indicates the total number of documents in entire corpus;
From formula: as soon as the number that word occurs in a document is more, tf value is bigger, includes some in entire corpus
The number of files of word is fewer, then df value is bigger, therefore the tf-df value of some word is bigger, then this word is that the probability of keyword is bigger;
Step 5: internal evaluation, the passive voice to the keyword obtained in step 4, noun, complicated noun phrase are specific
The verb frequency of occurrences and technical term frequency this 5 features are assessed, and according to this 5 features, give different weight coefficients,
It can be obtained whether the information content tends to Academic word, analysis index value is higher, it is considered that the information content is more valuable
Value.
2. the intelligent information assessment and marketing system according to claim 1 based on statistical language model algorithm, feature
Be: in step 1, setting current word is only related to a word before it, and calculation formula can simplify are as follows:
Using above formula, the probability that any one word occurs after another word can be counted, accuracy depends on statistics
Size.
3. the intelligent information assessment and marketing system according to claim 1 based on statistical language model algorithm, feature
It is: in step 2, Forward Maximum Method algorithm principle:
1) from left to right using m character of sentence to be slit as matching character, m is the length of longest entry in initial dictionary.;
2) character is matched with element in dictionary;
If 3) successful match, come out this character as a word segmentation;
If 4) matching is unsuccessful, the last character of this character is removed, then is matched, is repeated the above process, it is known that
Cutting is completely until a text;
Reverse maximum matching algorithm principle is similar with Forward Maximum Method algorithm principle, the difference is that by scanning direction become from
The right side is turned left, and when matching unsuccessful, removes leftmost character: two-way maximum matching method is point for obtaining Forward Maximum Method method
The result that word result and reverse maximum matching method obtain is compared, to determine correct segmenting method.
4. the intelligent information assessment and marketing system according to claim 1 based on statistical language model algorithm, feature
Be: further including step 6: external assessment duplicate checking, external assessment is mainly the assessment of information content duplicate checking, if the information of publication
It is seen everywhere on the internet, then this information itself is nugatory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910647150.4A CN110347903A (en) | 2019-07-17 | 2019-07-17 | Intelligent information assessment and marketing system based on statistical language model algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910647150.4A CN110347903A (en) | 2019-07-17 | 2019-07-17 | Intelligent information assessment and marketing system based on statistical language model algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110347903A true CN110347903A (en) | 2019-10-18 |
Family
ID=68175016
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910647150.4A Pending CN110347903A (en) | 2019-07-17 | 2019-07-17 | Intelligent information assessment and marketing system based on statistical language model algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110347903A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818677A (en) * | 2021-02-22 | 2021-05-18 | 康美健康云服务有限公司 | Information evaluation method and system based on Internet |
CN116227488A (en) * | 2023-05-09 | 2023-06-06 | 北京拓普丰联信息科技股份有限公司 | Text word segmentation method and device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104598532A (en) * | 2014-12-29 | 2015-05-06 | 中国联合网络通信有限公司广东省分公司 | Information processing method and device |
CN105426360A (en) * | 2015-11-12 | 2016-03-23 | 中国建设银行股份有限公司 | Keyword extracting method and device |
CN105893410A (en) * | 2015-11-18 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Keyword extraction method and apparatus |
CN109902288A (en) * | 2019-01-17 | 2019-06-18 | 深圳壹账通智能科技有限公司 | Intelligent clause analysis method, device, computer equipment and storage medium |
CN109918657A (en) * | 2019-02-28 | 2019-06-21 | 云孚科技(北京)有限公司 | A method of extracting target keyword from text |
CN110019556A (en) * | 2017-12-27 | 2019-07-16 | 阿里巴巴集团控股有限公司 | A kind of topic news acquisition methods, device and its equipment |
-
2019
- 2019-07-17 CN CN201910647150.4A patent/CN110347903A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104598532A (en) * | 2014-12-29 | 2015-05-06 | 中国联合网络通信有限公司广东省分公司 | Information processing method and device |
CN105426360A (en) * | 2015-11-12 | 2016-03-23 | 中国建设银行股份有限公司 | Keyword extracting method and device |
CN105893410A (en) * | 2015-11-18 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Keyword extraction method and apparatus |
CN110019556A (en) * | 2017-12-27 | 2019-07-16 | 阿里巴巴集团控股有限公司 | A kind of topic news acquisition methods, device and its equipment |
CN109902288A (en) * | 2019-01-17 | 2019-06-18 | 深圳壹账通智能科技有限公司 | Intelligent clause analysis method, device, computer equipment and storage medium |
CN109918657A (en) * | 2019-02-28 | 2019-06-21 | 云孚科技(北京)有限公司 | A method of extracting target keyword from text |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818677A (en) * | 2021-02-22 | 2021-05-18 | 康美健康云服务有限公司 | Information evaluation method and system based on Internet |
CN116227488A (en) * | 2023-05-09 | 2023-06-06 | 北京拓普丰联信息科技股份有限公司 | Text word segmentation method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110647629B (en) | Multi-document machine reading understanding method for multi-granularity answer sorting | |
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN113220919B (en) | Dam defect image text cross-modal retrieval method and model | |
CN107229668B (en) | Text extraction method based on keyword matching | |
CN108763321B (en) | Related entity recommendation method based on large-scale related entity network | |
CN107577671B (en) | Subject term extraction method based on multi-feature fusion | |
CN108763196A (en) | A kind of keyword extraction method based on PMI | |
CN112818694A (en) | Named entity recognition method based on rules and improved pre-training model | |
CN109241277B (en) | Text vector weighting method and system based on news keywords | |
CN107908698B (en) | Topic web crawler method, electronic device, storage medium and system | |
CN110781679B (en) | News event keyword mining method based on associated semantic chain network | |
CN111402092B (en) | Law and regulation retrieval system based on multilevel semantic analysis | |
CN111324801B (en) | Hot event discovery method in judicial field based on hot words | |
CN106570120A (en) | Process for realizing searching engine optimization through improved keyword optimization | |
CN108153851B (en) | General forum subject post page information extraction method based on rules and semantics | |
CN104699797A (en) | Webpage data structured analytic method and device | |
CN111444704B (en) | Network safety keyword extraction method based on deep neural network | |
CN110347903A (en) | Intelligent information assessment and marketing system based on statistical language model algorithm | |
CN113761890A (en) | BERT context sensing-based multi-level semantic information retrieval method | |
CN106528726A (en) | Keyword optimization-based search engine optimization realization technology | |
CN111274494A (en) | Composite label recommendation method combining deep learning and collaborative filtering technology | |
CN111475608A (en) | Mashup service characteristic representation method based on functional semantic correlation calculation | |
CN104346382A (en) | Text analysis system and method employing language query | |
CN112860898B (en) | Short text box clustering method, system, equipment and storage medium | |
CN113111645B (en) | Media text similarity detection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191018 |
|
RJ01 | Rejection of invention patent application after publication |