CN110347903A - Intelligent information assessment and marketing system based on statistical language model algorithm - Google Patents

Intelligent information assessment and marketing system based on statistical language model algorithm Download PDF

Info

Publication number
CN110347903A
CN110347903A CN201910647150.4A CN201910647150A CN110347903A CN 110347903 A CN110347903 A CN 110347903A CN 201910647150 A CN201910647150 A CN 201910647150A CN 110347903 A CN110347903 A CN 110347903A
Authority
CN
China
Prior art keywords
word
language model
matching
statistical language
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910647150.4A
Other languages
Chinese (zh)
Inventor
吴俊哲
吴剑东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Dongwang Information Technology Co Ltd
Original Assignee
Jiangsu Dongwang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Dongwang Information Technology Co Ltd filed Critical Jiangsu Dongwang Information Technology Co Ltd
Priority to CN201910647150.4A priority Critical patent/CN110347903A/en
Publication of CN110347903A publication Critical patent/CN110347903A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The invention discloses a kind of intelligent information assessment and marketing system based on statistical language model algorithm, its key points of the technical solution are that including statistical language model, bi-directional matching segmentation methods, statistical language model algorithm and bi-directional matching segmentation methods cooperate, extract keyword and internal evaluation this series of steps, engine optimization manually is scanned for information the advantage of the invention is that can replace, save a large amount of manual labors, to save human cost, the opposite more efficient processing speed of manpower and treatment effeciency, it can handle the network text data of magnanimity scale and processing result be more accurate, help to improve the accuracy of assessment result, improve the guiding performance of subsequent marketing.

Description

Intelligent information assessment and marketing system based on statistical language model algorithm
Technical field
The present invention relates to a kind of web information searching tools, more specifically, it relates to which a kind of be based on statistical language model algorithm Intelligent information assessment and marketing system.
Background technique
Search engine collecting optimization is by all kinds of search engine collecting internet pages of understanding, is indexed and determines It carries out relevant optimization to technologies such as particular keywords search result ranks, to webpage, it is made to improve search engine ranking, To improve website visiting amount, the effect of the final sale for promoting website or publicity.
The prior art is commonly referred to as: SEO (Search Engine Optimization), i.e. search engine optimization.
SEO is related to the present invention mainly 3 aspects:
First, optimize to webpage META label: content title, keyword are all that label optimizes there are also synopsis Target.
Second, the link optimized inside website is to influence the principal element of website clicking rate, correlation connection, Anchor Text chain Connecing will optimize, and meet the needs of website user.
Third to web page code compression, is improved, is mainly to maintain the uniqueness of site home page, page and main business in website Link.
The prior art has the disadvantage in that current SEO uses manual type, time-consuming and laborious, and vulnerable to practitioner Quality influence.
Summary of the invention
It is a kind of based on statistical language model algorithm in view of the deficiencies of the prior art, the present invention intends to provide Intelligent information assessment and marketing system.
To achieve the above object, the present invention adopts the following technical scheme: a kind of intelligence based on statistical language model algorithm Information evaluation and marketing system, comprising the following steps:
Step 1: statistical language model, using statistical language model algorithm, statistical language model can be used to state vocabulary The statistical property of sequence for example learns the Joint Distribution probability function of word in sequence.If successively indicating one with w1 to wm Each word in words, then the probability of occurrence of the clause can be represented simply as:
Wherein, the conditional probability in model can be calculated with word frequency:
Step 2: bi-directional matching segmentation methods, the segmenting method based on string matching are also known as mechanical segmentation method, it An initial abundant big dictionary (technical term dictionary and universaling dictionary in Fig. 1) is needed, then by word to be segmented Symbol string is matched with the element in dictionary, if energy successful match, which is come out, by the difference of scanning direction, word Symbol String matching segmenting method can be divided into positive matching and reverse matching, merge and constitute bi-directional matching segmentation methods;
Step 3: statistical language model algorithm and bi-directional matching segmentation methods cooperate, and obtain part of speech to target character Mark;
Step 4: keyword is extracted, keyword extraction formula:
A document is represented with j, represents a word in the document with i,
Tf indicates the number that a word occurs in a document;
Df indicates the document number in entire corpus containing some word;
N indicates the total number of documents in entire corpus;
From formula: as soon as the number that word occurs in a document is more, tf value is bigger, includes in entire corpus The number of files of some word is fewer, then df value is bigger, therefore the tf-df value of some word is bigger, then this word is the probability of keyword It is bigger;
Step 5: internal evaluation, the passive voice to the keyword obtained in step 4, noun, complicated noun phrase, The specific verb frequency of occurrences and technical term frequency this 5 features are assessed, and according to this 5 features, give different weights Coefficient, can be obtained whether the information content tends to Academic word, and analysis index value is higher, it is considered that the information content is got over It is valuable.
Preferably, setting current word is only related to a word before it, and calculation formula can simplify in step 1 Are as follows:
Using above formula, the probability that any one word occurs after another word can be counted, accuracy depends on The size of statistical sample.
Preferably, in step 2, Forward Maximum Method algorithm principle:
1) from left to right using m character of sentence to be slit as matching character, m is the length of longest entry in initial dictionary Degree.;
2) character is matched with element in dictionary;
If 3) successful match, come out this character as a word segmentation;
If 4) matching is unsuccessful, the last character of this character is removed, then is matched, is repeated the above process, Know cutting completely until a text;
Reverse maximum matching algorithm principle is similar with Forward Maximum Method algorithm principle, the difference is that scanning direction is become It turns left from the right side, when matching unsuccessful, remove leftmost character: two-way maximum matching method is to obtain Forward Maximum Method method Word segmentation result and the obtained result of reverse maximum matching method be compared, to determine correct segmenting method.
Preferably, further including step 6: external assessment duplicate checking, external assessment are mainly the assessment of information content duplicate checking, If the information of publication is seen everywhere on the internet, this information itself is nugatory.
The present invention compares compared with the prior art to be had the advantage that and can replace that manually to scan for engine to information excellent Change, save a large amount of manual labors, to save human cost, the opposite more efficient processing speed of manpower and treatment effeciency can be with It handles the network text data of magnanimity scale and processing result is more accurate, help to improve the accuracy of assessment result, improve The guiding performance of subsequent marketing.
Detailed description of the invention
Fig. 1 is that the present invention is based on integrally patrol in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment Collect the flow diagram of thinking;
Fig. 2 is that the present invention is based on two-way in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment Flow diagram with segmentation methods;
Fig. 3 is that the present invention is based on keywords in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment The flow diagram of extraction;
Fig. 4 is to comment the present invention is based on internal in the assessment of the intelligent information of statistical language model algorithm and marketing system embodiment The flow diagram estimated;
Fig. 5 is external assessment in intelligent information assessment and marketing system embodiment of the invention based on statistical language model algorithm The flow diagram of duplicate checking
Specific embodiment
With reference to the accompanying drawing to the present invention is based on the assessments of the intelligent information of statistical language model algorithm and marketing system to implement Example is described further.
A kind of intelligent information assessment and marketing system based on statistical language model algorithm, comprising the following steps:
Step 1: statistical language model, using statistical language model algorithm, statistical language model can be used to state vocabulary The statistical property of sequence for example learns the Joint Distribution probability function of word in sequence.If successively indicating one with w1 to wm Each word in words, then the probability of occurrence of the clause can be represented simply as:
Wherein, the conditional probability in model can be calculated with word frequency:
But such formula calculation amount is extremely huge, if we set current word only it is related to a word before it, Calculation formula can simplify are as follows:
Using above formula, the probability that any one word occurs after another word can be counted, accuracy depends on The size of statistical sample;
Step 2: bi-directional matching segmentation methods, the segmenting method based on string matching are also known as mechanical segmentation method, it An initial abundant big dictionary (technical term dictionary and universaling dictionary in Fig. 1) is needed, then by word to be segmented Symbol string is matched with the element in dictionary, if energy successful match, which is come out, by the difference of scanning direction, word Symbol String matching segmenting method can be divided into positive matching and reverse matching, merge and constitute bi-directional matching segmentation methods, positive maximum Matching algorithm principle:
1) from left to right using m character of sentence to be slit as matching character, m is the length of longest entry in initial dictionary Degree.;
2) character is matched with element in dictionary;
If 3) successful match, come out this character as a word segmentation;
If 4) matching is unsuccessful, the last character of this character is removed, then is matched, is repeated the above process, Know cutting completely until a text;
Reverse maximum matching algorithm principle is similar with Forward Maximum Method algorithm principle, the difference is that scanning direction is become It turns left from the right side, when matching unsuccessful, remove leftmost character: two-way maximum matching method is to obtain Forward Maximum Method method Word segmentation result and the obtained result of reverse maximum matching method be compared, to determine correct segmenting method;
Step 3: statistical language model algorithm and bi-directional matching segmentation methods cooperate, and obtain part of speech to target character Mark;
Step 4: keyword is extracted, keyword extraction formula:
A document is represented with j, represents a word in the document with i,
Tf indicates the number that a word occurs in a document;
Df indicates the document number in entire corpus containing some word;
N indicates the total number of documents in entire corpus;
From formula: as soon as the number that word occurs in a document is more, tf value is bigger, includes in entire corpus The number of files of some word is fewer, then df value is bigger, therefore the tf-df value of some word is bigger, then this word is the probability of keyword It is bigger;
Step 5: internal evaluation, the passive voice to the keyword obtained in step 4, noun, complicated noun phrase, The specific verb frequency of occurrences and technical term frequency this 5 features are assessed, and according to this 5 features, give different weights Coefficient, can be obtained whether the information content tends to Academic word, and analysis index value is higher, it is considered that the information content is got over It is valuable;Step 6: external assessment duplicate checking, external assessment is mainly the assessment of information content duplicate checking, if the information of publication is mutual It is seen everywhere in networking, then this information itself is nugatory.
The present invention segments the information content by statistical language model algorithm, bi-directional matching algorithm, part of speech identifies, leads to It crosses keyword extraction algorithm and extracts information content keyword.
By above-mentioned function, the present invention, which can replace, manually scans for engine optimization to information, and effect includes:
1) a large amount of manual labors are saved, to save human cost;
2) the opposite more efficient processing speed of manpower and treatment effeciency, can handle the network text data of magnanimity scale, It is average to handle at least 500,000 documents per hour;
3) based on magnanimity corpus by segmentation methods and keyword extraction algorithm, keyword extraction result is often than artificial It more can reflect the trunk feature of this article.Keyword, which can be known as, entirely searches for the foundation stone of application.To user and search For index is held up, keyword is the medium of both sides' interaction.The accuracy of keyword extraction determines marketing result;
4) by external assessment function, the present invention always recommends the most Promethean information content, and significant increase search is drawn The friendliness held up;
5) by marketing achievement feedback, the present invention can self-teaching improve marketing strategy, have certain growth.
The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention, it is noted that for the art Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (4)

1. a kind of intelligent information assessment and marketing system based on statistical language model algorithm, it is characterised in that: including following step It is rapid:
Step 1: statistical language model, using statistical language model algorithm, statistical language model can be used to state sequence of words Statistical property, for example learn sequence in word Joint Distribution probability function.If successively indicated in a word with w1 to wm Each word, then the probability of occurrence of the clause can be represented simply as:
Wherein, the conditional probability in model can be calculated with word frequency:
Step 2: bi-directional matching segmentation methods, the segmenting method based on string matching are also known as mechanical segmentation method, it is needed There is an initial abundant big dictionary (technical term dictionary and universaling dictionary in Fig. 1), then by character string to be segmented It is matched with the element in dictionary, if energy successful match, which is come out, by the difference of scanning direction, character string Matching segmenting method can be divided into positive matching and reverse matching, merge and constitute bi-directional matching segmentation methods;
Step 3: statistical language model algorithm and bi-directional matching segmentation methods cooperate, and obtain part-of-speech tagging to target character;
Step 4: extracting keyword, and keyword extraction formula: representing a document with j, represents a word in the document with i,
Tf indicates the number that a word occurs in a document;
Df indicates the document number in entire corpus containing some word;
N indicates the total number of documents in entire corpus;
From formula: as soon as the number that word occurs in a document is more, tf value is bigger, includes some in entire corpus The number of files of word is fewer, then df value is bigger, therefore the tf-df value of some word is bigger, then this word is that the probability of keyword is bigger;
Step 5: internal evaluation, the passive voice to the keyword obtained in step 4, noun, complicated noun phrase are specific The verb frequency of occurrences and technical term frequency this 5 features are assessed, and according to this 5 features, give different weight coefficients, It can be obtained whether the information content tends to Academic word, analysis index value is higher, it is considered that the information content is more valuable Value.
2. the intelligent information assessment and marketing system according to claim 1 based on statistical language model algorithm, feature Be: in step 1, setting current word is only related to a word before it, and calculation formula can simplify are as follows:
Using above formula, the probability that any one word occurs after another word can be counted, accuracy depends on statistics Size.
3. the intelligent information assessment and marketing system according to claim 1 based on statistical language model algorithm, feature It is: in step 2, Forward Maximum Method algorithm principle:
1) from left to right using m character of sentence to be slit as matching character, m is the length of longest entry in initial dictionary.;
2) character is matched with element in dictionary;
If 3) successful match, come out this character as a word segmentation;
If 4) matching is unsuccessful, the last character of this character is removed, then is matched, is repeated the above process, it is known that Cutting is completely until a text;
Reverse maximum matching algorithm principle is similar with Forward Maximum Method algorithm principle, the difference is that by scanning direction become from The right side is turned left, and when matching unsuccessful, removes leftmost character: two-way maximum matching method is point for obtaining Forward Maximum Method method The result that word result and reverse maximum matching method obtain is compared, to determine correct segmenting method.
4. the intelligent information assessment and marketing system according to claim 1 based on statistical language model algorithm, feature Be: further including step 6: external assessment duplicate checking, external assessment is mainly the assessment of information content duplicate checking, if the information of publication It is seen everywhere on the internet, then this information itself is nugatory.
CN201910647150.4A 2019-07-17 2019-07-17 Intelligent information assessment and marketing system based on statistical language model algorithm Pending CN110347903A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910647150.4A CN110347903A (en) 2019-07-17 2019-07-17 Intelligent information assessment and marketing system based on statistical language model algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910647150.4A CN110347903A (en) 2019-07-17 2019-07-17 Intelligent information assessment and marketing system based on statistical language model algorithm

Publications (1)

Publication Number Publication Date
CN110347903A true CN110347903A (en) 2019-10-18

Family

ID=68175016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910647150.4A Pending CN110347903A (en) 2019-07-17 2019-07-17 Intelligent information assessment and marketing system based on statistical language model algorithm

Country Status (1)

Country Link
CN (1) CN110347903A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818677A (en) * 2021-02-22 2021-05-18 康美健康云服务有限公司 Information evaluation method and system based on Internet
CN116227488A (en) * 2023-05-09 2023-06-06 北京拓普丰联信息科技股份有限公司 Text word segmentation method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598532A (en) * 2014-12-29 2015-05-06 中国联合网络通信有限公司广东省分公司 Information processing method and device
CN105426360A (en) * 2015-11-12 2016-03-23 中国建设银行股份有限公司 Keyword extracting method and device
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus
CN109902288A (en) * 2019-01-17 2019-06-18 深圳壹账通智能科技有限公司 Intelligent clause analysis method, device, computer equipment and storage medium
CN109918657A (en) * 2019-02-28 2019-06-21 云孚科技(北京)有限公司 A method of extracting target keyword from text
CN110019556A (en) * 2017-12-27 2019-07-16 阿里巴巴集团控股有限公司 A kind of topic news acquisition methods, device and its equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598532A (en) * 2014-12-29 2015-05-06 中国联合网络通信有限公司广东省分公司 Information processing method and device
CN105426360A (en) * 2015-11-12 2016-03-23 中国建设银行股份有限公司 Keyword extracting method and device
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus
CN110019556A (en) * 2017-12-27 2019-07-16 阿里巴巴集团控股有限公司 A kind of topic news acquisition methods, device and its equipment
CN109902288A (en) * 2019-01-17 2019-06-18 深圳壹账通智能科技有限公司 Intelligent clause analysis method, device, computer equipment and storage medium
CN109918657A (en) * 2019-02-28 2019-06-21 云孚科技(北京)有限公司 A method of extracting target keyword from text

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818677A (en) * 2021-02-22 2021-05-18 康美健康云服务有限公司 Information evaluation method and system based on Internet
CN116227488A (en) * 2023-05-09 2023-06-06 北京拓普丰联信息科技股份有限公司 Text word segmentation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110647629B (en) Multi-document machine reading understanding method for multi-granularity answer sorting
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN113220919B (en) Dam defect image text cross-modal retrieval method and model
CN107229668B (en) Text extraction method based on keyword matching
CN108763321B (en) Related entity recommendation method based on large-scale related entity network
CN107577671B (en) Subject term extraction method based on multi-feature fusion
CN108763196A (en) A kind of keyword extraction method based on PMI
CN112818694A (en) Named entity recognition method based on rules and improved pre-training model
CN109241277B (en) Text vector weighting method and system based on news keywords
CN107908698B (en) Topic web crawler method, electronic device, storage medium and system
CN110781679B (en) News event keyword mining method based on associated semantic chain network
CN111402092B (en) Law and regulation retrieval system based on multilevel semantic analysis
CN111324801B (en) Hot event discovery method in judicial field based on hot words
CN106570120A (en) Process for realizing searching engine optimization through improved keyword optimization
CN108153851B (en) General forum subject post page information extraction method based on rules and semantics
CN104699797A (en) Webpage data structured analytic method and device
CN111444704B (en) Network safety keyword extraction method based on deep neural network
CN110347903A (en) Intelligent information assessment and marketing system based on statistical language model algorithm
CN113761890A (en) BERT context sensing-based multi-level semantic information retrieval method
CN106528726A (en) Keyword optimization-based search engine optimization realization technology
CN111274494A (en) Composite label recommendation method combining deep learning and collaborative filtering technology
CN111475608A (en) Mashup service characteristic representation method based on functional semantic correlation calculation
CN104346382A (en) Text analysis system and method employing language query
CN112860898B (en) Short text box clustering method, system, equipment and storage medium
CN113111645B (en) Media text similarity detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191018

RJ01 Rejection of invention patent application after publication