CN101706794A - Information browsing and retrieval method based on semantic entity-relationship model and visualized recommendation - Google Patents

Information browsing and retrieval method based on semantic entity-relationship model and visualized recommendation Download PDF

Info

Publication number
CN101706794A
CN101706794A CN200910199284A CN200910199284A CN101706794A CN 101706794 A CN101706794 A CN 101706794A CN 200910199284 A CN200910199284 A CN 200910199284A CN 200910199284 A CN200910199284 A CN 200910199284A CN 101706794 A CN101706794 A CN 101706794A
Authority
CN
China
Prior art keywords
semantic entity
entity
user
data
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910199284A
Other languages
Chinese (zh)
Other versions
CN101706794B (en
Inventor
罗迒哉
范建平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Angeray Electronic Technology Co ltd
Original Assignee
Shanghai Xianzhi Information Science & Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xianzhi Information Science & Technology Co Ltd filed Critical Shanghai Xianzhi Information Science & Technology Co Ltd
Priority to CN2009101992840A priority Critical patent/CN101706794B/en
Publication of CN101706794A publication Critical patent/CN101706794A/en
Application granted granted Critical
Publication of CN101706794B publication Critical patent/CN101706794B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an information browsing and retrieval method based on a semantic entity-relationship model and visualized recommendation, comprising the following steps: first collecting data from the internet at regular time, then extracting the semantic entity and relationship, converting the obtained data into the original semantic entity-relationship model Dr and adding the original semantic entity-relationship model Dr into a historical database after time delay, generating a user knowledge model KU presenting the known knowledge of the user after the data in the historical database and a learning/forgetting curve of the user are subjected to convolution and using the user knowledge model KU to predict the data in the original semantic entity-relationship model Dr. The method has the following advantages: 1. the users can check the information which the users are interested in; 2. relatively reasonable recommendation can be obtained without any input; 3. both the written information and the multimedia information such as videos, images and the like can be inquired, and cross-media inquiry is also available; and 4. the unstructured information can be checked intuitively.

Description

Information browse search method based on semantic entity-relationship model and visualized recommendation
Technical field
The present invention relates to a kind of novel magnanimity information and browse retrieval technique, be used to realize that the magnanimity unstructured information browses services such as retrieval based on semantic entity-relationship model and visualized recommendation.
Background technology
The magnanimity unstructured data (for example: the information of under cover enriching Internet).These information can provide valuable information for the owner of data aspect numerous.For example, state security department can analyze its true attitude to China from other country's news report, and unusual transaction can detect with the extension that avoids damage in enterprise from the management data of oneself, or the like.But these information all are hidden in the lot of data dearly.Obtain these information, the user must browse the data that had in a large number, therefrom excavates own interested part.Because data volume is so big, the method for manually browsing and putting in order these information is can not be competent.
At present, search engine technique is the present semi-automatic state-of-the-art technology of obtaining these information by data decomposition being become simple key word and utilizing inverted file index, Boolean retrieval and ordering techniques (for example: PageRank and HITS) to realize index and retrieval to mass data.
But existing search engine technique still is difficult to satisfy the user's request of this respect.One, search engine technique require user's demand necessary clear and definite and concrete.This is because have only clear and definite and concrete demand just can translate into key word of the inquiry.But, relating in the application of mass data at great majority, the user does not have real needs.For example, when the user wishes to browse news, generally be not know what having, (otherwise being not news just) taken place actually; When financial regulator wished to monitor unusual transaction, what more can't define was " unusually ".In this case, the user is difficult to find suitable key word to describe the demand of oneself, just is difficult to use any search engine technique to obtain the information of oneself needs.
For addressing this problem, recommend and browse the means of being absolutely necessary.Because the user is indeterminate to demand, system just must analyze all data, comprehensive and summary, then most possible attraction user's information is showed the user intuitively, efficiently, allows the user find the information that needs most in the process of browsing information.And to realize this point, just must realize following three functions: the first, mass data is excavated and analyzed, and user's attention rate of all information is carried out quantitative evaluation; The second, all information are showed the user intuitively, efficiently; Three, provide the means of browsing and analyze magnanimity information, so that the user finds the information that oneself needs really in mass data.Present search engine technique can not be realized this three functions, so be difficult to the effect that obtains in these fields.
Summary of the invention
The purpose of this invention is to provide a kind ofly, allow the user in the process of browsing information, find the information-searching method that needs most by to the excavation of magnanimity multimedia unstructured data with analyze the most possible user's of attraction information is showed the user intuitively, efficiently.
In order to achieve the above object, technical scheme of the present invention has provided a kind of information browse search method based on semantic entity-relationship model and visualized recommendation, the steps include:
Step 1, from internet or private data storehouse gather data regularly, it is characterized in that,
Step 2, from document data, voice data or vision data that step 1 obtains, extract semantic entity and relation, thereby data-switching is become the form of representing with semantic entity and relation, wherein, semantic entity is defined in any entity that stable implication was arranged in the time period that the user pays close attention to, and relation then is present between any a pair of semantic entity;
Step 3, by extracting frequency, the data-switching that step 2 is obtained becomes original semantic entity-relationship model D r, original semantic entity-relationship model D rAdd historical data base after time-delay, described frequency is the frequency of occurrences of semantic entity or relation;
Data in step 4, the historical data base by with study/forgetting curve convolution of user after generate the user knowledge model K of the existing knowledge of expression user U
Step 5, utilize user knowledge model K UTo original semantic entity-relationship model D rIn data predict and can generate user's interest knowledge.
Further, have after step 5: step 6, the data that will obtain by described step 5 are filtered through retrieval and are presented on the visual user interface by the hyperbolic geometry placement device, perhaps directly are presented on the visual user interface by the hyperbolic geometry placement device.
The present invention proposes the semanteme that semantic entity-relationship model is described the magnanimity unstructured data.In this model, semantic entity is the nucleus of model as the most basic semantic unit.Semantic entity relation each other is composition as a supplement then, to allow model carrying more information.To each semantic entity, differently can be expressed as various ways such as slightly different literal, image/video or sound in different files or identical file.Obviously, all these forms all is the important material of setting forth this semantic entity semanteme, and wherein any form hereof the appearance of any position all may mean the appearance of this semantic entity.By setting up semantic entity-relationship model, can realize excavation and analysis to magnanimity multimedia unstructured data, thereby form the most possible user's of attraction information automatically, change the defective of search engine technique on function in the past, allow the user in the process of browsing information, find the information that needs most.
Advantage of the present invention is: the first, directly view own information of interest, and do not need to consult in a large number irrelevant information; The second, under situation, can obtain comparatively reasonably to recommend, need not the input inquiry key word without any input; Three, both can inquire about Word message, also can inquire about multimedia messagess such as video, image, can also stride Media Inquiries; Four, view unstructured information intuitively, need not to check uninteresting contents such as data form.
Description of drawings
Fig. 1 is the process flow diagram of a kind of information browse search method based on semantic entity-relationship model and visualized recommendation provided by the invention;
Fig. 2 is the process flow diagram that extracts the method for semantic entity from document data;
Fig. 3 A is desirable semantic entity-relationship model;
Fig. 3 B is the model of the semantic entity of actual use;
Fig. 4 is for calculating the schematic diagram of user knowledge model;
The process flow diagram that Fig. 5 realizes for visualization interface;
Fig. 6 is the study forgetting curve of domestic consumer;
Fig. 7 is the effect that the literal semantic entity-relationship model shows through visualization technique.
Embodiment
Specify the utility model below in conjunction with embodiment.
Embodiment
As shown in Figure 1, the process flow diagram for a kind of information browse search method based on semantic entity-relationship model and visualized recommendation provided by the invention the steps include:
Step 1, from internet or private data storehouse gather data regularly.The user's interest data may be from internet or private data storehouse, or the combination of the two, therefore at first regularly obtains new data from internet or private data storehouse by the automatic data accquisition device.
Step 2, from document data, voice data or vision data that step 1 obtains, extract semantic entity and relation, thereby data-switching is become form with semantic entity and relation expression.This step can be divided into for the extraction of semantic entity and for the extraction that concerns.
Extraction for semantic entity:
Semantic entity is defined in any entity that stable implication was arranged in the time period that the user pays close attention to.It can be concrete personage, place, tissue, commodity etc., also can be abstract incident (as National Day military parade), time (as 9/11), numeral (as 70 yards) etc.Semantic entity generally has ageing, and for example before 1991, the frequency that " Clinton " occurs in Chinese is very low, so this character string can not become semantic entity in this section period.Between 1992-2004, " Clinton " generally is meant the semantic entity of former US President's bielke Islington, the later semantic entity that then typically refers to Hillary Clinton in 2007.So if the time window difference that the user paid close attention to, the semantic entity that the user obtains also may be different.
Because material file is always wanted the semanteme of expressed in abundance, semantic entity generally can independently not occur, but a plurality of semantic entity appears in a word, video lens or the piece image simultaneously so that express a complete incident.So it is just extremely important to go out semantic entity from different extracting data.According to the characteristic of different types of data, need to use different technology to extract semantic entity.
(1) from document data, extracts semantic entity
The literal semantic entity is embedded in different document and is used to express its semanteme accordingly in the different statements, and as shown in table 1 is the example of semantic entity " courageous and resolute " in statement.
The courageous and resolute regional war of Burma continues
Army of the Government of Myanmar and courageous and resolute allied forces 29 days
The courageous and resolute forum in first special zone-
It is courageous and resolute to conflict, and I hear
Say online the watching of the courageous and resolute woman soldier of Burma of Chinese
Table 1
To from text strings, be partitioned into semantic entity, just must all identify these different examples. in the prior art, the technology the most close with the present invention is the named entity recognition technology in the natural language processing technique. this technology can be utilized the name in the various means extraction text strings, place name and organizational structure's name. at present, the state-of-the-art named entity recognition technology that is based on condition random field (English abbreviates CRF as) of named entity recognition technology. but, because the semantic entity and the named entity of definition than big-difference (for example have here, " National Day military parade " can be the semantic entity that defines here, but never having the researchist that it is used as named entity discerns), even use named entity recognition technology also can't obtain satisfied effect based on CRF.
As shown in Figure 2, the method of extracting semantic entity from document data that the present invention proposes is: all pending document D 1 serve as according to being decomposed into word stream, extracting the corresponding boundary characteristic D3 and the statistical nature D4 of various character string combinations through CRF Boundary Prediction device S2 and statistical nature extraction element S3 respectively then through dictionary participle device S1 with default dictionary D2.At last, the boundary characteristic of same character string and statistical nature are sent into svm classifier device S4 simultaneously and are classified by the SVM algorithm as eigenvector, and all character strings that are identified as the literal semantic entity by svm classifier device S4 promptly constitute the described semantic entity of institute's step 2.Its concrete computing method are as follows:
If character string s=is (w 1..., w n) be pending character string, wherein w iBe a speech, the word flow that comprises s a: ξ=(..., c -2, c -1, s, c 1, c 2...) in, c jBe the cliction up and down of a character string s, then CRF Boundary Prediction device S2 will estimate the left margin of each character string and the left margin that right margin becomes a semantic entity and the probability B of right margin by the CRF canonical algorithm l(s) and B r(s).These two probability all are the employed standard features of CRF canonical algorithm.In addition, can extract following statistical nature R (s) and boundary characteristic E for character string s l(s), E r(s):
Figure G2009101992840D0000051
Wherein f (s) is the frequency that character string s occurs in pending document, f (w i) be i speech w iThe frequency that occurs in character string s, n is the number of all speech among the character string s.Whether R (s) tolerance s only is the accidental combination of several speech, if R (s) is big more, then character string s is impossible more is the accidental combination of several speech, and then character string s is that the probability of semantic object is big more.
Figure G2009101992840D0000052
Ω (s, c wherein -l) be might be close to the set of the word that appears at the character string s left side, i.e. all left contexts of character string s, p (x) is the probability of occurrence of above-mentioned word x.E l(s) the simple property of left context of tolerance character string s: E l(s) big more, the left context of description character string s changes many more, and then character string s is that the probability of semantic object is big more.
Figure G2009101992840D0000053
With E l(s) similar, as to be used for measuring s upper right hereinafter simple property.
By above all feature (B l(s), B r(s), R (s), E l(s), E r(s)) be combined into eigenvector input svm classifier device S4, whether SVM algorithm computation that can establishing criteria goes out character string s is a semantic entity.After the above algorithm process of text strings process, text strings will be divided into a series of speech and semantic entity, thereby semantic entity can be extracted.Example after cutting apart is as follows:
Www.chinanews.com Changsha June 22 (reporter Fu Yu) night today, Hunan governor Zhou Qiang has met with Asia-Pacific president Zhang Daocheng delegation of the global investment bank of the The Hongkong and Shanghai Banking Corporation Limited (HSBC) in Changsha.Zhou Qiang welcomes Zhang Daocheng delegation, and has briefly introduced Hunan Economic Development situation to the guest.He says; in the face of international financial crisis, adhering to going after profits and advoiding disadvantages in the Hunan, responds actively; conscientiously implement the policies and measures of central increaseing security, protecting the people's livelihood, maintaining stability; catch country to implement the opportunity of ten big industry development planning,, promote upgrading of industrial structures advancing new industrialization; and push forward comprehensively in the process of building test site, long strain pool " amphitypy society "; stablize foreign trade, continue to extend and open, deepen domestic and international economic cooperation.
Wherein, what mark with runic is detected semantic entity, and all the other are general word.As can be seen, Ding Yi semantic entity not only comprises traditional named entities such as name, place name here, has also comprised expressing the very crucial variety of event name of literal semanteme, policy name etc.Come the semanteme of descriptive text document can more completely keep semanteme in the word or file by semantic entity.
(2) from voice data, extract semantic entity
Because the automatic speech recognition technology is very ripe, and topmost information is wherein language in the audio frequency, so the present invention at first utilizes the automatic speech recognition technology that audio conversion is become text strings, utilize the method extraction semantic entity wherein that from document data, extracts semantic entity recited above then when extracting the audio frequency semantic entity.
(3) from vision data, extract semantic entity
Extracting semantic entity from vision data roughly can be divided into and cutting apart and two steps of merger:
Step 2.1, cut apart
Because present technology still is not enough to extract perfect semanteme from image and video, so directly extraction can be unpractical with the vision semantic entity of literal semantic entity correspondence.In order to address this problem, the present invention is the definition, simple of vision semantic entity: to image, each image is taken as single image, semantic entity; To video, each camera lens in the video is taken as single video semanteme entity.Like this, can at an easy rate vision data be divided into different image, semantic entities or video semanteme entity.
Step 2.2, merger
For image, semantic entity or video semanteme entity are mapped with the literal of expressing identical semanteme, need and to be integrated into together with the literal semantic entity of expressing identical semanteme by image, semantic entity or the video semanteme entity that step 2.1 obtains.
To image, common source is the figure on the webpage.Webpage can embed the user of Alternate text conveniently can not Show Picture because of various reasons of this figure usually, and be furnished with title and read for the user when embedding figure in HTML code.These Alternate texts and title are the semantic extraordinary literal of describing this figure, so this figure just can be used for representing the visual signature of the semantic entity in these literal, also promptly this figure and the merger of corresponding literal semantic entity.Similar audio frequency, these Alternate texts and title can be with utilizing the method extraction semantic entity wherein that extracts semantic entity from document data recited above then.
To video, because video is dubbed and described its semantic content well, the present invention proposes dubbing with the automatic speech recognition technology of video converted to literal and extract the literal semantic entity with the method for extracting semantic entity from document data recited above.To each literal semantic entity that from dub, identifies,, can be synchronized to certain camera lens in the video to this literal semantic entity according to the synchronized relation of dubbing with video.Then, this literal semantic entity can be same semantic entity with the video semanteme entity merger of each 5 camera lens of the front and back of the camera lens that is synchronized to just, obtains the described semantic entity of step 2.
Extraction for relation:
After extracting semantic entity, all documents can be broken down into a series of semantic entity and common words.For example, sentence " Obama is elected to the US President " can resolve into " Obama is elected to the US President ".Common words wherein " is elected to " relation that can be regarded as between semantic entity " Obama " and " US President ".So the semanteme of this sentence can complete representation be to have interaction between semantic entity " Obama " and " US President " " to be elected to ", as shown in Figure 3A.But, because existing machine learning techniques extracts interactional precision in literal, image, video barely satisfactory, the interaction type that extract degree of precision is relatively more difficult.For addressing this problem, the present invention proposes interacting also as semantic entity, and define between any semantic entity that occurs simultaneously and have direction-free relation, promptly common words also being used as semantic entity treats and (but will remove wherein too common word, as " ", " place " etc.), shown in Fig. 3 B.Under this definition, all may have a relation between any a pair of semantic entity.
Step 3, by extracting frequency, the data-switching that step 2 is obtained becomes original semantic entity-relationship model D r, original semantic entity-relationship model D rAdd historical data base after time-delay, described frequency is the frequency of occurrences of semantic entity or relation.
Any semantic entity or relation any once occurs and do not mean that the user may information of interest.The user obviously has different attention rates to different semantic entities or relation.Therefore also be necessary for each semantic entity and concern and be equipped with suitable weight to set up semantic entity-relationship model D.So semantic entity-relationship model D can be expressed as on mathematics:
D={ (e i, w i) | 1≤i≤m} ∪ { (r j, w j) | 1≤j≤n}, wherein e iRepresent a semantic entity, w iRepresent the weight that it is corresponding, r jRepresent the relation between a pair of semantic entity, w jRepresent the weight that it is corresponding, m represents the number of semantic entity, and n represents the number that concerns.The weight here should be the attention rate of user to this semantic entity or relation under the best circumstances, obviously is difficult to but directly calculate this attention rate.Weight the simplest, that can obtain fast is the frequency of occurrences of semantic entity or relation in all data, promptly
D r={ (e i, f (e i)) | 1≤i≤m} ∪ { (r j, f (r j)) | 1≤j≤n}, wherein e iRepresent a semantic entity, f (e i) expression e iThe frequency that occurs; r jRepresent the relation between a pair of semantic entity, f (r j) expression r jThe frequency that occurs, m represents the number of semantic entity, n represents the number that concerns.Such model is called as " original semantic entity-relationship model ", because it is the easiest acquisition, is again to calculate the more basis of high level model.
Data in step 4, the historical data base by with study/forgetting curve convolution of user after generate the user knowledge model K of the existing knowledge of expression user U
Simple information such as frequency of utilization are not very attractive as the weight of semantic entity-relationship model, because the very high semantic entity of frequency is often not attractive, and for example " China ", " Hu Jintao ".The unengaging reason of high frequency semantic entity is that people understand these semantic entities very much, also promptly sees these semantic entities and can not bring fresh information to the user.We can say that what the user wanted to see most is those users and ignorant information.The information that the user is not familiar, possible user is interested more.So if with user's the not familiar degree weight as semantic entity-relationship model, then this model will suit the demand of user to information very much.
Yet, calculate the not familiar degree of user, must at first know user's known knowledge.But because the user may be from the uncontrollable approach acquired informations of a lot of computing machines (as: friend), user's known knowledge model seems and can't accurately estimate.But, if the information of carefully investigating is delivered to user's whole process from the source, can find that information that most users are known all is (or in the own database of user) history file (literal or the news-video that be derived from that certain is disclosed, webpage, blogs etc.), most after all people is the back door that does not have acquired information.So, if all disclosed history files (and historical data of user's oneself database) are all collected, and take all factors into consideration the user and obtain the influence of the approach of these information and user's memory to user knowledge, just can comparatively accurately calculate user knowledge model K U
Information is from delivering the process that the process of being known by the user (being the route of transmission) can be approximated to be user learning.And the user can forget it (being user's memory capability) after acquiring certain information gradually, can be approximated to be to forget process.These two combination, can use a study forgetting curve as shown in Figure 6 to describe.And if information is described with semantic entity and relation thereof, can regard semantic entity or relation as on the time shaft discrete incident that occurs.According to mathematical principle, the combined action of these incidents under the study forgetting curve equals all discrete events and the convolution of learning forgetting curve.So, final user's knowledge model K UTo obtain by all historical datas and user's study forgetting curve convolution:
Figure G2009101992840D0000091
Wherein, e represents a user's interest incident, and w (e) is user's degree of awareness possible to e, f U(t) be user's study forgetting curve, g U(x) be impulse function, t eBe the time that e takes place, l is the number of incident.These computing method as shown in Figure 4.
If user knowledge model K UIn incident be semantic entity and relation, user knowledge model K then UTo form by semantic entity and relation and attached weight thereof.So, user knowledge model K UTo have identical mathematical form with semantic entity-relationship model on the mathematics. identical mathematical form will make the calculating between the two become easy.
Step 5, utilize user knowledge model K UTo original semantic entity-relationship model D rIn data predict and can generate user's interest knowledge.
Obtaining user knowledge model K UOriginal semantic entity-relationship model D with current data rAfterwards, just can utilize user knowledge model K UMeasure original semantic entity-relationship model D rIn the not familiar degree of each semantic entity and relation: if certain semantic entity or relation can be predicted that then this semantic entity or relation are that the user is known well by the user knowledge model, not familiar degree is low, and the user is unlikely interested; Otherwise then not familiar degree is high, and the user is interested probably.So measuring each semantic entity or relation is crucial by " inaccuracy " of user knowledge model prediction, the difference of also promptly measuring original semantic entity-relationship model and user knowledge model.
The original semantic entity-relationship model D of careful investigation r, can find that this model can be regarded as discrete probability distribution after the weight normalization: semantic entity or relation are the probability items, and normalized weight is the probability of corresponding entry.So, the difference computing method between two such differences between model just can probability of use distribute.The KL-distance is the standard method of difference between calculating probability distributes, so can be with here.According to the KL-range formula, the total variances between them is:
w ( D r | | K U ) = Σ j p ( e j , D r ) log p ( e j , D r ) P ( e j , K U ) ,
E wherein jBe a semantic entity or relation, p (e j, D r) represent semantic entity or concern e jAt original semantic entity-relationship model D rIn probability, p (e j, K U) represent semantic entity or concern e jAt user knowledge model K UIn probability.This formula is decomposed and remove outlier and positive correlation function, the local not familiar degree formula that can obtain each semantic entity or relation is as follows:
w r ( e j ) = p ( e j , D r ) p ( e j , K U )
The weight that above formula calculates is not normalized, inconvenient comparison and subsequent calculations.So it is carried out normalization is necessary.Normalized computing method are:
w ( e j ) = w r ( e j ) max e j ∈ D r { w r ( e j ) }
So, replace original semantic entity-relationship model D with above weight rWeight, just obtained one and can express the mathematical model of user interest information preferably.In this model, weight more may attract the user near 1 semantic entity or relation, and weight then has little significance to the user near 0 semantic entity and relation, can omit.
Step 6, the data that will obtain by described step 5 are filtered through retrieval and are presented on the visual user interface by the hyperbolic geometry placement device, perhaps directly are presented on the visual user interface by the hyperbolic geometry placement device.
Though the semantic entity-relationship model that extracts to the method for step 5 by step 1 suits the demand of user to information, the user can be not interested in these raw informations.Iff these original semantic entities, relation and weight are showed the user with simple form, can be satisfied with without any the user.In order to allow the user obtain real information of interest fast and efficiently, the form that must convert these raw data intuitively to, conveniently browse, and retrieval method efficiently is provided.In view of this, integrated visualization technique and collaborative design can be united the user interface of browsing and retrieving to multimedia messages and helped that the user finishes information analysis, the task of browsing, retrieving is exactly the function that system must realize.For realizing this purpose, the present invention proposes multiple visual integrated interface.
The flow process that visualization interface is realized as shown in Figure 5.Semantic entity-relationship model can be seen as two parts and form: E={ (e i, w i) | 1≤i≤m} and R={ (r j, w j) | 1≤j≤n}.Regard the limit set as if E is regarded as vertex set, R, then semantic entity-relationship model can be regarded as the non-directed graph structure of having the right.In visual field, graph structure can be arranged by (layout) algorithm of arranging of standard and be formed the best image image on the display plane, so in the visualization interface that the present invention proposes, the first step is exactly that graph structure is arranged.The graph structure of arranging through the graph structure arrangement apparatus is in the plane, and the visual characteristic that human eye is meticulous in the middle of having, the edge is rough, for meeting this visual characteristic, we re-use the hyperbolic geometry converting means graph structure of arranging are projected on the hyperboloid, again hyperboloid are projected to display plane.The hyperbolic geometry conversion also is the standard technique in the visualization technique.
Providing one below utilizes the technology of the present invention to realize the embodiment that Internet news is recommended, browsed.In this embodiment, collect the webpage of from network, collecting relevant news by the web crawlers of a customization.Web crawlers obtained the RSS file of each news website every 15 minutes, and therefrom extracted the tabulation of current each website news.Then, web crawlers can be downloaded the extraction semantic entity to the news in this tabulation one by one, and a result after the processing is as follows:
Semantic entity Type
??【 ??【
The reporter ??NN
Wang Yue ??PER
The reporter ??NN
Wu Mei ??PER
Xu Xiaomei ??PER
??】 ??】
Real estate transaction meeting spring in 2009 ??EVN
Will ??RB
In ??IN
May 7 ??CD
{。##.##1}, ??IN
The sports center, Tianjin ??LOC
Grand ??JJ
Semantic entity Type
Opening VV
??SENT
First row are the semantic entities that identify in the table, and secondary series is corresponding type.After semantic entity is detected, extract the frequency that obtains semantic entity and relation thereof, promptly original semantic entity-relationship model.With the news on May 30th, 2009 serves as that the partial data of the original semantic entity-relationship model that extracts of basis can see the following form:
Frequency Semantic entity
??5421 China
??2226 The U.S.
??1425 China Team
??... ??...
??718 Germany
??714 Roh Moo-hyun
??... ??...
Frequency Relation
??834 One<--be
??596 China<--〉women's volleyball
??... ??...
??746 H1n1<--〉influenza
??454 Tell<--the reporter
??387 Influenza<--〉case
??... ??...
Because web crawlers is continuous working, so also can export the original semantic entity-relationship model of different times by Fixed Time Interval continuously.In this embodiment, get 1 day, promptly extract an original semantic entity-relationship model every day for extracting the time interval of original semantic entity-relationship model.According to the difference of using, this time interval can be adjusted.
Original semantic entity-relationship model is added into a historical data base and is used to extract the user knowledge model through the time-delay of regulation.The time-delay here generally is taken as the time interval of extracting original semantic entity-relationship model.So time-delay in this embodiment is taken as 1 day.In order to extract the user knowledge model, must designated user study forgetting curve.Use step function shown in Figure 6 to be similar to the study forgetting curve of domestic consumer at this embodiment.
Can obtain at that time user knowledge model by above study forgetting curve and historical data being carried out convolution.Be to be some data of user knowledge model that news data is extracted on May 30th, 2009 below:
Probability of occurrence Semantic entity
??6.58E-02 China
??2.25E-02 Beijing
??2.23E-02 The U.S.
??... ??...
??4.57E-03 Britain
??3.95E-03 Chengdu
??3.91E-03 China Team
??... ??...
Probability of occurrence Relation
??1.15E-03 One<--be
??4.14E-04 Tell<--the reporter
??4.05E-04 Beijing<--〉Olympic Games
??... ??...
??3.84E-04 Solve<--problem
Probability of occurrence Relation
??3.57E-04 One<--have
??2.74E-04 China<--〉market
??... ??...
After having obtained original semantic entity-relationship model and user knowledge model, just can predict to obtain user's knowledge interested.The partial data that on May 30th, 2009 extracted from news sees the following form:
Weight Semantic entity
??1.00E+00 Wang Hanyu
??8.04E-01 Lin Wenlong
??5.31E-01 Zhao Yanni
??5.15E-01 Guo Keying
??5.03E-01 Add Wal Suo Wa
??4.40E-01 Wear certain
??... ??...
Weight Relation
??1.00E+00 Two the generation<--case
??2.31E-01 Lin Wenlong<--〉Guo Keying
??1.34E-01 The make up artist<--〉photo studio
??... ??...
??1.25E-01 Wang Hanyu<--〉Gao Jie
??1.07E-01 Wedding photography<--〉photo studio
Weight Relation
??... ??...
The example of original semantic entity-relationship model, user knowledge model and the user's knowledge interested of the same time of listing by above process, can find that original semantic entity-relationship model and user knowledge model all are some ordinary, as can't to attract user information, and user's knowledge interested has been represented the most attracting news at that time really, comprised unusual, the user may interested knowledge.
At last, data are arranged and the hyperbolic geometry conversion through visualization device, can be shown to the user.Two visual user knowledge instances interested are shown in Fig. 7 A and Fig. 7 B.

Claims (7)

1. the information browse search method based on semantic entity-relationship model and visualized recommendation the steps include:
Step 1, from internet or private data storehouse gather data regularly;
Step 2, from document data, voice data or vision data that step 1 obtains, extract semantic entity and relation, thereby data-switching is become the form of representing with semantic entity and relation, semantic entity is defined in any entity that stable implication was arranged in the time period that the user pays close attention to, and relation then is present between any a pair of semantic entity;
Step 3, by extracting frequency, the data-switching that step 2 is obtained becomes original semantic entity-relationship model D r, original semantic entity-relationship model D rAdd historical data base after time-delay, described frequency is the frequency of occurrences of semantic entity or relation;
Data in step 4, the historical data base by with study/forgetting curve convolution of user after generate the user knowledge model K of the existing knowledge of expression user U
Step 5, utilize user knowledge model K UTo original semantic entity-relationship model D rIn data predict and can generate user's interest knowledge.
2. a kind of information browse search method based on semantic entity-relationship model and visualized recommendation as claimed in claim 1 is characterized in that having after step 5:
Step 6, the data that will obtain by described step 5 are filtered through retrieval and are presented on the visual user interface by the hyperbolic geometry placement device, perhaps directly are presented on the visual user interface by the hyperbolic geometry placement device.
3. a kind of information browse search method as claimed in claim 1 based on semantic entity-relationship model and visualized recommendation, it is characterized in that, the method of extracting semantic entity described in the step 2 from document data is: pending all documents (D1) serve as according to being decomposed into word stream through dictionary participle device (S1) with default dictionary (D2), extract the corresponding boundary characteristic and the statistical nature of various character string combinations through CRF Boundary Prediction device (S2) and statistical nature extraction element (S3) respectively then, at last, boundary characteristic of same character string (D3) and statistical nature (D4) are sent into svm classifier device (S4) simultaneously and are classified by the SVM algorithm as eigenvector, and all character strings that are identified as the literal semantic entity by svm classifier device (S4) promptly constitute the described semantic entity of institute's step 2.
4. a kind of information browse search method as claimed in claim 3 based on semantic entity-relationship model and visualized recommendation, it is characterized in that, the method of extracting semantic entity described in the step 2 from voice data is: at first utilize the automatic speech recognition technology that audio conversion is become text strings, utilize the described method of from document data, extracting semantic entity of claim 3 to extract wherein semantic entity then, thereby obtain the described semantic entity of step 2.
5. a kind of information browse search method based on semantic entity-relationship model and visualized recommendation as claimed in claim 3 is characterized in that, the method for extracting semantic entity described in the step 2 from vision data is:
Step 2.1, cut apart
Regard every width of cloth figure as a semantic entity, regard each camera lens in the video as single semantic entity simultaneously, vision data is divided into a plurality of image, semantic entities or video semanteme entity according to above-mentioned rule;
Step 2.2, merger
To image, from the Alternate text of figure and title, be partitioned into the literal semantic entity by the described method of from document data, extracting semantic entity of claim 3, is same semantic entity with this literal semantic entity with cutting apart the image, semantic entity merger that obtains by step 2.1, obtains the described semantic entity of step 2;
To video, with video dub with the automatic speech recognition technology convert to literal and with claim 3 described from document data the method for extraction semantic entity be partitioned into wherein literal semantic entity, to each literal semantic entity that from dub, identifies, according to the synchronized relation of dubbing with video, can be synchronized to certain camera lens in the video to this semantic entity, then, this literal semantic entity can be same semantic entity with the video semanteme entity merger of each 5 camera lens of the front and back of the camera lens that is synchronized to just, obtains the described semantic entity of step 2.
6. a kind of information browse search method based on semantic entity-relationship model and visualized recommendation as claimed in claim 1 is characterized in that, original semantic entity-relationship model D described in the step 3 rOn mathematics, be expressed as:
D r={ (e i, f (e i)) | 1≤i≤m} ∪ { (r j, f (r j)) | 1≤j≤n}, wherein e iRepresent a semantic entity, f (e i) expression e iThe frequency that occurs; r jRepresent the relation between a pair of semantic entity, f (r j) expression r jThe frequency that occurs, m represents the number of semantic entity, n represents the number that concerns.
7. a kind of information browse search method based on semantic entity-relationship model and visualized recommendation as claimed in claim 1 is characterized in that the described prediction steps of step 5 is:
Step 5.1, with described original semantic entity-relationship model D rIn weight term normalization, calculate user knowledge model K according to the KL-range formula UWith original semantic entity-relationship model D rBetween total variances;
Step 5.2, this formula is decomposed and remove outlier and positive correlation function, can obtain the local not familiar degree w (e of each semantic entity or relation j);
Step 5.3, with original semantic entity-relationship model D rIn weight term with local not familiar degree w (e j) substitute, obtain one and can express the mathematical model of user interest information preferably, in this model, semantic entity or relation with higher weights more may attract the user, semantic entity that weight is lower and relation then have little significance to the user, can omit.
CN2009101992840A 2009-11-24 2009-11-24 Information browsing and retrieval method based on semantic entity-relationship model and visualized recommendation Expired - Fee Related CN101706794B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009101992840A CN101706794B (en) 2009-11-24 2009-11-24 Information browsing and retrieval method based on semantic entity-relationship model and visualized recommendation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009101992840A CN101706794B (en) 2009-11-24 2009-11-24 Information browsing and retrieval method based on semantic entity-relationship model and visualized recommendation

Publications (2)

Publication Number Publication Date
CN101706794A true CN101706794A (en) 2010-05-12
CN101706794B CN101706794B (en) 2012-08-22

Family

ID=42377020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009101992840A Expired - Fee Related CN101706794B (en) 2009-11-24 2009-11-24 Information browsing and retrieval method based on semantic entity-relationship model and visualized recommendation

Country Status (1)

Country Link
CN (1) CN101706794B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779114A (en) * 2011-05-12 2012-11-14 商业对象软件有限公司 Unstructured data support generated by utilizing automatic rules
CN102890689A (en) * 2011-07-22 2013-01-23 北京百度网讯科技有限公司 Method and system for building user interest model
CN104025085A (en) * 2011-07-28 2014-09-03 纪金有限公司 Systems And Methods For Providing Information Regarding Semantic Entities Included In A Page Of Content
CN104462061A (en) * 2014-12-05 2015-03-25 北京国双科技有限公司 Word extraction method and word extraction device
CN105981006A (en) * 2014-02-14 2016-09-28 三星电子株式会社 Electronic device and method for extracting and using sematic entity in text message of electronic device
CN106095748A (en) * 2016-06-06 2016-11-09 东软集团股份有限公司 A kind of method and device generating event relation collection of illustrative plates
CN106294744A (en) * 2016-08-11 2017-01-04 上海动云信息科技有限公司 Interest recognition methods and system
CN106383904A (en) * 2016-09-29 2017-02-08 中国联合网络通信集团有限公司 Video recommendation method and device
CN106469170A (en) * 2015-08-18 2017-03-01 阿里巴巴集团控股有限公司 The treating method and apparatus of text data
CN107424610A (en) * 2017-03-02 2017-12-01 广州小鹏汽车科技有限公司 A kind of vehicle radio station information acquisition methods and device
CN107507627A (en) * 2016-06-14 2017-12-22 科大讯飞股份有限公司 Speech data temperature analysis method and system
CN107741943A (en) * 2017-06-08 2018-02-27 清华大学 The representation of knowledge learning method and server of a kind of binding entity image
CN108763321A (en) * 2018-05-02 2018-11-06 深圳智能思创科技有限公司 A kind of related entities recommendation method based on extensive related entities network
CN110168579A (en) * 2016-11-23 2019-08-23 启创互联公司 For using the system and method for the representation of knowledge using Machine learning classifiers
CN110326315A (en) * 2017-02-22 2019-10-11 瑞典爱立信有限公司 First communication equipment, the network equipment and wherein for identification provide semantic expressiveness at least one the second communication equipment method
CN110489613A (en) * 2019-07-29 2019-11-22 北京航空航天大学 Cooperate with viewdata recommended method and device
CN110688483A (en) * 2019-09-16 2020-01-14 重庆邮电大学 Dictionary-based noun visibility labeling method, medium and system in context conversion
CN111160352A (en) * 2019-12-27 2020-05-15 创新奇智(北京)科技有限公司 Workpiece metal surface character recognition method and system based on image segmentation

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779114B (en) * 2011-05-12 2018-06-29 商业对象软件有限公司 It is supported using the unstructured data of automatically rule generation
CN102779114A (en) * 2011-05-12 2012-11-14 商业对象软件有限公司 Unstructured data support generated by utilizing automatic rules
CN102890689A (en) * 2011-07-22 2013-01-23 北京百度网讯科技有限公司 Method and system for building user interest model
CN104025085A (en) * 2011-07-28 2014-09-03 纪金有限公司 Systems And Methods For Providing Information Regarding Semantic Entities Included In A Page Of Content
CN105981006A (en) * 2014-02-14 2016-09-28 三星电子株式会社 Electronic device and method for extracting and using sematic entity in text message of electronic device
US10630619B2 (en) 2014-02-14 2020-04-21 Samsung Electronics Co., Ltd. Electronic device and method for extracting and using semantic entity in text message of electronic device
CN104462061A (en) * 2014-12-05 2015-03-25 北京国双科技有限公司 Word extraction method and word extraction device
CN104462061B (en) * 2014-12-05 2017-10-27 北京国双科技有限公司 Term extraction method and extraction element
CN106469170B (en) * 2015-08-18 2019-09-10 阿里巴巴集团控股有限公司 The treating method and apparatus of text data
CN106469170A (en) * 2015-08-18 2017-03-01 阿里巴巴集团控股有限公司 The treating method and apparatus of text data
CN106095748B (en) * 2016-06-06 2019-08-27 东软集团股份有限公司 A kind of method and device generating event relation map
CN106095748A (en) * 2016-06-06 2016-11-09 东软集团股份有限公司 A kind of method and device generating event relation collection of illustrative plates
CN107507627A (en) * 2016-06-14 2017-12-22 科大讯飞股份有限公司 Speech data temperature analysis method and system
CN107507627B (en) * 2016-06-14 2021-02-02 科大讯飞股份有限公司 Voice data heat analysis method and system
CN106294744A (en) * 2016-08-11 2017-01-04 上海动云信息科技有限公司 Interest recognition methods and system
CN106383904A (en) * 2016-09-29 2017-02-08 中国联合网络通信集团有限公司 Video recommendation method and device
CN106383904B (en) * 2016-09-29 2019-10-01 中国联合网络通信集团有限公司 Video recommendation method and device
CN110168579A (en) * 2016-11-23 2019-08-23 启创互联公司 For using the system and method for the representation of knowledge using Machine learning classifiers
CN110326315B (en) * 2017-02-22 2023-05-26 瑞典爱立信有限公司 First communication device, network device, and method therein for identifying at least one second communication device providing a semantic representation
CN110326315A (en) * 2017-02-22 2019-10-11 瑞典爱立信有限公司 First communication equipment, the network equipment and wherein for identification provide semantic expressiveness at least one the second communication equipment method
CN107424610B (en) * 2017-03-02 2021-02-19 广州小鹏汽车科技有限公司 Vehicle-mounted radio station information acquisition method and device
CN107424610A (en) * 2017-03-02 2017-12-01 广州小鹏汽车科技有限公司 A kind of vehicle radio station information acquisition methods and device
CN107741943B (en) * 2017-06-08 2020-07-17 清华大学 Knowledge representation learning method and server combined with entity image
CN107741943A (en) * 2017-06-08 2018-02-27 清华大学 The representation of knowledge learning method and server of a kind of binding entity image
CN108763321A (en) * 2018-05-02 2018-11-06 深圳智能思创科技有限公司 A kind of related entities recommendation method based on extensive related entities network
CN108763321B (en) * 2018-05-02 2021-07-06 深圳智能思创科技有限公司 Related entity recommendation method based on large-scale related entity network
CN110489613A (en) * 2019-07-29 2019-11-22 北京航空航天大学 Cooperate with viewdata recommended method and device
CN110489613B (en) * 2019-07-29 2022-04-26 北京航空航天大学 Collaborative visual data recommendation method and device
CN110688483A (en) * 2019-09-16 2020-01-14 重庆邮电大学 Dictionary-based noun visibility labeling method, medium and system in context conversion
CN110688483B (en) * 2019-09-16 2022-10-18 重庆邮电大学 Dictionary-based noun visibility labeling method, medium and system in context conversion
CN111160352A (en) * 2019-12-27 2020-05-15 创新奇智(北京)科技有限公司 Workpiece metal surface character recognition method and system based on image segmentation
CN111160352B (en) * 2019-12-27 2023-04-07 创新奇智(北京)科技有限公司 Workpiece metal surface character recognition method and system based on image segmentation

Also Published As

Publication number Publication date
CN101706794B (en) 2012-08-22

Similar Documents

Publication Publication Date Title
CN101706794B (en) Information browsing and retrieval method based on semantic entity-relationship model and visualized recommendation
CN102708100B (en) Method and device for digging relation keyword of relevant entity word and application thereof
CN105468605B (en) Entity information map generation method and device
US20170364834A1 (en) Real-time monitoring of public sentiment
CN103390051B (en) A kind of topic detection and tracking method based on microblog data
CN102411638B (en) Method for generating multimedia summary of news search result
Guo et al. How does market concern derived from the Internet affect oil prices?
Wang et al. Automatic online news topic ranking using media focus and user attention based on aging theory
CN106570144A (en) Method and apparatus for recommending information
CN104536956A (en) A Microblog platform based event visualization method and system
CN102890702A (en) Internet forum-oriented opinion leader mining method
Chen Personalized recommendation system of e-commerce based on big data analysis
CN104050163A (en) Content recommendation system and method
CN104685495A (en) A system and method for automatic generation of information-rich content from multiple microblogs, each microblog containing only sparse information
CN103020159A (en) Method and device for news presentation facing events
CN105378730A (en) Social media content analysis and output
CN102890689A (en) Method and system for building user interest model
Hauff et al. Placing images on the world map: a microblog-based enrichment approach
CN103455487A (en) Extracting method and device for search term
TW201403360A (en) Method and device for generating search results
CN112632405B (en) Recommendation method, recommendation device, recommendation equipment and storage medium
CN104298683A (en) Theme digging method and equipment and query expansion method and equipment
CN108763961B (en) Big data based privacy data grading method and device
TW201118619A (en) An opinion term mining method and apparatus thereof
JP2010044462A (en) Content evaluation server, content evaluation method and content evaluation program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20160323

Address after: Suzhou City, Jiangsu province 215300 Kunshan Dengyun Road No. 268, room 211

Patentee after: SUZHOU ANGERAY ELECTRONIC TECHNOLOGY Co.,Ltd.

Address before: 879 Lane 200062, Zhongjiang Road, Shanghai, Putuo District

Patentee before: SHANGHAI XIANZHI INFORMATION SCIENCE & TECHNOLOGYCO., Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120822