配色: 字号:
Text-Based Web Page Classification with Use of Visual Information
2013-05-03 | 阅:  转:  |  分享 
  
Text-BasedWebPageClassificationwithUseof

VisualInformation



VladimírBartík

Dept.ofInformationSystems,FacultyofInformationTechnology

BrnoUniversityofTechnology

Brno,CzechRepublic

e-mail:bartik@fit.vutbr.cz





Abstract—Asthenumberofpagesonthewebispermanently

increasing,thereisaneedtoclassifypagesintocategoriesto

facilitateindexingorsearchingthem.Inthemethodproposed

here,weusebothtextualandvisualinformationtofindasuitable

representationofwebpagecontent.Inthispaper,severalterm

weights,basedonTForTF-IDFweightingareproposed.

Modificationisbasedonvisualareas,inwhichthetextappears

andtheirvisualproperties.Someresultsofexperimentsare

includedinthefinalpartofthepaper.

Keywords-webpageclassification,termweights,text

classification,TF-IDFweight,visualinformation,visualblocks.

I.INTRODUCTION

AstheamountofinformationprovidedbyWorldWide

Web(WWW)ispermanentlyincreasing,thereisaneedof

someusefulknowledgeobtainedfromWWW.Webpage

classification,alsoknownaswebpagecategorization,isa

processofassigningthewebpagestooneofseveralpredefined

classes.Classificationofwebpagesisessentialtomanyweb

informationretrievaltasks,suchasindexingofpages,

improvingthewebsearchorconstructingofWebdirectories.

Inthebeginning,classificationmethodswereapplied

primarilytostructureddatabases.Toclassifysemi-structured

datainaformofwebpages,wehavetofindarepresentationof

webpagecontent,whichissuitableforclassificationmethods.

Asweknow,therearetwomaininformationtypescontained

onawebpage.ThereisavisualstructureformedbyHTML

code,whichhidesinformationaboutvisualblocksonapage,

butalsounstructuredinformationinaformoftextispresent.

ContemporarymethodsforclassificationofWebdatawork

mainlywiththetextinformation,whichispresentonaWeb

page.Thesetext-basedclassificationmethodsusuallyusethe

bag-of-wordsrepresentationtorepresentthecontentsofa

document.Inthiscase,adocumentisrepresentedbyavector

ofTF/IDFweightsassignedtoindividualterms.

However,thisrepresentationofadocumentdoesnot

capturevisualinformation.Ontheotherhand,therehavebeen

someclassificationmethodsbasedonwebpagestructure

proposed.Inthiscase,structureofadocumentalongwith

pictures,areconsideredasinputinformationforclassification

algorithms.Buttextinformationisnottakenintoconsideration.

Becauseformostofwebpages,textplayscrucialrolefor

representationofwebpagecontent,itisappropriatetoenrich

textrepresentationwithvisualinformation.Itispossibletouse

informationfromHTMLtagstoimprovetermweighting.This

allowscapturingseveralpropertiesoftext,suchasfontsizeor

textcolor,andusingittomodifytheweightsoftextterms.

However,thismethoddoesnotreflectothervisualproperties

ofawebpage,suchaslayoutand,locationofanelementetc.

Tocapturethistypeofvisualinformation,wecanusepage

segmentationmethods.Segmentationisdefinedasaprocessof

detectingorganizationofvisualblocksonthepageand

analyzingthepropertiesofcomponentvisualelements.Data

obtainedbysegmentationcanalsobeusedtoimprovetheweb

pagerepresentation.Segmentationalgorithmsusuallywork

withrenderedwebdocumentsandtheirvisualrepresentation.

Inthispaper,weintroduceanewwayofmodifyingterm

weightingwithvisualinformation.Segmentationalgorithmis

usedtocapturevisualblocks,whichformthewebpage.Then,

weareabletoclassifyvisualblocksintopredefinedcategories,

i.e.heading,maintextofapage,advertisementornavigation.

Thesetypesofvisualblockshavedifferentimportancein

representationofapage.Theimportanceshouldbetakeninto

accountbyclassificationofwholewebpages.Therefore,we

presentmodificationsofweightscapturingthisinformation.

Atfirst,weproposethewayofvisualblocksclassification

verybriefly.Next,modificationsoftermweightingare

introduced,andsomeresultsofwebpageclassification

experimentsaredescribedinthefinalpartofthispaper.

II.RELATEDWORK

A.TermWeightingforClassification

Classificationofwebpagesbasedontextextractedfrom

pagesisthemostcommonwaytoclassifywebpages.The

basicmethodsarebasedonabag-of-wordsrepresentationwith

TForTF/IDFweights[1].TFmeansthetermfrequencyina

document.Itshouldbenormalized,forexampleas:



)(

),(5.0

5.0),(

dMaxFreq

dtf

dtTF

?

+=,(1)

2010InternationalConferenceonAdvancesinSocialNetworksAnalysisandMining

978-0-7695-4138-9/10$26.00?2010IEEE

DOI10.1109/ASONAM.2010.34

416

whereMaxFreq(d)isthemaximumfrequencyofanyterm

inadocument.Itisnecessarytoconsiderinversedocument

frequency(IDF),whichrepresentsthegeneralimportanceofa

termamongalldocuments.TheresultingTF/IDFweightis

obtainedas:

))log(1(),(),(/

k

n

dtTFdtIDFTF+?=,(2)

wherenisthecountofalldocumentsinthedatasetandkis

thenumberofdocumentscontainingthetermt.

AnotherfrequentlyusedtextrepresentationisN-gram

representation[2],whichallowsrepresentingadocumentwith

termsconsistingofmorethanoneword.

Asmentionedabove,importanceofvariousvisualblocksis

different;thereforeitisnecessarytoreflectitindocument

representation.In[3],theuseofinformationderivedfrom

HTMLtagsofapageforclassification,isproposed.Similar

method,inwhichtheHTMLtagsaredividedintothreegroups

withdifferentimportanceoftermsineachgroup,isdescribed

in[4].Themaindisadvantageofthesetwoapproachesisthe

factthattheformationofvariouswebpagesisdifferentandthe

sameinformationonapageisoftenrepresentedindifferent

waysinHTML.

Themethodproposedin[5],takesfouraspectsofterm

frequencyintoconsideration:termfrequency,termfrequency

inheading,frequencyofemphasizedwordsintextandword

positionfunctionassumingthatthefirstandlastquartersof

textaremostrelevant.Thesefouraspectsarecombinedtoform

theresultanttermweights.

B.OtherMethodsforDocumentRepresentation

In[6],TF/IDFweightsarereplacedbykeywordsextracted

frominformativewebpageblocks.Informativeblocksare

discoveredviaDOMtreeofapage.Theassumptionthatnon-

contentblocksarerepeatedlyappearinginaDOMtree,isused.

Therearealsomethods,whichusegraphrepresentation

insteadofweightvectors,proposed.Thisgraphrepresentation

cankeepalsothestructuralinformationaboutthedocument.If

thereisabilitytocomputethesimilaritybetweentwographs,

somelazyclassificationalgorithms(i.e.k-nearestneighbor)

canbeusedforclassification[7].Acombinationofgraphand

vectorrepresentationisproposedin[8].Graphsareprocessed

viaafrequentsub-graphminingmethod.Thefrequentsub-

graphsobtainedbecomeattributesforvectorrepresentation.

Amethod,whichrepresentsawebdocumentaccordingto

visualpropertiesofadocument,isproposedin[9].Here,the

visualadjacencymultigraphrepresentationispresented.Itis

abletorepresentinformationaboutmutualpositionofvisual

partsonapageandcontentsofcomponentparts.

C.Segmentationofwebpages

Toperformsegmentationofawebdocument,itis

necessarytorenderthedocumentandobtaininformationabout

visualareasofrendereddocument(notaboutHTMLtags).

Therearesomesegmentationalgorithmsproposed.The

probablymostknownsegmentationalgorithmisVIPS

presentedin[10].Theresultisatreeofvisualareas

independentonHTMLtags.Adocumentisdividedintovisual

blocks,basedonitsvisualcues,suchasdifferentfontsor

colors,linesandotherseparatorsetc.

Anothersegmentationmethodusedinourmethod

describedhere,isproposedin[11].Thismethodusesabottom-

upapproachtofindvisualareas.Theresultisasetofvisual

rectangularblocksvisuallyseparatedfromtheremainderofthe

webpage.

III.ROLEOFVISUALINFORMATIONINWEBPAGE

CLASSIFICATION

Asitismentionedabove,visualinformationaboutweb

pageareascouldbeusedtoimprovewebpageclassification.If

thevisualinformationisconsidered,itisnecessaryto

distinguishbetweencontentandnon-contentblocks.

Ifweareabletoobtainvisualpropertiesandpositionof

eachvisualareaautomatically,itispossibletodetermine

importanceofeachvisualarea.Itisveryimportantduetothe

factthatmostofspaceonawebdocumentisoccupiedby

information,whichisnotrelevanttothemaintopicofthe

documentandshouldnotbeconsideredindocument

representation.Wecanmentionadvertisement,navigationbars,

andlinkstootherpagesorcopyrightinformationasexamples

ofunimportantareasofapage.

A.ClassificationofWebPageVisualBlocks

Accordingtotheseremarks,toobtainagooddocument

representationresultinginaccurateclassificationofpages,we

havetodistinguishbetweenthefollowingtypesofvisualareas:

?headings:includesthemainheadingandsubheadings

containedonthepage.Itistypicallycharacterizedbya

fontsizehigherthantherestofthepage.

?maintext:themaincontentofthepage,themost

importantinformation,whichshouldbeincludedin

representationofadocument.

?date/authors:includesinformationaboutwebpage

authorsordateofpagecreation;unimportant

informationforfurtherprocessing.

?Navigationbar:linkstootherpartsofawebsite,

typicallyconstantforallpagesofasite.Itisalso

unimportantforpagerepresentation.

?Links:linkstootherpagesrelatedtotheactualpage;

someofthelinkscanbeirrelevant

?Others:otherpartsofapage,i.e.captionofawhole

page,advertisement;alsounimportant.

Ifwehavenecessaryinformationaboutvisualproperties

andpositionofablock,thecategoriesdescribedabovecanbe

assignedtopageblocksviaclassification.Thedetailed

descriptionofthisclassificationapproach,visualproperties

usedtoclassifywebpageareasanddetailedresultsof

experimentsaredescribedin[12].

417

B.RoleofCategoriesofVisualBlocksinPageClassification

Ifwehaveinformationaboutacategoryforeachwebpage

visualblockobtainedbyclassification,itispossibletouseitto

enrichthestandardTF/IDFscheme.Thetexttermsthatare

containedinthemostimportantpartsareevaluatedwiththe

highestweight(astandardweightmultipliedbyacoefficient),

thetermsinsomelessimportantpartshaveastandardweight

andtheleastimportantareaswillbeomittedfromthe

representationofdocumentcontents.Then,someclassification

methodcanbeusedtoassignacategorytowholepages.

Inthenextchapter,severalvariantsofTF/IDFschemeare

describedindetail.Theydifferprimarilyinthecoefficients,

whichareusedtoexpresstheweightofconstituentvisual

blockscategories.

IV.WEBPAGEREPRESENTATIONWITHMODIFIEDWEIGHTS

A.TextPreprocessing

Beforewecancountandassigntheweightstotheterms,it

isnecessarytoperformstopwordsremovalandstemming.

Stopwordsremovalensuresthatnon-contentwords

appearinginmostoftextdocuments,suchas“the”,“in”or

“this”.Themainpurposeofstopwordsremovalistoreduce

thenumberofindextermsandkeepthewordswhichare

importanttorepresentthecontentsofadocument.

Stemmingisaprocessofreducingthewordsintotheir

stems(rootform).Thereasontousestemmingistounify

wordswithsimilarmeaningintooneindexterm.Itisarranged

byasetofrules,whichremoveprefixesandsuffixesofwords.

Inourimplementation,thePorterstemmer[13]hasbeenused.

Afterthat,weareremovingwords,whichappearinavery

smallnumberofdocuments,becausetheyarealsonot

importantfordocumentclassification.

B.ModificationsofStandardWeights

Inthissubsection,variouspossibilitiestomodifytheTF,

IDFandTF/IDFweightsaccordingtoknowledgeabout

categoriesofvisualblocks,towhichthepartsoftextbelong,

aredescribed.Atfirst,thepossiblemodificationsoftheTF

weightwillbesummarized.ThewaysofTFmodificationare

following:

?Wecansettheweightofatermaccordingtothevisual

block,inwhichitappears.Forexample,awordina

“heading”blockcanhavehigherimportancethana

wordfrom“links”.Thisisensuredbymultiplicationof

aweightbyacoefficientsetforeachcategory.

?Somepartsofawebpagecanbeomittedfrom

weighting.Therearesomecategories,whichdonot

haveacontentrelatedtoatopicofthewholepage.Itis

necessarytochoosethesenon-contentcategories,such

asnavigationbarsordate/authorsasexamples.

?Itisalsopossibletomakeamodificationofthe

equationfornormalizedtermfrequency–seeequation

(1).ThereisaMaxFreqvaluemeaningthemaximum

frequencyvalueforadocument.Wecannowdecide,if

thisvaluewillbecountedforthewholedocument

consistingofallvisualblocksoronlyforcontentparts.

Thefirstwaybetterreflectsthewholesizeofapage;

thesecondonereflectsthelengthofthemaintext

contentonapage.

?Itisnotcertain,ifblocksofcategory“links”shouldbe

omittedfromtherepresentation.Incaseofomitting

theseblocks,wecanlosesomepieceofinformation

aboutlinkstorelatedpages.Ontheotherhand,alotof

linksusuallyrefertoirrelevantpages.

Accordingtotheremarksabove,wecandefineamodified

termfrequencytorepresenttextcontentsofawebdocument:

AssumethatwehaveasetofwebdocumentsD={d

1,



d

n

}andasetofterms(words)T={t

1,

…t

m

},whichoccurin

documentsfromD.

AfterrenderingdocumentsfromD,wecandivideeach

documentintovisualblocks,whichcanbeclassifiedinto

severalclasses.LetusdenoteavectorofclasslabelsasC=

(c

1

,…,c

k

).Eachclassisevaluatedbyacoefficientaccordingto

thesignificanceofrelevantvisualblock.Thisisdenotedasa

vectorofcoefficientsV=(v

1

,…,v

k

),wherev

j

isthecoefficient

ofacorrespondingclassofvisualblockc

j

.

Then,amodifieddocumentfrequencyofatermt∈Tina

webdocumentd∈Disdefinedas:



=

?=

k

i

ii

vdtFcdtMTF

1

),,(),(,(3)

whereF(t,d,b

i

)isafrequencyoftermtinallblocksof

classc

i

inadocumentd.Theresultanttermweightisobtained

asasummarizationofallweightsforcomponentvisualblocks.

Thismodifiedtermfrequencyshouldalsobenormalized,as

inequation(1).Modifiedtermfrequencyisdefinedas:

)(

),(5.0

5.0),(

dMaxFreq

dtMTF

dtnMTF

V

?

+=,(4)

whereMaxFreq

V

(d)isthemaximumfrequencyofanyterm

incontentpartsofadocument.Contentofvisualblocks

classifiedintoclasses,coefficientofwhichisequaltozeroin

vectorV,arenotconsideredforcountingofaMaxFreqvalue.

Thiskindofinformationcanalsobeusedduringthe

preprocessingofdocuments.Wecanremovethewords,which

occuronlyinnon-contentblocksfromtherepresentationand

reducethesizeofweightvectorsalreadybeforethe

representationiscreated.

C.ModificationofInverseDocumentFrequency

Themodificationofinversedocumentfrequencyissimilar

tothemodificationsoftermfrequency.Inthiscase,the

categoriesofvisualblockswillalsobeconsideredtodetermine

thenumberofdocumentscontainingtheterm–seeequation

(2).Themodifiedinversedocumentfrequencyisdefinedas:

418

)log(1)(

V

k

n

tMIDF+=,(5)

wheretisatermfromthesetoftermsT,nisthecountof

alldocumentsinthedatasetandk

V

isthenumberofdocuments,

inwhichcontentvisualblocks(havingcoefficientinvectorV

higherthanzero)atleastoncecontainthetermt.

TheresultingmodifiedTF/IDFweightisobtainedas

multiplicationofmodifieddocumentfrequencyandinverse

documentfrequency.

V.EXPERIMENTALRESULTSOFCLASSIFICATION

Severalvariantsofmodificationsandtheirinfluenceon

accuracyofclassificationaredescribedinthissection.

TheWEKAtoolhasbeenusedforexperiments.Wehave

chosenfourclassifiers,whichbringthebestresultsforourdata

–twoBayesianclassifiers(Na?veBayes,BayesNet),atree-

basedclassifier(FT–FunctionalTrees)andSupportVector

Machinesclassifier(SMO–SequentialMinimalOptimization).

A.DescriptionofDatasetsUsedforExperiments

Therehavebeentwodatasetscontainingwebdocuments

usedforexperimentsdescribedhere.

Firstofthem,afreelyavailableWebKBcorpusofweb

pageshasbeenusedtoverifythefunctionalityofthemethod.It

contains4518webpagesfromthecomputerscience

departmentwebsites.Theyareclassifiedintosixcategories–

course,department,faculty,project,staffandstudent.

Theseconddatasetwasmanuallycreated.Itcontainsweb

pagestakenfromseveralEnglishwrittennewswebsites

(CNN.com,Reuters.com,nytimes.com,boston.comand

usatoday.com).Thesepageshavebeenmanuallyannotated.

Theyarecategorizedintosixtopics:politics,business,sport,

art,healthandscience.Intotal,datasetcontainsalmost500

pages,approximatelywell-proportionedintothesixtopics.

B.ExperimentswithaWeb-KBDataset

TheWeb-KBdatasetcanbeusedtoverifythefunctionality

ofclassification,butnottocomparestandardtermweighting

andmodifiedweights,becausethisdatasetcontainsrelatively

oldwebpages,withminimumofnon-contentblocks.Onmost

ofpages,nonavigationbars,linksandadvertisementsare

present.Themostofthecontentsisformedbythemaintext,

headingsandsomedate/authorsinformation.

TABLEI.CLASSIFICATIONRESULTSFORWEB-KBDATASET

TFTF/IDFMTF/IDF

Na?veBayes68.874.878.6

BayesNet76.477.380.7

Funct.Trees83.081.478.8

SMO75.180.172.7



Thereforethedifferenceofclassificationaccuracywith

standardandmodifiedweightsisverysmall.Itiscausebythe

factthatvisualinformationhasverysmallinfluenceon

modifiedweightingofterms.Theaccuracywasapproximately

80%forbothweightings,asshowninTableI.

ThemodifiedTF/IDFweightsareusedwiththefollowing

coefficientsforvisualblocks:formaintext,valueissetto2;

forheadings:5;forlinks:1;otherblocks:0.

C.ExperimentswithaDatasetofPagesfromNewsWebsites

Theseconddatasetconsistsofwebpageswithalotofnon-

contentblocks,thereforeitisexpectedthatmodifiedweighting

willhavehigherinfluenceonaccuracyofpageclassification.

First,itisnecessarytomakeacomparisonofclassification

accuracybetweenstandardandmodifiedtermweightingtosee

theeffectofmodifications.InTableII,youcanseetheresults

withstandardTFandTF/IDFweightsandmodifiedweight.

TABLEII.COMPARISONOFSTANDARDANDMODIFIEDWEIGHTING

TFTF/IDFMTF/IDF

Na?veBayes86.780.986.1

BayesNet89.390.693.4

Funct.Trees87.088.690.9

SMO85.488.090.5



Thesettingofcoefficientsformodifiedweightsofvisual

blocksisthesameasusedinthepreviousexperiment.

AsyoucanseefromTableII,usingmodifiedweightsleads

tobetteraccuracyofclassificationformostofclassification

methodsexceptNa?veBayesbeingquitebetterforTFweights.

Theobjectiveofthesecondexperimentwastodiscover,if

thevisualblocksclassifiedas“links”areusefultobeincluded

intothewebdocumentrepresentation.Thisisrealizedby

settingthecoefficientforthe“links”categoryinthevectorV.

Threeexperimentswithvaluesofcoefficientfor“links”

categoryhavebeenperformed.Itwassetto0(excluded),1

(lowimportance)and5(sameasmaintext).

TABLEIII.COMPARISONOFVARIOUS“LINKS”COEFFICIENTSETTINGS

v

links

=0v

links

=1v

links

=5

Na?veBayes83.486.184.9

BayesNet87.893.492.3

Funct.Trees83.190.988.2

SMO80.090.590.4



Theresultsleadtoaconclusionthatlinksarealsousefulfor

representationofthewholewebdocument.Utilizationoflinks

withasmallcoefficientcausesasmallincreaseofclassification

accuracy.Ontheotherhand,increaseofthecoefficientdoesn’t

bringbetterresultsofclassification.

ThethirdexperimentisfocusedonaMaxFreqvalue,which

canbecomputedintwodifferentways–seeequations(1)and

(4).Maximumfrequencycanbecomputedforthewhole

documentorwithrespecttocontentblocksonly.Thesettingof

coefficientsisthesameasusedinthefirstexperimentagain.

419

TABLEIV.CLASSIFICATIONRESULTSFORDIFFERENTMAXFREQ

MaxFreqMaxFreq

V



Na?veBayes86.181.6

BayesNet93.492.2

Funct.Trees90.986.0

SMO90.589.2



Themodificationofmaximumfrequencydoesnotbring

improvementofclassificationaccuracy.Asyoucanseefrom

theresultsshowninTableIV,withuseofallfourclassification

methods,theaccuracyisworseforthemodifiedMaxFreq

value.

Inthelastexperiment,theinfluenceofinversedocument

frequencyanditsmodificationonclassificationhadtobe

examined.

Threemeasurementswereaccomplished:inthefirstone,

onlymodifiedTFweightwasused(withoutIDF);then,a

modifiedTF/IDFweightwithstandardIDFcomputationwas

used;atlast,modifiedTF/IDFwithmodifiedIDFweight–see

equation(5)wasused.Thesamesettingofcoefficientsfor

visualblocksasinthefirstexperimentsisusedagain.

TABLEV.COMPARISONOFSTANDARDANDMODIFIEDWEIGHTING

MTFMTF/IDFMTF/MIDF

Na?veBayes88.686.185.3

BayesNet91.993.492.6

Funct.Trees84.590.988.9

SMO81.690.590.8



ItisobviousthattheresultsarebetterwithuseofbothIDF

weights(Na?veBayesmethodistheonlyexception).

DifferencebetweenbothIDFweightsisnothigh,modified

IDFdoesnotbringanyimprovementofaccuracy.

VI.CONCLUSIONANDFUTUREWORKS

Inthispaper,wehavepresentedanewwayofwebpage

contentrepresentationbasedonvisualfeatures.Visualfeatures

areusedtomodifythetermweightsusuallyusedtorepresent

textcontentofadocument.Thevisualinformationisobtained

bypagerenderingandsegmentation.Then,itisusedtoexpress

thesignificancecomponenttexttermsonapage.Thisis

achievedbyvariousmodificationsofstandardTF/IDFterm

weights.

Severalwaysofmodificationhavebeenproposedhere.

Then,theexperimentsprovedtheimprovementofwebpage

classificationwithuseofthesemodifications.Alsothe

comparisonofallvariantsofweightingmodificationshasbeen

presented.

Inthefutureresearch,wearegoingtojointheprocessof

visualblocksclassificationandtext-basedclassificationinto

oneprocessoftwo-phaseclassification.Thiswillallowmaking

theprocessautomatic.Thereisalsoanissuetofindoptimal

settingofcoefficientsforvisualblockssignificance.Inthis

paper,wehaveonlypresentedafewpossibilitiesofsetting

thesecoefficients.Theconceptofmodifiedtermweightscould

beusefulalsoforotherwebminingtasks,forexample

clusteringofwebpages,whichcanbeusedtofindsimilarweb

pageswithinsomedataset.

ACKNOWLEDGMENT

ThisresearchhasbeensupportedbytheResearchPlanNo.

MSM0021630528–“Security-OrientedResearchin

InformationTechnology”andbytheBUTFITgrantNo.FIT-

10-S-2–“RecognitionandPresentationofMultimediaData”.

REFERENCES

[1]G.SaltonandC.Buckley:“Termweightingapproachesinautomatic

textretrieval”.InformationProcessingandManagement,Vol.24,1998,

pp.513–523.

[2]D.Mladenic:“TurningYahoointoanautomaticWeb-pageclassifier”.

InProceedingsoftheEuropeanConferenceonArtificialIntelligence

(ECAI’98),pp473–474,1998.

[3]K.GolubandA.Ardo:“ImportanceofHTMLstructuralelementsand

metadatainautomatedsubjectclassification”.InProceedingsofthe9th

EuropeanConferenceonResearchandAdvancedTechnologyfor

DigitalLibraries(ECDL2005).LectureNotesinComputerScience,vol.

3652,pp.368-378,Springer,Berlin,Germany,2005.

[4]O.-W.KwonandJ.-H.Lee:“Textcategorizationbasedonk-nearest

neighborapproachforWebsiteclassification”.InformationProcessing

andManagement,Vol.39,Issue1,pp.25–44,PergamonPress,Inc.,

2003

[5]V.FresnoandA.Ribeiro:“Ananalyticalapproachtoconceptextraction

inHTMLenvironments”.JournalofIntelligentInformationSystems,

Vol.22,Number3,pp.215-235.,Springer,2004.

[6]S.Lee,M.JungandE.Lee:“AnovelWebpageanalysismethodfor

efficientreasoningofuserpreference.”InProceedingsofthe8thAsia-

PacificConferenceonComputer-Humaninteraction2008(Seoul,

Korea).LectureNotesInComputerScience,vol.5068.Springer-Verlag,

Berlin,Heidelberg,pp.86-93,2008.

[7]A.Schenker,M.Last,H.BunkeandA.Kandel,”ClassificationofWeb

documentsusinggraphmatching”,InternationalJournalofPattern

RecognitionandArtificialIntelligence,SpecialIssueonGraph

MatchinginComputerVisionandPatternRecognition,Vol.18,No.3,

pp.475-496,2004.

[8]A.MarkovandM.Last:“Asimple,structure-sensitiveapproachfor

Webdocumentclassification”,In:ProceedingsoftheThird

InternationalAtlanticWebIntelligenceConference,AWIC2005,Lodz,

Poland,LectureNotesinComputerScience,Vol.3528,Springer,pp.

293-298,2005.

[9]M.Kovacevic,M.Diligenti,M.GoriandV.Milutinovic:“Visual

adjacencymultigraphs-anovelapproachforaWebpage

classification”.InProceedingsoftheWorkshoponStatistical

ApproachestoWebMining(SAWM2004),pp.38–49,2004.

[10]D.Cai,S.Yu,J.R.WenandW.Y.Ma:“VIPS:aVision-basedpage

segmentationalgorithm.”MicrosoftResearch,2003.

[11]R.Burget:“Automaticdocumentstructuredetectionfordata

integration.”In:ProceedingsofBusinessInformationSystems(BIS

2007),LectureNotesinComputerScience,Vol.4439,Poznan,Poland,

pp.391-397,2007.

[12]R.BurgetandI.Rudolfová:”Webpageelementclassificationbasedon

visualfeatures”InProceedingsoftheFirstAsianConferenceon

IntelligentInformationandDatabaseSystems(ACIIDS2009),pp.67-

72,2009

[13]M.F.Porter:“Analgorithmforsuffixstripping”,Program,Vol.14(3),

pp130?137,1980.



420

献花(0)
+1
(本文系YularLib首藏)