Text-BasedWebPageClassificationwithUseof
VisualInformation
VladimírBartík
Dept.ofInformationSystems,FacultyofInformationTechnology
BrnoUniversityofTechnology
Brno,CzechRepublic
e-mail:bartik@fit.vutbr.cz
Abstract—Asthenumberofpagesonthewebispermanently
increasing,thereisaneedtoclassifypagesintocategoriesto
facilitateindexingorsearchingthem.Inthemethodproposed
here,weusebothtextualandvisualinformationtofindasuitable
representationofwebpagecontent.Inthispaper,severalterm
weights,basedonTForTF-IDFweightingareproposed.
Modificationisbasedonvisualareas,inwhichthetextappears
andtheirvisualproperties.Someresultsofexperimentsare
includedinthefinalpartofthepaper.
Keywords-webpageclassification,termweights,text
classification,TF-IDFweight,visualinformation,visualblocks.
I.INTRODUCTION
AstheamountofinformationprovidedbyWorldWide
Web(WWW)ispermanentlyincreasing,thereisaneedof
someusefulknowledgeobtainedfromWWW.Webpage
classification,alsoknownaswebpagecategorization,isa
processofassigningthewebpagestooneofseveralpredefined
classes.Classificationofwebpagesisessentialtomanyweb
informationretrievaltasks,suchasindexingofpages,
improvingthewebsearchorconstructingofWebdirectories.
Inthebeginning,classificationmethodswereapplied
primarilytostructureddatabases.Toclassifysemi-structured
datainaformofwebpages,wehavetofindarepresentationof
webpagecontent,whichissuitableforclassificationmethods.
Asweknow,therearetwomaininformationtypescontained
onawebpage.ThereisavisualstructureformedbyHTML
code,whichhidesinformationaboutvisualblocksonapage,
butalsounstructuredinformationinaformoftextispresent.
ContemporarymethodsforclassificationofWebdatawork
mainlywiththetextinformation,whichispresentonaWeb
page.Thesetext-basedclassificationmethodsusuallyusethe
bag-of-wordsrepresentationtorepresentthecontentsofa
document.Inthiscase,adocumentisrepresentedbyavector
ofTF/IDFweightsassignedtoindividualterms.
However,thisrepresentationofadocumentdoesnot
capturevisualinformation.Ontheotherhand,therehavebeen
someclassificationmethodsbasedonwebpagestructure
proposed.Inthiscase,structureofadocumentalongwith
pictures,areconsideredasinputinformationforclassification
algorithms.Buttextinformationisnottakenintoconsideration.
Becauseformostofwebpages,textplayscrucialrolefor
representationofwebpagecontent,itisappropriatetoenrich
textrepresentationwithvisualinformation.Itispossibletouse
informationfromHTMLtagstoimprovetermweighting.This
allowscapturingseveralpropertiesoftext,suchasfontsizeor
textcolor,andusingittomodifytheweightsoftextterms.
However,thismethoddoesnotreflectothervisualproperties
ofawebpage,suchaslayoutand,locationofanelementetc.
Tocapturethistypeofvisualinformation,wecanusepage
segmentationmethods.Segmentationisdefinedasaprocessof
detectingorganizationofvisualblocksonthepageand
analyzingthepropertiesofcomponentvisualelements.Data
obtainedbysegmentationcanalsobeusedtoimprovetheweb
pagerepresentation.Segmentationalgorithmsusuallywork
withrenderedwebdocumentsandtheirvisualrepresentation.
Inthispaper,weintroduceanewwayofmodifyingterm
weightingwithvisualinformation.Segmentationalgorithmis
usedtocapturevisualblocks,whichformthewebpage.Then,
weareabletoclassifyvisualblocksintopredefinedcategories,
i.e.heading,maintextofapage,advertisementornavigation.
Thesetypesofvisualblockshavedifferentimportancein
representationofapage.Theimportanceshouldbetakeninto
accountbyclassificationofwholewebpages.Therefore,we
presentmodificationsofweightscapturingthisinformation.
Atfirst,weproposethewayofvisualblocksclassification
verybriefly.Next,modificationsoftermweightingare
introduced,andsomeresultsofwebpageclassification
experimentsaredescribedinthefinalpartofthispaper.
II.RELATEDWORK
A.TermWeightingforClassification
Classificationofwebpagesbasedontextextractedfrom
pagesisthemostcommonwaytoclassifywebpages.The
basicmethodsarebasedonabag-of-wordsrepresentationwith
TForTF/IDFweights[1].TFmeansthetermfrequencyina
document.Itshouldbenormalized,forexampleas:
)(
),(5.0
5.0),(
dMaxFreq
dtf
dtTF
?
+=,(1)
2010InternationalConferenceonAdvancesinSocialNetworksAnalysisandMining
978-0-7695-4138-9/10$26.00?2010IEEE
DOI10.1109/ASONAM.2010.34
416
whereMaxFreq(d)isthemaximumfrequencyofanyterm
inadocument.Itisnecessarytoconsiderinversedocument
frequency(IDF),whichrepresentsthegeneralimportanceofa
termamongalldocuments.TheresultingTF/IDFweightis
obtainedas:
))log(1(),(),(/
k
n
dtTFdtIDFTF+?=,(2)
wherenisthecountofalldocumentsinthedatasetandkis
thenumberofdocumentscontainingthetermt.
AnotherfrequentlyusedtextrepresentationisN-gram
representation[2],whichallowsrepresentingadocumentwith
termsconsistingofmorethanoneword.
Asmentionedabove,importanceofvariousvisualblocksis
different;thereforeitisnecessarytoreflectitindocument
representation.In[3],theuseofinformationderivedfrom
HTMLtagsofapageforclassification,isproposed.Similar
method,inwhichtheHTMLtagsaredividedintothreegroups
withdifferentimportanceoftermsineachgroup,isdescribed
in[4].Themaindisadvantageofthesetwoapproachesisthe
factthattheformationofvariouswebpagesisdifferentandthe
sameinformationonapageisoftenrepresentedindifferent
waysinHTML.
Themethodproposedin[5],takesfouraspectsofterm
frequencyintoconsideration:termfrequency,termfrequency
inheading,frequencyofemphasizedwordsintextandword
positionfunctionassumingthatthefirstandlastquartersof
textaremostrelevant.Thesefouraspectsarecombinedtoform
theresultanttermweights.
B.OtherMethodsforDocumentRepresentation
In[6],TF/IDFweightsarereplacedbykeywordsextracted
frominformativewebpageblocks.Informativeblocksare
discoveredviaDOMtreeofapage.Theassumptionthatnon-
contentblocksarerepeatedlyappearinginaDOMtree,isused.
Therearealsomethods,whichusegraphrepresentation
insteadofweightvectors,proposed.Thisgraphrepresentation
cankeepalsothestructuralinformationaboutthedocument.If
thereisabilitytocomputethesimilaritybetweentwographs,
somelazyclassificationalgorithms(i.e.k-nearestneighbor)
canbeusedforclassification[7].Acombinationofgraphand
vectorrepresentationisproposedin[8].Graphsareprocessed
viaafrequentsub-graphminingmethod.Thefrequentsub-
graphsobtainedbecomeattributesforvectorrepresentation.
Amethod,whichrepresentsawebdocumentaccordingto
visualpropertiesofadocument,isproposedin[9].Here,the
visualadjacencymultigraphrepresentationispresented.Itis
abletorepresentinformationaboutmutualpositionofvisual
partsonapageandcontentsofcomponentparts.
C.Segmentationofwebpages
Toperformsegmentationofawebdocument,itis
necessarytorenderthedocumentandobtaininformationabout
visualareasofrendereddocument(notaboutHTMLtags).
Therearesomesegmentationalgorithmsproposed.The
probablymostknownsegmentationalgorithmisVIPS
presentedin[10].Theresultisatreeofvisualareas
independentonHTMLtags.Adocumentisdividedintovisual
blocks,basedonitsvisualcues,suchasdifferentfontsor
colors,linesandotherseparatorsetc.
Anothersegmentationmethodusedinourmethod
describedhere,isproposedin[11].Thismethodusesabottom-
upapproachtofindvisualareas.Theresultisasetofvisual
rectangularblocksvisuallyseparatedfromtheremainderofthe
webpage.
III.ROLEOFVISUALINFORMATIONINWEBPAGE
CLASSIFICATION
Asitismentionedabove,visualinformationaboutweb
pageareascouldbeusedtoimprovewebpageclassification.If
thevisualinformationisconsidered,itisnecessaryto
distinguishbetweencontentandnon-contentblocks.
Ifweareabletoobtainvisualpropertiesandpositionof
eachvisualareaautomatically,itispossibletodetermine
importanceofeachvisualarea.Itisveryimportantduetothe
factthatmostofspaceonawebdocumentisoccupiedby
information,whichisnotrelevanttothemaintopicofthe
documentandshouldnotbeconsideredindocument
representation.Wecanmentionadvertisement,navigationbars,
andlinkstootherpagesorcopyrightinformationasexamples
ofunimportantareasofapage.
A.ClassificationofWebPageVisualBlocks
Accordingtotheseremarks,toobtainagooddocument
representationresultinginaccurateclassificationofpages,we
havetodistinguishbetweenthefollowingtypesofvisualareas:
?headings:includesthemainheadingandsubheadings
containedonthepage.Itistypicallycharacterizedbya
fontsizehigherthantherestofthepage.
?maintext:themaincontentofthepage,themost
importantinformation,whichshouldbeincludedin
representationofadocument.
?date/authors:includesinformationaboutwebpage
authorsordateofpagecreation;unimportant
informationforfurtherprocessing.
?Navigationbar:linkstootherpartsofawebsite,
typicallyconstantforallpagesofasite.Itisalso
unimportantforpagerepresentation.
?Links:linkstootherpagesrelatedtotheactualpage;
someofthelinkscanbeirrelevant
?Others:otherpartsofapage,i.e.captionofawhole
page,advertisement;alsounimportant.
Ifwehavenecessaryinformationaboutvisualproperties
andpositionofablock,thecategoriesdescribedabovecanbe
assignedtopageblocksviaclassification.Thedetailed
descriptionofthisclassificationapproach,visualproperties
usedtoclassifywebpageareasanddetailedresultsof
experimentsaredescribedin[12].
417
B.RoleofCategoriesofVisualBlocksinPageClassification
Ifwehaveinformationaboutacategoryforeachwebpage
visualblockobtainedbyclassification,itispossibletouseitto
enrichthestandardTF/IDFscheme.Thetexttermsthatare
containedinthemostimportantpartsareevaluatedwiththe
highestweight(astandardweightmultipliedbyacoefficient),
thetermsinsomelessimportantpartshaveastandardweight
andtheleastimportantareaswillbeomittedfromthe
representationofdocumentcontents.Then,someclassification
methodcanbeusedtoassignacategorytowholepages.
Inthenextchapter,severalvariantsofTF/IDFschemeare
describedindetail.Theydifferprimarilyinthecoefficients,
whichareusedtoexpresstheweightofconstituentvisual
blockscategories.
IV.WEBPAGEREPRESENTATIONWITHMODIFIEDWEIGHTS
A.TextPreprocessing
Beforewecancountandassigntheweightstotheterms,it
isnecessarytoperformstopwordsremovalandstemming.
Stopwordsremovalensuresthatnon-contentwords
appearinginmostoftextdocuments,suchas“the”,“in”or
“this”.Themainpurposeofstopwordsremovalistoreduce
thenumberofindextermsandkeepthewordswhichare
importanttorepresentthecontentsofadocument.
Stemmingisaprocessofreducingthewordsintotheir
stems(rootform).Thereasontousestemmingistounify
wordswithsimilarmeaningintooneindexterm.Itisarranged
byasetofrules,whichremoveprefixesandsuffixesofwords.
Inourimplementation,thePorterstemmer[13]hasbeenused.
Afterthat,weareremovingwords,whichappearinavery
smallnumberofdocuments,becausetheyarealsonot
importantfordocumentclassification.
B.ModificationsofStandardWeights
Inthissubsection,variouspossibilitiestomodifytheTF,
IDFandTF/IDFweightsaccordingtoknowledgeabout
categoriesofvisualblocks,towhichthepartsoftextbelong,
aredescribed.Atfirst,thepossiblemodificationsoftheTF
weightwillbesummarized.ThewaysofTFmodificationare
following:
?Wecansettheweightofatermaccordingtothevisual
block,inwhichitappears.Forexample,awordina
“heading”blockcanhavehigherimportancethana
wordfrom“links”.Thisisensuredbymultiplicationof
aweightbyacoefficientsetforeachcategory.
?Somepartsofawebpagecanbeomittedfrom
weighting.Therearesomecategories,whichdonot
haveacontentrelatedtoatopicofthewholepage.Itis
necessarytochoosethesenon-contentcategories,such
asnavigationbarsordate/authorsasexamples.
?Itisalsopossibletomakeamodificationofthe
equationfornormalizedtermfrequency–seeequation
(1).ThereisaMaxFreqvaluemeaningthemaximum
frequencyvalueforadocument.Wecannowdecide,if
thisvaluewillbecountedforthewholedocument
consistingofallvisualblocksoronlyforcontentparts.
Thefirstwaybetterreflectsthewholesizeofapage;
thesecondonereflectsthelengthofthemaintext
contentonapage.
?Itisnotcertain,ifblocksofcategory“links”shouldbe
omittedfromtherepresentation.Incaseofomitting
theseblocks,wecanlosesomepieceofinformation
aboutlinkstorelatedpages.Ontheotherhand,alotof
linksusuallyrefertoirrelevantpages.
Accordingtotheremarksabove,wecandefineamodified
termfrequencytorepresenttextcontentsofawebdocument:
AssumethatwehaveasetofwebdocumentsD={d
1,
…
d
n
}andasetofterms(words)T={t
1,
…t
m
},whichoccurin
documentsfromD.
AfterrenderingdocumentsfromD,wecandivideeach
documentintovisualblocks,whichcanbeclassifiedinto
severalclasses.LetusdenoteavectorofclasslabelsasC=
(c
1
,…,c
k
).Eachclassisevaluatedbyacoefficientaccordingto
thesignificanceofrelevantvisualblock.Thisisdenotedasa
vectorofcoefficientsV=(v
1
,…,v
k
),wherev
j
isthecoefficient
ofacorrespondingclassofvisualblockc
j
.
Then,amodifieddocumentfrequencyofatermt∈Tina
webdocumentd∈Disdefinedas:
∑
=
?=
k
i
ii
vdtFcdtMTF
1
),,(),(,(3)
whereF(t,d,b
i
)isafrequencyoftermtinallblocksof
classc
i
inadocumentd.Theresultanttermweightisobtained
asasummarizationofallweightsforcomponentvisualblocks.
Thismodifiedtermfrequencyshouldalsobenormalized,as
inequation(1).Modifiedtermfrequencyisdefinedas:
)(
),(5.0
5.0),(
dMaxFreq
dtMTF
dtnMTF
V
?
+=,(4)
whereMaxFreq
V
(d)isthemaximumfrequencyofanyterm
incontentpartsofadocument.Contentofvisualblocks
classifiedintoclasses,coefficientofwhichisequaltozeroin
vectorV,arenotconsideredforcountingofaMaxFreqvalue.
Thiskindofinformationcanalsobeusedduringthe
preprocessingofdocuments.Wecanremovethewords,which
occuronlyinnon-contentblocksfromtherepresentationand
reducethesizeofweightvectorsalreadybeforethe
representationiscreated.
C.ModificationofInverseDocumentFrequency
Themodificationofinversedocumentfrequencyissimilar
tothemodificationsoftermfrequency.Inthiscase,the
categoriesofvisualblockswillalsobeconsideredtodetermine
thenumberofdocumentscontainingtheterm–seeequation
(2).Themodifiedinversedocumentfrequencyisdefinedas:
418
)log(1)(
V
k
n
tMIDF+=,(5)
wheretisatermfromthesetoftermsT,nisthecountof
alldocumentsinthedatasetandk
V
isthenumberofdocuments,
inwhichcontentvisualblocks(havingcoefficientinvectorV
higherthanzero)atleastoncecontainthetermt.
TheresultingmodifiedTF/IDFweightisobtainedas
multiplicationofmodifieddocumentfrequencyandinverse
documentfrequency.
V.EXPERIMENTALRESULTSOFCLASSIFICATION
Severalvariantsofmodificationsandtheirinfluenceon
accuracyofclassificationaredescribedinthissection.
TheWEKAtoolhasbeenusedforexperiments.Wehave
chosenfourclassifiers,whichbringthebestresultsforourdata
–twoBayesianclassifiers(Na?veBayes,BayesNet),atree-
basedclassifier(FT–FunctionalTrees)andSupportVector
Machinesclassifier(SMO–SequentialMinimalOptimization).
A.DescriptionofDatasetsUsedforExperiments
Therehavebeentwodatasetscontainingwebdocuments
usedforexperimentsdescribedhere.
Firstofthem,afreelyavailableWebKBcorpusofweb
pageshasbeenusedtoverifythefunctionalityofthemethod.It
contains4518webpagesfromthecomputerscience
departmentwebsites.Theyareclassifiedintosixcategories–
course,department,faculty,project,staffandstudent.
Theseconddatasetwasmanuallycreated.Itcontainsweb
pagestakenfromseveralEnglishwrittennewswebsites
(CNN.com,Reuters.com,nytimes.com,boston.comand
usatoday.com).Thesepageshavebeenmanuallyannotated.
Theyarecategorizedintosixtopics:politics,business,sport,
art,healthandscience.Intotal,datasetcontainsalmost500
pages,approximatelywell-proportionedintothesixtopics.
B.ExperimentswithaWeb-KBDataset
TheWeb-KBdatasetcanbeusedtoverifythefunctionality
ofclassification,butnottocomparestandardtermweighting
andmodifiedweights,becausethisdatasetcontainsrelatively
oldwebpages,withminimumofnon-contentblocks.Onmost
ofpages,nonavigationbars,linksandadvertisementsare
present.Themostofthecontentsisformedbythemaintext,
headingsandsomedate/authorsinformation.
TABLEI.CLASSIFICATIONRESULTSFORWEB-KBDATASET
TFTF/IDFMTF/IDF
Na?veBayes68.874.878.6
BayesNet76.477.380.7
Funct.Trees83.081.478.8
SMO75.180.172.7
Thereforethedifferenceofclassificationaccuracywith
standardandmodifiedweightsisverysmall.Itiscausebythe
factthatvisualinformationhasverysmallinfluenceon
modifiedweightingofterms.Theaccuracywasapproximately
80%forbothweightings,asshowninTableI.
ThemodifiedTF/IDFweightsareusedwiththefollowing
coefficientsforvisualblocks:formaintext,valueissetto2;
forheadings:5;forlinks:1;otherblocks:0.
C.ExperimentswithaDatasetofPagesfromNewsWebsites
Theseconddatasetconsistsofwebpageswithalotofnon-
contentblocks,thereforeitisexpectedthatmodifiedweighting
willhavehigherinfluenceonaccuracyofpageclassification.
First,itisnecessarytomakeacomparisonofclassification
accuracybetweenstandardandmodifiedtermweightingtosee
theeffectofmodifications.InTableII,youcanseetheresults
withstandardTFandTF/IDFweightsandmodifiedweight.
TABLEII.COMPARISONOFSTANDARDANDMODIFIEDWEIGHTING
TFTF/IDFMTF/IDF
Na?veBayes86.780.986.1
BayesNet89.390.693.4
Funct.Trees87.088.690.9
SMO85.488.090.5
Thesettingofcoefficientsformodifiedweightsofvisual
blocksisthesameasusedinthepreviousexperiment.
AsyoucanseefromTableII,usingmodifiedweightsleads
tobetteraccuracyofclassificationformostofclassification
methodsexceptNa?veBayesbeingquitebetterforTFweights.
Theobjectiveofthesecondexperimentwastodiscover,if
thevisualblocksclassifiedas“links”areusefultobeincluded
intothewebdocumentrepresentation.Thisisrealizedby
settingthecoefficientforthe“links”categoryinthevectorV.
Threeexperimentswithvaluesofcoefficientfor“links”
categoryhavebeenperformed.Itwassetto0(excluded),1
(lowimportance)and5(sameasmaintext).
TABLEIII.COMPARISONOFVARIOUS“LINKS”COEFFICIENTSETTINGS
v
links
=0v
links
=1v
links
=5
Na?veBayes83.486.184.9
BayesNet87.893.492.3
Funct.Trees83.190.988.2
SMO80.090.590.4
Theresultsleadtoaconclusionthatlinksarealsousefulfor
representationofthewholewebdocument.Utilizationoflinks
withasmallcoefficientcausesasmallincreaseofclassification
accuracy.Ontheotherhand,increaseofthecoefficientdoesn’t
bringbetterresultsofclassification.
ThethirdexperimentisfocusedonaMaxFreqvalue,which
canbecomputedintwodifferentways–seeequations(1)and
(4).Maximumfrequencycanbecomputedforthewhole
documentorwithrespecttocontentblocksonly.Thesettingof
coefficientsisthesameasusedinthefirstexperimentagain.
419
TABLEIV.CLASSIFICATIONRESULTSFORDIFFERENTMAXFREQ
MaxFreqMaxFreq
V
Na?veBayes86.181.6
BayesNet93.492.2
Funct.Trees90.986.0
SMO90.589.2
Themodificationofmaximumfrequencydoesnotbring
improvementofclassificationaccuracy.Asyoucanseefrom
theresultsshowninTableIV,withuseofallfourclassification
methods,theaccuracyisworseforthemodifiedMaxFreq
value.
Inthelastexperiment,theinfluenceofinversedocument
frequencyanditsmodificationonclassificationhadtobe
examined.
Threemeasurementswereaccomplished:inthefirstone,
onlymodifiedTFweightwasused(withoutIDF);then,a
modifiedTF/IDFweightwithstandardIDFcomputationwas
used;atlast,modifiedTF/IDFwithmodifiedIDFweight–see
equation(5)wasused.Thesamesettingofcoefficientsfor
visualblocksasinthefirstexperimentsisusedagain.
TABLEV.COMPARISONOFSTANDARDANDMODIFIEDWEIGHTING
MTFMTF/IDFMTF/MIDF
Na?veBayes88.686.185.3
BayesNet91.993.492.6
Funct.Trees84.590.988.9
SMO81.690.590.8
ItisobviousthattheresultsarebetterwithuseofbothIDF
weights(Na?veBayesmethodistheonlyexception).
DifferencebetweenbothIDFweightsisnothigh,modified
IDFdoesnotbringanyimprovementofaccuracy.
VI.CONCLUSIONANDFUTUREWORKS
Inthispaper,wehavepresentedanewwayofwebpage
contentrepresentationbasedonvisualfeatures.Visualfeatures
areusedtomodifythetermweightsusuallyusedtorepresent
textcontentofadocument.Thevisualinformationisobtained
bypagerenderingandsegmentation.Then,itisusedtoexpress
thesignificancecomponenttexttermsonapage.Thisis
achievedbyvariousmodificationsofstandardTF/IDFterm
weights.
Severalwaysofmodificationhavebeenproposedhere.
Then,theexperimentsprovedtheimprovementofwebpage
classificationwithuseofthesemodifications.Alsothe
comparisonofallvariantsofweightingmodificationshasbeen
presented.
Inthefutureresearch,wearegoingtojointheprocessof
visualblocksclassificationandtext-basedclassificationinto
oneprocessoftwo-phaseclassification.Thiswillallowmaking
theprocessautomatic.Thereisalsoanissuetofindoptimal
settingofcoefficientsforvisualblockssignificance.Inthis
paper,wehaveonlypresentedafewpossibilitiesofsetting
thesecoefficients.Theconceptofmodifiedtermweightscould
beusefulalsoforotherwebminingtasks,forexample
clusteringofwebpages,whichcanbeusedtofindsimilarweb
pageswithinsomedataset.
ACKNOWLEDGMENT
ThisresearchhasbeensupportedbytheResearchPlanNo.
MSM0021630528–“Security-OrientedResearchin
InformationTechnology”andbytheBUTFITgrantNo.FIT-
10-S-2–“RecognitionandPresentationofMultimediaData”.
REFERENCES
[1]G.SaltonandC.Buckley:“Termweightingapproachesinautomatic
textretrieval”.InformationProcessingandManagement,Vol.24,1998,
pp.513–523.
[2]D.Mladenic:“TurningYahoointoanautomaticWeb-pageclassifier”.
InProceedingsoftheEuropeanConferenceonArtificialIntelligence
(ECAI’98),pp473–474,1998.
[3]K.GolubandA.Ardo:“ImportanceofHTMLstructuralelementsand
metadatainautomatedsubjectclassification”.InProceedingsofthe9th
EuropeanConferenceonResearchandAdvancedTechnologyfor
DigitalLibraries(ECDL2005).LectureNotesinComputerScience,vol.
3652,pp.368-378,Springer,Berlin,Germany,2005.
[4]O.-W.KwonandJ.-H.Lee:“Textcategorizationbasedonk-nearest
neighborapproachforWebsiteclassification”.InformationProcessing
andManagement,Vol.39,Issue1,pp.25–44,PergamonPress,Inc.,
2003
[5]V.FresnoandA.Ribeiro:“Ananalyticalapproachtoconceptextraction
inHTMLenvironments”.JournalofIntelligentInformationSystems,
Vol.22,Number3,pp.215-235.,Springer,2004.
[6]S.Lee,M.JungandE.Lee:“AnovelWebpageanalysismethodfor
efficientreasoningofuserpreference.”InProceedingsofthe8thAsia-
PacificConferenceonComputer-Humaninteraction2008(Seoul,
Korea).LectureNotesInComputerScience,vol.5068.Springer-Verlag,
Berlin,Heidelberg,pp.86-93,2008.
[7]A.Schenker,M.Last,H.BunkeandA.Kandel,”ClassificationofWeb
documentsusinggraphmatching”,InternationalJournalofPattern
RecognitionandArtificialIntelligence,SpecialIssueonGraph
MatchinginComputerVisionandPatternRecognition,Vol.18,No.3,
pp.475-496,2004.
[8]A.MarkovandM.Last:“Asimple,structure-sensitiveapproachfor
Webdocumentclassification”,In:ProceedingsoftheThird
InternationalAtlanticWebIntelligenceConference,AWIC2005,Lodz,
Poland,LectureNotesinComputerScience,Vol.3528,Springer,pp.
293-298,2005.
[9]M.Kovacevic,M.Diligenti,M.GoriandV.Milutinovic:“Visual
adjacencymultigraphs-anovelapproachforaWebpage
classification”.InProceedingsoftheWorkshoponStatistical
ApproachestoWebMining(SAWM2004),pp.38–49,2004.
[10]D.Cai,S.Yu,J.R.WenandW.Y.Ma:“VIPS:aVision-basedpage
segmentationalgorithm.”MicrosoftResearch,2003.
[11]R.Burget:“Automaticdocumentstructuredetectionfordata
integration.”In:ProceedingsofBusinessInformationSystems(BIS
2007),LectureNotesinComputerScience,Vol.4439,Poznan,Poland,
pp.391-397,2007.
[12]R.BurgetandI.Rudolfová:”Webpageelementclassificationbasedon
visualfeatures”InProceedingsoftheFirstAsianConferenceon
IntelligentInformationandDatabaseSystems(ACIIDS2009),pp.67-
72,2009
[13]M.F.Porter:“Analgorithmforsuffixstripping”,Program,Vol.14(3),
pp130?137,1980.
420
|
|