关关采集器,主要使用正则采集,以下是正则的一些表达 \d* 表示数字
\s* 表示空格+换行 .+? 表示字符(不能为空) .* 表示字符(可以为空) () 表示我们需要的部分 ((.|\n)*) 章节的内容部分,包括了换行。 =====杰奇对应===== !!!! 相当于 ([^><]*) ~~~~ 相当于 ([^><'"]*) ^^^^ 相当于 ([^><\d]*) $$$$ 相当于 ([\d]*) **** 相当于 (.*) =====其他基本===== . 匹配任何单个字符。例如正则表达式r.t匹配这些字符串:rat、rut、r t,但是不匹配root。 $ 匹配行结束符。例如正则表达式weasel$ 能够匹配字符串"He's a weasel"的末尾,但是不能匹配字符串"They are a bunch of weasels."。 ^ 匹配一行的开始。例如正则表达式^When in能够匹配字符串"When in the course of human events"的开始,但是不能匹配"What and When in the"。 * 匹配0或多个正好在它之前的那个字符。例如正则表达式.*意味着能够匹配任意数量的任何字符。 \ 这是引用府,用来将这里列出的这些元字符当作普通的字符来进行匹配。例如正则表达式\$被用来匹配美元符号,而不是行尾,类似的,正则 表达式\.用来匹配点字符,而不是任何字符的通配符。 万能图片规则<[^<]*((?<=<(?:img|IMG)[^>]*(?:(?:src|SRC)(?:\s*=\s*(?:["']?))))(?:[^\s"'>]*)\.(?:jpg|gif|jpeg|bmp|png|GIF|JPG)) [^>]*> 附带:藏海阁文学网 采集规则,全文字的哦
<RuleConfigInfo xmlns:xsi="http://www./2001/XMLSchema-instance" xmlns:xsd="http://www./2001/XMLSchema"> <RuleVersion> <RegexName /> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </RuleVersion> <RuleID> <RegexName>RuleID</RegexName> <Pattern>1</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </RuleID> <GetSiteName> <RegexName>GetSiteName</RegexName> <Pattern>藏海阁</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </GetSiteName> <GetSiteCharset> <RegexName>GetSiteCharset</RegexName> <Pattern>utf-8</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </GetSiteCharset> <GetSiteUrl> <RegexName>GetSiteUrl</RegexName> <Pattern>http://www./</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </GetSiteUrl> <NovelSearchUrl> <RegexName>NovelSearchUrl</RegexName> <Pattern>http://www./Book/Search.aspx</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelSearchUrl> <NovelSearchData> <RegexName>NovelSearchData</RegexName> <Pattern>SearchKey={SearchKey}&SearchClass=1</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelSearchData> <NovelSearch_GetNovelKey> <RegexName>NovelSearch_GetNovelKey</RegexName> <Pattern><div id="CListTitle"><a href="/Book/(\d*)/Index.aspx" target="_blank"><b>{SearchKey}</b></a></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelSearch_GetNovelKey> <NovelListUrl> <RegexName>NovelListUrl</RegexName> <Pattern>http://www./type/1/</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelListUrl> <NovelList_GetNovelKey> <RegexName>NovelList_GetNovelKey</RegexName> <Pattern><a href="http://www./books/(\d*)/" id=".+?" title=".+?">(.+?)</a></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelList_GetNovelKey> <NovelUrl> <RegexName>NovelUrl</RegexName> <Pattern>http://www./books/{NovelKey}/</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelUrl> <NovelErr> <RegexName>NovelErr</RegexName> <Pattern>未找到该编号的书籍信息</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelErr> <NovelName> <RegexName>NovelName</RegexName> <Pattern><h1>(.+?)</h1></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelName> <NovelAuthor> <RegexName>NovelAuthor</RegexName> <Pattern>作者:(.+?)</span></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelAuthor> <LagerSort> <RegexName>LagerSort</RegexName> <Pattern>书籍类别:(.+?)</span></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </LagerSort> <SmallSort> <RegexName>SmallSort</RegexName> <Pattern>书籍类别:(.+?)</span></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </SmallSort> <NovelIntro> <RegexName>NovelIntro</RegexName> <Pattern><div>内容简介:((.|\n)*?)</div>\s*</li></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern><span(.|\n)+?</span>|<p>|<a.+?</a>|</div></FilterPattern> </NovelIntro> <NovelKeyword> <RegexName>NovelKeyword</RegexName> <Pattern><h1>(.+?)</h1></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelKeyword> <NovelDegree> <RegexName>NovelDegree</RegexName> <Pattern>连载状态:(.+?)</span></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelDegree> <NovelCover> <RegexName>NovelCover</RegexName> <Pattern><a class="pic"><img src="(.+?)"</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelCover> <NovelDefaultCoverUrl> <RegexName>NovelDefaultCoverUrl</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelDefaultCoverUrl> <NovelInfo_GetNovelPubKey> <RegexName>NovelInfo_GetNovelPubKey</RegexName> <Pattern>连载状态:(.+?)</span></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelInfo_GetNovelPubKey> <PubCookies> <RegexName>PubCookies</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubCookies> <PubIndexUrl> <RegexName>PubIndexUrl</RegexName> <Pattern>http://www./books/{NovelKey}/</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubIndexUrl> <PubIndexErr> <RegexName>PubIndexErr</RegexName> <Pattern>这里必须填写</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubIndexErr> <PubVolumeContent> <RegexName>PubVolumeContent</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubVolumeContent> <PubVolumeSplit> <RegexName>PubVolumeSplit</RegexName> <Pattern><h3></Pattern> <Method>Spilt</Method> <Options>None</Options> <FilterPattern /> </PubVolumeSplit> <PubVolumeName> <RegexName>PubVolumeName</RegexName> <Pattern>Title">(.+?)</div></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern> </FilterPattern> </PubVolumeName> <PubChapterName> <RegexName>PubChapterName</RegexName> <Pattern><li><a href=" http://www./book/\d*/\d*/">([^<]+?)</a></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubChapterName> <PubChapter_GetChapterKey> <RegexName>PubChapter_GetChapterKey</RegexName> <Pattern><li><a href="( http://www./book/\d*/\d*/)">[^<]+?</a></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubChapter_GetChapterKey> <PubContentUrl> <RegexName>PubContentUrl</RegexName> <Pattern>{ChapterKey}</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubContentUrl> <PubContentErr> <RegexName>PubContentErr</RegexName> <Pattern>这里必须填写</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubContentErr> <PubContent_GetTextKey> <RegexName>PubContent_GetTextKey</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubContent_GetTextKey> <PubTextUrl> <RegexName>PubTextUrl</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubTextUrl> <PubContentText> <RegexName>PubContentText</RegexName> <Pattern><div id="zjneirong" style="font-size:14px;width:100%;">((.|\n)+?)<hr</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern><div.+?>|<div>|</div>|<DIV.+?>|</DIV>|<script(.|\n)+?</script>|<style(.|\n)+?</style>|<a(.|\n)+?</a></FilterPattern> </PubContentText> <PubContentReplace> <RegexName>PubContentReplace</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubContentReplace> <PubContentImages> <RegexName>PubContentImages</RegexName> <Pattern><[^<]*((?<=<(?:img|IMG)[^>]*(?:(?:src|SRC)(?:\s*=\s*(?:["']?))))(?:[^\s"'>]*)\.(?:jpg|gif|jpeg|bmp|png|GIF|JPG))[^>]*></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubContentImages> </RuleConfigInfo> |
|