1.将HanLP的data(包含词典和模型)放到hdfs上,然后在项目配置文件hanlp.properties中配置root的路径,比如: root=hdfs://localhost:9000/tmp/ 2.实现com.hankcs.hanlp.corpus.io.IIOAdapter接口: public static class HadoopFileIoAdapter implements IIOAdapter { @Override public InputStream open(String path) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(path), conf); return fs.open(new Path(path)); } @Override public OutputStream create(String path) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(path), conf); OutputStream out = fs.create(new Path(path)); return out; } } 3.设置IoAdapter,创建分词器: private static Segment segment; static { HanLP.Config.IOAdapter = new HadoopFileIoAdapter(); segment = new CRFSegment(); } 然后,就可以在Spark的操作中使用segment进行分词了。 文章来源于云聪的博客
|
|
来自: lanlantian123 > 《待分类》