1. Spark 集群环境
spark的安装配置参考《Spark 安装》。本环境是用了6台工作站,规划如下:
序号 |
主机名 |
IP |
用途 |
1 |
bdml-c01 |
192.168.200.170 |
客户端 |
2 |
bdml-m01 |
192.168.200.171 |
namenode
resourcemanager
master |
3 |
bdml-s01 |
192.168.200.172 |
datanode
nodemanager
worker |
4 |
bdml-s01 |
192.168.200.173 |
datanode
nodemanager
worker |
5 |
bdml-s01 |
192.168.200.174 |
datanode
nodemanager
worker |
6 |
bdml-s01 |
192.168.200.175 |
datanode
nodemanager
worker |
TensorflowOnSpark 的安装参考了《Getting Started TensorFlowOnSpark on Hadoop Cluster》。这篇文章也有误导,以至于我专门装了一个虚拟机去编译tensorflow,实际上是如果你不需要RDMA这个特性的话,完全不需要编译。为编译tensorflow,配置google的Bazel编译环境,费了不少时间。
2. 软件版本
redhat 7.2 / centOS 7.2
Hadoop 2.6.0
spark 1.6.0
Scala 2.10.6
Python 2.7.12
tensorflow 1.0.1
TensorFlowOnSpark master
3. 安装
安装中最怕的就是版本冲突,当经过数次尝试,版本对了。再重新装一次,把步骤整理出来,事情就变得容易了。
1)软件列表:
Python-2.7.12.tgz
setuptools-23.0.0.tar.gz
pip-8.1.2.tar.gz
pbr-0.11.0-py2.py3-none-any.whl
funcsigs-1.0.2-py2.py3-none-any.whl
six-1.10.0-py2.py3-none-any.whl
mock-2.0.0-py2.py3-none-any.whl
protobuf-3.1.0.post1-py2.py3-none-any.whl
pydoop-1.2.0.tar.gz
numpy-1.11.1-cp27-cp27mu-manylinux1_x86_64.whl
scipy-0.17.1-cp27-cp27mu-manylinux1_x86_64.whl
wheel-0.29.0-py2.py3-none-any.whl
tensorflow-1.0.0-cp27-none-linux_x86_64.whl
2)编译安装 python 2.7.12,先设置好环境
- yum install zlib-devel -y
- yum install bzip2-devel -y
- yum install openssl-devel -y
- yum install ncurses-devel -y
- yum install sqlite-devel -y
- yum install readline-devel -y
解压缩Python-2.7.12.tgz,执行make和make install
- ./configure --prefix="/home/hadoop/Python" --enable-unicode=ucs4
- make & make install
3) 将相关包安装到Python
按顺序先安装setuptools-23.0.0.tar.gz和pip-8.1.2.tar.gz,其它的包可以用pip install安装,如果版本装错了,用pip uninstall删除,查看用pip list。但是pydoop需要编译安装,目前这个包不支持python3。安装这边包需要设置jvm环境,否则会包找不到jni.h。
安装pydoop
- ln -s /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.91-2.6.2.3.el7.x86_64/include/jni.h ./jni.h
- ln -s /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.91-2.6.2.3.el7.x86_64/include/linux/jni_md.h ./jni_md.h
- tar -xvf pydoop-1.2.0.tar.gz
- cd pydoop-1.2.0
- python setup.py build
- python setup.py install
完成后的结果
制作Python.zip 并上传到hdfs
- cd ~/Python
- zip Python.zip *
- mv Python.zip ../
- hdfs dfs -put Python.zip /user/hadoop/
- hdfs dfs -ls /user/hadoop/
4)编译tensorflow-hadoop-1.0-SNAPSHOT.jar
到https://github.com/tensorflow/ecosystem下载源码编译打包,example文件编译不通过,需跳过
- mvn package -Dmaven.test.skip=true
上传tensorflow-hadoop-1.0-SNAPSHOT.jar到hdfs
- hdfs dfs -put tensorflow-hadoop-1.0-SNAPSHOT.jar
- hdfs dfs -ls /user/hadoop
5)编译TensorflowOnSaprk安装
到https://github.com/yahoo/TensorFlowOnSpark下载,解压缩到/home/hadoop目录(hadoop用户的home目录)就可以了
制作tfspark.zip
- cd TensorFlowOnSpark/src
- zip -r ../tfspark.zip *
支持安装基本完成
4. 运行mnist测试程序
1)准备数据
下载mnist数据集 ,拷贝到/home/hadoop目录的MLdata/mnist目录
t10k-images-idx3-ubyte.gz
t10k-images-idx3-ubyte.gz
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte.gz
制作zip文件
- cd MLdata/mnist
- zip -r mnist.zip *
2)feed_dic方式运行,步骤如下
- # step 1 设置环境变量
- export PYTHON_ROOT=~/Python
- export LD_LIBRARY_PATH=${PATH}
- export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
- export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"
- export PATH=${PYTHON_ROOT}/bin/:$PATH
- export QUEUE=default
-
- # step 2 上传文件到hdfs
- hdfs dfs -rm /user/${USER}/mnist/mnist.zip
- hdfs dfs -put ${HOME}/MLdata/mnist/mnist.zip /user/${USER}/mnist/mnist.zip
-
- # step 3 将图像文件(images)和标签(labels)转换为CSV文件
- hdfs dfs -rm -r /user/${USER}/mnist/csv
- ${SPARK_HOME}/bin/spark-submit \
- --master yarn \
- --deploy-mode cluster \
- --queue ${QUEUE} \
- --num-executors 4 \
- --executor-memory 4G \
- --archives hdfs:///user/${USER}/Python.zip#Python,hdfs:///user/${USER}/mnist/mnist.zip#mnist \
- TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
- --output mnist/csv \
- --format csv
-
-
- # step 4 训练(train)
- hadoop fs -rm -r mnist_model
- ${SPARK_HOME}/bin/spark-submit \
- --master yarn \
- --deploy-mode cluster \
- --queue ${QUEUE} \
- --num-executors 3 \
- --executor-memory 8G \
- --py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
- --conf spark.dynamicAllocation.enabled=false \
- --conf spark.yarn.maxAppAttempts=1 \
- --conf spark.yarn.executor.memoryOverhead=6144 \
- --archives hdfs:///user/${USER}/Python.zip#Python \
- --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
- ${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
- --images mnist/csv/train/images \
- --labels mnist/csv/train/labels \
- --mode train \
- --model mnist_model
-
-
- # step 5 推断(inference)
- hadoop fs -rm -r predictions
- ${SPARK_HOME}/bin/spark-submit \
- --master yarn \
- --deploy-mode cluster \
- --queue ${QUEUE} \
- --num-executors 3 \
- --executor-memory 8G \
- --py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
- --conf spark.dynamicAllocation.enabled=false \
- --conf spark.yarn.maxAppAttempts=1 \
- --conf spark.yarn.executor.memoryOverhead=6144 \
- --archives hdfs:///user/${USER}/Python.zip#Python \
- --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
- ${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
- --images mnist/csv/test/images \
- --labels mnist/csv/test/labels \
- --mode inference \
- --model mnist_model \
- --output predictions
-
-
- # step 6 查看结果(可能有多个文件)
- hdfs dfs -ls predictions
- hdfs dfs -cat predictions/part-00001
- hdfs dfs -cat predictions/part-00002
- hdfs dfs -cat predictions/part-00003
-
- #网页方式,查看spark作业运行情况
- http://bdml-m01:8088/cluster/apps/
3) queuerunner方式运行,步骤如下
- # step 1 设置环境变量
- export PYTHON_ROOT=~/Python
- export LD_LIBRARY_PATH=${PATH}
- export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
- export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"
- export PATH=${PYTHON_ROOT}/bin/:$PATH
- export QUEUE=default
-
- # step 2 上传文件到hdfs
- hdfs dfs -rm /user/${USER}/mnist/mnist.zip
- hdfs dfs -rm -r /user/${USER}/mnist/tfr
- hdfs dfs -put ${HOME}/MLdata/mnist/mnist.zip /user/${USER}/mnist/mnist.zip
-
- # step 3 将图像文件(images)和标签(labels)转换为TFRecords
- ${SPARK_HOME}/bin/spark-submit \
- --master yarn \
- --deploy-mode cluster \
- --queue ${QUEUE} \
- --num-executors 4 \
- --executor-memory 4G \
- --archives hdfs:///user/${USER}/Python.zip#Python,hdfs:///user/${USER}/mnist/mnist.zip#mnist \
- --jars hdfs:///user/${USER}/tensorflow-hadoop-1.0-SNAPSHOT.jar \
- ${HOME}/TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
- --output mnist/tfr \
- --format tfr
-
- # step 4 训练(train)
- hadoop fs -rm -r mnist_model
- ${SPARK_HOME}/bin/spark-submit \
- --master yarn \
- --deploy-mode cluster \
- --queue ${QUEUE} \
- --num-executors 4 \
- --executor-memory 4G \
- --py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
- --conf spark.dynamicAllocation.enabled=false \
- --conf spark.yarn.maxAppAttempts=1 \
- --conf spark.yarn.executor.memoryOverhead=4096 \
- --archives hdfs:///user/${USER}/Python.zip#Python \
- --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
- ${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
- --images mnist/tfr/train \
- --format tfr \
- --mode train \
- --model mnist_model
-
- # step 5 推断(inference)
- hadoop fs -rm -r predictions
- ${SPARK_HOME}/bin/spark-submit \
- --master yarn \
- --deploy-mode cluster \
- --queue ${QUEUE} \
- --num-executors 4 \
- --executor-memory 4G \
- --py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
- --conf spark.dynamicAllocation.enabled=false \
- --conf spark.yarn.maxAppAttempts=1 \
- --conf spark.yarn.executor.memoryOverhead=4096 \
- --archives hdfs:///user/${USER}/Python.zip#Python \
- --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
- ${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
- --images mnist/tfr/test \
- --mode inference \
- --model mnist_model \
- --output predictions
-
- # step 6 查看结果(可能有多个文件)
- hdfs dfs -ls predictions
- hdfs dfs -cat predictions/part-00001
- hdfs dfs -cat predictions/part-00002
- hdfs dfs -cat predictions/part-00003
-
- #网页方式,查看spark作业运行情况
- http://bdml-m01:8088/cluster/apps/
|