https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN
开始使用Hadoop集群上的TensorFlowOnSpark
GetStarted_YARN
leewyang编辑本页 20 days ago · 7修订
第6页
本地克隆此wiki
在桌面克隆
开始使用Hadoop集群上的TensorFlowOnSpark
在开始之前,您应该已经熟悉TensorFlow并且可以访问安装了Spark的Hadoop网格。如果你的网格有GPU节点,他们必须在本地安装cuda。
安装Python 2.7
从网格网关,下载/安装Python到本地文件夹。Python的这种安装将分发给Spark执行器,以便任何自定义依赖关系,包括TensorFlow,都可以被执行器使用。
# download and extract Python 2.7
export PYTHON_ROOT=~/Python
curl -O https://www./ftp/python/2.7.12/Python-2.7.12.tgz
tar -xvf Python-2.7.12.tgz
rm Python-2.7.12.tgz
# compile into local PYTHON_ROOT
pushd Python-2.7.12
./configure --prefix="${PYTHON_ROOT}" --enable-unicode=ucs4
make
make install
popd
rm -rf Python-2.7.12
# install pip
pushd "${PYTHON_ROOT}"
curl -O https://bootstrap./get-pip.py
bin/python get-pip.py
rm get-pip.py
# install tensorflow (and any custom dependencies)
${PYTHON_ROOT}/bin/pip install pydoop
# Note: add any extra dependencies here
popd
安装和编译TensorFlow w / RDMA支持
git clone git@github.com:yahoo/tensorflow.git
# follow build instructions to install into ${PYTHON_ROOT}
为TFRecords安装和编译Hadoop InputFormat / OutputFormat
git clone https://github.com/tensorflow/ecosystem.git
# follow build instructions to generate tensorflow-hadoop-1.0-SNAPSHOT.jar
# copy jar to HDFS for easier reference
hadoop fs -put tensorflow-hadoop-1.0-SNAPSHOT.jar
为Spark创建一个Python w / TensorFlow zip包
pushd "${PYTHON_ROOT}"
zip -r Python.zip *
popd
# copy this Python distribution into HDFS
hadoop fs -put ${PYTHON_ROOT}/Python.zip
安装TensorFlowOnSpark
接下来,克隆这个repo并为Spark构建一个zip包:
git clone git@github.com:yahoo/TensorFlowOnSpark.git
pushd TensorFlowOnSpark/src
zip -r ../tfspark.zip *
popd
运行MNIST示例
下载/压缩MNIST数据集
mkdir ${HOME}/mnist
pushd ${HOME}/mnist >/dev/null
curl -O "http://yann./exdb/mnist/train-images-idx3-ubyte.gz"
curl -O "http://yann./exdb/mnist/train-labels-idx1-ubyte.gz"
curl -O "http://yann./exdb/mnist/t10k-images-idx3-ubyte.gz"
curl -O "http://yann./exdb/mnist/t10k-labels-idx1-ubyte.gz"
popd >/dev/null
将MNIST zip文件转换为HDFS文件
# set environment variables (if not already done)
export PYTHON_ROOT=~/Python
export LD_LIBRARY_PATH=${PATH}
export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"
export PATH=${PYTHON_ROOT}/bin/:$PATH
export QUEUE=gpu
# for CPU mode:
# export QUEUE=default
# remove --conf spark.executorEnv.LD_LIBRARY_PATH # remove --driver-library-path
# save images and labels as CSV files
${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 4G --archives hdfs:///user/${USER}/Python.zip#Python,mnist/mnist.zip#mnist --conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64" --driver-library-path="/usr/local/cuda-7.5/lib64" TensorFlowOnSpark/examples/mnist/mnist_data_setup.py --output mnist/csv --format csv
# save images and labels as TFRecords
${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 4G --archives hdfs:///user/${USER}/Python.zip#Python,mnist/mnist.zip#mnist --jars hdfs:///user/${USER}/tensorflow-hadoop-1.0-SNAPSHOT.jar --conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64" --driver-library-path="/usr/local/cuda-7.5/lib64" TensorFlowOnSpark/examples/mnist/mnist_data_setup.py --output mnist/tfr --format tfr
运行分布式MNIST训练(使用feed_dict)
# for CPU mode:
# export QUEUE=default
# set --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" # remove --driver-library-path
# for CDH (per @wangyum)
# set "--conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH/lib64:$JAVA_HOME/jre/lib/amd64/server"
# hadoop fs -rm -r mnist_model
${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 27G --py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --archives hdfs:///user/${USER}/Python.zip#Python --conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$JAVA_HOME/jre/lib/amd64/server" --driver-library-path="/usr/local/cuda-7.5/lib64" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images mnist/csv/train/images --labels mnist/csv/train/labels --mode train --model mnist_model
# to use infiniband, add --rdma
运行分布式MNIST推理(使用feed_dict)
${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 27G --py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --archives hdfs:///user/${USER}/Python.zip#Python --conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$JAVA_HOME/jre/lib/amd64/server" --driver-library-path="/usr/local/cuda-7.5/lib64" TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images mnist/csv/test/images --labels mnist/csv/test/labels --mode inference --model mnist_model --output predictions
运行分布式MNIST训练(使用QueueRunners)
# for CPU mode:
# export QUEUE=default
# set --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" # remove --driver-library-path
# for CDH (per @wangyum)
# set "--conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH/lib64:$JAVA_HOME/jre/lib/amd64/server"
# hadoop fs -rm -r mnist_model
${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 27G --py-files tensorflow/tfspark.zip,tensorflow/examples/mnist/tf/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --archives hdfs:///user/${USER}/Python.zip#Python --conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$JAVA_HOME/jre/lib/amd64/server" --driver-library-path="/usr/local/cuda-7.5/lib64" tensorflow/examples/mnist/tf/mnist_spark.py --images mnist/tfr/train --format tfr --mode train --model mnist_model
# to use infiniband, replace the last line with --model mnist_model --rdma
运行分布式MNIST推断(使用QueueRunners)
# hadoop fs -rm -r predictions
${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 27G --py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --archives hdfs:///user/${USER}/Python.zip#Python --conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$JAVA_HOME/jre/lib/amd64/server" --driver-library-path="/usr/local/cuda-7.5/lib64" TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py --images mnist/tfr/test --mode inference --model mnist_model --output predictions
|