2017-02-23

在Ubuntu上搭建GPU加速的TensorFlow环境

硬件软件环境

Ubuntu 16.10
GTX 750ti（需要一张NVIDIA的显卡，越新越好，新卡的Compute Capability版本高）
NVIDA CUDA 8.0
NVIDIA 驱动 375.26
gcc version 4.9

1. 基础环境配置

因为Ubuntu是机子新装的，所以我安装了Linux自己用的一些基本环境和python科学计算的库，请各取所需。

基本开发

安装vim sudo apt-get install vim
安装zsh

1 2	sudo apt-get install zsh chsh -s /usr/bin/zsh

安装git sudo apt-get install git
安装 oh-my-zsh sh -c "$(curl -fsSL https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"
安装 autojump sudo apt-get install autojump

Python科学计算库安装

安装Anaconda

# 下载Anaconda
bash Anaconda2-4.3.0-Linux-x86_64.sh 
# 切换成清华镜像，用于conda加速
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --set show_channel_urls yes

安装open-jdk sudo apt-get install openjdk-8-jdk
安装pycharm

2. NVIDA环境安装

首先贴一段Tensorflow官网上GPU支持对NVIDIA的环境需求：If you are installing TensorFlow with GPU support using one of the mechanisms described in this guide, then the following NVIDIA software must be installed on your system:

CUDA® Toolkit 8.0. For details, see NVIDIA’s documentation. Ensure that you append the relevant Cuda pathnames to the LD_LIBRARY_PATH environment variable as described in the NVIDIA documentation.
The NVIDIA drivers associated with CUDA Toolkit 8.0.
cuDNN v5.1. For details, see NVIDIA’s documentation. Ensure that you create the CUDA_HOME environment variable as described in the NVIDIA documentation.
GPU card with CUDA Compute Capability 3.0 or higher. See NVIDIA documentation for a list of supported GPU cards.
The libcupti-dev library, which is the NVIDIA CUDA Profile Tools Interface. This library provides advanced profiling support. To install this library, issue the following command:

1	$ sudo apt-get install libcupti-dev

除了最后的libcupti-dev库可以直接apt-get，我们需要装的大头就是CUDA® Toolkit和cuDNN两个东西，各种坑从这里开始了囧。

CUDA安装

按照 NVIDIA’s documentation 给出的步骤：

在安装之前首先逐一验证系统是否符合条件（Pre-installation Actions)
下载CUDA Toolkit，UBuntu推荐下载deb(local)版，安装过程比较方便
把deb包加入到包管理中，然后apt-get安装
安装后的验证过程

在安装后的验证过程中需要注意的几个点如下：

CUDA环境变量配置

export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
# 注意这里要路径要和Nvida驱动版本一致
export LPATH=/usr/lib/nvidia-375:$LPATH
export LIBRARY_PATH=/usr/lib/nvidia-375:$LIBRARY_PATH
# Tensorflow 要求的环境变量
export CUDA_HOME=/usr/local/cuda-8.0

这里最坑爹的一点是LIBRARY_PATH这个环境变量配置，官方的文档上一点没提，如果不写的话，在编译cuda的samples时，会在3_Imaging这个samples下报这个错误

1
2
3

/usr/bin/ld: cannot find -lnvcuvid
collect2: error: ld returned 1 exit status
Makefile:346: recipe for target 'cudaDecodeGL' failed

切换成低版本的gcc编译器

因为Ubuntu 16.10自带的gcc编译器版本是6.2，对于CUDA来说太新了，所以会报错

1	error -- unsupported GNU version! gcc versions later than 5 are not supported!

可以看到CUDA 8.0 能够支持的gcc最新版本不能超过5。网上给出的比较好的解决办法是利用Ubutnu的update-alternatives 命令来切换版本，具体命令如下：

sudo apt-get install gcc-4.9 g++-4.9
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 40 --slave /usr/bin/g++ g++ /usr/bin/g++-4.9
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-6 30 --slave /usr/bin/g++ g++ /usr/bin/g++-6 
sudo update-alternatives --config gcc

敲完sudo update-alternatives --config gcc之后，你就可以看到不同版本的gcc优先级了。

Samples编译测试

根据Recommended Actions](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#recommended-post)步骤编译Cuda的那些samples，如果出现Finished building CUDA samples，说明所有samples的编译通过了。可以敲NVIDIA_CUDA-8.0_Samples ./bin/x86_64/linux/release/nbody，可以看到以下效果

tensorflow_gpu_2017-02-22_01

cuDNN配置

下载 cuDNN之前需要注册一下，成为NVIDIA的开发者，然后把下载的包解压拷贝到CUDA的链接库和头文件目录就行了。

tar -xzvf cudnn-8.0-linux-x64-v5.1.tgz 
# 解压得到cuda文件
sudo cp cuda/lib64/* /usr/local/cuda/lib64 
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/

3. Tensorflow安装

安装Tensorflow有多种方式，这里我直接用的pip安装，python版本是2.7。

1 2	TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.0.0-cp27-none-linux_x86_64.whl sudo pip install --upgrade TF_BINARY_URL

都搞定之后，启动ipython，输入

import tensorflow as tf 
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

能看到输出的结果，说明GPU加速安装成功了。

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 750 Ti
major: 5 minor: 0 memoryClockRate (GHz) 1.0845
pciBusID 0000:01:00.0
Total memory: 1.95GiB
Free memory: 1.53GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0)
Hello, TensorFlow!

心怀畏惧

Do not go gentle into that good night