在Ubuntu上搭建GPU加速的TensorFlow环境

硬件软件环境

  • Ubuntu 16.10

  • GTX 750ti(需要一张NVIDIA的显卡,越新越好,新卡的Compute Capability版本高)

  • NVIDA CUDA 8.0

  • NVIDIA 驱动 375.26

  • gcc version 4.9

1. 基础环境配置

因为Ubuntu是机子新装的,所以我安装了Linux自己用的一些基本环境和python科学计算的库,请各取所需。

基本开发

  • 安装vim sudo apt-get install vim
  • 安装zsh
1
2
sudo apt-get install zsh
chsh -s /usr/bin/zsh
  • 安装git sudo apt-get install git
  • 安装 oh-my-zsh sh -c "$(curl -fsSL https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"
  • 安装 autojump sudo apt-get install autojump

Python科学计算库安装

1
2
3
4
5
# 下载Anaconda
bash Anaconda2-4.3.0-Linux-x86_64.sh
# 切换成清华镜像,用于conda加速
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --set show_channel_urls yes

  • 安装open-jdk sudo apt-get install openjdk-8-jdk

  • 安装pycharm

2. NVIDA环境安装

首先贴一段Tensorflow官网上GPU支持对NVIDIA的环境需求:If you are installing TensorFlow with GPU support using one of the mechanisms described in this guide, then the following NVIDIA software must be installed on your system:

  • CUDA® Toolkit 8.0. For details, see NVIDIA’s documentation. Ensure that you append the relevant Cuda pathnames to the LD_LIBRARY_PATH environment variable as described in the NVIDIA documentation.

  • The NVIDIA drivers associated with CUDA Toolkit 8.0.

  • cuDNN v5.1. For details, see NVIDIA’s documentation. Ensure that you create the CUDA_HOME environment variable as described in the NVIDIA documentation.

  • GPU card with CUDA Compute Capability 3.0 or higher. See NVIDIA documentation for a list of supported GPU cards.

  • The libcupti-dev library, which is the NVIDIA CUDA Profile Tools Interface. This library provides advanced profiling support. To install this library, issue the following command:

1
$ sudo apt-get install libcupti-dev

除了最后的libcupti-dev库可以直接apt-get,我们需要装的大头就是CUDA® Toolkit和cuDNN两个东西,各种坑从这里开始了囧。

CUDA安装

按照 NVIDIA’s documentation 给出的步骤:

  • 在安装之前首先逐一验证系统是否符合条件(Pre-installation Actions)
  • 下载CUDA Toolkit,UBuntu推荐下载deb(local)版,安装过程比较方便
  • 把deb包加入到包管理中,然后apt-get安装
  • 安装后的验证过程

在安装后的验证过程中需要注意的几个点如下:

CUDA环境变量配置

1
2
3
4
5
6
7
export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
# 注意这里要路径要和Nvida驱动版本一致
export LPATH=/usr/lib/nvidia-375:$LPATH
export LIBRARY_PATH=/usr/lib/nvidia-375:$LIBRARY_PATH
# Tensorflow 要求的环境变量
export CUDA_HOME=/usr/local/cuda-8.0

这里最坑爹的一点是LIBRARY_PATH这个环境变量配置,官方的文档上一点没提,如果不写的话,在编译cuda的samples时,会在3_Imaging这个samples下报这个错误

1
2
3
/usr/bin/ld: cannot find -lnvcuvid
collect2: error: ld returned 1 exit status
Makefile:346: recipe for target 'cudaDecodeGL' failed

切换成低版本的gcc编译器

因为Ubuntu 16.10自带的gcc编译器版本是6.2,对于CUDA来说太新了,所以会报错

1
error -- unsupported GNU version! gcc versions later than 5 are not supported!

可以看到CUDA 8.0 能够支持的gcc最新版本不能超过5。网上给出的比较好的解决办法是利用Ubutnu的update-alternatives 命令来切换版本,具体命令如下:

1
2
3
4
sudo apt-get install gcc-4.9 g++-4.9
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 40 --slave /usr/bin/g++ g++ /usr/bin/g++-4.9
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-6 30 --slave /usr/bin/g++ g++ /usr/bin/g++-6
sudo update-alternatives --config gcc

敲完sudo update-alternatives --config gcc之后,你就可以看到不同版本的gcc优先级了。

Samples编译测试

根据Recommended Actions](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#recommended-post)步骤编译Cuda的那些samples,如果出现Finished building CUDA samples,说明所有samples的编译通过了。可以敲NVIDIA_CUDA-8.0_Samples ./bin/x86_64/linux/release/nbody,可以看到以下效果

tensorflow_gpu_2017-02-22_01

tensorflow_gpu_2017-02-22_01

cuDNN配置

下载 cuDNN之前需要注册一下,成为NVIDIA的开发者,然后把下载的包解压拷贝到CUDA的链接库和头文件目录就行了。

1
2
3
4
tar -xzvf cudnn-8.0-linux-x64-v5.1.tgz
# 解压得到cuda文件
sudo cp cuda/lib64/* /usr/local/cuda/lib64
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/

3. Tensorflow安装

安装Tensorflow有多种方式,这里我直接用的pip安装,python版本是2.7。

1
2
TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.0.0-cp27-none-linux_x86_64.whl
sudo pip install --upgrade TF_BINARY_URL

都搞定之后,启动ipython,输入

1
2
3
4
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

能看到输出的结果,说明GPU加速安装成功了。

1
2
3
4
5
6
7
8
9
10
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 750 Ti
major: 5 minor: 0 memoryClockRate (GHz) 1.0845
pciBusID 0000:01:00.0
Total memory: 1.95GiB
Free memory: 1.53GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0)
Hello, TensorFlow!