HIP: Heterogenous-computing Interface for Portability
|
In addition to providing a portable C++ programming environment for GPUs, HIP is designed to ease the porting of existing CUDA code into the HIP environment. This section describes the available tools and provides practical suggestions on how to port CUDA code and work through common issues.
The hipexamine-perl.sh tool will scan a source directory to determine which files contain CUDA code and how much of that code can be automatically hipified.
hipexamine-perl scans each code file (cpp, c, h, hpp, etc.) found in the specified directory:
Interesting information in kmeans_cuda_kernel.cu :
hipexamine-perl also presents a summary at the end of the process for the statistics collected across all files. This has similar format to the per-file reporting, and also includes a list of all kernels which have been called. An example from above:
For each input file FILE, this script will:
This is useful for testing improvements to the hipify toolset.
The hipconvertinplace-perl.sh script will perform inplace conversion for all code files in the specified directory. This can be quite handy when dealing with an existing CUDA code base since the script preserves the existing directory structure and filenames - and includes work. After converting in-place, you can review the code to add additional parameters to directory names.
CUDA Library | ROCm Library | Comment |
---|---|---|
cuBLAS | rocBLAS | Basic Linear Algebra Subroutines |
cuFFT | rocFFT | Fast Fourier Transfer Library |
cuSPARSE | rocSPARSE | Sparse BLAS + SPMV |
cuSolver | rocSOLVER | Lapack library |
AMG-X | rocALUTION | Sparse iterative solvers and preconditioners with Geometric and Algebraic MultiGrid |
Thrust | rocThrust | C++ parallel algorithms library |
CUB | rocPRIM | Low Level Optimized Parallel Primitives |
cuDNN | MIOpen | Deep learning Solver Library |
cuRAND | rocRAND | Random Number Generator Library |
EIGEN | EIGEN – HIP port | C++ template library for linear algebra: matrices, vectors, numerical solvers, |
NCCL | RCCL | Communications Primitives Library based on the MPI equivalents |
All HIP projects target either AMD or NVIDIA platform. The platform affects which headers are included and which libraries are used for linking.
HIP_PLATFORM_AMD
is defined if the HIP platform targets AMD. Note, HIP_PLATFORM_HCC
was previously defined if the HIP platform targeted AMD, it is deprecated.HIP_PLATFORM_NVDIA
is defined if the HIP platform targets NVIDIA. Note, HIP_PLATFORM_NVCC
was previously defined if the HIP platform targeted NVIDIA, it is deprecated.Often, it's useful to know whether the underlying compiler is HIP-Clang or nvcc. This knowledge can guard platform-specific code or aid in platform-specific performance tuning.
Compiler directly generates the host code (using the Clang x86 target) and passes the code to another host compiler. Thus, they have no equivalent of the __CUDA_ACC define.
nvcc makes two passes over the code: one for host code and one for device code. HIP-Clang will have multiple passes over the code: one for the host code, and one for each architecture on the device code. __HIP_DEVICE_COMPILE__
is set to a nonzero value when the compiler (HIP-Clang or nvcc) is compiling code for a device inside a __global__
kernel or for a device function. __HIP_DEVICE_COMPILE__
can replace #ifdef checks on the __CUDA_ARCH__
define.
Unlike __CUDA_ARCH__
, the __HIP_DEVICE_COMPILE__
value is 1 or undefined, and it doesn't represent the feature capability of the target device.
|Define | HIP-Clang | nvcc | Other (GCC, ICC, Clang, etc.) |— | — | — |—| |HIP-related defines:| |__HIP_PLATFORM_AMD__
| Defined | Undefined | Defined if targeting AMD platform; undefined otherwise | |__HIP_PLATFORM_NVIDIA__
| Undefined | Defined | Defined if targeting NVIDIA platform; undefined otherwise | |__HIP_DEVICE_COMPILE__
| 1 if compiling for device; undefined if compiling for host |1 if compiling for device; undefined if compiling for host | Undefined |__HIPCC__
| Defined | Defined | Undefined |__HIP_ARCH_*
|0 or 1 depending on feature support (see below) | 0 or 1 depending on feature support (see below) | 0 |nvcc-related defines:| |__CUDACC__
| Defined if source code is compiled by nvcc; undefined otherwise | Undefined |__NVCC__
| Undefined | Defined | Undefined |__CUDA_ARCH__
| Undefined | Unsigned representing compute capability (e.g., "130") if in device code; 0 if in host code | Undefined |hip-clang-related defines:| |__HIP__
| Defined | Undefined | Undefined |HIP-Clang common defines:| |__clang__
| Defined | Defined | Undefined | Defined if using Clang; otherwise undefined
Some CUDA code tests __CUDA_ARCH__
for a specific value to determine whether the machine supports a certain architectural feature. For instance,
This type of code requires special attention, since AMD and CUDA devices have different architectural capabilities. Moreover, you can't determine the presence of a feature using a simple comparison against an architecture's version number. HIP provides a set of defines and device properties to query whether a specific architectural feature is supported.
The __HIP_ARCH_*
defines can replace comparisons of __CUDA_ARCH__
values:
For host code, the __HIP_ARCH__*
defines are set to 0. You should only use the HIP_ARCH fields in device code.
Host code should query the architecture feature flags in the device properties that hipGetDeviceProperties returns, rather than testing the "major" and "minor" fields directly:
The table below shows the full set of architectural properties that HIP supports.
|Define (use only in device code) | Device Property (run-time query) | Comment | |----— | ------— | --— | |32-bit atomics:|| |__HIP_ARCH_HAS_GLOBAL_INT32_ATOMICS__
| hasGlobalInt32Atomics |32-bit integer atomics for global memory |__HIP_ARCH_HAS_GLOBAL_FLOAT_ATOMIC_EXCH__
| hasGlobalFloatAtomicExch |32-bit float atomic exchange for global memory |__HIP_ARCH_HAS_SHARED_INT32_ATOMICS__
| hasSharedInt32Atomics |32-bit integer atomics for shared memory |__HIP_ARCH_HAS_SHARED_FLOAT_ATOMIC_EXCH__
| hasSharedFloatAtomicExch |32-bit float atomic exchange for shared memory |__HIP_ARCH_HAS_FLOAT_ATOMIC_ADD__
| hasFloatAtomicAdd |32-bit float atomic add in global and shared memory |64-bit atomics: | | |__HIP_ARCH_HAS_GLOBAL_INT64_ATOMICS__
| hasGlobalInt64Atomics |64-bit integer atomics for global memory |__HIP_ARCH_HAS_SHARED_INT64_ATOMICS__
| hasSharedInt64Atomics |64-bit integer atomics for shared memory |Doubles: | | |__HIP_ARCH_HAS_DOUBLES__
| hasDoubles |Double-precision floating point |Warp cross-lane operations: | | |__HIP_ARCH_HAS_WARP_VOTE__
| hasWarpVote |Warp vote instructions (any, all) |__HIP_ARCH_HAS_WARP_BALLOT__
| hasWarpBallot |Warp ballot instructions |__HIP_ARCH_HAS_WARP_SHUFFLE__
| hasWarpShuffle |Warp shuffle operations (shfl_*) |__HIP_ARCH_HAS_WARP_FUNNEL_SHIFT__
| hasFunnelShift |Funnel shift two input words into one |Sync: | | |__HIP_ARCH_HAS_THREAD_FENCE_SYSTEM__
| hasThreadFenceSystem |threadfence_system |__HIP_ARCH_HAS_SYNC_THREAD_EXT__
| hasSyncThreadsExt |syncthreads_count, syncthreads_and, syncthreads_or |Miscellaneous: | | |__HIP_ARCH_HAS_SURFACE_FUNCS__
| hasSurfaceFuncs | |__HIP_ARCH_HAS_3DGRID__
| has3dGrid | Grids and groups are 3D |__HIP_ARCH_HAS_DYNAMIC_PARALLEL__
| hasDynamicParallelism |
Makefiles can use the following syntax to conditionally provide a default HIP_PATH if one does not exist:
HIP can depend on rocclr, or cuda as runtime
hipLaunchKernel is a variadic macro which accepts as parameters the launch configurations (grid dims, group dims, stream, dynamic shared size) followed by a variable number of kernel arguments. This sequence is then expanded into the appropriate kernel launch syntax depending on the platform. While this can be a convenient single-line kernel launch syntax, the macro implementation can cause issues when nested inside other macros. For example, consider the following:
Avoid nesting macro parameters inside parenthesis - here's an alternative that will work:
hipcc is a portable compiler driver that will call nvcc or HIP-Clang (depending on the target system) and attach all required include and library options. It passes options through to the target compiler. Tools that call hipcc must ensure the compiler options are appropriate for the target compiler. The hipconfig
script may helpful in identifying the target platform, compiler and runtime. It can also help set options appropriately.
Here are the main compiler options supported on AMD platforms by HIP-Clang.
Option | Description |
---|---|
–amdgpu-target=<gpu_arch> | [DEPRECATED] This option is being replaced by --offload-arch=<target> . Generate code for the given GPU target. Supported targets are gfx701, gfx801, gfx802, gfx803, gfx900, gfx906, gfx908, gfx1010, gfx1011, gfx1012, gfx1030, gfx1031. This option could appear multiple times on the same command line to generate a fat binary for multiple targets. |
–fgpu-rdc | Generate relocatable device code, which allows kernels or device functions calling device functions in different translation units. |
-ggdb | Equivalent to -g plus tuning for GDB. This is recommended when using ROCm's GDB to debug GPU code. |
–gpu-max-threads-per-block=<num> | Generate code to support up to the specified number of threads per block. |
-O<n> | Specify the optimization level. |
-offload-arch=<target> | Specify the AMD GPU target ID. |
-save-temps | Save the compiler generated intermediate files. |
-v | Show the compilation steps. |
hipcc adds the necessary libraries for HIP as well as for the accelerator compiler (nvcc or AMD compiler). We recommend linking with hipcc since it automatically links the binary to the necessary HIP runtime libraries. It also has knowledge on how to link and to manage the GPU objects.
hipcc adds -lm by default to the link command.
CUDA code often uses nvcc for accelerator code (defining and launching kernels, typically defined in .cu or .cuh files). It also uses a standard compiler (g++) for the rest of the application. nvcc is a preprocessor that employs a standard host compiler (gcc) to generate the host code. Code compiled using this tool can employ only the intersection of language features supported by both nvcc and the host compiler. In some cases, you must take care to ensure the data types and alignment of the host compiler are identical to those of the device compiler. Only some host compilers are supported—for example, recent nvcc versions lack Clang host-compiler capability.
HIP-Clang generates both device and host code using the same Clang-based compiler. The code uses the same API as gcc, which allows code generated by different gcc-compatible compilers to be linked together. For example, code compiled using HIP-Clang can link with code compiled using "standard" compilers (such as gcc, ICC and Clang). Take care to ensure all compilers use the same standard C++ header and library formats.
hipcc links to libstdc++ by default. This provides better compatibility between g++ and HIP.
If you pass "--stdlib=libc++" to hipcc, hipcc will use the libc++ library. Generally, libc++ provides a broader set of C++ features while libstdc++ is the standard for more compilers (notably including g++).
When cross-linking C++ code, any C++ functions that use types from the C++ standard library (including std::string, std::vector and other containers) must use the same standard-library implementation. They include the following:
Applications with these interfaces should use the default libstdc++ linking.
Applications which are compiled entirely with hipcc, and which benefit from advanced C++ features not supported in libstdc++, and which do not require portability to nvcc, may choose to use libc++.
The hip_runtime.h and hip_runtime_api.h files define the types, functions and enumerations needed to compile a HIP program:
CUDA has slightly different contents for these two files. In some cases you may need to convert hipified code to include the richer hip_runtime.h instead of hip_runtime_api.h.
You can compile hip_runtime_api.h using a standard C or C++ compiler (e.g., gcc or ICC). The HIP include paths and defines (__HIP_PLATFORM_AMD__
or __HIP_PLATFORM_NVIDIA__
) must pass to the standard compiler; hipconfig then returns the necessary options:
You can capture the hipconfig output and passed it to the standard compiler; below is a sample makefile syntax:
nvcc includes some headers by default. However, HIP does not include default headers, and instead all required files must be explicitly included. Specifically, files that call HIP run-time APIs or define HIP kernels must explicitly include the appropriate HIP headers. If the compilation process reports that it cannot find necessary APIs (for example, "error: identifier hipSetDevice is undefined"), ensure that the file includes hip_runtime.h (or hip_runtime_api.h, if appropriate). The hipify-perl script automatically converts "cuda_runtime.h" to "hip_runtime.h," and it converts "cuda_runtime_api.h" to "hip_runtime_api.h", but it may miss nested headers or macros.
The HIP-Clang path provides an empty cuda.h file. Some existing CUDA programs include this file but don't require any of the functions.
Many existing CUDA projects use the ".cu" and ".cuh" file extensions to indicate code that should be run through the nvcc compiler. For quick HIP ports, leaving these file extensions unchanged is often easier, as it minimizes the work required to change file names in the directory and #include statements in the files.
For new projects or ports which can be re-factored, we recommend the use of the extension ".hip.cpp" for source files, and ".hip.h" or ".hip.hpp" for header files. This indicates that the code is standard C++ code, but also provides a unique indication for make tools to run hipcc when appropriate.
Code should not assume a warp size of 32 or 64. See Warp Cross-Lane Functions for information on how to write portable wave-aware code.
Kernel code should use __attribute__((amdgpu_flat_work_group_size(<min>,<max>)))
.
For example:
HIP support for hipMemcpyToSymbol is complete. This feature allows a kernel to define a device-side data symbol which can be accessed on the host side. The symbol can be in __constant or device space.
Note that the symbol name needs to be encased in the HIP_SYMBOL macro, as shown in the code example below. This also applies to hipMemcpyFromSymbol, hipGetSymbolAddress, and hipGetSymbolSize.
For example:
Device Code:
To get pointer's memory type in HIP/HIP-Clang one should use hipPointerGetAttributes API. First parameter of the API is hipPointerAttribute_t which has 'memoryType' as member variable. 'memoryType' indicates input pointer is allocated on device or host.
For example:
Threadfence_system makes all device memory writes, all writes to mapped host memory, and all writes to peer memory visible to CPU and other GPU devices. Some implementations can provide this behavior by flushing the GPU L2 cache. HIP/HIP-Clang does not provide this functionality. As a workaround, users can set the environment variable HSA_DISABLE_CACHE=1
to disable the GPU L2 cache. This will affect all accesses and for all kernels and so may have a performance impact.
Compute programs sometimes use textures either to access dedicated texture caches or to use the texture-sampling hardware for interpolation and clamping. The former approach uses simple point samplers with linear interpolation, essentially only reading a single point. The latter approach uses the sampler hardware to interpolate and combine multiple samples. AMD hardware, as well as recent competing hardware, has a unified texture/L1 cache, so it no longer has a dedicated texture cache. But the nvcc path often caches global loads in the L2 cache, and some programs may benefit from explicit control of the L1 cache contents. We recommend the __ldg instruction for this purpose.
AMD compilers currently load all data into both the L1 and L2 caches, so __ldg is treated as a no-op.
We recommend the following for functional portability:
On an AMD platform, set the AMD_LOG_LEVEL environment variable to log HIP application execution information.
The value of the setting controls different logging level,
Logging mask is used to print types of functionalities during the execution of HIP application. It can be set as one of the following values,
To see the detailed commands that hipcc issues, set the environment variable HIPCC_VERBOSE to 1. Doing so will print to stderr the HIP-clang (or nvcc) commands that hipcc generates.
If you pass a ".cu" file, hcc will attempt to compile it as a CUDA language file. You must tell hcc that it's in fact a C++ file: use the "-x c++" option.
See the utils/vim or utils/gedit directories to add handy highlighting to hip files.