Previous: , Up: Architectures   [Contents][Index]


22.4.10 AMD GPU

ROCGDB provides support for systems that have heterogeneous agents associated with commercially available AMD GPU devices (see Heterogeneous Debugging) when the AMD ROCm platform is installed.

The following AMD GPU chips are supported:

ROCGDB supports the following source languages:

HIP

The HIP Programming Language is supported.

When compiling, the -ggdb option should be used to produce debugging information suitable for use by ROCGDB. The --amdgpu-target option is used to specify the AMD GPUs that the executable is required to support. For example, to compile a HIP program that can utilize “Vega 10” and “Vega 7nm” AMD GPU devices, with no optimization:

hipcc -O0 -ggdb --amdgpu-target=gfx900 --amdgpu-target=gfx906 \
        bit_extract.cpp -o bit_extract

The AMD ROCm compiler maps HIP source language device function work-items to the lanes of an AMD GPU wavefront, which are represented in ROCGDB as heterogeneous lanes.

Assembly Code

Assembly code kernels are supported.

Other Languages

Other languages, including OpenCL and Fortran, are currently supported as the minimal pseudo-language, provided they are compiled specifying at least the AMD GPU Code Object V3 and DWARF 4 formats. See Unsupported Languages.

AMD GPU heterogeneous agents are not listed by the ‘info agents’ command until the inferior has started executing the program.

The AMD GPU heterogeneous queue types reported by the ‘info agents’ command are:

HSA

An HSA AQL queue. The ‘(Single)’ suffix indicates it uses the single-producer protocol, ‘(Multi)’ suffix indicates the multi-producer protocol, and ‘(Coop)’ suffix indicates the multi-producer cooperative dispatch protocol.

PM4

An AMD PM4 queue.

DMA

A DMA queue.

XGMI

An XGMI queue.

AMD GPU supports the following address spaces for the ‘info dispatches’ command:

Shared

Per work-group storage.

Private

Per work-item storage.

The ‘info dispatches’ command uses the following BNF syntax for AMD GPU heterogeneous dispatch fences:

fence     ::== [ barrier ] [ separator ] [ acquire ]  [ separator ] [ release ]
separator ::== "|"
barrier   ::== "B"
acquire   ::== "A" scope
release   ::== "R" scope
scope     ::== system | agent
system    ::== "s"
agent     ::== "a"

Where:

separator

The elements are separated by ‘|’.

barrier

If present indicates the next heterogeneous packet will not be initiated until heterogeneous dispatch completes.

acquire

Indicates an acquire memory fence was performed before initiating the heterogeneous dispatch.

release

Indicates a release memory fence will be performed when the heterogeneous dispatch completes.

system

Indicates the memory fence is performed at the system memory scope.

agent

Indicates the memory fence is performed at the agent memory scope.

An AMD GPU wavefront is represented in ROCGDB as a thread.

AMD GPU supports the following reggroup values for the ‘info registers reggroup’ command:

The number of scalar and vector registers is configured when a wavefront is created. Only allocated registers are displayed. Scalar registers are reported as 32-bit signed integer values. Vector registers are reported as a wavefront size vector of signed 32-bit values. The pc is reported as a function pointer value. The exec register is reported as a wavefront size-bit unsigned integer value. The vcc and xnack_mask pseudo registers are reported as a wavefront size-bit unsigned integer value. The flat_scratch pseudo register is reported as a 64-bit unsigned integer value.

The info sharedlibrary command will show the AMD GPU code objects together with the CPU code objects. For example:

(gdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0x00007fd120664ac0  0x00007fd120682790  Yes (*)     /lib64/ld-linux-x86-64.so.2
...
0x00007fd0125d8ec0  0x00007fd015f21630  Yes (*)     /opt/rocm-3.5.0/hip/lib/../../lib/libamd_comgr.so
0x00007fd11d74e870  0x00007fd11d75a868  Yes (*)     /lib/x86_64-linux-gnu/libtinfo.so.5
0x00007fd11d001000  0x00007fd11d00173c  Yes         file:///home/rocm/examples/bit_extract#offset=6477&size=10832
0x00007fd11d008000  0x00007fd11d00adc0  Yes (*)     memory://95557/mem#offset=0x7fd0083e7f60&size=41416
(*): Shared library is missing debugging information.
(gdb)

The code object path for AMD GPU code objects is shown as a URI (Universal Location Identifier) with a syntax defined by the following BNF syntax:

code_object_uri ::== file_uri | memory_uri
file_uri        ::== "file://" file_path [ range_specifier ]
memory_uri      ::== "memory://" process_id range_specifier
range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
file_path       ::== URI_ENCODED_OS_FILE_PATH
process_id      ::== DECIMAL_NUMBER
number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER

Where:

number

A C integral literal where hexadecimal values are prefixed by ‘0x’ or ‘0X’, and octal values by ‘0’.

file_path

The file’s path specified as a URI encoded UTF-8 string. In URI encoding, every character that is not in the regular expression ‘[a-zA-Z0-9/_.~-]’ is encoded as two uppercase hexidecimal digits proceeded by ‘%’. Directories in the path are separated by ‘/’.

offset

A 0-based byte offset to the start of the code object. For a file URI, it is from the start of the file specified by the file_path, and if omitted defaults to 0. For a memory URI, it is the memory address and is required.

size

The number of bytes in the code object. For a file URI, if omitted it defaults to the size of the file. It is required for a memory URI.

process_id

The identity of the process owning the memory. For Linux it is the C unsigned integral decimal literal for the process pid.

AMD GPU code objects are loaded into each AMD GPU device separately. The info sharedlibrary command will therefore show the same code object loaded multiple times. As a consequence, setting a breakpoint in AMD GPU code will result in multiple breakpoints if there are multiple AMD GPU devices.

If the source language runtime defers loading code objects until kernels are launched, then setting breakpoints may result in pending breakpoints that will be set when the code object is finally loaded.

The AMD GPU heterogeneous entities have the following target identifier and convenience variable formats:

Agent Target ID

The AMD GPU agent target identifier agent_systag string has the following format:

AMDGPU Agent (GPUID target-agent-id)

It is used in the ‘Target ID’ column of the ‘info agents’ command and is available using the $_agent_systag convenience variable.

Queue Target ID

The AMD GPU queue target identifier queue_systag string has the following format:

AMDGPU Queue agent-id:queue-id (QID target-queue-id)

It is used in the ‘Target ID’ column of the ‘info queues’ command and is available using the $_queue_systag convenience variable.

Dispatch Target ID

The AMD GPU dispatch target identifier dispatch_systag string has the following format:

AMDGPU Dispatch agent-id:queue-id:dispatch-id (PKID target-packet-id)

It is used in the ‘Target ID’ column of the ‘info dispatches’ command and is available using the $_dispatch_systag convenience variable. The target-packet-id corresponds to the dispatch packet that initiated the dispatch.

Thread Target ID

The AMD GPU thread target identifier (systag) string has the following format:

AMDGPU Thread agent-id:queue-id:dispatch-id:wave-id (work-group-x,work-group-y,work-group-z)/work-group-thread-index

It is used in the ‘Target ID’ column of the ‘info threads’ command and is available using the $_thread_systag convenience variable.

Lane Target ID

The AMD GPU lane target identifier (lane_systag) string has the following format:

AMDGPU Lane agent-id:queue-id:dispatch-id:wave-id/lane-index (work-group-x,work-group-y,work-group-z)[work-item-x,work-item-y,work-item-z]

It is used in the ‘Target ID’ column of the ‘info lanes’ command and is available using the $_lane_systag convenience variable.

$_dispatch_pos

The string returned by the $_dispatch_pos debugger convenience variable has the following format:

(work-group-x,work-group-y,work-group-z)/work-group-thread-index
$_thread_workgroup_pos

The string returned by the $_thread_workgroup_pos debugger convenience variable has the following format:

work-group-thread-index
$_lane_workgroup_pos

The string returned by the $_lane_workgroup_pos debugger convenience variable has the following format:

[work-item-x,work-item-y,work-item-z]

Where:

agent-id
queue-id
dispatch-id
wave-id

The AMD GPU target agent identifier, queue identifier, dispatch identifier, and wave identifier respectively. The identifiers are global across all inferiors.

target-agent-id
target-queue-id

The AMD GPU target driver agent identifier and queue identifier identifier respectively. The identifiers are per process.

target-packet-id

The AMD GPU target driver packet identifier. The identifier is per queue.

work-group-x
work-group-y
work-group-z

The grid position of the thread’s work-group within the heterogeneous dispatch.

work-group-thread-index

The thread’s number within the heterogeneous work-group.

lane-index

The heterogeneous lane index within the thread.

work-item-x
work-item-y
work-item-z

The position of the heterogeneous lane’s work-item within the heterogeneous work-group.

AMD GPU heterogeneous agents support the following address spaces:

global

the default global virtual address space

group

the per heterogeneous work-group shared address space (LDS (Local Data Store))

private

the per heterogeneous lane private address space (Scratch)

generic

the generic address space that can access the global, group, or private address spaces (Flat)

The set debug amdgpu log-level level command can be used to enable diagnostic messages for the AMD GPU target, where level can be:

off

no logging is enabled

error

fatal errors are reported

warning

fatal errors and warnings are reported

info

fatal errors, warnings, and info messages are reported

verbose

all messages are reported

The show debug amdgpu log-level command displays the current AMD GPU target log level.

For example, the following will enable information messages and send the log to a new file:

(gdb) set debug amdgpu log-level info
(gdb) set logging overwrite
(gdb) set logging file log.out
(gdb) set logging debugredirect on
(gdb) set logging on

If you want to print the log to both the console and a file, ommit the set the logging debugredirect command. See Logging Output.

ROCGDB AMD GPU support is currently a prototype and has the following restrictions. Future releases aim to address these restrictions.

  1. The debugger convenience variables, convenience functions, and commands described in Heterogeneous Debugging are not yet implemented unless noted below.

    The command line interface info agents, info queues, info dispatches, queue find, and dispatch find commands are supported. However, these have no Python bindings and have undocumented prototype machine interface command support that is subject to change.

    The debugger convenience variable $_wave_id is available which returns a string that has the format:

    (work-group-x,work-group-y,work-group-z)/work-group-thread-index
    

    Where:

    work-group-x
    work-group-y
    work-group-z

    The grid position of the thread’s work-group within the heterogeneous dispatch.

    work-group-thread-index

    The thread’s number within the heterogeneous work-group.

    The address space qualification of addresses described in Heterogeneous Debugging is not implemented. However, the default address space for AMD GPU threads is generic. This allows a generic address to be used to read or write in the global, group, or private address spaces. For the AMD ROCm platform the AMD GPU generic address value for global addresses is the same, for group addresses it has the most significant 32-bits of the address set to 0x00010000, and for private addresses is has the host significant 32-bits of the address set to 0x00020000. A generic private address only accesses lane 0 of the currently focused wavefront. A group address accesses the group segment memory shared by all wavefronts that are members of the same work-group as the currently focused wavefront.

  2. The AMD ROCm compiler currently does not support generating valid AMD GPU DWARF information for symbolic variables and call frame information. As a consequence:

    The AMD ROCm compiler currently adds the -gline-tables-only, -mllvm -disable-dwarf-locations, and -mllvm -amdgpu-spill-cfi-saved-regs options for AMD GPU when the -ggdb option is specified. These ensure source line information is generated, but not invalid DWARF for source variables, and registers not currently supported by the CFI generation are saved so the CFI information is correct. If these options are not used, the invalid DWARF may cause ROCGDB to report that it is unable to read memory (such as when reading arguments in a backtrace).

  3. Only AMD GPU Code Object V3 and above is supported. This is the default for the AMD ROCm compiler. The following error will be reported for incompatible code objects:
    Error while mapping shared library sections:
    `file:///rocm/bit_extract#offset=6751&size=3136': ELF file ABI version (0) is not supported.
    
  4. DWARF 5 is not yet supported. There is no support for compressed or split DWARF.

    DWARF 4 is the default for the AMD ROCm compiler.

  5. No support yet for AMD GPU core dumps.
  6. When in all-stop mode, AMD GPU does not currently prevent new wavefronts from being created, which may report breakpoints being hit. However, ROCGDB is configured by default to not remove breakpoints when at the command line in all-stop mode. This prevents breakpoints being missed by wavefronts created after at the command line in all-stop mode. The set breakpoint always-inserted on command can be used to change the default to remove breakpoints when at the command line in all-stop mode, but this may result in new wavefronts missing breakpoints.
  7. The performance of resuming from a breakpoint when a large number of threads have hit a breakpoint can currently take up to 10 seconds on a fully occupied single AMD GPU device. The techniques described in Heterogeneous Debugging can be used to mitigate this. Once continued from the first breakpoint hit, the responsiveness of commands normally is better. Other techniques that can improve responsiveness are:
  8. Currently each AMD GPU device can only be in use by one process that is being debugged by ROCGDB. The Linux cgroups facility can be used to limit which AMD GPU devices are used by a process. In order for a ROCGDB process to access the AMD GPU devices of the process it is debugging, the AMD GPU devices must be included in the ROCGDB process cgroup.

    Therefore, multiple ROCGDB processes can each debug a process provided the cgroups specify disjoint sets of AMD GPU devices. However, a single ROCGDB process cannot debug multiple inferiors that use AMD GPU devices even if those inferiors have cgroups that specify disjoint AMD GPU devices. This is because the ROCGDB process must have all the AMD GPU devices in its cgroups and so will attempt to enable debugging for all AMD GPU devices for all inferiors it is debugging.

    It is suggested to use Docker rather than cgroups directly to limit the AMD GPU devices visible inside a container:

    1. /dev/kfd’ must be mapped into the container.
    2. The ‘/dev/dri/renderD<render-minor-number>’ and ‘/dev/drm/card<node-number>’ files corresponding to each AMD GPU device that is to be visible must be mapped into the container. Note that non-AMD GPU cards may also be present.

      The render-minor-number for a device can be obtained by looking at the ‘drm_render_minor’ field value from:

      cat /sys/class/kfd/kfd/topology/nodes/<node-number>/properties
      
    3. Make sure the container user is a member of the render group for Ubuntu 20.04 onward and the video group for all other distributions.
    4. Specify the ‘--cap-add=SYS_PTRACE’ and ‘--security-opt seccomp=unconfined’ options.
    5. Install the AMD ROCm packages in the container. See https://github.com/RadeonOpenCompute/ROCm-docker.

    All processes running in the container will see the same subset of devices. By having two containers with non-overlapping sets of AMD GPUs, it is possible to use ROCGDB in both containers at the same time since each AMD GPU will only have one ROCGDB process accessing it.

    For example:

    docker run -it --rm --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
        --device=/dev/kfd --device=/dev/drm/card0 --device=/dev/dri/renderD128 \
        --group-add render ubuntu:20.04 /bin/bash
    
  9. The until command does not work when multiple AMD GPUs are present as ROCGDB has limitations when there are multiple code objects that have the same breakpoint set. The work around is to use ‘tbreak line; continue’.
  10. The HIP runtime currently performs deferred code object loading by default. AMD GPU code objects are not loaded until the first kernel is launched. Before then, all breakpoints have to be set as pending breakpoints.

    If source line positions are used that only correspond to source lines in unloaded code objects, then ROCGDB may not set pending breakpoints, and instead set breakpoints in unpredictable places of the loaded code objects if they contain code from the same file. This can result in unexpected breakpoint hits being reported. When the code object containing the source lines is loaded, the incorrect breakpoints will be removed and replaced by the correct ones. This problem can be avoided by only setting breakpoints in unloaded code objects using symbol or function names.

    The HIP_ENABLE_DEFERRED_LOADING environment variable can be used to disable deferred code object loading by the HIP runtime. This ensures all code objects will be loaded when the inferior reaches the beginning of the main function.

    For example,

    export HIP_ENABLE_DEFERRED_LOADING=0
    

    Note: If deferred code object loading is disabled and the application performs a fork, then the program may crash.

  11. Memory violations are reported to the wavefronts that cause them. However, the program location at which they are reported may be after the source statement that caused them. The AMD ROCm runtime can currently cause the inferior to terminate before the memory violation is reported. This can be avoided by setting a breakpoint in abort and using the non-stop mode (see Non-Stop Mode). This will prevent the AMD ROCm runtime from terminating the inferior, while allowing ROCGDB to report the memory violation.
  12. ROCGDB may report errors if execution is continued after the AMD ROCm runtime aborts the application. For example:
    warning: unable to open /proc file '/proc/1234/status'
    

    This can happen due to memory violations in the AMD GPU code as described in the previous item. To prevent the errors, do not continue the application after the AMD ROCm runtime has invoked abort.

  13. ROCGDB supports watchpoints, but limits the capabilities to the lowest common denominator of the heterogeneous agents in the system. Hardware supported watchpoints are used when possible, otherwise software emulation is used. Software emulation involves using-single stepping and reading memory to determine if values have changed, and as a result performs substantially slower than hardware watchpoints.

    The AMD GPU supported architectures provide a maximum of 4 hardware write watchpoints. Precise read watchpoints or access watchpoints are not supported.

    The x86 architecture provides 4 hardware watchpoints that can each monitor up to 8 bytes.

    When ROCGDB is used with x86 and AMD GPU devices, hardware watchpoints are therefore limited to at most 4 write watchpoints that have a collective size of up to 32 bytes. The collective size is calculated by adding the size of each watchpoint rounded up to a multiple of 8 bytes. Software emulation will be used for watchpoints that exceed the hardware limitations.

    Currently, watchpoints are only created on the CPU, and not the AMD GPU, until the AMD ROCm runtime is initialized. With deferred code object loading disabled this does not happen until the inferior reaches the beginning of the main function. With deferred code object loading enabled this does not happen until the first kernel is executed. This also means that, when the inferior is re-run, watchpoints are only re-activated on the CPU, not on the AMD GPU.

  14. When single stepping there can be times when ROCGDB appears to wait indefinitely for the single step to complete. If this happens, ‘Ctrl-C’ can be used to cancel the single step command so it can be tried again.
  15. The HIP runtime currently loads code objects from memory, including when loading modules from a file, which results in code object URIs being reported as ‘memory://’.

    The HSA_LOADER_ENABLE_MMAP_URI environment variable can be used to request that the AMD ROCm runtime attempt to determine the file containing the code object memory so that ‘file://’ URIs can be reported.

    For example,

    export HSA_LOADER_ENABLE_MMAP_URI=1
    
  16. AMD GPU does not currently support calling inferior functions.
  17. ROCGDB does not support following a forked process.
  18. The gdbserver is not supported.
  19. No language specific support for Fortran or OpenCL. No OpenMP language extension support for C, C++, or Fortran.
  20. Does not support the AMD ROCm HCC compiler or runtime available as part of releases before ROCm 3.5.
  21. AMD GPU does not currently support the compiler address, memory, or thread sanitizers.
  22. ROCGDB support for AMD GPU is not currently available under virtualization.
  23. Performing an instruction single step when an AMD GPU wavefront is positioned on an S_ENDPGM instruction may cause the AMD GPU hardware to hang.

Previous: , Up: Architectures   [Contents][Index]