Previous: S12Z, Up: Architectures [Contents][Index]
ROCGDB provides support for systems that have heterogeneous agents associated with commercially available AMD GPU devices (see Heterogeneous Debugging) when the AMD ROCm platform is installed.
The following AMD GPU chips are supported:
Displayed as ‘vega10’ by ROCGDB and denoted as ‘gfx900’ by the compiler.
Displayed as ‘vega20’ by ROCGDB and denoted as ‘gfx906’ by the compiler.
Displayed as ‘arcturus’ by ROCGDB and denoted as ‘gfx908’ by the compiler.
Displayed as ‘aldebaran’ by ROCGDB and denoted as ‘gfx90a’ by the compiler.
Displayed as ‘navi10’ by ROCGDB and denoted as ‘gfx1010’ by the compiler.
Displayed as ‘navi12’ by ROCGDB and denoted as ‘gfx1011’ by the compiler.
Displayed as ‘navi14’ by ROCGDB and denoted as ‘gfx1012’ by the compiler.
Displayed as ‘sienna_cichlid’ by ROCGDB and denoted as ‘gfx1030’ by the compiler.
Displayed as ‘navy_flounder’ by ROCGDB and denoted as ‘gfx1031’ by the compiler.
ROCGDB supports the following source languages:
The HIP Programming Language is supported.
When compiling, the -ggdb option should be used to produce debugging information suitable for use by ROCGDB. The --offload-arch option is used to specify the AMD GPU chips that the executable is required to support. For example, to compile a HIP program that can utilize “Vega 10” and “Vega 7nm” AMD GPU devices, with no optimization:
hipcc -O0 -ggdb ---offload-arch=gfx900 --offload-arch=gfx906 \ bit_extract.cpp -o bit_extract
The AMD ROCm compiler maps HIP source language device function work-items to the lanes of an AMD GPU wavefront, which are represented in ROCGDB as heterogeneous lanes.
Assembly code kernels are supported.
Other languages, including OpenCL and Fortran, are currently supported as the minimal pseudo-language, provided they are compiled specifying at least the AMD GPU Code Object V3 and DWARF 4 formats. See Unsupported Languages.
AMD GPU heterogeneous agents are not listed by the ‘info agents’ command until the inferior has started executing the program.
The AMD GPU heterogeneous queue types reported by the ‘info agents’ command are:
An HSA AQL queue. The ‘(Single)’ suffix indicates it uses the single-producer protocol, ‘(Multi)’ suffix indicates the multi-producer protocol, and ‘(Coop)’ suffix indicates the multi-producer cooperative dispatch protocol.
An AMD PM4 queue.
A DMA queue.
An XGMI queue.
AMD GPU supports the following address spaces for the ‘info dispatches’ command:
Per work-group storage.
Per work-item storage.
The ‘info dispatches’ command uses the following BNF syntax for AMD GPU heterogeneous dispatch fences:
fence ::== [ barrier ] [ separator ] [ acquire ] [ separator ] [ release ] separator ::== "|" barrier ::== "B" acquire ::== "A" scope release ::== "R" scope scope ::== system | agent system ::== "s" agent ::== "a"
Where:
The elements are separated by ‘|’.
If present indicates the next heterogeneous packet will not be initiated until heterogeneous dispatch completes.
Indicates an acquire memory fence was performed before initiating the heterogeneous dispatch.
Indicates a release memory fence will be performed when the heterogeneous dispatch completes.
Indicates the memory fence is performed at the system memory scope.
Indicates the memory fence is performed at the agent memory scope.
An AMD GPU wavefront is represented in ROCGDB as a thread.
AMD GPU supports the following reggroup values for the ‘info registers reggroup …’ command:
The number of scalar and vector registers is configured when a
wavefront is created. Only allocated registers are displayed. Scalar
registers are reported as 32-bit signed integer values. Vector
registers are reported as a wavefront size vector of signed 32-bit
values. The pc
is reported as a function pointer value. The
exec
register is reported as a wavefront size-bit unsigned
integer value. The vcc
and xnack_mask
pseudo registers
are reported as a wavefront size-bit unsigned integer value. The
flat_scratch
pseudo register is reported as a 64-bit unsigned
integer value.
The info sharedlibrary
command will show the AMD GPU code objects
together with the CPU code objects. For example:
(gdb) info sharedlibrary From To Syms Read Shared Object Library 0x00007fd120664ac0 0x00007fd120682790 Yes (*) /lib64/ld-linux-x86-64.so.2 ... 0x00007fd0125d8ec0 0x00007fd015f21630 Yes (*) /opt/rocm-3.5.0/hip/lib/../../lib/libamd_comgr.so 0x00007fd11d74e870 0x00007fd11d75a868 Yes (*) /lib/x86_64-linux-gnu/libtinfo.so.5 0x00007fd11d001000 0x00007fd11d00173c Yes file:///home/rocm/examples/bit_extract#offset=6477&size=10832 0x00007fd11d008000 0x00007fd11d00adc0 Yes (*) memory://95557/mem#offset=0x7fd0083e7f60&size=41416 (*): Shared library is missing debugging information. (gdb)
The code object path for AMD GPU code objects is shown as a URI (Universal Location Identifier) with a syntax defined by the following BNF syntax:
code_object_uri ::== file_uri | memory_uri file_uri ::== "file://" file_path [ range_specifier ] memory_uri ::== "memory://" process_id range_specifier range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number file_path ::== URI_ENCODED_OS_FILE_PATH process_id ::== DECIMAL_NUMBER number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
Where:
A C integral literal where hexadecimal values are prefixed by ‘0x’ or ‘0X’, and octal values by ‘0’.
The file’s path specified as a URI encoded UTF-8 string. In URI encoding, every character that is not in the regular expression ‘[a-zA-Z0-9/_.~-]’ is encoded as two uppercase hexidecimal digits proceeded by ‘%’. Directories in the path are separated by ‘/’.
A 0-based byte offset to the start of the code object. For a file URI, it is from the start of the file specified by the file_path, and if omitted defaults to 0. For a memory URI, it is the memory address and is required.
The number of bytes in the code object. For a file URI, if omitted it defaults to the size of the file. It is required for a memory URI.
The identity of the process owning the memory. For Linux it is the C unsigned integral decimal literal for the process pid.
AMD GPU code objects are loaded into each AMD GPU device separately.
The info sharedlibrary
command will therefore show the same
code object loaded multiple times. As a consequence, setting a
breakpoint in AMD GPU code will result in multiple breakpoints if
there are multiple AMD GPU devices.
If the source language runtime defers loading code objects until kernels are launched, then setting breakpoints may result in pending breakpoints that will be set when the code object is finally loaded.
The AMD GPU heterogeneous entities have the following target identifier and convenience variable formats:
The AMD GPU agent target identifier agent_systag string has the following format:
AMDGPU Agent (GPUID target-agent-id)
It is used in the ‘Target ID’ column of the ‘info agents’ command and is available using the $_agent_systag convenience variable.
The AMD GPU queue target identifier queue_systag string has the following format:
AMDGPU Queue agent-id:queue-id (QID target-queue-id)
It is used in the ‘Target ID’ column of the ‘info queues’ command and is available using the $_queue_systag convenience variable.
The AMD GPU dispatch target identifier dispatch_systag string has the following format:
AMDGPU Dispatch agent-id:queue-id:dispatch-id (PKID target-packet-id)
It is used in the ‘Target ID’ column of the ‘info dispatches’ command and is available using the $_dispatch_systag convenience variable. The target-packet-id corresponds to the dispatch packet that initiated the dispatch.
The AMD GPU thread target identifier (systag) string has the following format:
AMDGPU Thread agent-id:queue-id:dispatch-id:wave-id (work-group-x,work-group-y,work-group-z)/work-group-thread-index
It is used in the ‘Target ID’ column of the ‘info threads’ command and is available using the $_thread_systag convenience variable.
The AMD GPU lane target identifier (lane_systag) string has the following format:
AMDGPU Lane agent-id:queue-id:dispatch-id:wave-id/lane-index (work-group-x,work-group-y,work-group-z)[work-item-x,work-item-y,work-item-z]
It is used in the ‘Target ID’ column of the ‘info lanes’ command and is available using the $_lane_systag convenience variable.
$_dispatch_pos
The string returned by the $_dispatch_pos
debugger convenience
variable has the following format:
(work-group-x,work-group-y,work-group-z)/work-group-thread-index
$_thread_workgroup_pos
The string returned by the $_thread_workgroup_pos
debugger
convenience variable has the following format:
work-group-thread-index
$_lane_workgroup_pos
The string returned by the $_lane_workgroup_pos
debugger
convenience variable has the following format:
[work-item-x,work-item-y,work-item-z]
Where:
The AMD GPU target agent identifier, queue identifier, dispatch identifier, and wave identifier respectively. The identifiers are global across all inferiors.
The AMD GPU target driver agent identifier and queue identifier identifier respectively. The identifiers are per process.
The AMD GPU target driver packet identifier. The identifier is per queue.
The grid position of the thread’s work-group within the heterogeneous dispatch.
The thread’s number within the heterogeneous work-group.
The heterogeneous lane index within the thread.
The position of the heterogeneous lane’s work-item within the heterogeneous work-group.
AMD GPU heterogeneous agents support the following address spaces:
global
the default global virtual address space
group
the per heterogeneous work-group shared address space (LDS (Local Data Store))
private
the per heterogeneous lane private address space (Scratch)
generic
the generic address space that can access the global, group, or private address spaces (Flat)
A wavefront can report memory violation and address watch access events. However, the program location at which they are reported may be after the machine instruction that caused them. This can result in the reported source statement being incorrect. The following commands can be used to control this behavior:
set amdgpu precise-memory mode
set amdgpu precise-memory
controls how the AMD GPU detects
memory violations and address watch events. Where mode can be:
off
The program location may not be immediately after the instruction that caused the memory violation or address watch event. This is the default.
on
Requests that the program location will be immediately after the instruction that caused a memory violation or address watch event. Enabling this mode may make the AMD GPU execution significantly slower as it has to wait for each memory operation to complete before executing the next instruction.
For example:
(gdb) set amdgpu precise-memory off (gdb) show amdgpu precise-memory AMDGPU precise memory access reporting is off (gdb)
If a memory violation or address watch access event is reported for an AMD GPU thread that supports controlling precise memory detection when the mode is ‘off’, then the message includes an indication that the position may not be accurate. For example:
(gdb) run Thread 6 "bit_extract" received signal SIGSEGV, Segmentation fault. 0x00007ffee6a0a028 in bit_extract_kernel (C_d=<optimized out>, A_d=<optimized out>, N=<optimized out>) at bit_extract.cpp:38 38 size_t offset = (hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x); ('set amdgpu precise-memory on' is not enabled so reported location may not be accurate) (gdb)
The precise memory mode cannot be enabled until the inferior is started or attached. If at that time all AMD GPU chips accessible to the inferior support the ‘on’ mode, then it is enabled. For example:
(gdb) set amdgpu precise-memory on (gdb) show amdgpu precise-memory AMDGPU precise memory access reporting is on (currently disabled) (gdb) run ... (gdb) show amdgpu precise-memory AMDGPU precise memory access reporting is on (currently enabled) (gdb)
Alternatively, if at that time any of the AMD GPU chips accessible to the inferior do not support the ‘on’ mode, then a warning is reported and the mode is not enabled. For example:
(gdb) set amdgpu precise-memory on (gdb) show amdgpu precise-memory AMDGPU precise memory access reporting is on (currently disabled) (gdb) run AMDGPU precise memory access reporting could not be enabled (gdb)
If the inferior is already executing when setting the ‘on’ mode, then a warning will be reported immediately. For example:
(gdb) set amdgpu precise-memory on AMDGPU precise memory access reporting could not be enabled (gdb) show amdgpu precise-memory AMDGPU precise memory access reporting is on (currently disabled) (gdb)
Otherwise, setting the ‘on’ mode will enable it immediately. For example:
(gdb) set amdgpu precise-memory on (gdb) show amdgpu precise-memory AMDGPU precise memory access reporting is on (currently enabled) (gdb)
show amdgpu precise-memory
show amdgpu precise-memory
displays the currently requested AMD
GPU precise memory setting. If ‘on’ has been requested, the
message also indicates if it is currently enabled. For example:
(gdb) show amdgpu precise-memory AMDGPU precise memory access reporting is on (currently disabled) (gdb)
The set debug amdgpu log-level level
command can be used
to enable diagnostic messages for the AMD GPU target. The show
debug amdgpu log-level
command displays the current AMD GPU target
log level. See set debug amdgpu.
For example, the following will enable information messages and send the log to a new file:
(gdb) set debug amdgpu log-level info (gdb) set logging overwrite (gdb) set logging file log.out (gdb) set logging debugredirect on (gdb) set logging on
If you want to print the log to both the console and a file, omit the
set logging debugredirect on
command. See Logging Output.
ROCGDB AMD GPU support is currently a prototype and has the following restrictions. Future releases aim to address these restrictions.
The command line interface info agents
, info queues
, info
dispatches
, queue find
, and dispatch find
commands are
supported. However, these have no Python bindings.
The debugger convenience variable $_wave_id
is available which
returns a string that has the format:
(work-group-x,work-group-y,work-group-z)/work-group-thread-index
Where:
The grid position of the thread’s work-group within the heterogeneous dispatch.
The thread’s number within the heterogeneous work-group.
The address space qualification of addresses described in
Heterogeneous Debugging is not implemented. However, the
default address space for AMD GPU threads is generic
. This
allows a generic address to be used to read or write in the
global
, group
, or private
address spaces. For
the AMD ROCm platform the AMD GPU generic address value for
global
addresses is the same, for group
addresses it has
the most significant 32-bits of the address set to 0x00010000, and for
private
addresses is has the host significant 32-bits of the
address set to 0x00020000. A generic private address only accesses
lane 0 of the currently focused wavefront. A group address accesses
the group
segment memory shared by all wavefronts that are
members of the same work-group as the currently focused wavefront.
print
command and breakpoint conditions. This
includes static variables, local variables, function arguments, and
any language types. However, global symbols for functions and
variables can be specified, and source line information is available.
backtrace
command can only show the current frame and
parent frames that are fully inlined. Function or kernel arguments
will not be displayed and instead an empty formal argument list may be
shown.
next
command may not step over function calls, but instead
stop at the first statement of the called function.
The AMD ROCm compiler currently adds the -gline-tables-only, -mllvm -disable-dwarf-locations, and -mllvm -amdgpu-spill-cfi-saved-regs options for AMD GPU when the -ggdb option is specified. These ensure source line information is generated, but not invalid DWARF for source variables, and registers not currently supported by the CFI generation are saved so the CFI information is correct. If these options are not used, the invalid DWARF may cause ROCGDB to report that it is unable to read memory (such as when reading arguments in a backtrace).
Error while mapping shared library sections: `file:///rocm/bit_extract#offset=6751&size=3136': ELF file ABI version (0) is not supported.
DWARF 4 is the default for the AMD ROCm compiler.
set breakpoint always-inserted on
command can be used to change the default to remove breakpoints when
at the command line in all-stop mode, but this may result in new
wavefronts missing breakpoints.
tbreak
so only one thread reports the breakpoint and the
other threads hitting the breakpoint will be continued. A similar
effect can be achieved by deleting the breakpoint manually when it is
hit.
Therefore, multiple ROCGDB processes can each debug a process provided the cgroups specify disjoint sets of AMD GPU devices. However, a single ROCGDB process cannot debug multiple inferiors that use AMD GPU devices even if those inferiors have cgroups that specify disjoint AMD GPU devices. This is because the ROCGDB process must have all the AMD GPU devices in its cgroups and so will attempt to enable debugging for all AMD GPU devices for all inferiors it is debugging.
It is suggested to use Docker rather than cgroups directly to limit the AMD GPU devices visible inside a container:
The render-minor-number for a device can be obtained by looking at the ‘drm_render_minor’ field value from:
cat /sys/class/kfd/kfd/topology/nodes/<node-number>/properties
All processes running in the container will see the same subset of devices. By having two containers with non-overlapping sets of AMD GPUs, it is possible to use ROCGDB in both containers at the same time since each AMD GPU will only have one ROCGDB process accessing it.
For example:
docker run -it --rm --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ --device=/dev/kfd --device=/dev/drm/card0 --device=/dev/dri/renderD128 \ --group-add render ubuntu:20.04 /bin/bash
If source line positions are used that only correspond to source lines in unloaded code objects, then ROCGDB may not set pending breakpoints, and instead set breakpoints in unpredictable places of the loaded code objects if they contain code from the same file. This can result in unexpected breakpoint hits being reported. When the code object containing the source lines is loaded, the incorrect breakpoints will be removed and replaced by the correct ones. This problem can be avoided by only setting breakpoints in unloaded code objects using symbol or function names.
The HIP_ENABLE_DEFERRED_LOADING
environment variable can be
used to disable deferred code object loading by the HIP runtime. This
ensures all code objects will be loaded when the inferior reaches the
beginning of the main
function.
For example,
export HIP_ENABLE_DEFERRED_LOADING=0
Note: If deferred code object loading is disabled and the
application performs a fork
, then the program may crash.
abort
and using the non-stop mode (see Non-Stop Mode). This will
prevent the AMD ROCm runtime from terminating the inferior,
while allowing ROCGDB to report the memory violation.
warning: unable to open /proc file '/proc/1234/status'
This can happen due to memory violations in the AMD GPU code as
described in the previous item. To prevent the errors, do not
continue the application after the AMD ROCm runtime has
invoked abort
.
The AMD GPU supported architectures provide a maximum of 4 hardware write watchpoints. Precise read watchpoints or access watchpoints are not supported.
The x86 architecture provides 4 hardware watchpoints that can each monitor up to 8 bytes.
When ROCGDB is used with x86 and AMD GPU devices, hardware watchpoints are therefore limited to at most 4 write watchpoints that have a collective size of up to 32 bytes. The collective size is calculated by adding the size of each watchpoint rounded up to a multiple of 8 bytes. Software emulation will be used for watchpoints that exceed the hardware limitations.
Currently, watchpoints are only created on the CPU, and not the AMD
GPU, until the AMD ROCm runtime is initialized. With
deferred code object loading disabled this does not happen until the
inferior reaches the beginning of the main
function. With
deferred code object loading enabled this does not happen until the
first kernel is executed. This also means that, when the inferior is
re-run, watchpoints are only re-activated on the CPU, not on the AMD
GPU.
HSA_ENABLE_SDMA
environment variable can be set to ‘0’
to disable the AMD ROCm runtime from using DMA for
transfers between the CPU and AMD GPU.
The HSA_LOADER_ENABLE_MMAP_URI
environment variable can be used
to request that the AMD ROCm runtime attempt to determine
the file containing the code object memory so that ‘file://’ URIs
can be reported.
For example,
export HSA_LOADER_ENABLE_MMAP_URI=1
gdbserver
is not supported.
Previous: S12Z, Up: Architectures [Contents][Index]