First and foremost, consult the official GNU manual for detailed information about the gdb debugger-
You can run your code and use gdb in both batch and interactive modes.
The GNU DDD graphical interface debugger is available on the head nodes. It could be useful when examining core files--especially if you're not familiar with gdb.
Getting started with gdb on the clusters [running interactive or batch]:
Don't forget to recompile all your code with the -g option so that it contains symbol information for the debugger.
Using "qsub -I -V" to get an interactive job, you can run mpi jobs with vmirun_gdb. In batch jobs, use the same method--replace vmirun with vmirun_gdb. NOTE: batch jobs may hang in the run state until their walltime limit is exceeded when run under gdb.
Example output [C & gdb]:
VMI: Loading
protocol Shmem from /etc/vmi/Devices/Shmem/libVMIShmem.so.
VMI: Loading
protocol GM from /etc/vmi/Devices/GM/libVMIGM.so.
VMI/GM: Max
# of Myrinet Nodes: 159, Max Ports: 8. PPP -> 1
VMI: Following
protocol drivers were loaded:
[1] Shmem VMI Protocol Driver v0.1
(C) 1999 NCSA, University of Illinois
[2] GM-1.4 (Multi Port) VMI Protocol Driver v0.3
(C) 2001 NCSA, University of Illinois
Started VMI
on node titan114(0)
Started VMI
on node titan114(1)
Started VMI
on node titan115(2)
Started VMI
on node titan115(3)
titan115
titan115
titan114
titan114
[New Thread
1024 (LWP 1398)]
Program received
signal SIGSEGV, Segmentation fault.
[Switching to
Thread 1024 (LWP 1398)]
0x4000000000002171
in main (argc=1, argv=0x80000fffffffb328)
at hello_world.c:43
43
badptr=0;
In some cases, you'll be interested in examing core files [to get a backtrace listing].
Use vmirun_gdbcore instead of vmirun in your batch script. It will setup mpich-vmi and your environment to generate a core file on program errors [mpich-vmi normally traps signals and cleans up without creating a core file].
Your program should create a core file when it gets to the error.
You can examine the core file with gdb sometime after the job has run
[at your convenience].
Batch example output [C & gdb]:
[arnoldg@hn03 6581]$ gdb ~arnoldg/mpi/hi_vmi core
GNU gdb Red Hat
Linux 7.x (5.0rh-15) (MI_OUT)
Copyright 2001
Free Software Foundation, Inc.
GDB is free
software, covered by the GNU General Public License, and you are
welcome to change
it and/or distribute copies of it under certain conditions.
Type "show copying"
to see the conditions.
There is absolutely
no warranty for GDB. Type "show warranty" for details.
This GDB was
configured as "i386-redhat-linux"...
Core was generated
by `/u/ncsa/arnoldg/mpi/hi_vmi'.
Program terminated
with signal 11, Segmentation fault.
Error while
mapping shared library sections:
/etc/vmi/Devices/Shmem/libVMIShmem.so:
No such file or directory.
Error while
mapping shared library sections:
/etc/vmi/Devices/GM/libVMIGM.so:
No such file or directory.
Reading symbols
from /lib/libdl.so.2...done.
Loaded symbols
for /lib/libdl.so.2
Reading symbols
from /lib/libpthread.so.0...done.
warning: Unable
to set global thread event mask: generic error
[New Thread
1024 (LWP 3806)]
Error while
reading shared library symbols:
Cannot enable
thread event reporting for Thread 1024 (LWP 3806): generic error
Reading symbols
from /lib/libc.so.6...done.
Loaded symbols
for /lib/libc.so.6
Reading symbols
from /lib/ld-linux.so.2...done.
Loaded symbols
for /lib/ld-linux.so.2
Reading symbols
from /lib/libnss_files.so.2...done.
Loaded symbols
for /lib/libnss_files.so.2
Error while
reading shared library symbols:
/etc/vmi/Devices/Shmem/libVMIShmem.so:
No such file or directory.
Error while
reading shared library symbols:
/etc/vmi/Devices/GM/libVMIGM.so:
No such file or directory.
#0 0x080497aa
in main (argc=1, argv=0xbfffe1a4) at hello_world.c:41
41
*badptr=3.5;
(gdb) list
36
printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);
37
38
39
{
40
badptr=0;
41
*badptr=3.5;
42
}
43
44
45
MPI_Finalize();
(gdb) backtrace
#0 0x080497aa
in main (argc=1, argv=0xbfffe1a4) at hello_world.c:41
#1 0x40055316
in __libc_start_main (main=0x8049700 <main>, argc=1,
ubp_av=0xbfffe1a4, init=0x8049270 <_init>, fini=0x805eaa0 <_fini>,
rtld_fini=0x4000d2fc <_dl_fini>, stack_end=0xbfffe19c)
at ../sysdeps/generic/libc-start.c:129
(gdb)