Using the GNU debugger [gdb] with NCSA linux clusters.

First and foremost, consult the official GNU manual for detailed information about the gdb debugger-

The GNU GDB manual

You can run your code and use gdb in both batch and interactive modes.

The GNU DDD graphical interface debugger is available on the head nodes. It could be useful when examining core files--especially if you're not familiar with gdb.

Getting started with gdb on the clusters [running interactive or batch]:

Don't forget to recompile all your code with the -g option so that it contains symbol information for the debugger.

Using "qsub -I -V" to get an interactive job, you can run mpi jobs with vmirun_gdb.  In batch jobs, use the same method--replace vmirun with vmirun_gdb.  NOTE: batch jobs may hang in the run state until their walltime limit is exceeded when run under gdb.

Example output [C & gdb]:

VMI: Loading protocol Shmem from /etc/vmi/Devices/Shmem/libVMIShmem.so.
VMI: Loading protocol GM from /etc/vmi/Devices/GM/libVMIGM.so.
VMI/GM: Max # of Myrinet Nodes: 159, Max Ports: 8. PPP ->  1
VMI: Following protocol drivers were loaded:
         [1] Shmem VMI Protocol Driver v0.1
          (C) 1999 NCSA, University of Illinois
         [2] GM-1.4 (Multi Port) VMI Protocol Driver v0.3
          (C) 2001 NCSA, University of Illinois
Started VMI on node titan114(0)
Started VMI on node titan114(1)
Started VMI on node titan115(2)
Started VMI on node titan115(3)
titan115
titan115
titan114
titan114
[New Thread 1024 (LWP 1398)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1024 (LWP 1398)]
0x4000000000002171 in main (argc=1, argv=0x80000fffffffb328)
    at hello_world.c:43
43         badptr=0;
 
 
 



Getting started with gdb on the clusters [running in batch mode to create core file]:

In some cases, you'll be interested in examing core files [to get a backtrace listing].

Use vmirun_gdbcore instead of vmirun in your batch script.  It will setup mpich-vmi and your environment to generate a core file on program errors [mpich-vmi normally traps signals and cleans up without creating a core file].

Your program should create a core file when it gets to the error.
You can examine the core file with gdb sometime after the job has run [at your convenience].

Batch example output [C & gdb]:

 [arnoldg@hn03 6581]$ gdb ~arnoldg/mpi/hi_vmi core

GNU gdb Red Hat Linux 7.x (5.0rh-15) (MI_OUT)
Copyright 2001 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
Core was generated by `/u/ncsa/arnoldg/mpi/hi_vmi'.
Program terminated with signal 11, Segmentation fault.
Error while mapping shared library sections:
/etc/vmi/Devices/Shmem/libVMIShmem.so: No such file or directory.
Error while mapping shared library sections:
/etc/vmi/Devices/GM/libVMIGM.so: No such file or directory.
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libpthread.so.0...done.

warning: Unable to set global thread event mask: generic error
[New Thread 1024 (LWP 3806)]
Error while reading shared library symbols:
Cannot enable thread event reporting for Thread 1024 (LWP 3806): generic error
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Error while reading shared library symbols:
/etc/vmi/Devices/Shmem/libVMIShmem.so: No such file or directory.
Error while reading shared library symbols:
/etc/vmi/Devices/GM/libVMIGM.so: No such file or directory.
#0  0x080497aa in main (argc=1, argv=0xbfffe1a4) at hello_world.c:41
41         *badptr=3.5;
(gdb) list
36              printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);
37
38
39      {
40         badptr=0;
41         *badptr=3.5;
42      }
43
44
45              MPI_Finalize();
(gdb) backtrace
#0  0x080497aa in main (argc=1, argv=0xbfffe1a4) at hello_world.c:41
#1  0x40055316 in __libc_start_main (main=0x8049700 <main>, argc=1,
    ubp_av=0xbfffe1a4, init=0x8049270 <_init>, fini=0x805eaa0 <_fini>,
    rtld_fini=0x4000d2fc <_dl_fini>, stack_end=0xbfffe19c)
    at ../sysdeps/generic/libc-start.c:129
(gdb)