GCC attach to process debugging: when it comes to Large Scale Evolutionary Algorithms assembly code

For my PhD, I experiment with evolutionary algorithms. Their development might get heavy on computational resources occasionally. Any I really want to make the program code run as savvy as possible. I try to parallelize the code, strip all unnecessary bloat, optimize it, and still — it may take a month for CPUs to compute the experimental data. To get a successful and competent experiment run, it may take several months. Then I break the computation into pieces and run manually part by part of the experiment. And still, it takes too much time. Then, I decide to have a look in running code on the server. The data is valuable, so recompilation with debugging symbols and re-run is not an option. I go with assembly code of the executable. This is where this story begins.

Today, I noticed one of the experiment runs computation suddenly got inresponsive and possibly got stuck somehow. Since no progress was observed still after few hours, I decided to dig into it. I successfuly begun trying to debug the running linux process by attaching the debugger and using a copy of the executable which I have had carefully backed up before commencing the experiment computation run. Since the program was started using nohup, super user privilegues were necessary for attachment:

$ tar xf cec2011-algorithm_source.tgz; cd codes/cec2011
$ ps aux | grep cec2011adv
ales 27284 94.5 0.7 358228 26040 pts/4 tl Jun10 1002:36 ./cec2011adv funcs 21 22
$ sudo gdb cec2011adv 27284
(gdb) (gdb) bt
#8  0x00007f651cbc1dc4 in mlfMY_FUNCTION14 () from ./libCEC2011.so
#9  0x0000000000413830 in cost_function14(double*, double*) ()
#10 0x000000000040f55a in jDEtevc2006::selectionTrialVsTarget(int) ()
#11 0x000000000040f29f in jDEtevc2006::optimizeOneGeneration(int&, int) ()
#12 0x000000000040e099 in aDEtevc2012::optimizeOneGeneration(int&, int) ()
#13 0x000000000040e159 in aDEtevc2012::optimize(double*, double*) ()
#14 0x000000000040240f in OptDispatcherMPI::workerOptimizeCEC2011Function(int, int, int) ()
#15 0x0000000000402749 in OptDispatcherMPI::optimizeCEC2011FunctionParallel(int, int, std::vector<double, std::allocator<double> >&) ()
#16 0x0000000000402f0d in OptDispatcherMPI::runOptimizationOnAllCEC2011Functions() ()
#17 0x000000000040fed7 in main ()

Knowing the best information I want to get is current number of evaluations variable, I go searching for it in the memmory of the process. But, without symbol table for the optimized executable, this gets heavy — disassembly comes to rescue.

(gdb) frame 10
(gdb) disassemble
Dump of assembler code for function _ZN11jDEtevc200622selectionTrialVsTargetEi:
   0x000000000040f540 <+0>:     push   %rbp
   0x000000000040f541 <+1>:     mov    %esi,%ebp
   0x000000000040f543 <+3>:     push   %rbx
   0x000000000040f544 <+4>:     mov    %rdi,%rbx
   0x000000000040f547 <+7>:     sub    $0x8,%rsp
   0x000000000040f54b <+11>:    movslq 0x68(%rbx),%rax
   0x000000000040f54f <+15>:    mov    0x50(%rdi),%rdi
   0x000000000040f553 <+19>:    lea    (%rdi,%rax,8),%rsi
   0x000000000040f557 <+23>:    callq  *0x8(%rbx)
=> 0x000000000040f55a <+26>:    mov    0x40(%rbx),%edx
   0x000000000040f55d <+29>:    mov    0x58(%rbx),%rsi
   0x000000000040f561 <+33>:    mov    %rbx,%rdi
   0x000000000040f564 <+36>:    mov    (%rbx),%rax
   0x000000000040f567 <+39>:    addl   $0x1,0x44(%rbx)
   0x000000000040f56b <+43>:    imul   %ebp,%edx
   0x000000000040f56e <+46>:    movslq %edx,%rdx
   0x000000000040f571 <+49>:    lea    (%rsi,%rdx,8),%rdx
   0x000000000040f575 <+53>:    mov    0x50(%rbx),%rsi
   0x000000000040f579 <+57>:    callq  *0x28(%rax)
   0x000000000040f57c <+60>:    test   %al,%al
   0x000000000040f57e <+62>:    jne    0x40f590 <_ZN11jDEtevc200622selectionTrialVsTargetEi+80>
   0x000000000040f580 <+64>:    add    $0x8,%rsp
   0x000000000040f584 <+68>:    pop    %rbx
   0x000000000040f585 <+69>:    pop    %rbp
   0x000000000040f586 <+70>:    retq
   0x000000000040f587 <+71>:    nopw   0x0(%rax,%rax,1)
   0x000000000040f590 <+80>:    mov    0x40(%rbx),%edx
   0x000000000040f593 <+83>:    mov    0x58(%rbx),%rcx
   0x000000000040f597 <+87>:    mov    %rbx,%rdi
   0x000000000040f59a <+90>:    mov    (%rbx),%rax
   0x000000000040f59d <+93>:    mov    0x50(%rbx),%rsi
   0x000000000040f5a1 <+97>:    imul   %ebp,%edx
   0x000000000040f5a4 <+100>:   movslq %edx,%rdx
   0x000000000040f5a7 <+103>:   lea    (%rcx,%rdx,8),%rdx
   0x000000000040f5ab <+107>:   callq  *0x30(%rax)
   0x000000000040f5ae <+110>:   mov    (%rbx),%rax
   0x000000000040f5b1 <+113>:   mov    0x48(%rbx),%rdx
   0x000000000040f5b5 <+117>:   mov    %rbx,%rdi
   0x000000000040f5b8 <+120>:   mov    0x50(%rbx),%rsi
   0x000000000040f5bc <+124>:   callq  *0x28(%rax)
   0x000000000040f5bf <+127>:   test   %al,%al
   0x000000000040f5c1 <+129>:   je     0x40f580 <_ZN11jDEtevc200622selectionTrialVsTargetEi+64>
   0x000000000040f5c3 <+131>:   mov    (%rbx),%rax
   0x000000000040f5c6 <+134>:   mov    0x48(%rbx),%rdx
   0x000000000040f5ca <+138>:   mov    %rbx,%rdi
   0x000000000040f5cd <+141>:   mov    0x50(%rbx),%rsi
   0x000000000040f5d1 <+145>:   mov    0x30(%rax),%rax
   0x000000000040f5d5 <+149>:   add    $0x8,%rsp
   0x000000000040f5d9 <+153>:   pop    %rbx
   0x000000000040f5da <+154>:   pop    %rbp
   0x000000000040f5db <+155>:   jmpq   *%rax
End of assembler dump.

Yeah, the code. 🙂 The hard style. But, after careful observation, I get the idea to have look in the source code of the function at hand:

void jDEtevc2006::selectionTrialVsTarget(int i) {
    evaluate(trial, trial+idxFE); // evaluation

    if (isBetterFirst(trial, pop+i*L)) { // selection
        replaceFirstVectorIntoSecond(trial, pop + i*L);

        if (isBetterFirst(trial, best))
            replaceFirstVectorIntoSecond(trial, best);

This yields location of NFevals variable in the assembly code:

0x000000000040f567 <+39>:    addl   $0x1,0x44(%rbx)

Here it is, the %rax in the function is the this pointer, which gets copied in %rbx, and at the offset of 0x44 is the NFevals variable:

(gdb) info registers rbx
rbx            0x1bebb10        29276944
(gdb) print *(int*)(0x1bebb10+0x44)
$0 = 19274

Well, it sure seamed unimaginable, but here it is, NFevals is 19274 where the program loop breaks; and now, I can go and see with a debugger, what happened there. (The code problem then was fixed.)

Lesson learned? It sure is handy to have some GDB trickery under the hood.

If you have any comments, you are welcome to post (click the sticker next to the post title above).

Leave a Reply

Your email address will not be published. Required fields are marked *