Strace is one of those tools that can be quite intimidating. But if you dont know how to use it, you are at a major disadvantage.
 

  • What does strace do? Well, it traces any system calls that a process makes.
  • Okay, why is that useful? Have you ever had an issue, or an error that no one else had? Well how the heck do you solve it? That is when we start tracing system calls.
  • Where can I find more details? The man page. https://linux.die.net/man/1/strace

 

Attaching to a running process.

Lets start from the most common use case. We want to attach to a process and see what it is doing:

[root@b08s02ur ~]# cat attach.sh
#!/bin/bash

for i in `seq 1 10`;
do
  sleep 10

done
[root@b08s02ur ~]#

 

We are going to loop through a sleep to see what it looks like when we attach.

[root@b08s02ur ~]# bash attach.sh &
[1] 3609
[root@b08s02ur ~]# ps aufx | grep attach
root      3609  0.0  0.0 113120  1380 pts/0    S    22:54   0:00  |       \_ bash attach.sh
root      3613  0.0  0.0 112648   956 pts/0    S+   22:54   0:00  |       \_ grep --color=auto attach
[root@b08s02ur ~]# strace -p 3609
Process 3609 attached
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 3614
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3614, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, 0x7ffd57fefd10, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn()                          = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f1e5d877250}, {0x43e670, [], SA_RESTORER, 0x7f1e5d877250}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f1e5e243a10) = 3618
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x43e670, [], SA_RESTORER, 0x7f1e5d877250}, {SIG_DFL, [], SA_RESTORER, 0x7f1e5d877250}, 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 3618
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3618, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, 0x7ffd57fefd10, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn()                          = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f1e5d877250}, {0x43e670, [], SA_RESTORER, 0x7f1e5d877250}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f1e5e243a10) = 3619
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x43e670, [], SA_RESTORER, 0x7f1e5d877250}, {SIG_DFL, [], SA_RESTORER, 0x7f1e5d877250}, 8) = 0
wait4(-1, ^CProcess 3609 detached
 <detached ...>
[root@b08s02ur ~]#

This should be pretty clear. We attached to process 3609, and what you are seeing is the sleep. But obviously we see the system call instead which is wait4(man wait4). The only parameter I used was -p, which allows me to specify the PID of the process to attach to.

 

Starting a process with strace.

Now, lets try starting the process using strace.

[root@b08s02ur ~]# cat start.sh
#!/bin/bash

if [ ! -x /tmp/cantread ];
then
  exit 1
fi
[root@b08s02ur ~]# bash start.sh
[root@b08s02ur ~]# echo $?
1
[root@b08s02ur ~]#

Sometimes the process exits so fast we cant always attach. The way to troubleshoot something like this is to start it with strace.

[root@b08s02ur ~]# strace -s 99 bash start.sh
execve("/usr/bin/bash", ["bash", "start.sh"], [/* 38 vars */]) = 0
brk(0)                                  = 0x2130000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f091a4a1000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=37298, ...}) = 0
mmap(NULL, 37298, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f091a497000
close(3)                                = 0
open("/lib64/libtinfo.so.5", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\316\0\0\0\0\0\0@\0\0\0\0\0\0\0\270\242\2\0\0\0\0\0\0\0\0\0@\0008\0\7\0@\0\34\0\33\0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\\O\2"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=174520, ...}) = 0
mmap(NULL, 2268928, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f091a058000
mprotect(0x7f091a07d000, 2097152, PROT_NONE) = 0
mmap(0x7f091a27d000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f091a27d000
close(3)                                = 0
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\16\0\0\0\0\0\0@\0\0\0\0\0\0\0\0E\0\0\0\0\0\0\0\0\0\0@\0008\0\7\0@\0!\0 \0\1\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\330\37\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=19776, ...}) = 0
mmap(NULL, 2109744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0919e54000
mprotect(0x7f0919e56000, 2097152, PROT_NONE) = 0
mmap(0x7f091a056000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f091a056000
close(3)                                = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\34\2\0\0\0\0\0@\0\0\0\0\0\0\0000? \0\0\0\0\0\0\0\0\0@\0008\0\n\0@\0K\0J\0\6\0\0\0\5\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0000\2\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2118128, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f091a496000
mmap(NULL, 3932672, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0919a93000
mprotect(0x7f0919c4a000, 2093056, PROT_NONE) = 0
mmap(0x7f0919e49000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f0919e49000

[...]

open("start.sh", O_RDONLY)              = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffc893cd580) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "#!/bin/bash\n\nif [ ! -x /tmp/cantread ];\nthen \n  exit 1\nfi\n", 80) = 58
lseek(3, 0, SEEK_SET)                   = 0
getrlimit(RLIMIT_NOFILE, {rlim_cur=1024, rlim_max=4*1024}) = 0
fcntl(255, F_GETFD)                     = -1 EBADF (Bad file descriptor)
dup2(3, 255)                            = 255
close(3)                                = 0
fcntl(255, F_SETFD, FD_CLOEXEC)         = 0
fcntl(255, F_GETFL)                     = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fstat(255, {st_mode=S_IFREG|0644, st_size=58, ...}) = 0
lseek(255, 0, SEEK_CUR)                 = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
read(255, "#!/bin/bash\n\nif [ ! -x /tmp/cantread ];\nthen \n  exit 1\nfi\n", 58) = 58
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
faccessat(AT_FDCWD, "/tmp/cantread", X_OK) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(1)                           = ?
+++ exited with 1 +++
[root@b08s02ur ~]#

Oh look, we can see it exited non 0, and we can see what happened directly above that. It appears it looked for a file that didnt exist, and then exited. This will be your typical scenario when you have a poorly documented binary looking for poorly packaged configuration files, and vague errors. If you add “-s 99999” it increases the snap length. (eg: those strings will turn to …’s much later.) As a rule, I use “-s 99999” when I am investigating. I do this because you could easily miss an important error you are looking for because it gets truncated.

 

Strace a process with children

What if the thing we are looking for is complex as heck, like troubleshooting a PHP error via Apache? Sure we could turn up error logging, and set a php_errorlog in the php.ini. But, for example sake, lets use strace.

[root@core ~]# cat /var/www/example/lame.php
<?php
fopen('/tmp/cantread', 'r');
?>
[root@core ~]# 

Since this file doesnt exist, it should simply throw an error. But how do we catch that among the noise of all the other threads? Well, we definitely need to increase the snap length(-s), and we also need to log it(-o). But there is also a neat flag that follows the child processes(-fF).

[root@core html]# strace -s 9999 -o my_log -fF  /etc/init.d/httpd start 
Starting httpd: [Sat Sep 02 10:17:41 2017] [warn] NameVirtualHost 64.85.161.114:443 has no VirtualHosts
[Sat Sep 02 10:17:41 2017] [warn] NameVirtualHost *:80 has no VirtualHosts
                                                           [  OK  ]

Now, the reason we are starting apache with strace instead of attaching, is because strace doesn’t seem to immediately attached to the preforked children. So we would have to either kill off the child processes, or wait for them to recycle naturally for strace to attach.

Another way to solve it, would be to change your prefork settings for apache to keep 1 spare server. But I find just starting it with strace to be easier.
 

Now, in another terminal lets query for the page.

[root@core ~]# curl core.example.com/lame.php

[root@core ~]# 

Okay, so now, if everything worked out, we should be able to see the error in our strace log file. Lets kill our strace, and check out the log. (if you strace something like httpd, you probably wont be able to ctrl+c, so just kill it from your other terminal or if its an init script, run stop.

Now, when you follow child processes be prepared for the log file to be a bit large. So you really need an idea of what you are looking for. In this case, I can grep for the cantread file, or lame.php

[root@core html]# grep cantread my_log
13800 read(16, "<?php\nfopen('/tmp/cantread', 'r');\n?>\n\n", 8192) = 39
13800 lstat64("/tmp/cantread", 0xbfc0be08) = -1 ENOENT (No such file or directory)
13800 open("/tmp/cantread", O_RDONLY)   = -1 ENOENT (No such file or directory)
13800 write(16, "[02-Sep-2017 10:21:53] PHP Warning:  fopen(/tmp/cantread) [<a href='function.fopen'>function.fopen</a>]: failed to open stream: No such file or directory in /var/www/example/lame.php on line 2\n", 190) = 190
[root@core html]# 

As you can see we get the no such file error for /tmp/cantread. Its worth noting that we can actually see the see the call to write the error to the php log.

 

Strace has been one of the secret weapons in my arsenal for years. Sometimes you just need a way to tease out more information on whats going on, and the code is not always readily available.