Some may recall my strange pppoe hanging-up up watchdog-script restart problem from a few months back (original email included at bottom) to which no one replied ;-)
After a few serendipitous epiphanies I stumbled upon the possibility of kernel "session id" as the culprit. I could get around that by starting pppoe stuff with setsid (either in perl or right at the command line). That got me my own session (removing me from the same session as the watchdog script). But that didn't solve it.
But that led me to process groups, but that was also not it.
Then I stumbled across in the fs /proc/*/cgroup and noticed that the ppp ps's were (still) tied to my watchdog script, via the systemd unit name (call it my-watchdog), so, where pppd is 13480:
#cat /proc/13480/cgroup 10:perf_event:/ 9:blkio:/ 8:net_cls:/ 7:freezer:/ 6:devices:/ 5:memory:/ 4:cpuacct,cpu:/system/my-watchdog.service 3:cpuset:/ 2:name=systemd:/system/my-watchdog.service
So that was the tie between the watchdog and the ppp* programs! Time to learn cgroups... argh.
To make a long story short, this is all systemd's fault! Surprise! It makes sense now, as this bug only appeared for me around the time Fedora switched to systemd. Before that I was using inittab and then upstart to control these things and those tools don't use cgroups (well, upstart might?). Systemd, as a "feature", puts each *.service into its own eponymous cgroup. Then all child ps's of those would inherit the same cgroup settings.
Then when the parent *.service gets restarted or killed, either the kernel or systemd (haven't deduced which; ideas?) propagates the kill signal to all members of the cgroup. Thus answering the question.
The solution I settled on (for now) out of the about 20 ways it seemed one could fix it is to put the following in my my-watchdog.service file in the [Service] section:
ControlGroup=cpuacct:/ name=systemd:/
Which then causes the relevant /proc/*/cgroup files to look like (relevant lines only): 4:cpuacct,cpu:/ 2:name=systemd:/
Thus tying them to the "root" cgroup and eliminating the propagation of kill signals. (Which hints that it is systemd propagating, not the kernel unless the kernel explicitly excludes the root cgroup.)
I also achieved the same thing by writing directly to the /sys/fs/cgroup/ files, but that only achieves temporary independence and would have to be done each time ppp* is started.
echo <pid> >> /sys/fs/cgroup/cpu,cpuacct/tasks echo <pid> >> /sys/fs/cgroup/systemd/tasks
That's actually a viable solution, and perhaps better, as it allows me to keep my-watchdog in its own cgroup as systemd wants to do whilst separating out just the ppp* stuff. BUT, systemd's online docs indicate systemd will in the future block manual fs configuration like this.
Also, the systemd docs and current Fedora 19 implementation seem to be a bit at odds and many of the options they discuss do not work when tried (invalid lvalue, etc).
I also think that my solution here may not work in the future as they lock down cgroups more.
I think perhaps the "real" solution to all of this is to somehow make ppp* a unit in systemd and have my-watchdog use systemd to control it. That way we achieve cgroup independence without doing anything weird. But, pppoe (client) doesn't really fit the systemd *.service mold as such and no one I can see has implemented it that way yet. This may be required in the future, however.
The moral of the story is, if you use systemd and your service files/programs launch a process as a "side effect", that process will die when your service gets restarted/killed. If you are starting anything daemon-like that you want to survive, you'll need to implement workarounds as described in this email.
On 20130324 17:53:30, Trevor Cordes wrote:
I have a script that starts/restarts PPPoE/pppd (MTS aDSL) as required on a box. It used to work A-OK. After updates w ehil back (probably from Fedora 14 -> 16 I think) a weird thing started happening:
When my script is stopped (to be restarted after an update for instance), it instantly kills pppd and so my pppoe connection.
pppd[1572]: Terminating on signal 15 pppd[1572]: Connect time 7.5 minutes. pppd[1572]: Sent 136980 bytes, received 76924 bytes. pppd[1572]: Connection terminated. pppoe[1573]: read (asyncReadFromPPP): Session 14782: Input/output error pppoe[1573]: Sent PADT pppd[1572]: Exit.
sig 15 is SIGTERM.
I can see before this happens that the processes involved are at the root owned by init (ps 1):
root 4393 1 0 17:30 ? 00:00:00 /bin/bash /sbin/pppoe-connect /etc/sysconfig/network-scripts/ifcfg-ppp0 root 4417 4393 0 17:30 ? 00:00:00 /usr/sbin/pppd pty /usr/sbin/pppoe -p /var/run/pppoe-adsl.pid.pppoe -I eth0 -T 60 -U -m 1412 ipparam ppp0 linkname ppp0 noipdefault noauth default-asyncmap defaultroute hide-p$ nobody 4418 4417 0 17:30 ? 00:00:00 /usr/sbin/pppoe -p /var/run/pppoe-adsl.pid.pppoe -I eth0 -T 60 -U -m 1412
One would think that if it's been "daemonized" to be owned by init then killing the calling script would not affect it. But it is.
What my script actually call is ifup and then Fedora's network magic in turn calls pppoe/pppd. ifup itself exits right after doing this (which is probably why its children become children of init).
I tried adding a nohup before the ifup, but it doesn't help. I guess that makes sense since it's SIGTERM, not SIGHUP.
I'm out of ideas to make sure this stuff stays up after restarting my script, and don't know what could have changed in Fedora to make this happen.
Thanks!