Some may recall my strange pppoe hanging-up up watchdog-script restart
problem from a few months back (original email included at bottom) to
which no one replied ;-)
After a few serendipitous epiphanies I stumbled upon the possibility of
kernel "session id" as the culprit. I could get around that by starting
pppoe stuff with setsid (either in perl or right at the command line).
That got me my own session (removing me from the same session as the
watchdog script). But that didn't solve it.
But that led me to process groups, but that was also not it.
Then I stumbled across in the fs /proc/*/cgroup and noticed that the ppp
ps's were (still) tied to my watchdog script, via the systemd unit name
(call it my-watchdog), so, where pppd is 13480:
#cat /proc/13480/cgroup
10:perf_event:/
9:blkio:/
8:net_cls:/
7:freezer:/
6:devices:/
5:memory:/
4:cpuacct,cpu:/system/my-watchdog.service
3:cpuset:/
2:name=systemd:/system/my-watchdog.service
So that was the tie between the watchdog and the ppp* programs! Time to
learn cgroups... argh.
To make a long story short, this is all systemd's fault! Surprise! It
makes sense now, as this bug only appeared for me around the time Fedora
switched to systemd. Before that I was using inittab and then upstart to
control these things and those tools don't use cgroups (well, upstart
might?). Systemd, as a "feature", puts each *.service into its own
eponymous cgroup. Then all child ps's of those would inherit the same
cgroup settings.
Then when the parent *.service gets restarted or killed, either the kernel
or systemd (haven't deduced which; ideas?) propagates the kill signal to
all members of the cgroup. Thus answering the question.
The solution I settled on (for now) out of the about 20 ways it seemed one
could fix it is to put the following in my my-watchdog.service file in the
[Service] section:
ControlGroup=cpuacct:/ name=systemd:/
Which then causes the relevant /proc/*/cgroup files to look like
(relevant lines only):
4:cpuacct,cpu:/
2:name=systemd:/
Thus tying them to the "root" cgroup and eliminating the propagation of
kill signals. (Which hints that it is systemd propagating, not the
kernel unless the kernel explicitly excludes the root cgroup.)
I also achieved the same thing by writing directly to the /sys/fs/cgroup/
files, but that only achieves temporary independence and would have to be
done each time ppp* is started.
echo <pid> >> /sys/fs/cgroup/cpu,cpuacct/tasks
echo <pid> >> /sys/fs/cgroup/systemd/tasks
That's actually a viable solution, and perhaps better, as it allows me to
keep my-watchdog in its own cgroup as systemd wants to do whilst
separating out just the ppp* stuff. BUT, systemd's online docs indicate
systemd will in the future block manual fs configuration like this.
Also, the systemd docs and current Fedora 19 implementation seem to be a
bit at odds and many of the options they discuss do not work when tried
(invalid lvalue, etc).
I also think that my solution here may not work in the future as they lock
down cgroups more.
I think perhaps the "real" solution to all of this is to somehow make ppp*
a unit in systemd and have my-watchdog use systemd to control it. That
way we achieve cgroup independence without doing anything weird. But,
pppoe (client) doesn't really fit the systemd *.service mold as such and
no one I can see has implemented it that way yet. This may be required in
the future, however.
The moral of the story is, if you use systemd and your service
files/programs launch a process as a "side effect", that process will die
when your service gets restarted/killed. If you are starting anything
daemon-like that you want to survive, you'll need to implement workarounds
as described in this email.
On 20130324 17:53:30, Trevor Cordes wrote:
> I have a script that starts/restarts PPPoE/pppd (MTS aDSL) as required on
> a box. It used to work A-OK. After updates w ehil back (probably from
> Fedora 14 -> 16 I think) a weird thing started happening:
>
> When my script is stopped (to be restarted after an update for instance),
> it instantly kills pppd and so my pppoe connection.
>
> pppd[1572]: Terminating on signal 15
> pppd[1572]: Connect time 7.5 minutes.
> pppd[1572]: Sent 136980 bytes, received 76924 bytes.
> pppd[1572]: Connection terminated.
> pppoe[1573]: read (asyncReadFromPPP): Session 14782: Input/output error
> pppoe[1573]: Sent PADT
> pppd[1572]: Exit.
>
> sig 15 is SIGTERM.
>
> I can see before this happens that the processes involved are at the root
> owned by init (ps 1):
>
> root 4393 1 0 17:30 ? 00:00:00 /bin/bash /sbin/pppoe-connect /etc/sysconfig/network-scripts/ifcfg-ppp0
> root 4417 4393 0 17:30 ? 00:00:00 /usr/sbin/pppd pty /usr/sbin/pppoe -p /var/run/pppoe-adsl.pid.pppoe -I eth0 -T 60 -U -m 1412 ipparam ppp0 linkname ppp0 noipdefault noauth default-asyncmap defaultroute hide-p$
> nobody 4418 4417 0 17:30 ? 00:00:00 /usr/sbin/pppoe -p /var/run/pppoe-adsl.pid.pppoe -I eth0 -T 60 -U -m 1412
>
> One would think that if it's been "daemonized" to be owned by init then
> killing the calling script would not affect it. But it is.
>
> What my script actually call is ifup and then Fedora's network magic in
> turn calls pppoe/pppd. ifup itself exits right after doing this (which is
> probably why its children become children of init).
>
> I tried adding a nohup before the ifup, but it doesn't help. I guess that
> makes sense since it's SIGTERM, not SIGHUP.
>
> I'm out of ideas to make sure this stuff stays up after restarting my
> script, and don't know what could have changed in Fedora to make this
> happen.
>
> Thanks!