SOLVED: PPPoE/pppd dying - Roundtable

21 Dec 2013


      Some may recall my strange pppoe hanging-up up watchdog-script restart 
problem from a few months back (original email included at bottom) to 
which no one replied ;-)
After a few serendipitous epiphanies I stumbled upon the possibility of 
kernel "session id" as the culprit.  I could get around that by starting 
pppoe stuff with setsid (either in perl or right at the command line).  
That got me my own session (removing me from the same session as the 
watchdog script).  But that didn't solve it.
But that led me to process groups, but that was also not it.
Then I stumbled across in the fs /proc/*/cgroup and noticed that the ppp 
ps's were (still) tied to my watchdog script, via the systemd unit name 
(call it my-watchdog), so, where pppd is 13480:
#cat /proc/13480/cgroup
10:perf_event:/
9:blkio:/
8:net_cls:/
7:freezer:/
6:devices:/
5:memory:/
4:cpuacct,cpu:/system/my-watchdog.service
3:cpuset:/
2:name=systemd:/system/my-watchdog.service
So that was the tie between the watchdog and the ppp* programs!  Time to 
learn cgroups... argh.
To make a long story short, this is all systemd's fault!  Surprise!  It 
makes sense now, as this bug only appeared for me around the time Fedora 
switched to systemd.  Before that I was using inittab and then upstart to 
control these things and those tools don't use cgroups (well, upstart 
might?).  Systemd, as a "feature", puts each *.service into its own 
eponymous cgroup.  Then all child ps's of those would inherit the same 
cgroup settings.
Then when the parent *.service gets restarted or killed, either the kernel 
or systemd (haven't deduced which; ideas?) propagates the kill signal to 
all members of the cgroup.  Thus answering the question.
The solution I settled on (for now) out of the about 20 ways it seemed one 
could fix it is to put the following in my my-watchdog.service file in the 
[Service] section:
ControlGroup=cpuacct:/ name=systemd:/
Which then causes the relevant /proc/*/cgroup files to look like 
(relevant lines only):
4:cpuacct,cpu:/
2:name=systemd:/
Thus tying them to the "root" cgroup and eliminating the propagation of 
kill signals.  (Which hints that it is systemd propagating, not the 
kernel unless the kernel explicitly excludes the root cgroup.)
I also achieved the same thing by writing directly to the /sys/fs/cgroup/ 
files, but that only achieves temporary independence and would have to be 
done each time ppp* is started.
echo <pid> >> /sys/fs/cgroup/cpu,cpuacct/tasks
echo <pid> >> /sys/fs/cgroup/systemd/tasks
That's actually a viable solution, and perhaps better, as it allows me to 
keep my-watchdog in its own cgroup as systemd wants to do whilst 
separating out just the ppp* stuff.  BUT, systemd's online docs indicate 
systemd will in the future block manual fs configuration like this.
Also, the systemd docs and current Fedora 19 implementation seem to be a 
bit at odds and many of the options they discuss do not work when tried 
(invalid lvalue, etc).
I also think that my solution here may not work in the future as they lock 
down cgroups more.
I think perhaps the "real" solution to all of this is to somehow make ppp* 
a unit in systemd and have my-watchdog use systemd to control it.  That 
way we achieve cgroup independence without doing anything weird.  But, 
pppoe (client) doesn't really fit the systemd *.service mold as such and 
no one I can see has implemented it that way yet.  This may be required in 
the future, however.
The moral of the story is, if you use systemd and your service 
files/programs launch a process as a "side effect", that process will die 
when your service gets restarted/killed.  If you are starting anything 
daemon-like that you want to survive, you'll need to implement workarounds 
as described in this email.
On 20130324 17:53:30, Trevor Cordes wrote:
...
I have a script that starts/restarts PPPoE/pppd (MTS aDSL) as required on
a box.  It used to work A-OK.  After updates w ehil back (probably from
Fedora 14 -> 16 I think) a weird thing started happening:
When my script is stopped (to be restarted after an update for instance),
it instantly kills pppd and so my pppoe connection.
pppd[1572]: Terminating on signal 15
pppd[1572]: Connect time 7.5 minutes.
pppd[1572]: Sent 136980 bytes, received 76924 bytes.
pppd[1572]: Connection terminated.
pppoe[1573]: read (asyncReadFromPPP): Session 14782: Input/output error
pppoe[1573]: Sent PADT
pppd[1572]: Exit.
sig 15 is SIGTERM.
I can see before this happens that the processes involved are at the root
owned by init (ps 1):
root      4393     1  0 17:30 ?        00:00:00 /bin/bash /sbin/pppoe-connect /etc/sysconfig/network-scripts/ifcfg-ppp0
root      4417  4393  0 17:30 ?        00:00:00 /usr/sbin/pppd pty /usr/sbin/pppoe -p /var/run/pppoe-adsl.pid.pppoe -I eth0 -T 60 -U  -m 1412    ipparam ppp0 linkname ppp0 noipdefault noauth default-asyncmap defaultroute hide-p$
nobody    4418  4417  0 17:30 ?        00:00:00 /usr/sbin/pppoe -p /var/run/pppoe-adsl.pid.pppoe -I eth0 -T 60 -U -m 1412
One would think that if it's been "daemonized" to be owned by init then
killing the calling script would not affect it.  But it is.
What my script actually call is ifup and then Fedora's network magic in
turn calls pppoe/pppd.  ifup itself exits right after doing this (which is
probably why its children become children of init).
I tried adding a nohup before the ifup, but it doesn't help.  I guess that
makes sense since it's SIGTERM, not SIGHUP.
I'm out of ideas to make sure this stuff stays up after restarting my
script, and don't know what could have changed in Fedora to make this
happen.
Thanks!