#linuxcnc-devel Logs
Nov 14 2018
#linuxcnc-devel Calendar
03:28 AM rmu: seems hm2_rpspi is using excessive read- and write memory barriers
03:31 AM rmu: I'm pretty sure that those are not needed iff running realtime stuff on one core with isolcpus
05:05 AM rmu: still debugging rpi stuff.. there is definitely something strange going on here. I see no reason why e.g. pid calculaction that usually has a thread time of around 600 can spike up to 35000
05:06 AM rmu: strange thing is that high function execution times line up, as if the whole CPU was slowed down for 0,5ms or so
06:46 AM jepler: https://github.com/golang/go/issues/1435 well this explains something I didn't understand before, why using setuid() in rtapi_app led to large latencies. glibc's setuid() wrapper actually interrupts all threads in the process in order to do its dirty work! Going more linux-specific, and using the relevant syscalls directly, might be able to work around it, but hurt portability.
06:46 AM jepler: sub-*processes* that open privileged file descriptors and pass them back is probably a long term and portable design, in case our freebsd porter ever comes back
06:48 AM jepler: rmu: interesting findings, does the CPU do thermal throttling or anything of that nature, that is documented?
06:50 AM rmu: jepler: i put a lot of effort into making sure no thermal or voltage throttling is happening, vcgencmd get_throttled reports 0x0 while this is happening
06:51 AM rmu: i tried different clock settings of "core" between 250 and 400 MHz and turned off frequency scaling
06:53 AM rmu: next thing i will try is doing some old-school gpio debugging with the oscilloscope, wiggling pins from within the hal components to see if they REALLY are slower or if the clock is doing funky things
06:56 AM rmu: i can get throttling to kick in wich "cpuburn-a53", but usually while running linuxcnc it is running at about 50°C
07:03 AM rmu: i think i will prepare my findings with some screenshots and open raspberry pi kernel issue or ask on the forum
07:19 AM rene_dev_: rmu does the rpi have a jtag interface?
07:19 AM Tom_itx: pretty sure that's how you program it
07:19 AM rene_dev_: try removing the memory barriers, and see what happens...
07:20 AM rene_dev_: no, its programmed by downloading stuff from the internet ;D
07:26 AM rene_dev_: hmm, looks like jtag tracing is not all that easy on a pi
07:58 AM mozmck: it seems that I can run the tests one at a time which failed and they pass. I got 15 that failed last time I tried and I had 15 segfaults in dmesg: halcmd[11249]: segfault at 7f0351f37000 ip 00007f03519053ab sp 00007ffd564e7ed0 error 4 in liblinuxcnchal.so.0
07:58 AM jepler: mozmck: what do you mean "one at a time"? linuxcnc's runtests runs the tests sequentially
07:59 AM mozmck: I also have a VM with ubuntu 18.04 (on this same machine) so I compiled master in that and I got only three failed because they could not find -lieee
07:59 AM mozmck: jepler: I ran the ones that failed like this: runtests tests/abs.0
07:59 AM mozmck: That runs only the named test
08:11 AM jepler: OK, when I read what you said I thought maybe you believed that "runtests" runs multiple tests concurrently. It doesn't.
08:12 AM jepler: It's possible you are seeing a problem that only manifests when your system is "under load", or only in a fraction of all runs... you could repeatedly run one test that you know fails in the hopes that it will eventually fail.
08:12 AM mozmck: Oh, no - I just went back a manually ran the tests that failed
08:12 AM jepler: this shell syntax will expand to "tests/abs.0" a bunch of times: runtests tests/abs.0{,,,}{,,,}{,,,}
08:13 AM mozmck: yep, failed 11 out of 64 runs
08:14 AM jepler: you are on linuxcnc master branch, or some other ref?
08:14 AM mozmck: master
08:14 AM mozmck: cleaned and rebuilt (several times now)
08:15 AM jepler: what OS?
08:15 AM mozmck: I'm running in linuxmint 17.3 (same basically as (x)ubuntu 14.04
08:15 AM jepler: uspace then?
08:16 AM mozmck: I've run 2.7 without issues for years on this OS, and just rebuilt and ran the runtests on it without any fails last night.
08:16 AM mozmck: yes, uspace - but not a preempt-rt kernel on this machine
08:16 AM jepler: and it's hardware you trust too?
08:17 AM jepler: > Runtest: 64 tests run, 64 successful, 0 failed + 0 expected
08:17 AM mozmck: Well, I have trusted it as my main devel machine for several years now - but anything will fail eventually
08:17 AM jepler: this is my debian 9 machine, uspace, no RT kernel
08:18 AM jepler: are the segfault 'ip' addresses consistent or different? (but they might be different due to ASLR too)
08:18 AM mozmck: I would expect that if I was having machine problems they would show up in other things - I run a lot of programs all the time.
08:18 AM jepler: I expect tests/abs.0 to be as reliable as anything
08:19 AM mozmck: ip addresses are all different (what is ASLR?)
08:19 AM jepler: address space layout randomization
08:19 AM mozmck: ah
08:19 AM jepler: are the last 3 characters in the ip consistent?
08:19 AM jepler: "3ab"
08:19 AM mozmck: yes
08:20 AM jepler: OK so that likely means it's the same code, just loaded at a different randomized address in each run
08:20 AM mozmck: I see. so how can I go about finding that code?
08:21 AM jepler: (ASLR is a mitigation that makes certain bugs harder to exploit, because the exploit can't depend on any of the program's code being at a fixed address)
08:21 AM mozmck: I've heard of it, but didn't recognize the acronym. Makes sense though.
08:21 AM jepler: ideally you'd be able to force your linux to give you a core dump file that you can do postmortem debugging on with gdb
08:22 AM jepler: but I .. don't even know how to do that these days
08:22 AM mozmck: Maybe I have a buggy library or something that is causing it.
08:22 AM mozmck: I'll see if I can find info on doing that.
08:23 AM jepler: you could disable ASLR, then the address should be the same each time
08:23 AM jepler: if that's the case, then you should be able to find out what function that specific address is inside using gdb on a non-crashed halcmd
08:23 AM mozmck: Any idea why halcompile on 18.04 might be giving the error: /usr/bin/x86_64-linux-gnu-ld: cannot find -lieee?
08:24 AM jepler: (For instance, start halcmd, find its pid, "gdb halcmd 12345", and then "x/i 0x00007f03519053ab"
08:25 AM jepler: I don't know why linuxcnc would be linking to -lieee but that library is provided by libc6-dev on my debian 9 system
08:27 AM jepler: by grepping the configure file I notice that -lieee ends up in TCL_LIBS because it is specified by tclConfig.sh but I wouldn't expect that to "get into" halcompile
08:27 AM mozmck: hah, so I did "echo 0 | sudo tee /proc/sys/kernel/randomize_va_space" thanks to askubuntu and now I have run the 64 tests 3 times without a failure.
08:28 AM jepler: well that's a stinker!
08:28 AM mozmck: Well, I have libc6-dev installed
08:28 AM mozmck: Yeah!
08:29 AM mozmck: maybe ASLR is buggy :-)
08:29 AM jepler: I have this on my system: kernel.randomize_va_space = 2
08:30 AM mozmck: Yeah, I think that is the default - full randomization
08:31 AM jepler: OK, turn it back on and try to get a core dump then?
08:32 AM jepler: at least you can provoke a crash every few minutes, I was reading a "go" bug in which it seemed the bug was best reproduced with a 48-core machine and reproduced about once a day if you were lucky :-/
08:32 AM mozmck: I'll try in a few minutes. I'm running the full runtests now - so far no fails...
08:32 AM jepler: afk for a bit
08:32 AM mozmck: thanks
08:35 AM rmu: rene_dev_: I'm sure there is a JTAG somewhere in this SOC, but i'm not aware of accessible JTAG pins
08:35 AM rmu: rene_dev_: what do you have in mind?
08:41 AM rene_dev_: rmu https://www.segger.com/products/debug-probes/j-trace/
08:41 AM rene_dev_: but looks like that is no use for the pi
08:45 AM rmu: i used to work with lauterbach stuff, but in the end, those traces never helped that much.
08:47 AM rene_dev_: it would help if you know whats going on while its wasting time :D
08:50 AM rmu: situation is worse in 1ms cycle time vs. 0.333ms
08:51 AM rene_dev_: trigger a gpio in the usb irq
08:51 AM rene_dev_: its probably not all that easy :D
08:52 AM rmu: i don't believe cache or memory pressure alone is the culprit, because PID calculation sometimes spike from about 600ns to 35000ns
08:52 AM sync: ah you can jtag a rpi
08:53 AM sync: not sure if you can do etm trace on them
08:54 AM rmu: you surely can't exfil the complete instruction trace of 4 cores running at 1.4ghz
08:55 AM rmu: plus the evil stuff probably happens in the videocore and that does something evil on the bus (just a suspicion)
08:56 AM sync: well it supports etm
08:56 AM rene_dev_: I would just start disabling stuff, like usb, video, ... to see when stuff changes
08:57 AM rene_dev_: or just not use a rpi^^
08:57 AM rmu: the SPI read routine is reading the spi data register and busy waiting until transfer completes, so it is a small tight loop, transfer size also is small-ish, i don't see why that takes between 60000 and 250000 ns
08:57 AM rmu: yeah, i'm going to dump it. it seems to work somehow, but i don't like this ugliness.
08:58 AM rmu: running over X or on console doesn't make a difference
09:00 AM mozmck: well interesting. The full runtests ran with 0 failures with ASLR disabled. After enabling again, on the first run of 64 abs.0 - 16 failed
09:02 AM rene_dev_: Im looking, but its hard to find a real datasheet for the cpu
09:04 AM rene_dev_: who maintains the rt patch for the pi? could well be that they missed some stuff
09:04 AM rene_dev_: because rt preempt changes a lot of irq and lock behaviour
09:09 AM rmu: i don't think /proc/interrupts lies about what core services what IRQ
09:10 AM mozmck: jepler: so I enabled core dumps with ulimit -c unlimited, and gdb tell me the segfault is this:
09:10 AM mozmck: #0 test_and_set_bit (nr=0, addr=0x7f105bf17000) at rtapi/rtapi_bitops.h:48
09:10 AM mozmck: 48 unsigned long oldval = __sync_fetch_and_or(laddr + loff, 1lu << boff);
09:12 AM rmu: rene_dev_: don't bother looking for rpi datasheets, even if you found some, they would be incomplete or wrong
09:14 AM rene_dev_: rmu do you need servo interface, or step direction?
09:31 AM rmu: servo interface
09:31 AM rmu: analog servo
09:34 AM rmu: so running with 3khz would be possible and improve the jitter in the execution times, BUT smart serial is not fast enough... so that would need a hm2_7i90 read/write funct that only looks at the serve resp. only the smart serial stuff
09:42 AM rmu: man page of hm2_rpspi says "It should be noted that the Rpi3 must have an adequate 5V power supply and the power should be properly decoupled right on the 40-pin I/O header. At high speeds and noise on the supply, there is the possibility of noise throwing off the SoC's PLL(s), resulting in
09:42 AM rmu: strange behaviour."
09:43 AM rene_dev_: thats why I prefer ethernet
09:44 AM rene_dev_: which kernel do you use?
09:46 AM rmu: 4.14.74-rt44-v7-rmu+
09:47 AM rmu: just the rt-preempt branch from official rpf sources
09:47 AM rmu: i can make kernel packages available if somebody is interested
09:48 AM rene_dev_: this one? https://github.com/raspberrypi/linux/tree/rpi-4.14.y-rt
09:48 AM rmu: yes
09:57 AM rene_dev_: Showing 472 changed files with 17,474 additions and 5,183 deletions.
10:06 AM rmu: AFAIK this is "just" the rt-preempt stuff rebased on the raspberry pi foundation kernel
10:06 AM rmu: or vice versa
11:26 AM mozmck: So I see that the 2.7 rtapi_bitops.h file implements these functions using some assembly, but master uses compiler atomic functions. Seems that the old functions worked with ASLR enabled (on my particular setup), but the new ones randomly fail unless ASLR is disabled.
11:29 AM seb_kuzminsky: mozmck: that's surprising to me
11:31 AM mozmck: I see that the __sync_ functions are deprecated as of gcc 4.8, but I wouldn't think that would matter in this way.
11:33 AM mozmck: my gcc is 4.8.4 and kernel is the ubuntu 4.4.0-138-generic
11:38 AM rene_dev_: seb_kuzminsky did you see my question yesterday?
11:38 AM seb_kuzminsky: err, probably not, i'll read back
11:38 AM rene_dev_: 02:13 <rene_dev_> seb_kuzminsky whats #5400 or #_current_tool on startup?
11:38 AM rene_dev_: 02:14 <rene_dev_> I know what it is, but Im looking for documentation backing that up.
11:38 AM seb_kuzminsky: oh, i see it
11:38 AM seb_kuzminsky: i don't know the answer to that question off the top of my head
11:39 AM rene_dev_: so the behaviout is that its -1, until you do the first m61 or toolchange
11:39 AM rene_dev_: which indicates that it doesnt know what tool is loaded
11:39 AM seb_kuzminsky: i'd guess it's T0 on nonrandom machines, and whatever the tool table file says is in P0 on random toolchangers
11:39 AM seb_kuzminsky: T-1? ok, i guess that makes sense
11:40 AM rene_dev_: ideally I would like to make 5400 presistent, so it reloads the old tool on startup
11:40 AM rene_dev_: because that is how hardware behaves :D
11:41 AM rene_dev_: but I dont know if that should go into 2.7...
11:41 AM JT-Shop: you could change tools while the machine is off on many machines
11:41 AM seb_kuzminsky: i agree with JT
11:41 AM rene_dev_: you can also do that while the machine is running
11:42 AM seb_kuzminsky: i agree with that too :-)
11:42 AM rene_dev_: the machine never knows for sure what is in the spindle/turret
11:42 AM rene_dev_: the question is what it should expect to be in the spindle/turret
11:43 AM rene_dev_: IMHO it should expect to have the last tool
11:43 AM rene_dev_: gmoccapy has that implemented in the UI, where it reloads the last tool on startup
11:43 AM seb_kuzminsky: i think the current behavior is fine, and i don't want to change it in 2.7. If you feel strongly that it should remember #5400 in master i will not object
11:43 AM rene_dev_: ok
11:43 AM seb_kuzminsky: doing it in the UI seems like a mistake to me
11:43 AM rene_dev_: maybe make it configurable
11:43 AM rene_dev_: yes
11:43 AM mozmck: What is the effect of it not remembering?
11:44 AM rene_dev_: it is optional, but it doesnt belong there
11:44 AM rene_dev_: when you turn on the machine, you need to tell it what tool is in
11:44 AM seb_kuzminsky: i'm opposed to making it configurable - we should pick the best option and stick with it. we have too many knobs already
11:44 AM rene_dev_: and thats annoying
11:44 AM JT-Shop: I think it's a bad plan to assume the same tool in the spindle between running linuxcnc
11:44 AM * JT-Shop goes back to chicken winterizing
11:44 AM seb_kuzminsky: it's sure safer to forget what tool's in the spindle and make the user remind us with m61 or m6
11:45 AM rene_dev_: why? on atc machines you usually cant change tools while the machine is off
11:45 AM mozmck: so what does it do if you don't tell it the tool? refuse to run? I have not used the tool table so far at all.
11:45 AM rene_dev_: that depents on your atc implementation
11:45 AM seb_kuzminsky: it runs, but it doesn't know what tlo and tool diameter are
11:46 AM rene_dev_: it should refuse to run, because it doesnt know if it needs to unload a tool, before picking up a new one
11:46 AM rene_dev_: both would crash the toolchanger
11:46 AM mozmck: ah, ok. It would seem that a warning or refusing to run would be good.
11:46 AM seb_kuzminsky: it could crash a nonrandom toolchanger, but not a random toolchanger (since that one tracks what's in the spindle)
11:47 AM seb_kuzminsky: i honestly don't have enough experience with nonrandom toolchangers to have a good intuition of how they should behave
11:47 AM rene_dev_: both should track that, because the hardware behaves exactly the same when you restart linuxcnc
11:47 AM jthornton: I can manually change tools on the atc then shut the machine down...
11:47 AM rene_dev_: you can always manually change tools, and not tell the machine. thats an operator error then
11:48 AM rene_dev_: if you want to crash, there is always a way
11:48 AM jthornton: it's a coder error to assume anything...
11:49 AM rene_dev_: im just thinking what the default is, and the default is that the tool remains in the spindle, unless you change it
11:49 AM rene_dev_: and when you change it, you have to tell the machine
11:50 AM rene_dev_: but when you dont change it, I dont feel the need to tell the machine
11:50 AM mozmck: That sounds reasonable.
11:51 AM mozmck: Either that or make it similar to the way homing works - if you power down you have to re-home before running code.
11:52 AM rene_dev_: at least #5400 and #_current_tool are the same now
11:52 AM mozmck: So tool could do the same - refuse to run code until you tell it what tool is in the spindle
11:52 AM rene_dev_: that would be very annoing for all the pepole that dont even use the tooltable, or dont have a toolchanger
11:54 AM mozmck: Yeah, refusing to run could have an ini override or something like homing does.
11:55 AM rene_dev_: yes, my idea was a ini setting which tool to load on startup. last, 0, or -1
11:57 AM mozmck: I'm thinking of something like NO_FORCE_HOMING = 1
11:58 AM rene_dev_: the assumption of not needing to track the pindle tool on non-random changers is just wrong
11:58 AM mozmck: So NO_FORCE_TOOL = 1 would mean it would run even if no tool is defined/selected - otherwise it would refuse to run code.
11:58 AM rene_dev_: because you do need to track, because you need to know if and where to put the tool when you pick up a new one
11:59 AM rmu: +1 for persisting the tool in the spindle
12:00 PM rmu: it would be most annoying in a ATC if you have a tool loaded and the machine doesn't know which one it is
12:00 PM rene_dev_: I have observed pepole using machines I retrofitted
12:00 PM mozmck: rene_dev_: have you had a chance to look at the reverse-run issue again?
12:02 PM mozmck: bbl - rebooting
12:02 PM JT-Shop: anytime you force machine specific configuration options you will be wrong at least half the time... things like that should be up to the system integrator to configure
12:04 PM rmu: ini setting that specifies last, specific tool, or empty seems sensible
12:05 PM rene_dev_: mozmck rob promised to look at it, but he didnt yet. I think I know how to fix it, and will try again this week.
12:05 PM mozmck: rene_dev_: ok
12:05 PM rene_dev_: yes, its always integrator specific
12:06 PM rene_dev_: but my observation is, that usually if you tell pepole that they need to tell the machine which tool is in the spindle
12:06 PM rene_dev_: the answer will be that they dont know :D
12:06 PM rene_dev_: because they forgot to label them
12:06 PM jepler: mozmck: can you use your core dump and generate full backtrace? Might need to pastebin it, it could be long.
12:06 PM rene_dev_: and if they get it wrong, you will crash the toolchanger
12:06 PM jepler: you can do that with the "bt" command in gdb
12:07 PM mozmck: jepler: I'll try...
12:08 PM jepler: you could probably substitute just the 2.7 code for the atomic operation and see if that fixes anything; I think it's unlikely, but this whole thing feels unlikely.
12:09 PM jepler: or you could 'git bisect' between 2.7 and master, but that would be a lengthy process
12:09 PM rene_dev_: they are tools to automate that
12:09 PM mozmck: https://pastebin.com/004FQ29T
12:23 PM jepler: thanks
12:24 PM jepler: so this is occurring when halcmd reinitializes itself after spawning a sub-program (for 'loadrt' in this case)
12:25 PM jepler: it looks like a hal file with a long string of 'loadusr -w true' should be able to trigger it more reliably
12:25 PM jepler: maybe repeat it 1000x
12:26 PM jepler: each time rtapi_init is called, it has to choose a new "unique identifier" which is just an integer that counts up forever
12:27 PM jepler: but in the case of the crash, the shared memory segment that has just been supposedly mapped by rtapi_shmem_getptr, is not valid
12:28 PM jepler: printf("note [%5d] uuid_mem = %p\n", getpid(), uuid_mem);
12:28 PM jepler: if (uuid_shmem_base == 0) {
12:29 PM jepler: on my system, the same address is printed each time, at least within the same process, because the first mapping made is retained even if rtapi_exit() is called
12:29 PM jepler: errr
12:29 PM jepler: maybe that's the change actually...!
12:29 PM jepler: in 2.7 rtapi_exit does nothing, in master it deletes the segment
12:30 PM jepler: commit 7c0c274d93b4fa5e24ce53f1569eae02d01c2721
12:30 PM jepler: uspace: delete the "uuid"(sic) shared memory at exit
12:30 PM jepler: Reported-By: Edward Tomasz Napierala <trasz@FreeBSD.org>
12:31 PM jepler: maybe try reverting this
12:31 PM jepler: I don't know what problem it was supposed to address :-/ that's all the notes I Have
12:31 PM jepler: afk again
12:45 PM mozmck: jepler: I tried reverting that commit and it still has the failures here
02:57 PM andypugh: jepler: Did you get the latest Laundry file?
02:58 PM andypugh: I think I know Squadron Leader Bradshaw.
02:58 PM andypugh: https://www.linkedin.com/in/simonbradshaw/
02:59 PM andypugh: (Given that both myself and the chap in question were at Stross’s birthday party in 2014 it can’t be coincidence.
04:30 PM jepler: andypugh: I've gotten a little ways into it. Can't use up a good book like that all at once
04:30 PM jepler: man I want to go to Stross's birthday party
04:30 PM jepler: mozmck: dang I really wish I could reproduce this failure locally! Yours is a 64 bit system, right? (mine is)
04:41 PM andypugh: I got a bit lucky. It was at Worldcon and Stross has exactly the same birthday as a friend of mine, and he knows Stross, so they held a joint 50th party, and I was one of Cosmic’s mates.
04:41 PM jepler: indeed
04:44 PM andypugh: Squadron Leader Bradshaw has written some SF himself, one is pretty good “First to the Moon” joint with Steven Baxter. But I can’t currently find the text online
04:45 PM andypugh: (Alternate history, no WW1, british moon shot in wood and bakelite rocket ends in partial disaster)
04:46 PM andypugh: Basically the British Interplanetary Society moon expedition design from 1936
04:49 PM andypugh: (Incidentally, did you look at the Linkedin profile, I think his hobby is collecting degrees, he has three Masters degrees and a couple of others on top)
06:01 PM rene_dev_: andypugh Dont know what you are talking about, but I know a guy called Bradshaw that writes and publishes books: http://www.jrbpub.net/books/
06:29 PM andypugh: Different one.
06:30 PM andypugh: The latest Charles Stross book (SF) features a character fairly clearly named after one of my friends from college.
10:44 PM Tom_itx is now known as Tom_L