#linuxcnc-devel | Logs for 2015-08-06

Back
[06:43:23] <carper> thinking of building a cnc cam/spindle grinder at some point in the next 12 months about the size of a 9 x 36 lathe using a slant bed with hybrid steppers/encoders with an autoprobing to. can anyone see any problems with the idea? i just dont want to start the bad practice of using a machine post grider on my cnc lathe.
[06:43:53] <carper> https://www.sherlinedirect.com/index.cfm?fuseaction=product.display&Product_ID=1032
[06:44:31] <carper> grinder
[06:51:39] <carper> upps wrong room
[07:12:27] <KGB-linuxcnc> 03John Thornton 052.7 c0d55e5 06linuxcnc 10docs/src/Master_Documentation.txt 10docs/src/Master_Getting_Started.txt 10docs/src/index.tmpl Docs: make the html and pdf docs include all chapters * 14http://git.linuxcnc.org/?p=linuxcnc.git;a=commitdiff;h=c0d55e5
[07:12:27] <KGB-linuxcnc> 03John Thornton 052.7 ad3328b 06linuxcnc 10(75 files in 9 dirs) Docs: move files from the common directory that are no longer common. * 14http://git.linuxcnc.org/?p=linuxcnc.git;a=commitdiff;h=ad3328b
[07:12:27] <KGB-linuxcnc> 03John Thornton 052.7 d11a828 06linuxcnc 10docs/src/gui/axis.txt Docs: add info on axisui pins * 14http://git.linuxcnc.org/?p=linuxcnc.git;a=commitdiff;h=d11a828
[07:35:12] <linuxcnc-build> build #1666 of 1404.rip-wheezy-rtpreempt-amd64 is complete: Failure [4failed compile runtests] Build details are at http://buildbot.linuxcnc.org/buildbot/builders/1404.rip-wheezy-rtpreempt-amd64/builds/1666 blamelist: John Thornton <bjt128@gmail.com>
[07:49:33] <linuxcnc-build> build #3325 of 0000.checkin is complete: Failure [4failed] Build details are at http://buildbot.linuxcnc.org/buildbot/builders/0000.checkin/builds/3325 blamelist: John Thornton <bjt128@gmail.com>
[07:58:47] <skunkworks> That would be really cool to be able to select the reatlime at boot time on the livecd.
[08:45:16] <skunkworks> building (make and make setuid) seems like it would take a boot a long time..
[09:40:16] <cradek> yeah getting the multiple kernels on the cd is easy. getting the corresponding linuxcnc installed is hard
[09:41:07] <cradek> I think making it easy to switch between the versions, just using apt in a normal way, is the most important goal
[09:54:19] <seb_kuzminsky> wow, the halui/mdi test fails often now
[09:54:46] <jepler> nobody's touching that stuff, why would it get worse?
[09:55:07] <seb_kuzminsky> last one touching it was... me, early july
[09:55:52] <seb_kuzminsky> i was trying to fix occasional failures we were having, guess it didnt work
[09:59:41] <seb_kuzminsky> there's a bad behavior in halui in several places, where it tries to do something, and if it fails it doesn't print a message about it
[10:01:13] <seb_kuzminsky> i added some error reporting (logging) to halui about the same time, but i guess it's not any of the things that i added checks for
[10:05:08] <seb_kuzminsky> oh wait
[10:05:15] <seb_kuzminsky> halui *did* say what was wrong
[10:05:23] <seb_kuzminsky> it's one of the new checks i added
[10:05:29] <cradek> ooh
[10:05:49] <seb_kuzminsky> but the output wasn't flushed, so it came after all the other output, down at the bottom after the test failed
[10:06:00] <seb_kuzminsky> halui: emcCommandSend: no echo from Task after 2.000 seconds
[10:06:00] <seb_kuzminsky> halui: sendMdiCommand: failed to Set Mode MDI
[10:07:00] <seb_kuzminsky> our logging sure is wonky
[10:15:01] <jepler> how frequently does that test fail? 40 iterations here and no failures yet
[10:15:06] <jepler> wheezy, vanilla
[10:15:18] <cradek> 2 seconds is a long time
[10:16:01] <seb_kuzminsky> maybe my VMs suck? maybe the hypervisor host was doing something? idfk
[10:17:23] <seb_kuzminsky> it's happened several times, usually on rtpreempt, sometimes on vanilla, never on rtai
[10:17:37] <jepler> failed on iteration 67 here
[10:17:55] <seb_kuzminsky> i've got 5 instances in my "unexplained buildbot failures" file, but i probably didnt record all of them
[10:17:56] <jepler> halui: emcCommandSend: no echo from Task after 2.000 seconds
[10:17:57] <jepler> halui: old_echo_serial_number=6 emcCommandSerialNumber=7 new_echo_serial_number=6
[10:17:58] <seb_kuzminsky> ooh, yay, repro!
[10:17:59] <jepler> halui: sendMdiCommand: failed to Set Mode MDI
[10:18:06] <jepler> so it's not your VMs
[10:18:22] <seb_kuzminsky> there should be another "halui: " line saying more, nearby?
[10:18:41] <seb_kuzminsky> bbl work :-/
[10:19:03] <jepler> http://paste.ubuntu.com/12014092/ ["result" file]
[10:45:47] <jepler> added more debugging but haven't hit the problem in the next 200 iterations
[11:01:51] <jepler> .. 400
[11:01:52] <jepler> groan
[11:04:56] <skunkworks> taking out the debugging?
[11:14:50] <pcw_home> This seems like a big latency win:
[11:14:52] <pcw_home> [PATCH RT 2/7] mm/slub: move slab initialization into irq enabled region
[11:44:05] <jepler> .. 900 :-/
[11:54:48] <jepler> .. 993 and it failed
[11:55:30] <jepler> too bad I don't really trust how the various output is interleaved
[11:55:43] <jepler> emcTaskPlan() new serial number 6 -> 7
[11:55:46] <jepler> update echo_serial_number 6 -> 7
[11:55:54] <jepler> halui: emcCommandSend: no echo from Task after 2.000 seconds
[11:55:54] <jepler> halui: old_echo_serial_number=6 emcCommandSerialNumber=7 new_echo_serial_number=6
[12:14:25] <jepler> // since emcStatus was passed to the WM init functions, it
[12:14:25] <jepler> // will be updated in the _update() functions above. There's
[12:14:25] <jepler> // no need to call the individual functions on all WM items.
[12:14:25] <jepler> int result = emcStatusBuffer->write(emcStatus);
[12:14:28] <jepler> what is WM ?
[12:23:33] <jepler> not sure if this is significant or not...
[12:23:40] <jepler> in a failed run the first "setting mode to Manual" comes at ~2s
[12:23:41] <jepler> 2.089: setting mode to Manual
[12:23:53] <jepler> in a random good run it's earlier
[12:23:53] <jepler> 0.386: setting mode to Manual
[12:24:26] <jepler> 0.396: setting mode to Manual
[12:25:39] <jepler> another failed run: 2.704: setting mode to Manual
[12:28:25] <skunkworks> does your skin crawl looking at that code?
[12:28:41] <jepler> veteran programmers have already had all their skin flayed off...
[12:28:55] <jepler> speaking of which, lunchtime!
[12:28:57] <skunkworks> heh
[13:27:55] <jepler> boh and I think I missed pasting this log http://paste.ubuntu.com/12014583/
[13:35:24] <jepler> 9.700: setting mode to Manual
[13:35:35] <jepler> the next failed run didn't do its first mode setting until 9.7s!
[13:36:38] <jepler> http://paste.ubuntu.com/12015132/
[13:42:53] <archivist> I like to use spare nixies to repair old test gear :)
[13:44:48] <cradek> I've never seen a single thing that uses the b5971 tubes. probably why they are so rare.
[13:45:57] <archivist> I fired up a HP 3591A a few days ago t to test the probe I got with LVDT mesurement
[13:49:28] <archivist> nixies seem ok but the counter part is duff, need to make an extender and scope it
[13:49:58] <cradek> nixies don't fail, they only eventually darken
[13:50:16] <cradek> and the burroughs long life don't really ever darken that I've seen
[13:52:07] <archivist> I have seen inter electrode faults
[13:52:34] <cradek> huh - manufacturing defects, or shock, or something else you think?
[13:53:14] <archivist> probably the seal failing and gas contamination
[13:55:55] <jepler> 1438885300 halui: old_echo_serial_number=3 emcCommandSerialNumber=4 new_echo_serial_number=3
[13:55:58] <jepler> 1438885301 update echo_serial_number 3 -> 4
[13:56:21] <jepler> so in this run I can see that halui gave up waiting on the serial number to change before it was changed by task
[13:57:17] <jepler> it looks like *something* happens, and it may be system load, that makes task sluggish
[13:57:43] <jepler> consistently I see the first 'setting mode to Manual' be long in the test that fails and short in other tests, >1.5s vs <.5s
[14:01:16] <jepler> 1438885301 emcTaskPlan() new serial number 3 -> 4
[14:01:27] <jepler> actually the command wasn't even received by task until after halui gave up waiting on it
[14:08:43] <cradek> yay, something makes it slow is a much better failure to see than a deadlock
[14:10:36] <jepler> http://paste.ubuntu.com/12015373/
[14:11:02] <jepler> well this one breaks the pattern, first "setting mode" at 0.347
[14:11:28] <jepler> 1438886792 emcTaskPlan() new serial number 6 -> 7
[14:11:30] <jepler> 1438886794 halui: old_echo_serial_number=6 emcCommandSerialNumber=7 new_echo_serial_number=6
[14:11:34] <jepler> 1438886801 update echo_serial_number 6 -> 7
[14:12:47] <cradek> have you figured out why the state of the world (serial numbers) is different from what we expect (we have queues now instead)?
[14:15:17] <jepler> c4c7fb9d57143251eb3907886833a61bca035875 which is my version of Sascha's patch for SF#328 implements emcCommandSend with "wait for received" semantics
[14:29:18] <jepler> 27757 1438887711.586678 rename("sim.var", "sim.var.bak" <unfinished ...>
[14:29:18] <jepler> 27757 1438887713.991722 <... rename resumed> ) = 0
[14:29:20] <jepler> boom
[14:29:36] <jepler> so yes, it's a failure that happens in high disk latency situations
[14:29:41] <cradek> wow
[14:29:53] <cradek> 2.4 seconds
[14:30:58] <cradek> does task do any disk IO other than at startup and shutdown?
[14:31:40] <cradek> oh sure it does, it reads gcode if nothing else
[14:31:41] <jepler> it reads part programs and can write stuff like probe logs directed by gcode
[14:31:44] <cradek> duh
[14:32:48] <jepler> it opens halui.ini a surprising number of times, but all at atartup
[14:33:46] <seb_kuzminsky> task should probably use something like uspace_rtapi_app's harden_rt()
[14:34:17] <jepler> that won't help disk I/O not go out to lunch
[14:34:59] <seb_kuzminsky> and harden_rt() should probably call sched_setscheduler(SCHED_RR) and sched_setparam() to raise its priority
[14:35:17] <seb_kuzminsky> but yes, that probably(?) won't help (much?) with disk latency
[14:35:41] <jepler> particularly not in a VM
[14:36:14] <skunkworks> random thought - could that be the issue with gmoccopy? did he add some extra disk access?
[14:36:35] <cradek> what issue?
[14:37:02] <jepler> the "ui stops responding for a few seconds" bug I think
[14:37:04] <seb_kuzminsky> the thing rick lair reported, where latency went way up
[14:37:12] <skunkworks> yes - rick
[14:37:23] <jepler> I forget what operation triggers the latency
[14:37:23] <cradek> oh the halui button was slow to respond
[14:37:28] <skunkworks> yes
[14:37:49] <cradek> he bisected that and found the breakage but I haven't seen any progress on it
[14:38:12] <jepler> what operation does he do in halui to trigger the unresponsiveness?
[14:38:16] <jepler> was it a mode switch?
[14:38:23] <cradek> no
[14:38:32] <cradek> I think it's pause and resume while running gcode
[14:38:34] <seb_kuzminsky> it was an interp pause, iirc
[14:38:36] <seb_kuzminsky> yeah
[14:38:38] <seb_kuzminsky> bug #427
[14:38:57] <seb_kuzminsky> http://sourceforge.net/p/emc/bugs/427/
[14:39:49] <seb_kuzminsky> we should probably run all our tasks with SCHED_RR, and Task should have the highest priority
[14:40:17] <jepler> pause/resume/step don't do anything in task that makes strace -e file report any syscalls
[14:40:31] <skunkworks> halui pause/resume?
[14:40:33] <jepler> abort and machine on/off/estop all do
[14:40:38] <jepler> skunkworks: no, axis pause/resume
[14:40:50] <skunkworks> he specifically is using halui
[14:40:57] <jepler> for this particular problem (task takes a long time calling rename()) it should not be UI dependent
[14:41:01] <jepler> -> rick's problem is not this problem
[14:41:06] <skunkworks> ok
[14:42:00] <jepler> well it is possible that his halui is sending commands that are different than what aixs pause/resume/step send though
[14:43:08] <cradek> I think gmoccapy might control halui somehow
[14:43:16] <cradek> I suspect something is giving him a flood of messages
[14:43:20] <jepler> I don't know why you'd design your software that way
[14:44:01] <jepler> anyway, that rename() looks like a smoking gun, and I've also shown to my own satisfaction that it happens a lot more readily while the system is doing a lot of I/O
[14:44:08] <jepler> I don't know what to propose doing about it however
[14:44:45] <jepler> preview-generating UIs need to have a way to know the var file is up to date, and knowing that the mode change command has been acted on is that way..
[14:45:03] <seb_kuzminsky> it's valuable to have a reliable backup of the var file, like that rename gives us
[14:45:09] <jepler> a mode change is normally instant, so task isn't ack'ing it via echo_serial_number before completing it
[14:45:54] <cradek> well having a wall-clock based timeout is the bug, right?
[14:46:00] <jepler> well
[14:46:20] <seb_kuzminsky> cradek: yeas, but...
[14:47:08] <cradek> there used to be a heartbeat. wonder where that was. if it was something that task twiddles when it's bored otherwise, that would be a timeout that makes sense to use.
[14:47:43] <cradek> xemc used to show it
[14:47:48] <cradek> I don't know from where
[14:48:13] <seb_kuzminsky> that would be better than just wallclock time, but i think you still need a wallclock timeout too
[14:48:29] <seb_kuzminsky> one of the things task could do wrong is not twiddle the heardbeat counter, and the test should time out if that happens
[14:48:35] <cradek> emcTaskUpdate increments heartbeat
[14:49:15] <cradek> what's the harm in just making the wall-clock timeout much much longer?
[14:49:31] <cradek> like minutes
[14:49:44] <seb_kuzminsky> that's probably fine
[14:49:46] <cradek> it will still tell us about any real deadlock
[14:49:55] <jepler> if task is unresponsive for a minute that's bad, because you can't software-estop
[14:49:59] <seb_kuzminsky> we'd like to know if Task goes out to lunch for seconds and seconds
[14:50:03] <seb_kuzminsky> yeah
[14:50:05] <jepler> you might even think it's bad if task is unresponseive for a second
[14:50:11] <seb_kuzminsky> or 10 ms
[14:50:13] <cradek> well it does
[14:50:23] <cradek> because it does file IO
[14:50:27] <cradek> ok, question answered
[14:51:21] <skunkworks> Task is unresponsive. Click ok to estop..
[14:51:33] <seb_kuzminsky> *pull the plug to estop
[14:51:38] <cradek> and even if we fix that somehow, the gui can still do it, and you still can't software anything-stop
[14:52:03] <jepler> "task was unresponsive for 60s. don't you wish you had a physical estop chain?"
[14:52:09] <jepler> [OK] [Cancel]
[14:52:09] <skunkworks> heh
[14:52:14] <cradek> no kidding
[14:52:15] <seb_kuzminsky> the further up from the core we can push the badness, the better i'll feel
[14:52:58] <seb_kuzminsky> specifically i think it'd be really neat if Task and IO were reliable, and any badness generally came from bad GUIs
[14:52:59] <cradek> well a real machine tool probably won't have buildbot or VMs running on it
[14:53:16] <cradek> yes, that would be nice
[14:53:20] <jepler> making task realtime would be a long term goal
[14:53:39] <jepler> of course someday we'll just rewrite task from a blank screen and it'll be good this time
[14:53:43] <cradek> we've already moved some of it into motion, so it is
[14:53:45] <skunkworks> do we know if anyone has had this issue on real computer hardware?
[14:54:21] <skunkworks> *non vm
[14:54:26] <seb_kuzminsky> i dont know of any in-the-wild problems from this particular thing, but the general thing of "much of linuxcnc runs in non-realtime" *might* be what's biting rick lair
[14:55:58] <skunkworks> well - rick lairs problem seems to be something that changed in gmoccopy
[14:56:19] <skunkworks> changing from responsive to not responsive
[14:56:26] <skunkworks> or some definition of
[14:58:07] <skunkworks> well then - back to the o-word issue. :)
[15:04:08] <seb_kuzminsky> why is task updating the var file in this test? it's not changing any of the persisted parameters is it?
[15:04:27] <cradek> it rewrites it when you sync (part of switching modes)
[15:04:39] <jepler> seb_kuzminsky: task isn't smart enough to know whether the var file contents would be different if written
[15:04:54] * seb_kuzminsky imagines a dirty flag on each parameter
[15:05:16] <cradek> that doesn't fix any fundamental problem
[15:05:21] <seb_kuzminsky> yeah
[15:13:07] <seb_kuzminsky> rename(2) causes sync() on ext3 fs, which i bet is the slow part
[15:13:35] <seb_kuzminsky> i bet it'd be much faster to treat the var file as a log and just append changed values as needed
[15:14:11] <seb_kuzminsky> at startup task would read through the log, updating on each line, then maybe writing out a condensed vars file with just one line per parameter
[15:14:16] <jepler> so now the var file grows forever?
[15:14:23] <jepler> per session?
[15:14:34] <seb_kuzminsky> the file would grow in an unbounded but slow way until task is restarted
[15:15:15] <seb_kuzminsky> it's 14 bytes per line
[15:16:07] <jepler> you'd have to check how existing programs use the var file
[15:16:16] <seb_kuzminsky> is it more than just task?
[15:16:22] <cradek> yep
[15:16:26] <cradek> that's why it's rewritten
[15:16:27] <jepler> for instance, if they do it by reading until hitting the *first* instance of a variable number, your idea is sunk
[15:16:42] <cradek> if it was just task, why bother rewriting it during runtime
[15:16:54] <jepler> axis implicitly reads the var file via rs274 so it can show the correct offsets in preview
[15:17:08] <jepler> librs274 that is
[15:17:12] <seb_kuzminsky> cradek: to persist the params asap, in case it crashes?
[15:17:34] <cradek> ok, that too
[15:17:39] <seb_kuzminsky> jepler: hrm, so the vars file is yet another wonky ipc mechanism we have, i didn't realize that
[15:17:50] <jepler> seb_kuzminsky: most of the time we manage to forget it
[15:18:03] <seb_kuzminsky> maybe a big array of floats in shm would be better
[15:18:30] <jepler> still breaks existing code
[15:18:33] <seb_kuzminsky> sure
[15:18:49] <jepler> if the var file were arranged right you could seek(offset); write(new value of var)
[15:19:03] <seb_kuzminsky> ... and we're kind of trying to get away from shm between task & the uis, and moving towards networking
[15:19:16] <seb_kuzminsky> jepler: not if you're using ascii encoding like we are now
[15:19:16] <jepler> e.g., if every line was known to be 32 bytes long or whatever would let you print any floating point value with full accuracy and be a nice power of 2
[15:19:27] <seb_kuzminsky> or you'd have to reserve space on each line
[15:19:29] <seb_kuzminsky> right
[15:19:57] <seb_kuzminsky> that doesn't give you the backup we have now, which is nice to have
[15:19:57] <cradek> remember the var file is also a special list of keys, not just a key/value store
[15:20:04] <jepler> I've spent enough years yelling at people for writing the var file, though, that I know other programs probably do
[15:20:19] <jepler> there used to be a GUI just for writing that file
[15:20:27] <jepler> instead of MDI #1234=567.89
[15:20:33] <cradek> yes, I hope it's gone now
[15:20:54] <cradek> either way, I bet others have reinvented it
[15:21:05] <cradek> and the units-related bugs that go with that bad idea
[15:21:12] <jepler> grnrnnnarmg
[15:21:24] <jepler> yeah and a similar thing for programs writing the tool table
[15:21:40] <jepler> but we don't offer an alternative for *reading* the var file; you can't MDI that
[15:21:40] <cradek> yeah, I suppose so
[15:21:44] <jepler> like you can for setting vars and tools
[15:24:51] <skunkworks> (DEBUG [#8]) works in mdi... ;)
[15:25:05] <jepler> "%6d % 24.17g\n" fits in 32 bytes and should allow any FP number to be printed with full precision
[15:25:13] <jepler> skunkworks: no, because only one UI out of all UIs can receive a debug message
[15:25:21] <skunkworks> ah
[15:25:35] <jepler> so if you run badideaui + worseideaui, and one of them does (DEBUG #8), no guarantee that one sees the message
[15:25:38] <skunkworks> you mean the best ui?
[15:26:51] <jepler> one of those two is probably the best ui out of those two
[15:49:18] <seb_kuzminsky> sounds like some changes went into linux 3.1 to make ext3 slower but safer, specifically relating to fsync (which rename uses)
[15:49:47] <seb_kuzminsky> maybe that's why this shows up on our wheezy machines (3.2 for uspace and 3.4 for rtai) and not on lucid
[15:50:02] <seb_kuzminsky> hmm, no, we should see it on precise too, and i dont think we do (or maybe i just didnt notice)
[15:50:34] <seb_kuzminsky> http://kernelnewbies.org/Linux_3.1#head-7536ebf800edb3e888431812b283b63ed2b2b7ea
[15:52:15] <cradek> interesting
[15:54:41] <mozmck> doesn't precise use ext4 by default?
[15:54:48] <seb_kuzminsky> i'll check
[15:55:12] <jepler> this system seems to be using ext4 on kernel 3.2.
[15:55:15] <seb_kuzminsky> yep, you're right, precise uses ext4
[15:55:19] <seb_kuzminsky> *our precise VMs
[15:55:42] <seb_kuzminsky> hm, my wheezy machine uses ext4 too
[15:55:55] <seb_kuzminsky> (which already had the slow, safe behavior prior to 3.1)
[15:57:13] <mozmck> Looks like ext4 was default for lucid as well: https://wiki.ubuntu.com/LucidLynx/ReleaseNotes
[15:57:27] <seb_kuzminsky> our wheezy VMs use ext4 with barriers enabled
[15:58:06] <jepler> /dev/mapper/install14-home /home ext4 rw,relatime,user_xattr,barrier=1,data=ordered 0 0
[15:58:10] <seb_kuzminsky> heh yep, our lucid vms use ext4 too
[15:58:16] <seb_kuzminsky> ok, never mind, red herring
[16:29:59] <Patang> Anyone here know how (if possible) to assign a value to a gmoccapy-variable from a gcode subroutine?
[16:30:21] <cradek> I don't know what a gmoccapy-variable is
[16:32:11] <Patang> well, a glade-variable might be more correct....
[16:34:07] <cradek> what's it displaying?
[16:37:48] <Patang> Its a just a setting for blockheight, ment to be entered manually in GUI, but Im making a routine using 2 toolsensors to get it automatically measured
[16:39:26] <cradek> it sounds like you need to use the remap facility to customize something, instead of trying to bend a preconceived hardcoded setup
[16:40:12] <Patang> basically Im just wondering if there is a variant of "gmoccapy.blockheight = 24" available in gcode subroutines
[16:40:50] <cradek> there's no way to set anything from gcode unless it's a hal pin
[16:41:06] <Patang> ok, thanks
[16:41:46] <andypugh> Patang: Open a terminal and type “halcmd chow pin gm*” and see if there is anything likely
[16:41:59] <andypugh> No, don’t type that. The computer will act puzzled
[16:42:13] <andypugh> halcmd show pin gm*
[16:43:14] <seb_kuzminsky> chow
[16:43:41] <Patang> I thought hal-pins were read-only in gcode
[16:44:47] <Patang> but then Im very new at this
[16:46:47] <Patang> ref http://linuxcnc.org/docs/html/remap/structure.html#_named_parameters_and_hal_items_a_id_remap_referto_hal_items_a
[16:50:15] <andypugh> It’s possible to read them
[16:50:40] <andypugh> But you can also output on an “analog output” that is netted to a pin in HAL.
[16:51:38] <Patang> hm, ok. Is that M68 then?
[16:58:24] <andypugh> I forget, but it’s in the 60’s
[17:00:07] <Patang> thanks, Ill work something out with this
[17:27:11] <seb_kuzminsky> the vars file gets written sequentially, but there seems to be no way to communicate to the readers that it's complete. for example, we don't write to a temp file and atomically rename it when we're finished writing
[17:27:47] <seb_kuzminsky> so readers race the writer, and may reach eof before the writer (task) is done writing, and erroneously not see the vars that task hasn't written yet
[17:31:18] <mozmck> erroneously?
[17:34:51] <seb_kuzminsky> yeah that seems like a bug
[17:36:58] <mozmck> Yeah, I was being a little nit-picky - perhaps erroneously :) I think the error is not that the readers don't see the vars that haven't been written yet, but that there is no blocking mechanism to keep them from reading while the file is being written.
[17:37:17] <seb_kuzminsky> ah, yeah
[17:37:46] <seb_kuzminsky> i'd argue that the error is that we try to use rewriting of files as an IPC mechanism in the first place :-/
[17:37:57] <mozmck> I know a lot of things use a lockfile, I wonder if that would work here?
[17:38:08] <mozmck> hah! probably so.
[17:38:42] <seb_kuzminsky> we could use flock(2) to arbitrate access to the vars file
[17:38:57] <seb_kuzminsky> that might introduce even worse latencies in task than what we have now
[17:39:02] <mozmck> heh
[17:39:20] <mozmck> what is using vars for IPC?
[17:39:34] <seb_kuzminsky> a poorly written gui could flock the vars file indefinitely, preventing task from acquiring the lock needed to re-write it
[17:40:15] <seb_kuzminsky> task writes the vars file, and axis reads it to learn things like coordinate system offsets
[17:41:47] <seb_kuzminsky> we have the current G5x coordinate system offset in EMC_TASK_STAT
[17:42:00] <seb_kuzminsky> and the G92 offset
[17:42:17] <mozmck> So I wonder if it would be better for the gui's to ask task for the settings, instead of reading the file directly?
[17:42:18] <seb_kuzminsky> i wunner what else is needed
[17:42:40] <seb_kuzminsky> the guis already read the task stat buffer, so they have those fields
[17:43:02] <mozmck> but they read the file directly?
[17:43:04] <seb_kuzminsky> there are params that get written to the vars file that do not appear in the task stat buffer, maybe some of those are needed by some of the guis?
[17:43:10] <mozmck> oh
[17:43:16] <seb_kuzminsky> the guis read the file directly, says jepler
[17:44:36] <mozmck> I would think having a single reader and writer would seem be ideal, then you just query the reader/writer - but I don't know what all is involved.
[17:45:18] <seb_kuzminsky> in the old-school shared-memory way, the params should all live in a giant array of floats, with task having a read/write mapping to it and everyone else who cares having a read-only mapping to it
[17:45:50] <seb_kuzminsky> in the new-school networky way, guis should connect to task and subscribe to the params they care about, and task should send them updates when those params change
[17:45:53] <mozmck> Does the OS do read/write controls?
[17:46:03] <seb_kuzminsky> on shared memory?
[17:46:05] <mozmck> yes
[17:46:27] <andypugh> Does #define lead to textual substitution in strings?
[17:46:29] <seb_kuzminsky> yeah, you'd use mmap(2), which lets you request read-only or read-write mappings
[17:47:02] <seb_kuzminsky> andypugh: inside quoted strings, cpp does *not* do macro expansion
[17:47:06] <seb_kuzminsky> but you can do this:
[17:47:13] <seb_kuzminsky> #define NAME "Seb"
[17:47:26] <seb_kuzminsky> char *str = "My name is " NAME;
[17:47:46] <mozmck> ok, I haven't done any of that. I would do it the networky way :) I do that kind of thing in totally non-networked programs.
[17:47:54] <andypugh> Ah, that explains why http://git.linuxcnc.org/gitweb?p=linuxcnc.git;a=blob;f=src/emc/rs274ngc/interp_internal.hh;h=34a6c9ef961036e8956270c2e29c26d39b4c6289;hb=HEAD#l160 does not mess up http://git.linuxcnc.org/gitweb?p=linuxcnc.git;a=blob;f=src/emc/rs274ngc/interp_o_word.cc;h=084631e898b3ce00a9e5df5316b636fbe8bed1a8;hb=HEAD#l132
[17:50:23] <seb_kuzminsky> andypugh: one gripe i have with C is that you can't easily render the names of enums or #define macros as strings, makes for gross error-prone hacks like that
[17:51:26] <skunkworks> andypugh: I am cheering you on...
[17:51:35] <andypugh> I think I prefer “case 1: //O - IF”
[17:52:21] <seb_kuzminsky> i dont see where axis reads the parameter file
[17:53:33] <seb_kuzminsky> axis and gremlin both make a copy of the parameter file (?) but dont open it
[17:56:19] <seb_kuzminsky> and gremlin uses the g5x_offset field of the stat struct, just like you'd expect, rather than try to read it from a flaky file, like a crazy person would do
[17:59:31] <andypugh> I am finding this hard, I don’t really do C++, so stuff like this baffles me
[17:59:32] <andypugh> typedef std::map<const char *, offset, nocase_cmp>::iterator offset_map_iterator;
[18:00:22] <mozmck> I like C++ :) the new C++11 has some very nice improvements though that can make things more readable.
[18:00:47] <seb_kuzminsky> andypugh: me too
[18:01:31] <mozmck> They are typedef'ing that so when they want an iterator of type std::map<const char*, offset, nocase_cmp> later, they can just say offset_map_iterator instead.
[18:02:31] <andypugh> Typedef I get
[18:02:52] <andypugh> Its std::map<…. that I can’t parse
[18:03:05] <mozmck> yeah, it's a template class, which can be hairy
[18:03:33] <mozmck> a map is a key->value mapping container
[18:03:59] <seb_kuzminsky> i dont think anyone reads the var file but task
[18:04:21] <seb_kuzminsky> bbl
[18:04:21] <andypugh> I didn’t know C++ was so high-level
[18:04:27] <mozmck> the stuff in the <...> are the types of the key, value, and a pointer to a comparison function I think - I'd have to look up maps
[18:04:34] <andypugh> Or is that some sort of macro?
[18:04:59] <mozmck> No, not a macro, c++ is pretty high level in many ways.
[18:05:24] <mozmck> http://en.cppreference.com/w/cpp/container/map
[18:05:29] <andypugh> Of more immediate use, is there a way to find where a function is being called from when inside the function?
[18:07:14] <mozmck> I'm not sure what you are asking?
[18:08:30] <andypugh> I asume not then. Maybe it would be possible in an interpreted language
[18:08:34] <mozmck> You mean while running, have the function tell you who called it?
[18:08:49] <andypugh> Yes. I can see that is basically impossible.
[18:09:16] <andypugh> There would be a return pointer, but that’s not going to give any useful info about source code
[18:09:17] <mozmck> I don't know. You can get a backtrace from gdb.
[18:09:42] <mozmck> You can also set breakpoints, and see the call stack.
[18:09:44] <cradek> seb_kuzminsky: AXIS/gremlin runs another interpreter to generate the preview, and that interpreter needs the var file
[18:09:47] <andypugh> If I knew how to use gdb I wouldn’t be asking the question
[18:10:22] <mozmck> I haven't done a lot of that. Most of my debugging for a while has been in microcontrollers. I flash an LED or something :)
[18:10:49] <mozmck> I often add printf() statements to track what get's called and when.
[18:12:19] <mozmck> I've barely used gdb either. Mostly through eclipse when I do though.
[18:16:58] <andypugh> I installed Eclipse with that intention, but LinuxCNC is rather a complex system to set up Eclipse for.
[18:17:34] <andypugh> (For example, the eclipse gdb gui wants to knwo the name of the executable)
[18:18:48] <mozmck> I haven't debugged linuxcnc using gdb, but I see your point, I don't know what executable you'd load to debug anyhow.
[18:19:15] <Tom_itx> i had nothing but trouble with eclipse when using it with avr-gcc
[18:20:44] <mozmck> I've liked eclipse pretty well. Especially for it's editing and code reading stuff. Hitting F3 over any variable and it taking me to where it is defined saves a lot of time. As well as a bunch of other stuff.
[18:22:44] <andypugh> That’s _nearly_ working for me after rebuilding the index. It’s only taking me to the file, not the line of code
[18:25:44] <cradek> andypugh: milltask is what you want to debug
[18:25:53] <cradek> gdb path/to/milltask `pidof milltask`
[18:25:57] <cradek> will attach to it
[18:26:29] <andypugh> I think I tried the GUI version f that, but possibly when I didn’t have GDB installed
[18:27:05] <cradek> unfortunately I bet guis don't help if you don't already know how to use the debugger
[18:27:29] <cradek> getting attached to the running program is simple, but sometimes getting the program interrupted where you care about what's happening is harder
[18:27:59] <mozmck> Well, guis can make that part easier.
[18:28:02] <cradek> for an error you can just set a breakpoint at the spot where the error is issued, then continue, then do whatever causes the error to happen
[18:28:14] <andypugh> I have added a printf to the control_find_o_word function, and the output is curious.
[18:28:15] <andypugh> http://www.pastebin.ca/3091490
[18:28:49] <andypugh> There are a lot more calls to the function than expected, and many fail.
[18:29:50] <cradek> I think that fail means "nope it's not yet defined" which is maybe normal for every O line as you pass it?
[18:30:14] <andypugh> I don’t know.
[18:30:47] <andypugh> I am resentful of the fact that this isn’t my buggy code but I seem to be trying to fix it
[18:31:14] <cradek> that sucks
[18:31:31] <cradek> I'm pretty sure it's a remap problem, so it's not ken's either
[18:33:04] <andypugh> Well, there is something about the context in which the O-word parsing is happening that is confusing a test. It has the look of a rather brain-dead test to me
[19:20:28] <KGB-linuxcnc> 03Sebastian Kuzminsky 05seb/2.7/task-latency-reporting 4ebfe45 06linuxcnc 10src/emc/task/emctaskmain.cc task: report if main loop takes too long * 14http://git.linuxcnc.org/?p=linuxcnc.git;a=commitdiff;h=4ebfe45
[22:24:28] <KGB-linuxcnc> 03Sebastian Kuzminsky 05seb/2.7/task-latency-reporting 1fb7ca9 06linuxcnc 10src/emc/task/emctaskmain.cc task: more latency reporting * 14http://git.linuxcnc.org/?p=linuxcnc.git;a=commitdiff;h=1fb7ca9
[22:47:46] <seb_kuzminsky> i think (at least part of) the secret to fixing Task's latency is to revert 724f26975a2
[22:48:06] <seb_kuzminsky> or, you know, do something similar in a different way
[22:49:36] <cradek> when in doubt, be smarter the second time?
[22:53:56] <seb_kuzminsky> http://try.redis.io/
[22:55:13] <cradek> do you mean use a database instead of the var file?
[22:56:57] <cradek> I must not understand what problem you're working on solving
[22:58:39] <seb_kuzminsky> yeah, that's the thought
[22:59:24] <cradek> I agree (from earlier) it should obviously be write then mv, not mv then write
[23:00:11] <seb_kuzminsky> i think it needs to be cp (to make the backup) then write, then rename
[23:00:17] <seb_kuzminsky> (if we still think the backup is useful)
[23:00:19] <seb_kuzminsky> but!
[23:00:25] <cradek> I don't think it's ever bitten anyone, unlike the remap bug that's on my mind
[23:00:40] <cradek> doubt the backup is ever useful
[23:00:52] <seb_kuzminsky> maybe it made sense to write your own key-value ipc/storage mechanism in what 1990 or whatever, but in 2015 it's looking increasingly silly
[23:01:04] <cradek> with one backup the change you wanted would be nuked anyway
[23:01:44] <cradek> how about if it's simple, has been working for 25 years, is documented, and causes no dependencies?
[23:01:55] <seb_kuzminsky> the value of the backup (i guess) is if task crashes while writing the new file? then you could manually go fix things (if you were paying attention)
[23:02:00] <seb_kuzminsky> (which i most likely would not be)
[23:02:11] <cradek> yeah I wouldn't either
[23:02:12] <seb_kuzminsky> * and broken
[23:02:45] <seb_kuzminsky> not that task crashes very often
[23:02:45] <cradek> sorry, what's broken?
[23:02:50] <seb_kuzminsky> i dont think i've ever seen it crash
[23:03:22] <seb_kuzminsky> oh, the broken part is task writes the file often, in a way that doesnt let the readers know if they're reading a partial or a complete file
[23:03:54] <seb_kuzminsky> oh, and task takes 2+ seconds to do so, sometimes
[23:04:13] <seb_kuzminsky> during which time software estop and stuff stops working temporarily
[23:04:37] <cradek> wouldn't any solution cause file IO, which has the same problem?
[23:04:51] <cradek> or would it be multithreaded or something
[23:04:52] <seb_kuzminsky> having another thread do the file io asynchronously would fix the second problem
[23:05:12] <seb_kuzminsky> with some external db, another thread does the io
[23:05:13] <cradek> while making the first problem worse
[23:05:19] <cradek> ?
[23:05:30] <cradek> (I don't know anything about this stuff)
[23:05:59] <seb_kuzminsky> it's true that you can't do atomic updates of complex data types in redis
[23:06:04] <seb_kuzminsky> so the first problem remains
[23:06:13] <seb_kuzminsky> the second problem (task pauses for seconds) goes away
[23:06:16] <cradek> I can't debug task (remap) as-is and having more asynchronous threads doesn't feel like it would help
[23:06:38] <seb_kuzminsky> hmm, yeah, i see your point
[23:06:39] <cradek> maybe that's just got me down, it's not you
[23:06:49] <seb_kuzminsky> heh
[23:07:36] <seb_kuzminsky> i think threads with lots of shared state are hard to debug, but threads with clean interfaces (like external nosql dbs have these days) are pretty straight forward
[23:08:08] <seb_kuzminsky> i dont know much about them, except we use redis for our product at work and the unix geeks who work with it like it
[23:17:34] <seb_kuzminsky> oop, wait, redis supports transactions, so it can fix problem 1 as well as problem 2: http://redis.io/topics/transactions
[23:29:03] <seb_kuzminsky> redis is tiny, the server is < 500 kB installed, the C library is 96 kB
[23:29:23] <seb_kuzminsky> there's a C and a python library for clients (and some other languages that i dont care about)
[23:36:25] <seb_kuzminsky> task: 2703 cycles, min=0.000022, max=5.355035, avg=0.005549
[23:36:25] <seb_kuzminsky> halui: emcCommandSend: no echo from Task after 1.000 seconds
[23:38:53] <seb_kuzminsky> sometimes task is "eager" and doesn't sleep, thus the shorter-than 0.001 sec cycles
[23:39:49] <cradek> wonder what the standard deviation is
[23:40:27] <seb_kuzminsky> not sure
[23:40:31] <cradek> I guess what I really wonder is how many long ones there are
[23:40:43] <seb_kuzminsky> the new version of my branch tells you!
[23:40:46] <seb_kuzminsky> hold on!
[23:41:16] <seb_kuzminsky> task: 2163 cycles, min=0.000017, max=0.040818, avg=0.002104, 1030 latency excursions
[23:41:51] <seb_kuzminsky> so 1030/2163 cycles took longer than 2X the [TASK]CYCLE_TIME
[23:42:46] <seb_kuzminsky> and the average is just greater than 2x the CYCLE_TIME...
[23:43:10] <seb_kuzminsky> note to self: task is not realtime
[23:43:51] <seb_kuzminsky> in related news, i pitched a new project to my dayjob, that they should let me work on realtime linux kernels
[23:44:08] <seb_kuzminsky> i dont think they'll accept my offer
[23:44:09] <cradek> yeah if you ask for .001 and you routinely do file io, nope nope
[23:44:19] <cradek> whoah
[23:44:23] <cradek> that would be nice...
[23:44:30] <cradek> but aren't you a bsd shop?
[23:46:13] * seb_kuzminsky didges the question
[23:46:32] <seb_kuzminsky> i just ran the redis-benchmark program, on a vm running halui/mdi in a tight loop
[23:46:36] <cradek> 'sokay
[23:47:01] <seb_kuzminsky> it can do 20k "set that key to this value" operations per second
[23:47:14] <seb_kuzminsky> and same number of gets
[23:47:34] <seb_kuzminsky> that's... 50 us per parameter
[23:48:09] <seb_kuzminsky> so maybe we'd have to be smart and only save the ones that are changed
[23:49:22] <seb_kuzminsky> although if we save, say, 119 params (typical number for us), that's 6 ms, not great but better than the tens of ms we get when we call rename()
[23:59:46] <KGB-linuxcnc> 03Sebastian Kuzminsky 05seb/2.7/task-latency-reporting ee8b81b 06linuxcnc 10src/emc/task/emctaskmain.cc task: better latency warnings * 14http://git.linuxcnc.org/?p=linuxcnc.git;a=commitdiff;h=ee8b81b