#linuxcnc-devel | Logs for 2014-09-01

[06:52:18] <jepler> all night long!
[06:56:23] <jepler> (no bad latencies in an overnight run)
[07:28:10] <KGB-linuxcnc> 03Jeff Epler 05jepler/revive-usrmot a79b413 06linuxcnc 10src/emc/usr_intf/Submakefile 03src/emc/usr_intf/usrmot.c Revert "usrmot: it was broken; remove it" * 14http://git.linuxcnc.org/?p=linuxcnc.git;a=commitdiff;h=a79b413
[07:28:10] <KGB-linuxcnc> 03Jeff Epler 05jepler/revive-usrmot 96df8b8 06linuxcnc 10src/emc/usr_intf/usrmot.c Fill out new required fields in emcmot_command_t * 14http://git.linuxcnc.org/?p=linuxcnc.git;a=commitdiff;h=96df8b8
[08:26:13] <jepler> hm, pretty short list of master branch contributors in august
[10:32:20] <jepler> http://emergent.unpythonic.net/files/sandbox/spi-goodlatency-14h.png
[10:32:50] <jepler> oh maybe only 13 hours :-P
[12:00:06] <pcw_home> Thats a lot better, what are the units of tmax? (on x86 they are CPU clocks)
[12:57:45] <jepler> pcw_home: units of ns
[13:03:47] <pcw_home> ahh so different than on x86
[13:09:32] <jepler> there's not a timestamp counter that's available in userspace, so it just calls clock_gettime which returns ns regardless of what the underlying time source counts in
[13:09:37] <jepler> .. which is nice because you know what the units mean
[13:14:55] <pcw_home> Yeah I always mentally divide the tmax figure by CPU clock speed...
[13:15:04] <jepler> the downside is each clock_gettime call is like an I/O, takes about a us
[13:17:47] <pcw_home> I guess the x86 timestamp is a CPU register so almost 0 overhead
[13:19:55] <jepler> there might be some registers that can be optionally enabled in userspace. not sure if they're architectural or vary from part to part. http://neocontra.blogspot.com/2013/05/user-mode-performance-counters-for.html
[13:22:27] <jepler> I guess his notes imply they're part of the arm 7a architecture
[13:27:03] <pcw_home> I guess 1 usec adds up if every component does it
[13:30:43] <pcw_home> I still think there is some interrupt affinity/priority magic that could improve latency
[13:30:44] <pcw_home> (you can see latency spikes on any components time with video access for example )
[13:32:15] <pcw_home> Ideally all (non required) interrupts should be deferred until linuxcncs thread is done
[13:33:45] <jepler> the file /proc/interrupts can be used to monitor where IRQs are being delivered
[13:35:42] <jepler> and /proc/irq/#/smp_affinity_list can be used to read and set the IRQ's CPU affinity
[13:36:00] <jepler> so for instance you could try putting (highest CPU #) in the ethernet card's irq affinity file
[13:36:28] <jepler> and removing it from the affinity of e.g., the video card (I've got an irq for i915@pci:0000:00:02.0)
[13:38:04] <jepler> in the case of the odroid, I can see by /proc/interrupts that only a few interrupts are delivered to CPU3 -- MCT (system timer), rescheduling interrupts, and just a few function call interrupts
[13:38:32] <jepler> and that spi-s3c64xx doesn't see any interrupts even though it has a /proc/interrupts line
[13:48:59] <pcw_home> What I think is desirable but probably not easily accomplished would be to
[13:49:01] <pcw_home> disable _ALL_ interrupts to _all_ CPUs except required synchronous interrupts
[13:49:02] <pcw_home> (say Ethernet or base threads) during the time linuxcncs servo thread is running
[13:51:09] <pcw_home> though there might be deadlock potentials by doing that
[13:59:27] <jepler> oh you think delivering the IRQs to whatever CPU can be bad
[13:59:36] <jepler> not just stopping them from being delivered from the RT CPU
[14:01:47] <pcw_home> Just trying to understand spike eve in non/io related component times
[14:01:53] <pcw_home> even
[14:02:11] <pcw_home> even in RTAI
[14:04:00] <pcw_home> plot pid time and do some fancy video or just move mouse: spike forest
[14:07:39] <CaptHindsight> "interrupt affinity/priority magic" I think some of that kernel magic is broken
[14:09:01] <CaptHindsight> and some video drivers use "tricks" to cut in irq line
[14:10:22] <CaptHindsight> DMA's should not be long enough to violate any real time threads
[14:11:38] <CaptHindsight> and setting irq affinity should really set it to the core that is set
[14:12:35] <CaptHindsight> and irq priority should really set priority, but we were seeing some odd behaviors while testing the new RTAI on the 4 core APU's
[14:13:13] <CaptHindsight> I think in time we'll get this all sorted out
[14:14:42] <CaptHindsight> i don't anyone has been playing latency watchdog for the kernel, it only gets attention when there is some noticeable regression
[14:15:32] <pcw_home> It seems pretty clear the linuxcnc's thread gets interrupted, causing "hal latencies" for lack of a better term
[14:25:03] <jepler> going to try a new kernel with userspace performance counters enabled
[14:25:12] <jepler> .. 17 hours and no bad latencies
[14:25:47] <jepler> http://paste.debian.net/118794/
[14:25:54] <pcw_home> what did you change that got rid of the 10 ms latencies?
[14:26:08] <jepler> pcw_home: moved dma acquisition to spi driver registration time
[14:26:13] <jepler> rather than every spi transaction
[14:27:53] <pcw_home> looks like it would just about run at 2 KHz
[14:28:18] <jepler> it'll help when I rebase onto a branch that is rid of pet_watchdog
[14:28:32] <jepler> not sure how long motion typically runs..
[14:29:34] <pcw_home> how many cores does the Odroid CPU have?
[14:29:34] <jepler> t1 = rdtsc32();
[14:29:34] <jepler> t2 = rdtsc32();
[14:29:34] <jepler> printf("%u %u %u\n", t1, t2, t2-t1);
[14:29:37] <jepler> 2147499433 2147499470 37
[14:29:38] <jepler> whee
[14:29:39] <jepler> pcw_home: 4
[14:30:26] <pcw_home> 37 is a lot better than the 1000s the system call probably takes
[14:31:20] <jepler> I assume these are now units of CPU cycles (CPU speed = 1.7GHz)
[14:31:34] <jepler> rdtsc32() reports 14213 cycles for clock_gettime()
[14:31:48] <jepler> that's more than I thought
[14:31:54] <pcw_home> wow thats an expense
[14:32:17] <jepler> hmm
[14:32:39] <jepler> 1 call is ~14k, 100 clock_gettime calls in a row is only 99k
[14:33:16] <jepler> first call in a program is very slow
[14:33:22] <jepler> second call is fast
[14:33:27] <jepler> or maybe "most" calls are fast
[14:33:32] <pcw_home> caching?
[14:33:52] <pcw_home> maybe you just had a bad one
[15:37:22] <jepler> the timestamp counter must be virtualized or something, it doesn't seem to count during sleeps
[15:38:42] <jepler> t1 = rdtsc; usleep(1000*i); t2 = rdtsc; print t1-t2
[15:38:51] <jepler> this always prints about the same value, even though it gets slower as i gets bigger
[15:49:12] <jepler> well, for posterity: http://emergent.unpythonic.net/files/sandbox/0001-THIS-DOESN-T-WORK-use-performance-counter-for-timest.patch
[15:54:01] <skunkworks_> jepler: still going?
[15:54:20] <jepler> skunkworks_: interrupted a few times to try various things
[15:57:30] <jepler> now trying it with a 500us servo thread
[15:58:43] <jepler> I'm certainly seeing larger variations in the single-cycle velocity estimation
[16:01:56] <jepler> rtapi_app runs about 25% CPU according to top
[16:03:20] <pcw_home> I will have a DPLL latched stepgen sometime today
[16:09:45] <jepler> pcw_home: cool
[16:10:26] <jepler> pcw_home: did you say that via spi, it's not actually possible to reprogram the eeprom?
[16:10:50] <jepler> .. if it were possible, I'd put it on my list to look what it takes to accomplish that in mesaflash
[16:25:22] <PCW> I don't think mesaflash knows how to program via SPI though the firmware does support it
[16:28:14] <PCW> (lack of standard high level SPI interface)
[16:28:16] <PCW> you can jumper for secondary (EPP), connect to EPP host, cycle power, jumper for primary, load SPI config though
[16:29:52] <PCW> its easy to forget and program both EEPROMS with SPI
[16:30:24] <jepler> /dev/spidev is the "standard" I would write to
[16:30:57] <micges> PCW: how can you program via spi? you have 2 commands according to 7i90 doc: read and write
[16:32:26] <PCW> Flash interface hardware is the same as on PCI cards (two 8 bit registers at 0x70 and 0x74)
[16:33:00] <micges> ahh
[16:33:18] <PCW> (in hm2 space so need 32 bit access)
[16:34:53] <micges> jepler: so code to write flash is in mesaflash, just need to add spi support
[16:35:24] <jepler> may not be very fast unless it's batched up
[16:36:18] <PCW> I think you can write the page data without any polling except at 256 byte intervals
[16:38:22] <micges> yeah that way it works on 7i80 programming
[16:38:25] <PCW> even at 100 MHz SPI clock (640 ns/64 bit command) the 8 bit shift out will be done before the next hm2 write command
[16:41:35] <PCW> the 7I80 is a bit smarter since the processor waits for the shift to complete, but it has to, its next data is already in local RAM
[16:43:03] <PCW> and the parsers overhead for loops is ~160 ns per hm2 write, faster then the flash interface shift register
[16:43:36] <micges> I see
[16:44:11] <PCW> actually the 7I80 doesnt go through HM2 space for that, but in any case it polls locally
[16:45:01] <PCW> but the SPI interface cant outrun the flash SPIinterface so bursts should be OK
[16:46:22] <jepler> ok
[16:48:02] <PCW> long bursts (so only 32 bits per 8) would limit SPI clock to 50 MHz (with no handshaking or no-op SPI commands inserted)
[16:55:32] <jepler> closing in on 1 hour at 2kHz (thread max-time 315124 [ns])
[16:56:03] <PCW> very promising
[16:57:07] <PCW> micges: what do you want for a ICAP cookie?
[16:58:22] <micges> probably something with 5 and A ;)
[16:58:36] <PCW> I can put "ICAP" in text"
[16:58:46] <micges> 1CAB
[16:59:07] <PCW> done
[16:59:51] <PCW> will add to cards that have write only ICAP
[17:00:20] <micges> ok
[17:00:22] <PCW> (SP6 PCI,SPI,EPP)
[17:01:46] <PCW> HM2_serial is different since its basically the same as Ethernet and so has a read/write port in separate address space
[17:04:46] <PCW> actual ICAP cookie = 0x1cab1cab
[17:05:15] <micges> I like it
[17:11:34] <PCW> the latest 7I90s have 16 mbit flash chips so have room for 4 configs each
[17:12:24] <PCW> with ICAP it becomes possible to test a config in "free" flash space without risk
[17:13:22] <micges> seems to be worth adding directly addressing prace to read/write flash and reboot to
[17:13:28] <micges> space
[17:13:43] <micges> I mean block
[17:17:01] <jepler> pulling the plug on spi also triggers the encoder velocity estimation errors
[17:17:17] <jepler> I mean, pulling the cable
[17:20:26] <PCW> Yeah so any bad data
[17:22:52] <PCW> if the 0xAAAAAAAA is not read the whole hm2 data processing section could be skipped
[17:23:01] <jepler> I sure could add that
[17:25:06] <PCW> I will look into on CRCs (maybe CCITT16 for TX and RX in last 32 bits of data in a frame (no CRC for last 32 clocks of course)
[17:27:20] <jepler> seems like you really only need to crc one side
[17:27:25] <jepler> "the correct side"
[17:27:30] <jepler> but maybe that's harder
[17:30:15] <PCW> for the short cable lengths used CRC is probably a bit of overkill (and without buffering the frame the interface will have already written data with bad CRCs)
[17:30:17] <PCW> so there no telling what has happened
[17:56:15] <jepler> pcw_home: I don't think I can go looking for the 0xAAAAAAAA magic cookie. except when debugging it, I read back zeros
[17:57:17] <jepler> I suspect that /CS is being managed behind my back
[17:57:27] <jepler> and there aren't big enough delays for it to actually get deasserted
[17:57:52] <jepler> it's inconvenient to attach a probe, boo
[18:01:19] <jepler> if it weren't actually reading back correctly almost all of the time, I'd see feedback velocity glitches; it reads back a zero cookie much more frequently than that
[18:02:40] <PCW> the 0xAAAAAAAA could be echoed whenever a command was expected instead
[18:05:28] <PCW> all that frame timeout is good for is recovering from a aborted burst transfer (resync-ing)
[18:14:13] <PCW> the timeout could be much shorter assuming all data for one frame fits in the FIFO
[18:14:15] <PCW> but what is the actual CS timing?
[18:55:39] <jepler> weeks ago in the logic analyzer it seemed like CS was deasserted between each transaction
[18:55:58] <jepler> today I can't conveniently attach the logic analyzer, the second board doesn't have the extra pins I had put on for that purpose
[18:59:04] <jepler> > All SPI transfers start with the relevant chipselect active. Normally it stays selected until after the last transfer in a message. Drivers can affect the chipselect signal using cs_change.
[18:59:49] <jepler> maybe I should try setting cs_change
[19:00:55] <jepler> > (ii) When the transfer is the last one in the message, the chip may stay selected until the next transfer. On multi-device SPI busses with nothing blocking messages going to other devices, this is just a performance hint; starting a message to another device deselects this one. But in other cases, this can be used to ensure correctness. Some devices need protocol transactions to be built from a series of spi_message submissi
[19:01:02] <jepler> https://www.kernel.org/doc/htmldocs/device-drivers/API-struct-spi-transfer.html
[19:06:46] <jepler> no, that doesn't seem to make a difference in whether I read the cookie back.
[19:11:42] <jepler> hm2_tp_pwmgen_read is the source of the zero-byte read operation
[19:13:17] <jepler> PCW: I expect that if you enable some tp_pwmgens on ethernet you'll get more packets as well
[19:13:31] <jepler> I guess ethernet is guarded against sending zero-length reads somewhere or other
[19:13:46] <jepler> but my code merrily encodes the SPI transaction for "read zero words starting at..."
[19:13:50] <jepler> or did until now, anyway
[19:16:51] <jepler> well, bug entered. cheers, andy
[19:17:52] <jepler> yay, that pushes a typical read transaction down to 80us
[19:19:08] <skunkworks_> yycfv neat
[19:19:12] <skunkworks_> neat
[19:19:57] <jepler> PCW: do I need to not only cs_change but ensure a long-enough delay with cs deasserted?
[19:20:20] <jepler> something about a 50us timeout right?
[19:26:40] <jepler> whee, merges nicely with seb's watchdog branch, which cuts a good 20us off every period
[19:27:24] <jepler> and I guess about 60us off of tmax
[20:17:33] <pcw_home> the 50 usec cs deasserted delay resets the burst state machine
[20:18:55] <pcw_home> shorter cs de-assertions just clear the bit count
[20:20:17] <pcw_home> I think a zero length read on Ethernet will set the parse error bit
[20:20:36] <pcw_home> the SPI stuff is not that smart
[20:23:54] <pcw_home> I have not (and I doubt if anyone else) tried the tppwmgen with Ethernet
[20:24:12] <pcw_home> I did try the resolver module the other day
[20:27:17] <pcw_home> I patched the stepgen for DPLL sampling but dont have a bit file yet
[20:27:18] <pcw_home> I think I will make the latched stepgen position logic conditional on the DPLL being present
[20:27:20] <pcw_home> since its a lot of wasted FPGA space for the 32 bit latches if not used
[20:45:11] <jepler> seb_kuzminsky: someone on a debian mailing list mentioned in passing that by making documentation building part of the arch-indep part, it can be skipped on slower architectures like arm
[20:45:24] <jepler> seb_kuzminsky: I'll look into this if I get a chance, but thought you might be interested to know
[20:47:01] <jepler> seb_kuzminsky: so if monkeyed properly, dpkg-buildpackage -b would build everything, -B just the arch-dependent stuff (linuxcnc) and -A the arch-independent stuff (docs)
[21:28:41] <cradek> jthornton: would you fix another spam?
[21:28:52] <cradek> umm I'll send you the url in private I guess