Tuesday, January 31, 2006

Kernel patching for fun and profit

One of the nice things about Linux is that if a certain feature you need isn't available, you can write or create your own. The obvious drawback, of course, is you need to be a programmer with some mad skillz in order to do that. Fortunately, there are a few of them out there, and here are some of the kernel patches I sometimes use and where to get them. I'm thinking of starting a kernel patch repository here because frankly there isn't one anywhere else!

Suspend2
Those of us with laptops know how nice suspending to disk can be. I use it every day. I standby my laptop before I leave the office, and when I plug it in back at home, I'm right back where I was half an hour before. This comes in real handy when you're reading something online and want to finish when you get home. Anyway, this is the patch that emulates Windows hibernation feature. Not a small or easy change to implement, but worth it in the end.

OpenMosix & OpenSSI
These allow you to turn all of the computers in your location into one big supercomputer. Sort of...Beowulf-style clustering is the most well known Linux parallel-processing hack. The problem with Beowulf is you have to write your software to directly support parallel processing. Your software does all the work, Beowulf just lets it do it. OpenMosix & OpenSSI do this at the kernel level, meaning you kernel moves your threads between machines. You don't get the level of performance you do with Beowulf, but for things like render farms, this comes quite in handy. Unfortunately, neither supports the 2.6 kernel yet, but OpenMosix is pretty close.

User Mode Linux
Esentially, this is a console-mode VMWare. It allows you to compile a special kernel that can be run ("booted") just like a regular executable. Good for testing software before you put it into production.

USB/IP Project
Easily one of my favorites. And the fact that it works on recent kernels so I can use it doesn't hurt much, either. :) Basically, this is just a driver that emulates a USB hub. The hub, however, transmits/receives all traffic to/from the USB device connected to over an IP-based network. In other words, connect a USB device to your machine, and any other machine with this driver can use it as if it were a local device. Imagine a USB-enclosed hard drive that everyone can mount and backup their stuff to. Or, a desk in your house with a server that lives underneath. You plug your scanner into it, and to use it you just take your wireless laptop over and start working. No plugging things in just to work. Sweet!

Thursday, January 26, 2006

A....P....C....DEAD!

I tell you, I've got to stop writing this damn thing. I've had more serious troubles with my server since I started writing about how to do stuff. It's almost like every time I make some progress, something else comes along to force my hand...

The other night, we came home from work to find the house dark. My house is automated with HomeSeer. I don't have a shitload of tasks setup, but the most important: turn on the outside and living room lights just before sunset apparently hadn't run. There's usually only one reason for that: loss of power to the server. The server's plugged into a huge surge protector* along with the TV and other electronic equipment in the living room, and a few weeks back I moved it from the floor to an inaccessible spot behind the equipment rack. Beeper, the cat, likes to sleep back there 'cause of the heater and it's pretty isolated. Unfortunately, the huge switch on the protector was too easily hit by the fat cat. If I couldn't get to my mail at any point in the day, I knew it was due to her. :)

But, as we moved closer to the house, we could hear the alarm beeping. Uh-oh. I pulled out my handy Husky pocket flashlight, and took a tour around the house, peeking in the windows and such. The house was secure, and I could see the clock flashing on the oven. Power outage. Fucking RG&E. Well, at least it wasn't anything serious.

Now, as the three people who read this blog know, I've got my drives setup in RAID arrays. But, that don't help much when your drives have become corrupted, or you corrupt the array yourself. I'll spare you the details because, frankly, I'm not 100% sure what I did, or why I had to do it. Suffice it to say, two hours later, I'd pretty much had enough with computers for life!

The next night, I ran to CompUOverpay and grabbed a 305va APC Back-UPS ES. I made sure it was supported under Linux before buying, of course. :) It's not a bad little UPS for $40. Considering you're slightly better protected from power surges, I'd recommend it as a good investment.

Anywho, fortunately setting it up is easy as pie. The first thing you need to do is install apcupsd. This is a reasonably simple install on pretty much any distro. On Fedora, it's as easy as "yum install apcupsd". Typically, you'd take the time to verify your UPS was being recognized by hotplug before bothering to setup the daemon, but I figured "fuck it". So far, Fedora's been pretty good at that stuff, so let's barrel on!

Even better than expected, the rpm containing apcupsd was already pre-configured for a USB UPS (prolly 'cause that's the most common kind now. Ya think?). So, for shits and giggles I typed "apcaccess" and was rewarded with tons of useful info!

APC : 001,034,0884
DATE : Thu Jan 26 16:10:35 EST 2006
HOSTNAME : someplace.oranother.com
RELEASE : 3.12.1
VERSION : 3.12.1 (06 January 2006) redhat
UPSNAME : someplace.oranother.com
CABLE : USB Cable
MODEL : Back-UPS ES 350
UPSMODE : Stand Alone
STARTTIME: Wed Jan 25 20:44:09 EST 2006
STATUS : ONLINE
LINEV : 120.0 Volts
LOADPCT : 68.0 Percent Load Capacity
BCHARGE : 100.0 Percent
TIMELEFT : 3.9 Minutes
MBATTCHG : 5 Percent
MINTIMEL : 3 Minutes
MAXTIME : 0 Seconds
LOTRANS : 088.0 Volts
HITRANS : 138.0 Volts
ALARMDEL : Always
BATTV : 13.5 Volts
LASTXFER : No transfers since turnon
NUMXFERS : 0
TONBATT : 0 seconds
CUMONBATT: 0 seconds
XOFFBATT : N/A
STATFLAG : 0x07000008 Status Flag
MANDATE : 2005-02-16
SERIALNO : XXXXXXXXXX
BATTDATE : 2000-00-00
NOMBATTV : 12.0
FIRMWARE : 00.e5.D USB FW:e5
APCMODEL : Back-UPS ES 350
END APC : Thu Jan 26 16:11:29 EST 2006

Yaay! (I took this at 4PM the next day, so that's why the battery's so well charged). I see I don't get a whole lot of time before I die, though. The drawbacks of using a dual-proc server. But, 4 minutes is more than enough time to gracefully shutdown the server and hopefully protect my data and such.

The first thing I need to address is the fact that I've got a W2K3/Exchange 2003 virtual machine running. That needs to be shutdown gracefully first to minimize damage to the database. I have a copy of GSX server, but unfortunately, the newest version of GSX doesn't support machines built with the newest version of Workstation. I've tried a couple of times to wedge it in there, but finally decided to wait for a new GSX. (I know, there are plenty of ways to do it, and I've tried a few with no success for various reasons. Don't bother, it's not that important at the moment). Well, here's the problem, once VMware came out with their "server" products, they removed the ability to shutdown machines gracefully at shutdown (you used to be able to put a line in the VMX file telling it to hibernate the machine on SIGHUP). Since it's a GUI app, I can't just script it, so I'd need a tool to do so, and I looked at a couple. None really did easily what I needed it to do (esentially: bring focus to that window, hit ctrl-Z).

Then, I remembered an easier solution: telnet. W2K3 includes a telnet server, and while I have it disabled by default, that's easy enough to change! So, I enabled and started the service and ran this on the Linux host:

autoexpect -f serversdn.exp telnet hostname

Expect is a nifty little scripting language with a specific purpose: automate other console apps. It's perfect for scripting a telnet session because you can tell it "wait for 'ogin:' and then send the username". Autoexpect simplifies this further. You tell it the name of the file to save your tasks to, and then the command you want it to run. When you're done, you have an expect script that needs no more than a tiny bit o' tweaking to get you up and running.

So, I scripted it to telnet into the server, shutdown the Exchange services** and then shutdown the machine:


set force_conservative 0 ;
if {$force_conservative} {set send_slow {1 .1}
proc send {ignore arg} {sleep .1 exp_send -s -- $arg}# }

set timeout -1
spawn telnet server
match_max 100000
expect "login: "
send -- "administrator\r"
expect "password: "
send -- "easypass\r"

expect "Administrator>"
send "net stop MSExchangeIS /y\r"

expect "Administrator>"
send -- "net stop MSExchangeMTA /y \r"

expect "Administrator>"
send -- "net stop MSExchangeSA /y \r"

expect "Administrator>"
send -- "net stop WinHttpAutoProxySvc /y\r"

expect "Administrator>"
send -- "net stop HomeSeerService /y\r"

expect "Administrator>"
send -- "tsshutdn 0 /powerdown /delay:0\r"

interact


Does it work? Oh, hell yeah it works! I had to do a little tweaking of the server first, though. On the first few passes, it took two minutes and fourty five seconds to shut down. Since I've got just under four minutes of battery power, that might not leave enough time to shut the box down. Fortunately, I've got a little experience with Winders, too...

Open regedit, and change the following:


"HKCU\Control Panel\Desktop\AutoEndTasks" change from "0" to "1"

"HKCU\Control Panel\Desktop\WaitToKillAppTimeout" This one defaults to 20000 milliseconds, I believe. Change it to 2000.

"HKCU\Control Panel\Desktop\HungAppTimeout" Same as above.

Duplicate the above two entries for HKEY_USERS\.DEFAULT so it'll apply to new users as well.

Finally, change "HKLM\System\CurrentControlSet\ControlWaitToKillServiceTimeout" to 2000 as well.


The difference? The Exchange VM now shuts down in one minute and ten seconds. That's a whole lot better, huh?

Now, all I need to do is tell apcupsd what to do when the power goes out, and BOOM! everything shuts down easy as pie. This part's easy enough to figure out. Edit /etc/apcupsd/apccontrol and put your shutdown commands in the various case blocks.

I did a test run by pulling the cord on the UPS. Within a couple of seconds, I watched the VM shutdown and turn itself off. The Linux box then followed soon after without too much issue. I had to tweak the timings as the VM didn't entirely shutdown fast enough, but I think I've got it all set now.

Oh, one final step: go into your BIOS and look for a setting called "Restore on AC/Power Loss". Change it to "Full On" or "Power On". ATX-based machines don't automatically power back on, but changing this setting will make it happen. That way, if the power's only out for a short time, your machine'll be back up and running when you come back!


* I don't put a lot of stock in surge protectors. Even the best triacs used to clamp the circuit are generally not fast enough to stop a lightning bolt from killing Stevie and his siblings. However, I WILL generally spend the extra $10-20 and get a good one 'cause they usually come with guarantees that cover zapped equipment. :)

**This is a single machine acting as domain controller and Exchange server. In that combo, it's best to shutdown your Exchange services before you shutdown. If you take the machine down without doing that, it'll enter a race condition where it tries to shut the services down, but it can't query the domain controller properly because that's going down...the short of it is, in this condition, it can take 30-40 minutes for the box to shut itself down. I don't got that kind of time. Oh, and to prevent accidently doing it when I'm in the machine, I've removed the Shutdown command from the start menu via a policy and replaced it with a batch file that does it right. Where possible, always put a cover over the power switch. ;-)

Monday, January 23, 2006

Rules to entice Open Source adoption

Over the years, I've seen some pretty consistent mistakes done by a large portion of the open source community that have forced me to stay out of it. I think the biggest issue is open sourcers seem to think everyone's a mind reader and just KNOW everything there is to know about their product, and if you don't, you shouldn't be using Linux anyway. Well, if that's your attitude, fuck you and go somewhere else. It never ceases to amaze me that people will put their software out on the web for others to use, and then bitch when people ask them questions. For those that are interested in having people use their software, read on....

Rule #1. Screenshots should be useful and viewable. Screenshots go a long way to telling people about your software. They can see if it's laid out well, if it has the features they need in a way that makes it easy to use them. Sometimes it'll even tell you more about what the software does. Now, I know it's antithetical to the *nix philosophy of "GUI bad, command line GOOD", but too bad. If you don't want to look, don't look. More often than not, I can decide if a particular piece of software will work for me by just looking at it.

Now, that being said, please read this part: SMALLER IS BETTER! This is the other thing that blows my mind. Wanna piss off an open sourcer? Send them a mail in HTML format. You'll get so much crap about sending "bloated" mail. Then, what do they do? They take a screenshot of their entire 1600x1200, 32 million color desktop to show you their new tray widget. Three hours after it finishes downloading the screenshot you can then decide if you want to go further. Seriously, cut 'em down. Resize the app so it's a minimum size necessary to view the functionality, take a screenshot of it, and then put it through the Gimp to cut down on the colorspace, and perhaps compress it. Hell, crop out anything that's not your app while you're at it. It is not necessary to see every icon on your desktop in order to see the new word processor you wrote.

Rule #2. Docs are very useful ways of populating your website with useful information. I used to use LFS, and how often did I find a package that SEEMED to fit my needs, only to find it required hundreds of packages installed, or really didn't fit those needs at all? Too often. You wanna save me some time and just link your INSTALL and README files on your page? They're plain text, so they're not going to kill your space or bandwidth, and will save me a ton of time. I shouldn've have to download a package, untar it, and then go into an editor just to find out the prereqs for it.

Rule #2a. Know your prereqs.
In an ideal world, every developer would have an LFS machine around so they know exactly what libs need to be installed in order for their software to work. There's nothing worse than having a long build appear to finish successfully, only to have the executable panic the machine when run. If they fail to test build it that way first...up against the wall!

Rule #3. About first, history second. The first thing on your website should NOT be a changelog. It doesn't help me to know that you "modified mallocs to use less memory" if I don't know what your software does. Tell me what it is, then put the history and changelogs elsewhere. If I'm interested, I'll look. And, a real-world example or two can sometimes go a long way to helping me decide. Occasionally, I'll come across a project and after looking at it for some time, still have no idea what it does or why I'd use it. I have, more than once, come across a project and then dismissed it only to be directed to it by another site that says "try this, you'll love it!" When I look again, I slap my head. I don't like slapping my head. While we're on this topic, put in an "English" changelog, too. Instead of the above entry, simply say "this version uses less memory".

Rule #4. Not everyone is a programmer. For the love of Cthulu, take pride in what you do! Not everyone can develop software. It's a gift you have, don't take it lightly! If someone asks a question, "look at the source" is not an answer unless they specifically hand you a code block and say "how does this work?"

Rule #5. Don't use Sourceforge. I'm sorry, but I hate SF. It wouldn't be so bad if I didn't have to fight my way through thirty different extremelt slow-loading pages just to download the tarball. Besides, no one uses SourceForge right, anyway. Think i'm being a dick? You find me ONE project on there that actually uses the "Docs" link. Most of the "Project Home Page" links point to an empty index, if you're lucky.

Rule #6. Your software really isn't that revolutionary, work with someone else. How many window managers are there now? The apptrove at Freshmeat tells me it's got 130 projects in that category. Is that really necessary? Really? You mean to tell me no one can create a basic, simple fucking window manager and then add functionality in via plugins? Beyond window managers, though, let's look at some others: instant messengers, how 'bout using a the same config file as one of the others? The information in there should be the same, so if I want to install Gaim and a command line IM, I can and not have to worry about how they're each configured. Would it really be that hard to keep my username/password for each network in just ONE location? For a group that goes nuts about some of the choices Microsoft has made with Windows, you dorks have sure made a lot more mistakes in terms of simplicity and flexibility!

See, the problem is too many choices is NOT a strength. You can spout your stale rhetoric about that all you want, but that don't make it true. Too many choices, especially the apples-to-oranges choices offered only make life more difficult. We have all of these tools and such, and that's great, but they all have different purposes and functionality. You can't just choose one over the other easily...especially when you want to choose on functionality, but you're limited in KDE-type apps or something, and what you want are only available in Gnome. Now I gotta install another fucking environment just to use YOUR app. No thanks.

Okay, this rant's over. Seriously, just make it easier on people. Complexity for complexity's sake is a stupid way to put your software out there.

Wednesday, January 4, 2006

Your drive is dead, you are so screwed

Remember me saying how you really need to protect your data? Boy, do I know how to predict failure or what?! :) Seriously, I had a panic attack over the last couple of days as I started getting mails from mdadm:

A DegradedArray event had been detected on md device /dev/md0.

Yipes! So, I cat /proc/mdstat and found that all three arrays were showing a degraded state, with one drive missing from each. hdg was not showing in any of the arrays. A mdadm -Q /dev/hdg confirmed that it was not part of any array.

A quick inspection of /var/log/messages shows:

Jan 4 16:09:47 alfred kernel: ide: failed opcode was: unknown
Jan 4 16:09:47 alfred kernel: hdg: task_out_intr: status=0x50 { DriveReady SeekComplete }
Jan 4 16:15:47 alfred kernel: hdg: dma_intr: error=0x84 { DriveStatusError BadCRC }


BadCRC?! BADCRC!?!? This hard drive is just over a month old!! It better not be failing. So, I did me some searching, just to be sure I didn't need to go through the trouble of trying to get this drive out and shipped back to the manufacturer (especially since I JUST threw out the damn box!) I figured I'd give it a little test to make sure the drive itself wasn't bad. Since I knew it wasn't part of any array, I could play with it how I wanted. So, I repartitioned it and created a new ext2 filesystem on it. I then copied a large amount of data to and from the drive, with no errors in /var/log/messages. Hmmm...

Unfortunately, you know all that rhetoric about how Linux support is just BUSTLING on the Internet? How it's so much better than commercial support? Yeah, I've heard it, too...After about two hours of searching, I finally came to the conclusion that this error did not indicate a dying drive, but a problem with either the driver, the card, the cable or the drive (as in one of these things was not entirely compatible). In other words, there were lots of opinions out there on what these messages meant, but no real information. Most blamed it on the kernel ("it works fine with a 2.4 kernel"), some blamed it on the drive's manufacturer ("if it can't keep up with DMA requests, you'll get that error. Get a new drive"), others said it was ACPI ("add pci=noapic to your boot option"). Even on the kernel-dev list, a number of people had posted the exact same problem, few found a solution. None of their solutions worked for me.

I'll spare you the exact details on everything I had to do to fix this, but suffice it to say I believe the problem was that I had two different speed drives on the same controller card. It didn't make any sense to me, either since both drives were on their own controller on the card. The real reason for it was more likely a combo of that, and the fact that I'm using these drives in an array (quite a few of the folks with this issue were using arrays). So, I issued the following:

hdparm -X udma3 /dev/hde
hdparm -X udma3 /dev/hdg

This sets the speeds down to 66Mhz. So far, after a half an hour, I haven't seen any errors reoccur in the event log. I'll try moving them up to UDMA4 (100Mhz) at some point, but for now the RAID1 array is rebuilding and I'm not seeing the error, so I'm pretty sure this is a valid fix.

Update: it's the next day, and I had no errors last night or this morning. The arrays are still holding, so I think we're good. I modified /etc/sysconfig/harddisks to use the following command line on each of my drives at boot time:

hdparm -X udma3 -d1 -c3 /dev/hdx

The drives will be a little slower and not running at max efficiency, but I can deal. For the most part, these drives are for storage, and I don't need quick like a bunny access. I think this confirms my theory of it being an md driver issue. If one drive is faster than the other, md should "wait" for the other to catch up. Of course, that would probably introduce many other timing issues in a driver that's designed to minimize data loss.

This situation could have been a LOT worse had I not setup those arrays (ignoring the fact that it probably wouldn't have occured had it not been for the arrays...). The machine just chugged along nicely without a burp, even with one drive missing. It's also a good thing I told mdadm to send me alert e-mails. The machine would have plodded along without me ever knowing the drive had failed. I would have found out the seriously hard way: when one of the others died, too...