Personal tools
You are here: Home Blog support

support

2010-03-13

Karmic upgrade mini-howto

Filed Under:

This is not yet user-friendly. Let us know if you need help!  We'll soon develop a script to automate this upgrade process.  Note that these instructions should be more-or-less suitable for migrating any Debian OS, such as Lenny, to running a distribution kernel (rather than using our managed kernel)

  1. Upgrade from Jaunty with apt-get.  Note that *currently* installing Jaunty and upgrading is the easiest way to run Karmic.
  2. ***DO NOT REBOOT / SHUTDOWN***
  3. Copy /etc/event.d/tty1 to /etc/event.d/xvc0 and replace all 'tty1' with 'xvc0'.  Remove /etc/event.d/tty*
  4. Install the packages "linux-virtual" and "grub". Note that "grub-pc" (GRUBv2) will NOT work.
  5. Modify /boot/grub/menu.lst so that you use 'root = (hd0)'  (default is hd0,0)   Append console=xvc0 to the kernel line.
  6. Modify /etc/fstab, replace 'sda' with 'xvda'. (sda1 becomes xvda1, etc)
  7. Shutdown.
  8. Login to cloud-shell (management console)
  9. Run following commands in the cloud-shell:
  10. cloud-shell> setpref boot_method pvgrub
    cloud-shell> setpref disk_namespace xen

  11. Logout of the cloud-shell
  12. Log back into the cloud-shell (this reloads your preferences)
  13. Execute:
  14. cloud-shell> boot console

Note: if you build a partitioned instance then your menu.lst may contain the default (hd0,0) entry and you should set boot_method to 'pvgrub_part'. Partitioned instance support has not yet been fully tested.

WARNING: Do NOT upgrade to ext4 if using grub, as per these instructions as it does not support ext4.  If you must use ext4, you may attempt a partitioned installation; only the first partition must be ext2/3.

WARNING #2: Archive / restore support is not tested with grub. We should test this.  If it works, let us know.

SAN status

As many of our customers are aware, we've had problems with stability on some of our VPS nodes.  Specifically, our newest and largest nodes.  Most of these issues have been related in one way or another to our iSCSI SAN.  We understand how significant these problems are for our customers, and we hope customers realize how significant these problems are for us -- clearly, problems for customers are our problems too. We don't like dealing with down machines any more than necessary.  In fact, as a systems administrator, I've always advocated laziness -- the good kind of laziness: do it once, do it right.  I feel the the best systems administrators keep things running smoothly by doing the least amount of work possible.  The point is that we want things to work smoothly even more than you do.  The reality is that we're simply hitting a number of edge cases of the relatively new technology, both hardware and software, which we deploy.

The problems we have had have been numerous and varied.  Some crashes have left little information behind, some with detailed information, others even happened before our own eyes upon running some relatively safe and mundane command.  Some have had interesting causes, some have had interesting solutions.  Some solutions have been deployed already, some will automatically deploy upon reboot (i.e. after the next crash), and others are planned for future maintenance windows.

We're working to resolve these issues as quickly as possible and regain a reputation for reliable service.


Below is a list of issues which we're recognizing as either currently, or recently, problematic:

IO errors & read-only filesystems:  Customers have at times noticed IO errors and read-only filesystems when the SAN has "disappeared".  Sometimes this happens after a simple hiccup, sometimes it is a serious outage, but either way, if the error hits a customer instance, they're going to need a reboot.  To resolve this, we're changing the iSCSI timeouts.  It has been recently discovered that the default timeout settings were not appropriate for our configuration.  This change should cause customer's machines to block and wait for the storage to return.  However, really, these timeout settings are only a concern when something else goes wrong, but it might be the difference between a hiccup and a disaster.

Swap performance: There are obvious concerns about the performance of the SAN itself, but recently, the performance of swap -- which is stored on local arrays, per-node, seems to be a great issue in relation to the IO wait problems.  In fact, some SAN issues have resulted from problems with local storage.  Poor performance of local disks used for swap have caused high-IO wait on the entire machine, causing time-outs to the SAN.  This is particularly troublesome when customers run out of RAM in their instance, triggering page thrashing.  In this case, one or two customers can ruin it for everyone.  To reduce customer page thrashing, we've been raising the resource ceiling, giving (all) customers more RAM.  One intended improvement will be improved swap devices.  The best long-term solution, however, would be a capability to constrain IO per instance, which is not currently possible in kernel space.  Piping swap access through userspace will be slower, but the increased reliability might be more than worth it and is under consideration.

Dropping disks: Our SAN filer had has been dropping disks left and right.  At one point, it lost an entire mirror set, although we were able to recover it.  Some of these problems have been due to the controller itself, and we've upgraded the firmware to resolve those issues.  However, some of the disks themselves seem to be suffering from "old age" after a single year.  SMART returns "good", badblock scans are clean, but it appears that "old age" may have affected latency as drives encounter more internal errors.  Disk latency is bad, it triggers drive removals by our "intelligent" disk controller.  We've replaced 1/3rd of the disks in our array and moved to three-way mirroring (adding disks) to protect against multiple disk failures.  Unfortunately, it seems that we still need to replace more disks.  I think in this regard, we're learning a lot of the same issues that Google has according to their study, "Failure Trends in a Large Disk Drive Population".  As a result of these replacements, we've had to run a lot of background reconstructions, affecting performance and IO-wait.

SAN OS crashes: A few times, the SAN itself has crashed. Initially, it happened due to controller & os crashes upon drive hotswap, even from automatic drive removal, an issue now resolved by a firmware update.  Then, we experienced, of all things, a CPU fan failure, also resolved.  Finally, this past week, we've experienced what appears to have been a hangup in the controller's kernel driver.  The only improvements which seem possible here are controller redundancy and OS upgrades.  Most immediately, we'll upgrade the OS.  We'll probably add another controller soon.

iSCSI+ZFS is slow: Our current version of the OpenSolaris OS has bugs and limitations which affect performance for our use case.  The actual explanation for this is beyond this article, but essentially, limitations in the iSCSI implementation are preventing the system from benefiting from OS, disk, or controller cache.  Worse, it results in additional, unnecessary writes to disk which would otherwise not occur.  This means that performance suffers significantly during normal operation, let alone during background reconstructions or other high-load periods.  A new release of OpenSolaris resolving these issues is scheduled to be released this month (March 2010), we are lab-testing and will schedule an upgrade when appropriate.

CLVM: Some of our nodes are using CLVM. We're not happy with LVM or the RedHat cluster suite. It hasn't scaled and has been a source of crashes and reboots.  In January, we were able to complete software modifications which allow us to use iSCSI directly and new accounts are no longer stored on LVM.  Luckily for those still on the nodes using CLVM, we've addressed most/all of the issues and have recently made a change that may drastically improve reliability and stability.  Still, we're actively seeking to migrate those customers still on CLVM away from it.

Node crashes:  Some of our nodes have simply proven unstable.  Some of the problems we've run into have been strictly software, such as when a node crashes upon performing a "shutdown force" of an instance.  Some however, have been clearly hardware, or kernel drivers.  For those issues which are strictly software, we've been upgrading those machines, especially those most frequently affected by crashes, with newer kernels.  As for hardware, we had made the decision to replace some machines.  At this time, we're putting a temporary hold on that idea for further evaluation before we mindlessly throw money at a problem, but may likely proceed.

Instances are reinstalled poorly: Due to legacy concerns, reimaging is performed in a sub-optimal way.  While this hasn't been explicitly linked to widespread stability problems and to no issues recently, it has been clearly linked to a few isolated cases from 2008.  It is clear that even if our current solution is not a cause of instabilities, that it carries a great potential for causing instabilities and must be improved.  In February, we rewrote our installation script in a more stable fashion.  This was not sufficient to solve the (potential) problem, but was a necessary step towards this goal.

2009-08-28

Free incoming calls via Skype now available!

Filed Under:

Please note that in addition to our international (+1) number and US toll-free (+1 800) numbers, we can now accept Skype-to-Skype calls to SkypeID "grokthisnet".   Skype-to-Skype calls forward directly into our standard telephone and answering systems, providing an inexpensive option for international customers to contact us by voice without expensive international calling rates.

Please note that due to forwarding delays, Skype calls may require up to 8-10 rings before receiving an answer.  If no one is available to take your call, you will (eventually) be forwarded to our answering system.

Syndication
Facebook
GrokThis.net on Facebook
Twitter
Tag cloud
upgrade vps contest xen howto feature rails django hardware failure ajax virtualization security mason support cloud software
Log in


Forgot your password?
New user?
Archives
 

Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: