vps
2010-05-12
Web Management Console *PREVIEW*
We've been working on a new web-based management console. This will not replace, but will rather supplement and rely upon, the console-gui and API. This is an in-development preview of the new interface. It is a good time to make your voice heard, if there is something you'd like to see, or changes made. Clearly, we'll be making many changes yet, and some features may be renamed, moved, or may not make it into the immediate release. Screenshots below.
Technical details: It will use our new RESTful API and is expected to be entirely self-contained within HTML, Javascript, and CSS. This will allow for a high-level of customizability and as a reference application for those looking to utilize our API.
2010-04-23
Cloud-shell Scripting! [PART 2]
A short while ago, we added interpreters to the cloud-shell. However, there were still many things it could not do, limitations in loading externally generated scripts, etc. Well, since our initial announcement, we've added a number of really cool features which we hope you'll love.
Bug Fixes
First, we've fixed bugs. The setpref command should now be much more reliable and changes will take effect immediately, even changing the interactive_shell will have an immediate effect. You'll also know which interpreter you're using according to the prompt: "cloud-shell[tcl]>" or "cloud-shell[ruby]>"
Command Passing
You can now launch commands via SSH's command passing mechanism. To demonstrate:
$ ssh -l cdemo manage.vps.grokthis.net status
Guest : Running
All commands executed this way are done via the 'simple' interpreter, there is no smart interpreter when executing commands this way. Remember, we support SSH keys, so you can use this in an automated process.
Script Piping
Finally, if command passing isn't cool enough, you can now pipe scripts:$ ssh -l cdemo manage.vps.grokthis.net ruby < do_this_stuff.rb
$ echo "setpref interactive_shell ruby" | ssh -l cdemo manage.vps.grokthis.net tcl
$ cat <<EOF | ssh -l cdemo manage.vps.grokthis.net javascript
whoami()
EOF
Note that to pipe a script, you must (currently) specify the interpreter on the SSH commandline. We intend to add support for a hash-bang (#!) to specify the script type in the script itself...
Chef / CFEngine / Puppet
We haven't done anything with these DevOps engines yet, but we see the potential to do some really great stuff... some sort of integration and/or cookbooks are planned. Keep tuned!
Storage problems finally resolved (we think!)
So, we've had a number of problems with the storage. A few months ago, to help minimize the effect, we had moved to three-way mirroring. It turns out that not only did it fail to improve the situation, it only made it worse. Quite simply, we had several drives which were experiencing repeat "failures" despite testing good and running the latest firmware. It turns out, however, that there was a "super secret" firmware which our drive manufacturer would not inform us of, this was only discovered by research and communication with other customers of this particular disk manufacturer. Having the three-way mirroring had made this worse for another reason altogether, ZFS will either refuse to write to, or will put a heavy bias against, any *mirror* sets which are marked degraded. Normally, I'd expect that read performance would be affected negatively, but write performance would continue as normal on a degraded set. In ZFS, this is quite the contrary, but write performance suffers quite horribly.
In the end, we've managed to get the drives operating stably and now know that a three-way mirror when degraded will not equal the performance of a non-degraded 2-way mirror as I might otherwise have expected with the Linux software RAID.
The remaining thing to do, yet, is to migrate Rorschach off the clustered LVM volume, moving it to the direct-iSCSI volumes. Rorschach is the last machine using clustered LVM. Then, we'll have to migrate a few hosts that never made it onto shared storage. All in due time...
2010-03-19
Cloud-shell Scripting!
We now have basic Tcl, Javascript, and Ruby functionality built into cloud-shell! This allows you to script basic functionality from the interactive terminal. User feedback is encouraged, let us know what you think!
What does this mean for you? Well, power users will be able to do a whole lot of really neat things as we expand the library of available methods. Currently, though, you may access any of the existing shell commands from code.
What *will* it mean? Well, we're going to expand the available methods so that you'll be able to write scripts for any sort of neat things. Solaris or Macintosh users might be familiar with OpenFirmware and EFI -- the goal is to make our cloud-shell similar to, but more powerful than, these environments.
UPDATE: Michael Szul (@szul) asked if we had any particular reason for choosing Tcl. Well, for the interactive shell, it simply makes a lot of sense because from now on, all commands you run are actually Tcl procedures. We can do this without a change in the typical user expectations or experience. Stored scripts will be able to use the same familiar syntax you've always used, making it a close analog to a unix shell or powershell script. Essentially, it is easy for the user, despite being a relatively dated and obscure language in 2010. That said, I'm already working on Javascript support, which I feel won't be very useful for interactive use, but would be of great use for non-interactive code, especially for those from a web background.
UPDATE 2: Javascript now supported! Do 'setpref interactive_shell javascript' from the Tcl shell. Note that from then on, everything will be in Javascript syntax, i.e. to get back to Tcl: "setpref('interactive_shell', 'tcl')". We've mapped write/writeln/document.write/document.writeln such that they print to your terminal, for convenience.
Update 3: You may now load script files from the network. Use the fetchuri method to download a script (remote server MUST set Content-Length). You can then pass that to an eval function, i.e. in javascript: eval(fetchuri("http://ftp.grokthis.net/pub/installers/karmic.js")); Allowing one to download & exec scripts in languages other than that of the interactive shell should be complete sometime tomorrow.
Update 4: Now supporting Ruby!!! Finally, customers that, for whatever reason, want to disable interactive features entirely can revert to the old non-interactive shell by choosing the 'simple' shell.
Example Script (tcl):
Currently, many of the features listed below are not yet incorporated into your shell. This is an example of how scripting will eventually empower users. This example pseudo-script reimages the OS while also setting the CPU architecture and boot methods, and finally, boots the OS. It is a simple example, but hopefully, a powerful one. Ultimately, we intend to base our reimaging software on this concept.
# user settings
set cpu_arch 64
set boot_method pvgrub
set url "ftp://ftp.grokthis.net/disk_images/fedora-12_64.tar.bz2"
# set the guest
set guest [get_guest]
# Program:
foreach disk [get_storage] {
if { [string compare $disk(desc) 'root'] == 0 } {
reimage $disk $url
}
}
setpref cpu_arch $cpu_arch
setpref boot_method $boot_method
boot
# Notify a remote machine of our IP address
fetchuri "http://your-server/instance-manager/callback/create/$guest(ip)"
2010-03-15
Bandwidth monitoring
We've added a new bandwidth command to the cloud-shell management console. It is not active for everyone yet, but for over 50% of the customer base. Enjoy!
cloud-shell> bandwidth
cdemo
month rx | tx | total
------------------------+---------------+---------------
Mar '10 0 MB | 158.76 MB | 158.76 MB
------------------------+---------------+---------------
estimated 0 MB | 327 MB | 327 MB
Connection closed.
cloud-shell>
2010-03-13
SAN status
As many of our customers are aware, we've had problems with stability on some of our VPS nodes. Specifically, our newest and largest nodes. Most of these issues have been related in one way or another to our iSCSI SAN. We understand how significant these problems are for our customers, and we hope customers realize how significant these problems are for us -- clearly, problems for customers are our problems too. We don't like dealing with down machines any more than necessary. In fact, as a systems administrator, I've always advocated laziness -- the good kind of laziness: do it once, do it right. I feel the the best systems administrators keep things running smoothly by doing the least amount of work possible. The point is that we want things to work smoothly even more than you do. The reality is that we're simply hitting a number of edge cases of the relatively new technology, both hardware and software, which we deploy.
The problems we have had have been numerous and varied. Some crashes have left little information behind, some with detailed information, others even happened before our own eyes upon running some relatively safe and mundane command. Some have had interesting causes, some have had interesting solutions. Some solutions have been deployed already, some will automatically deploy upon reboot (i.e. after the next crash), and others are planned for future maintenance windows.
We're working to resolve these issues as quickly as possible and regain a reputation for reliable service.
Below is a list of issues which we're recognizing as either currently, or recently, problematic:
IO errors & read-only filesystems: Customers have at times noticed IO errors and read-only filesystems when the SAN has "disappeared". Sometimes this happens after a simple hiccup, sometimes it is a serious outage, but either way, if the error hits a customer instance, they're going to need a reboot. To resolve this, we're changing the iSCSI timeouts. It has been recently discovered that the default timeout settings were not appropriate for our configuration. This change should cause customer's machines to block and wait for the storage to return. However, really, these timeout settings are only a concern when something else goes wrong, but it might be the difference between a hiccup and a disaster.
Swap performance: There are obvious concerns about the performance of the SAN itself, but recently, the performance of swap -- which is stored on local arrays, per-node, seems to be a great issue in relation to the IO wait problems. In fact, some SAN issues have resulted from problems with local storage. Poor performance of local disks used for swap have caused high-IO wait on the entire machine, causing time-outs to the SAN. This is particularly troublesome when customers run out of RAM in their instance, triggering page thrashing. In this case, one or two customers can ruin it for everyone. To reduce customer page thrashing, we've been raising the resource ceiling, giving (all) customers more RAM. One intended improvement will be improved swap devices. The best long-term solution, however, would be a capability to constrain IO per instance, which is not currently possible in kernel space. Piping swap access through userspace will be slower, but the increased reliability might be more than worth it and is under consideration.
Dropping disks: Our SAN filer had has been dropping disks left and right. At one point, it lost an entire mirror set, although we were able to recover it. Some of these problems have been due to the controller itself, and we've upgraded the firmware to resolve those issues. However, some of the disks themselves seem to be suffering from "old age" after a single year. SMART returns "good", badblock scans are clean, but it appears that "old age" may have affected latency as drives encounter more internal errors. Disk latency is bad, it triggers drive removals by our "intelligent" disk controller. We've replaced 1/3rd of the disks in our array and moved to three-way mirroring (adding disks) to protect against multiple disk failures. Unfortunately, it seems that we still need to replace more disks. I think in this regard, we're learning a lot of the same issues that Google has according to their study, "Failure Trends in a Large Disk Drive Population". As a result of these replacements, we've had to run a lot of background reconstructions, affecting performance and IO-wait.
SAN OS crashes: A few times, the SAN itself has crashed. Initially, it happened due to controller & os crashes upon drive hotswap, even from automatic drive removal, an issue now resolved by a firmware update. Then, we experienced, of all things, a CPU fan failure, also resolved. Finally, this past week, we've experienced what appears to have been a hangup in the controller's kernel driver. The only improvements which seem possible here are controller redundancy and OS upgrades. Most immediately, we'll upgrade the OS. We'll probably add another controller soon.
iSCSI+ZFS is slow: Our current version of the OpenSolaris OS has bugs and limitations which affect performance for our use case. The actual explanation for this is beyond this article, but essentially, limitations in the iSCSI implementation are preventing the system from benefiting from OS, disk, or controller cache. Worse, it results in additional, unnecessary writes to disk which would otherwise not occur. This means that performance suffers significantly during normal operation, let alone during background reconstructions or other high-load periods. A new release of OpenSolaris resolving these issues is scheduled to be released this month (March 2010), we are lab-testing and will schedule an upgrade when appropriate.
CLVM: Some of our nodes are using CLVM. We're not happy with LVM or the RedHat cluster suite. It hasn't scaled and has been a source of crashes and reboots. In January, we were able to complete software modifications which allow us to use iSCSI directly and new accounts are no longer stored on LVM. Luckily for those still on the nodes using CLVM, we've addressed most/all of the issues and have recently made a change that may drastically improve reliability and stability. Still, we're actively seeking to migrate those customers still on CLVM away from it.
Node crashes: Some of our nodes have simply proven unstable. Some of the problems we've run into have been strictly software, such as when a node crashes upon performing a "shutdown force" of an instance. Some however, have been clearly hardware, or kernel drivers. For those issues which are strictly software, we've been upgrading those machines, especially those most frequently affected by crashes, with newer kernels. As for hardware, we had made the decision to replace some machines. At this time, we're putting a temporary hold on that idea for further evaluation before we mindlessly throw money at a problem, but may likely proceed.
Instances are reinstalled poorly: Due to legacy concerns, reimaging is performed in a sub-optimal way. While this hasn't been explicitly linked to widespread stability problems and to no issues recently, it has been clearly linked to a few isolated cases from 2008. It is clear that even if our current solution is not a cause of instabilities, that it carries a great potential for causing instabilities and must be improved. In February, we rewrote our installation script in a more stable fashion. This was not sufficient to solve the (potential) problem, but was a necessary step towards this goal.
2009-09-17
Booting with Grub (custom kernels!)
This guide will show you how to boot a custom kernel with grub!
To run a custom kernel, you need only two things, a kernel (of course!) and a GRUB configuration.
Lets start with GRUB. This assumes that you will not be using an initrd file, but you may use one, and that your kernel image will be installed as /boot/vmlinuz:
# mkdir -p /boot/grub
# cat <<EOF > /boot/grub/menu.lst
default 0
timeout 5
title Linux Default
root (hd0)
kernel (hd0)/boot/vmlinuz root=/dev/sda1 ro console=xvc0 clocksource=jiffies
EOF
That is it for GRUB! Now, as for your kernel, you just need to make sure that it is compiled to run as a Xen DomU with paravirtualization. This requires that you use a recent Linux kernel with support for Xen enabled, or you use a patched kernel such as provided by XenSource (http://xenbits.xensource.com/linux-2.6.18-xen.hg)
The compiled kernel image may be placed anywhere inside your filesystem as long as GRUB is configured to point to it. The above example assumes this location is /boot/vmlinuz.
Finally, from the management console, execute the command, "boot-grub". This command is API-accessible. Unfortunately, for now, you will need to manually specify this whenever booting. Your VPS will boot with a system-default image by default, if for instance, the node crashes. This caveat will disappear once we complete a migration of all accounts to a grub-based configuration.2009-08-15
Kernel upgrade procedure
We frequently get asked how to upgrade kernels, how to upgrade modules, etc. This is actually in our knowledge-base, but I thought it would be useful to toss this onto the blog since we just pushed out a very important kernel upgrade, and want to make sure that this information is as accessible as possible!
1. Simply shutdown your VPS. You must do a full shutdown and not a "reboot".
2. Then, boot the machine from the management console. IMPORTANT: If you have a 32-bit OS, you must use the command "boot32", this will be unnecessary in a future update.
3. Update your modules:
wget -O - http://ftp.grokthis.net/pub/linux/modules/install_modules.sh | /bin/bash
4. You're done! You might want to reboot if you depend on specific modules being available during boot.
2009-08-09
Enter to win a contest for a free VPS, or a $400 coupon!
RT @grokthis to win a one-year subscription to a VPS /w 1GB RAM and 48GB disk, or a $400 credit to VPS Village. Contest ends 8/14
On 8/15/09, one winner will be selected at random. No purchase necessary.
Regards,
Eric Windisch
2009-05-12
Introducing the GrokThis VPS "Cloud"
In the coming months, we will be extending and enhancing this API, and have plans to support industry standard APIs as they become available. At this time, the ACII API allows developers to obtain a list of available virtual machines, manage and control virtual machines by issue shutdown, reboot, rescue, and boot commands. Very shortly, this API will also allow the creation of new virtual machines.
At this time a utility billing model is not yet available, but the availability of utility billing is expected for Summer 2009.
At this time, an example client script in Perl has been posted:
Download Perl Example
The GrokThis.net VPS Cloud is powered by Annelidous, a free software cloud infrastructure management solution developed internally at GrokThis.net and released to the community under the AGPLv3 license.