software
2010-06-25
New API & command interface documentation preview
Very, very soon now, we'll be releasing a new management console. Currently, customers may login to manage individual instances, with each instance having a unique login. Our new "Cloud Admin Shell" will offer the ability to manage multiple instances under a single account -- using the same credentials as for their billing account. It will also support creating and removing instances on-the-fly! This will be possible, as currently, via interactive shell, RESTful API, and via an SSH "API".
Below, we are supplying a preview of the new command/API documentation. We will also provide an end-user manual at the time of release. This is being provided for the sake of public comment -- so let us know what you think!
NAME
cloud-admin-shell - Provide a console interface to control instances.
Command Table
instance [id]
boot
Issues a boot command to an instance. If the string argument "console" is provided, it will also start a serial console. Returns 1 on success, or
undefon failure.status
Indicates if the instance is running or not. Returns 1 on success, or
undefon failure.shutdown [graceful | force]
Sends a shutdown command. By default, this is a graceful shutdown. If a non-graceful shutdown is not possible, the instance will remain running. To force a non-graceful shutdown, specify the string argument, force. This is non-blocking, use instance status to determine if the instance is still running. The instance console command may be used to interactively view the status of the shutdown.
reboot
Sends a reboot command, a graceful instance shutdown followed by an instance boot. Will fail if cannot be performed gracefully. If so, the command instance shutdown force must be utilized.
reimage
On an instance whose status is shutdown, install a new operating system. Implementation and availability depends on the cloud specified.
console
Launch an interactive serial console to the instance. Only available on interactive shells.
instance [id] config [noun]
password [pass] [pass-repeat]
Set a new instance-password. This may be used to login for management of this instance, separate from the account-wide management offered by cloud-admin-shell. Useful for when looking to authorize users to an instance under an account, while not giving access to the entire account.
cloud [id]
Change the cloud on which this instance operates. Different clouds may operate within separate geographic areas and may incur different usage charges.
add
- - ipv4
Add a new IPv4 address
- - ipv6
Add a new IPv6 address
script [noun]
Commands useful and specific to scripting.
fetchuri [uri]
Fetches a URI. HTTP and FTP have been tested. Servers must specify a Document Length. If the Document Length is not provided by the server, this will return
-2If the Document Length is larger than 2MB, this will return-1Otherwise, this will download and return as a string the content of the provided, downloaded URI.language [interpreter]
Changes the language of the current shell. This is only temporary. To set the shell permanently, use config interactive_shell.
This may be used from scripts if a script should contain multiple languages, but is intended primarily for interactive use.
exit
Exits the shell.
quit
Exits the shell.
help
Returns helpful information.
create [noun]
instance
Create/add an instance.
show [noun]
instance
Show information about an instance. If no arguments are provided, lists all instances.
- - config
Shows the configuration & preferences for an instance.
author
Displays the authors of this application.
version
Displays the current version number.
whoami
Tells you who the current user is.
storage
Accesses the storage of an instance.
message [message-id]
Displays system messages. If no message is indicated, it will return a summary of the last 10 messages. If a message id is specified, it will return the full contents of the requested message.
cloud
List available clouds, given no arguments. Given the argument of a cloud identifier, will provide information regarding that cloud.
config [noun]
storage
Configure storage.
passwd [new-pass] [new-pass-repeat]
Set or change an account password.
sshkey [key]
Install an SSH key for the current user. Must be provided in OpenSSH format.
pref [variable] [value]
Set preferences for the current user.
2010-05-12
Web Management Console *PREVIEW*
We've been working on a new web-based management console. This will not replace, but will rather supplement and rely upon, the console-gui and API. This is an in-development preview of the new interface. It is a good time to make your voice heard, if there is something you'd like to see, or changes made. Clearly, we'll be making many changes yet, and some features may be renamed, moved, or may not make it into the immediate release. Screenshots below.
Technical details: It will use our new RESTful API and is expected to be entirely self-contained within HTML, Javascript, and CSS. This will allow for a high-level of customizability and as a reference application for those looking to utilize our API.
2010-04-23
Cloud-shell Scripting! [PART 2]
A short while ago, we added interpreters to the cloud-shell. However, there were still many things it could not do, limitations in loading externally generated scripts, etc. Well, since our initial announcement, we've added a number of really cool features which we hope you'll love.
Bug Fixes
First, we've fixed bugs. The setpref command should now be much more reliable and changes will take effect immediately, even changing the interactive_shell will have an immediate effect. You'll also know which interpreter you're using according to the prompt: "cloud-shell[tcl]>" or "cloud-shell[ruby]>"
Command Passing
You can now launch commands via SSH's command passing mechanism. To demonstrate:
$ ssh -l cdemo manage.vps.grokthis.net status
Guest : Running
All commands executed this way are done via the 'simple' interpreter, there is no smart interpreter when executing commands this way. Remember, we support SSH keys, so you can use this in an automated process.
Script Piping
Finally, if command passing isn't cool enough, you can now pipe scripts:$ ssh -l cdemo manage.vps.grokthis.net ruby < do_this_stuff.rb
$ echo "setpref interactive_shell ruby" | ssh -l cdemo manage.vps.grokthis.net tcl
$ cat <<EOF | ssh -l cdemo manage.vps.grokthis.net javascript
whoami()
EOF
Note that to pipe a script, you must (currently) specify the interpreter on the SSH commandline. We intend to add support for a hash-bang (#!) to specify the script type in the script itself...
Chef / CFEngine / Puppet
We haven't done anything with these DevOps engines yet, but we see the potential to do some really great stuff... some sort of integration and/or cookbooks are planned. Keep tuned!
Storage problems finally resolved (we think!)
So, we've had a number of problems with the storage. A few months ago, to help minimize the effect, we had moved to three-way mirroring. It turns out that not only did it fail to improve the situation, it only made it worse. Quite simply, we had several drives which were experiencing repeat "failures" despite testing good and running the latest firmware. It turns out, however, that there was a "super secret" firmware which our drive manufacturer would not inform us of, this was only discovered by research and communication with other customers of this particular disk manufacturer. Having the three-way mirroring had made this worse for another reason altogether, ZFS will either refuse to write to, or will put a heavy bias against, any *mirror* sets which are marked degraded. Normally, I'd expect that read performance would be affected negatively, but write performance would continue as normal on a degraded set. In ZFS, this is quite the contrary, but write performance suffers quite horribly.
In the end, we've managed to get the drives operating stably and now know that a three-way mirror when degraded will not equal the performance of a non-degraded 2-way mirror as I might otherwise have expected with the Linux software RAID.
The remaining thing to do, yet, is to migrate Rorschach off the clustered LVM volume, moving it to the direct-iSCSI volumes. Rorschach is the last machine using clustered LVM. Then, we'll have to migrate a few hosts that never made it onto shared storage. All in due time...
2010-03-13
Karmic upgrade mini-howto
This is not yet user-friendly. Let us know if you need help! We'll soon develop a script to automate this upgrade process. Note that these instructions should be more-or-less suitable for migrating any Debian OS, such as Lenny, to running a distribution kernel (rather than using our managed kernel)
- Upgrade from Jaunty with apt-get. Note that *currently* installing Jaunty and upgrading is the easiest way to run Karmic.
- ***DO NOT REBOOT / SHUTDOWN***
- Copy /etc/event.d/tty1 to /etc/event.d/xvc0 and replace all 'tty1' with 'xvc0'. Remove /etc/event.d/tty*
- Install the packages "linux-virtual" and "grub". Note that "grub-pc" (GRUBv2) will NOT work.
- Modify /boot/grub/menu.lst so that you use 'root = (hd0)' (default is hd0,0) Append console=xvc0 to the kernel line.
- Modify /etc/fstab, replace 'sda' with 'xvda'. (sda1 becomes xvda1, etc)
- Shutdown.
- Login to cloud-shell (management console)
- Run following commands in the cloud-shell:
- Logout of the cloud-shell
- Log back into the cloud-shell (this reloads your preferences)
- Execute:
cloud-shell> setpref boot_method pvgrub
cloud-shell> setpref disk_namespace xencloud-shell> boot console
Note: if you build a partitioned instance then your menu.lst may contain the default (hd0,0) entry and you should set boot_method to 'pvgrub_part'. Partitioned instance support has not yet been fully tested.
WARNING: Do NOT upgrade to ext4 if using grub, as per these instructions as it does not support ext4. If you must use ext4, you may attempt a partitioned installation; only the first partition must be ext2/3.
WARNING #2: Archive / restore support is not tested with grub. We should test this. If it works, let us know.
SAN status
As many of our customers are aware, we've had problems with stability on some of our VPS nodes. Specifically, our newest and largest nodes. Most of these issues have been related in one way or another to our iSCSI SAN. We understand how significant these problems are for our customers, and we hope customers realize how significant these problems are for us -- clearly, problems for customers are our problems too. We don't like dealing with down machines any more than necessary. In fact, as a systems administrator, I've always advocated laziness -- the good kind of laziness: do it once, do it right. I feel the the best systems administrators keep things running smoothly by doing the least amount of work possible. The point is that we want things to work smoothly even more than you do. The reality is that we're simply hitting a number of edge cases of the relatively new technology, both hardware and software, which we deploy.
The problems we have had have been numerous and varied. Some crashes have left little information behind, some with detailed information, others even happened before our own eyes upon running some relatively safe and mundane command. Some have had interesting causes, some have had interesting solutions. Some solutions have been deployed already, some will automatically deploy upon reboot (i.e. after the next crash), and others are planned for future maintenance windows.
We're working to resolve these issues as quickly as possible and regain a reputation for reliable service.
Below is a list of issues which we're recognizing as either currently, or recently, problematic:
IO errors & read-only filesystems: Customers have at times noticed IO errors and read-only filesystems when the SAN has "disappeared". Sometimes this happens after a simple hiccup, sometimes it is a serious outage, but either way, if the error hits a customer instance, they're going to need a reboot. To resolve this, we're changing the iSCSI timeouts. It has been recently discovered that the default timeout settings were not appropriate for our configuration. This change should cause customer's machines to block and wait for the storage to return. However, really, these timeout settings are only a concern when something else goes wrong, but it might be the difference between a hiccup and a disaster.
Swap performance: There are obvious concerns about the performance of the SAN itself, but recently, the performance of swap -- which is stored on local arrays, per-node, seems to be a great issue in relation to the IO wait problems. In fact, some SAN issues have resulted from problems with local storage. Poor performance of local disks used for swap have caused high-IO wait on the entire machine, causing time-outs to the SAN. This is particularly troublesome when customers run out of RAM in their instance, triggering page thrashing. In this case, one or two customers can ruin it for everyone. To reduce customer page thrashing, we've been raising the resource ceiling, giving (all) customers more RAM. One intended improvement will be improved swap devices. The best long-term solution, however, would be a capability to constrain IO per instance, which is not currently possible in kernel space. Piping swap access through userspace will be slower, but the increased reliability might be more than worth it and is under consideration.
Dropping disks: Our SAN filer had has been dropping disks left and right. At one point, it lost an entire mirror set, although we were able to recover it. Some of these problems have been due to the controller itself, and we've upgraded the firmware to resolve those issues. However, some of the disks themselves seem to be suffering from "old age" after a single year. SMART returns "good", badblock scans are clean, but it appears that "old age" may have affected latency as drives encounter more internal errors. Disk latency is bad, it triggers drive removals by our "intelligent" disk controller. We've replaced 1/3rd of the disks in our array and moved to three-way mirroring (adding disks) to protect against multiple disk failures. Unfortunately, it seems that we still need to replace more disks. I think in this regard, we're learning a lot of the same issues that Google has according to their study, "Failure Trends in a Large Disk Drive Population". As a result of these replacements, we've had to run a lot of background reconstructions, affecting performance and IO-wait.
SAN OS crashes: A few times, the SAN itself has crashed. Initially, it happened due to controller & os crashes upon drive hotswap, even from automatic drive removal, an issue now resolved by a firmware update. Then, we experienced, of all things, a CPU fan failure, also resolved. Finally, this past week, we've experienced what appears to have been a hangup in the controller's kernel driver. The only improvements which seem possible here are controller redundancy and OS upgrades. Most immediately, we'll upgrade the OS. We'll probably add another controller soon.
iSCSI+ZFS is slow: Our current version of the OpenSolaris OS has bugs and limitations which affect performance for our use case. The actual explanation for this is beyond this article, but essentially, limitations in the iSCSI implementation are preventing the system from benefiting from OS, disk, or controller cache. Worse, it results in additional, unnecessary writes to disk which would otherwise not occur. This means that performance suffers significantly during normal operation, let alone during background reconstructions or other high-load periods. A new release of OpenSolaris resolving these issues is scheduled to be released this month (March 2010), we are lab-testing and will schedule an upgrade when appropriate.
CLVM: Some of our nodes are using CLVM. We're not happy with LVM or the RedHat cluster suite. It hasn't scaled and has been a source of crashes and reboots. In January, we were able to complete software modifications which allow us to use iSCSI directly and new accounts are no longer stored on LVM. Luckily for those still on the nodes using CLVM, we've addressed most/all of the issues and have recently made a change that may drastically improve reliability and stability. Still, we're actively seeking to migrate those customers still on CLVM away from it.
Node crashes: Some of our nodes have simply proven unstable. Some of the problems we've run into have been strictly software, such as when a node crashes upon performing a "shutdown force" of an instance. Some however, have been clearly hardware, or kernel drivers. For those issues which are strictly software, we've been upgrading those machines, especially those most frequently affected by crashes, with newer kernels. As for hardware, we had made the decision to replace some machines. At this time, we're putting a temporary hold on that idea for further evaluation before we mindlessly throw money at a problem, but may likely proceed.
Instances are reinstalled poorly: Due to legacy concerns, reimaging is performed in a sub-optimal way. While this hasn't been explicitly linked to widespread stability problems and to no issues recently, it has been clearly linked to a few isolated cases from 2008. It is clear that even if our current solution is not a cause of instabilities, that it carries a great potential for causing instabilities and must be improved. In February, we rewrote our installation script in a more stable fashion. This was not sufficient to solve the (potential) problem, but was a necessary step towards this goal.
2010-03-03
User preferences in Cloud Shell
VPS/Cloud customers may now set user preferences via the cloud-shell application. Currently, we're supporting two such preferences and will continue to expand this list to support a wide array of user-configurable options...
The current variables include:
cpu_arch, which should be '32' or '64'. Default is 64.
autoheal, which should be 1 or 0. Default is 1.
The cpu_arch setting will be used to determine which architecture kernel should be loaded on boot. Please note that the boot_method variable may override this setting. WARNING: the legacy command 'boot32' has been removed in favor of this user-preference.
The autoheal setting will determine whether or not the configured instance pool will be automatically restored/rebooted upon failure.
To set or get these variables, use the commands 'setpref' and 'getpref' as such:
cloud-shell> setpref cpu_arch 32
cloud-shell> setpref autoheal 0
cloud-shell> getpref
$VAR1 = {
'client:NNN:pref:cpu_arch' => '32',
'client:NNN:pref:autoheal' => '0'
};
It should be noted that reimaging your OS currently cannot modify these variables. Thus, if you switch from a 32-bit or 64-bit OS, you must modify the cpu_arch variable manually. This is a forthcoming improvement we must make to the system.
UPDATE: New 'boot_method' preference available!
Currently accepted boot methods are:
- pv - Paravirtualized (default)
- pv_part - Paravirtualized, Partitioned disk
- pvgrub - Paravirtualized, with GRUB
- rescue - Boot into rescue image
2009-04-24
Introducing a free cloud architecture management framework
We want to let everyone know of a free software cloud management framework we've been building. If you've read the CCIF mailing lists or our twitter, then you're probably already aware of it. This project is called Annelidous (www.annelido.us). It enables the building of public and private cloud infrastructure services (IaaS), API agnostic frontends, and API proxies. It is licensed under the AGPLv3, more information regarding the AGPLv3 license can be found on the annelido.us website and at www.fsf.org.
So far, I'm running Annelido.us to run the services of GrokThis.net and VPS Village, but it has potential beyond simply hosting. Runnable code is now available for managing Xen clusters over SSH, and an initial frontend based on xen-shell has been completed. Some work has already begun on backend connectors to EC2 and Vertebra-xen. All of this code is available in a public git repository.
The design goals include the potential to build proxies between various IaaS APIs. As an example, a proxy could be built that allows a frontend with support for the OCCI API to communicate to a cloud which offers the EC2 API. This might also then include the ability to automatically and transparently allow OVF files to be used on EC2.
Frontend applications will be able to use this framework to access a variety of 'Connectors' through a common Perl API. In this sense, I intend for it to provide something analogous to what DBI provides for databases.
It should be noted that support for billing/accounting modules is being built in as well. So far, it integrates with Ubersmith, a billing manager oriented for web hosting operations.
I have very noble goals with this project, but as of yet, it has only scratched the service of its potential. There is currently an IRC channel on irc.freenode.org, #annelidous, and a public git repository. You can accept this as a call for both awareness and for assistance, so that we can have a free, open, and interoperable solution for building, connecting to, and supporting future IaaS solutions.
2008-10-08
Upgrading your Rails version
By default, we do not upgrade customer's Rails applications to newer releases of Rails. We have noticed that a large number of customers haven't been upgrading, and just want to make sure that customers know how to do so!
Simply edit your application's config/environment.rb as such to specify the version you would like to upgrade to:
-RAILS_GEM_VERSION = '1.1.6'
+RAILS_GEM_VERSION = '2.1.1'
Then, run 'rake rails:update' from with in your application's directory.
Notable upgrades for you would be:
1.2.0
1.2.6
2.0.0
2.0.4
2.1.0
2.1.1
Other versions can be available as well, but these are the earliest and latest point-releases available within each major release at the time of this writing.
Django 1.0 'admin' interface
We've had a few customers asking about why their Django admin interface is no longer working, since the upgrade to Django 1.0.
Changes in Django 1.0 required a couple changes to the url.py file in order to access the admin view. Here is an example of how to modify the file:
# Add following lines for 1.0 compat, to top of file
from django.conf.urls.defaults import *
from django.contrib import admin
import os
admin.autodiscover()
# and then under urlpatterns, eg:
urlpatterns = patterns('',
# Remove old entry, eg:
# (r'^admin/', include('django.contrib.admin.urls')),
# Add new entry, eg:
('^admin/(.*)', admin.site.root),
2008-03-19
Alert: GCC 4.3.0 - Avoid! (for now)
This is a quick notice that we are asking that customers hold off from upgrading GCC to version 4.3 at this time. A problem exists in all current Linux releases (from all vendors) which renders GCC 4.3.0 and higher incompatible. A kernel update will be provided shortly to resolve this issue.
UPDATE: all should now be okay in 2.6.26.5