Rails Page Caching with Subdomains and Advanced Cache Invalidation Cron

rails-podcast.png

Continuing to fight the good fight against Rails’ slowness and processor intensive handicap, I have been working a lot with page caching over the last day. I think I finally have a decent setup now. This write up talks about the Rails side of caching. This assumes that apache is serving up cached pages just fine. This also assumes you are using sub domains and have somewhat advanced caching needs. Finally, it assumes you know what caching is and how to make rails cache and serve cached pages in the first place. Check out this article for some basics (and some advanced things too). A lot of what I have done is based on things there.

So the setup: Artists on the site have subdomains (mine is blainegarrett.adamantinearts.org). By using the request routing gem for rails, this is easy. In my routes, if can differentiate between “main site” requests and “artist” requests. For example, the below two requests serve up the main page for the respective domain:

map.connect '/',  :controller => 'artist',      :action => 'show', :id => nil,  :conditions => { :subdomain => /^(?!www).+$/i } #subdomain is present but not www
map.connect '/', :action => 'index', :controller => 'index',  :conditions => { :subdomain => /^(www|)$/i } #subdomain is either www or not present (aka main site)

This works great for serving up pages, but there is a similar namespace issue for caching of these same pages. For caching to work properly, we need to be able to store the cached pages in a more advanced manner than the default (RAILS_ROOT/public folder). In my application controller, I tell rails where to store the pages by overriding the class variable for the cache directory. Pretty sneaky if you ask me.

@@page_cache_directory  = "#{RAILS_ROOT}/public/caches/" + subdomains + request.domain + "/"

So now as long as you are in production mode, rails will write pages to the /public/caches folder inside of folders named for the (sub)domain of the page. The next step is to get apache to serve those pages up correctly. This is dependent on your server setup, but I wrote an article on getting this to work with apache/modproxy/modrewrite. Your rewrite rules will look something like those used in that article.

Once all this works, you should be able to cache pages and serve up cached pages. Now we ant to invalidate pages.
Lets talk first. I don’t like rails cache invalidation for two reasons. Firstly, I don’t trust Rail’s ability to resolve urls via the routes. I just don’t. Secondly, the user shouldn’t have wait for all the pages that need to be recached to be invalidated. If you have an advanced site, this could be a huge list of pages to delete. Finally, as far as I can tell, Rails’ cache invalidator doesn’t re-cache the page. Therefore, the next person to visit the page has to wait for it to load. I don’t want anyone to have to wait for these pages to load. That is why we are using caching.

So, the goal is to build a sort of queue of urls to invalidate that is populated by Rails and then a cron job that will periodically re-cache all the pages in the queue. Run the cron every minute or so, and depending on load, there shouldn’t be much lag between saving changes and those changes reflected on the site.

Lets go:

Step 1: Build Rails Model
Again, if you use rake or other tools, feel free to do it that way.

Sql For table

CREATE TABLE `cache_queues` (
`id` MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`url` VARCHAR( 255 ) NOT NULL
) ENGINE = MYISAM COMMENT = 'Queue of Urls to recache - inserted into by Rails, deleted by Cron';

Simple Ruby Skeleton for the CacheQueue model. The queue method takes in the resolved url, sets itself and saves itself for minimal work in the controller.

class CacheQueue < ActiveRecord::Base
def queue(url)
self.url = url
self.save
end
end

Step 2: Queue up the page in your controllers
Next we need to edit the controllers that would update pages – such as editing your profile info. Instead of invalidating the cache, we will leave that to the cron. Simply queue up the page. Where ever you might have code like

expire_page(:controller => 'artists', :action => 'show', :subdomain => @member.isa.subdomain)

replace it with:

CacheQueue.new.queue(create_url('',{:subdomain => @member.isa.subdomain})) #Recache the artist's profile

I don’t trust Rail’s routes resolver, so lets just manually construct the url using my home rolled create_url() which is like url_for or link_to. The above code will then insert the record into the queue_caches table. Now we are done with the rails side of things.

Step 3: Write our Re-Cacher
I will be writing the re-cache mechanism in php since I like PHP and I have lots of code for these sorts of things. If you are a ruby purist, I’m sure you can adapt my code.

First, we iterate through the cache_queues table fetching all the urls, calling re_cache($url), then deleting the record from the queue. The code is not fancy.

$query = 'SELECT * FROM cache_queues';
$result = mysql_query($query) or die(mysql_error());
while($row = mysql_fetch_row($result)) {
echo "	* Found url (" . $row[1] .  "). Attempting to update cache. \n";
if (re_cache($row[1])) {
$query = 'DELETE FROM  cache_queues WHERE id = ' . mysql_escape_string($row[0]) . ' LIMIT 1';
$result2 = mysql_query($query) or die(mysql_error());
echo "	* URL Re-Cached and Removed from Queue\n\n";
}
}

Next, lets look at the re_cache function

function re_cache($url) {
//Get the domain and accomodate for the www no www links. Invalidate both please. Write your own aliasing code here if your setup is more advanced...
$domain = get_domain($url);

$domains_to_uncache = array();

if ($domain == 'www.adamantinearts.org' or $domain == 'adamantinearts.org') {
$domains_to_uncache[] = 'www.adamantinearts.org';
$domains_to_uncache[] = 'adamantinearts.org';
}
else {
$domains_to_uncache[] = $domain;
}

#Figure out the file_name based on the path part of the url
$url_bits = parse_url($url);
$url_path = $url_bits['path'];
if (!isset($url_bits['path']) or $url_bits['path'] == '/') {
$file_path = '/index.html';
}
else {
$file_path = $url_bits['path'] . '.html';
}

#Finally loop through each domain we want to invalidate the cache for...
foreach($domains_to_uncache as $domain) {
$cache_file_path  = '/home/clients/adamantinearts.org/web/public/caches/' . $domain . $file_path;
if (file_exists($cache_file_path)) {
echo " Cached File (" . $domain . $file_path . ") exist. Destroy!\n";

//Throw in obvious attack checks for ../ type attacks... note that this is not a rm -rf => BAD
`rm $cache_file_path`;
}
$curl_url = 'http://' . $domain . $url_path; #What if it wasn't https?
echo "	* Recaching url: $curl_url \n";
`curl -sS $curl_url`; # This will cause the page to recache so the next person to view the page doesn't have to wait for it to cache either.
if (file_exists($cache_file_path)) {
echo "	* Sweet. It appears to have re - cached correctly!\n";
}
else return false; # Figure out what to do about this later...
}
return true;
}

Pretty simple. It could use some work, but it appears to work really well!

Some of the helper functions from my code libraries :

function get_domain($url) {
$url_bits = parse_url($url);
if (isset($url_bits['host'])) {
return $url_bits = $url_bits['host'];
}
else {
return false;
}
}
function create_valid_url($url) {
//clean up query especially...
$url = preg_replace('/(\?)(.*)/sie', '\'$1\' . urlencode(\'$2\')', $url);
return $url;
}

And there you go.

blog comments powered by Disqus