Cassandra and Big Data – building a single-node “cluster”

Cassandra – Getting off the ground.
Continuation of post: Apache Cassandra Project – processing “Big Data”

While researching a project on Big Data services, I knew that I’d need a multi-node cluster to experiment with, but a pile of hardware was not immediately available.

Using the VERY helpful book Cassandra High Performance Cookbook I was able to build a 3 node cluster on a single machine. This is how I did it:


For this cluster test example, I am using Ubunto 10, with following JVM

      JVM vendor/version: OpenJDK 64-Bit Server VM/1.6.0_22

Downloaded Cassandra 1.0.8 package from here:
http://apache.mirrors.tds.net//cassandra/1.0.8/apache-cassandra-1.0.8-bin.tar.gz

Created new user on system: bigdata

Create the required base data directories

  $ mkdir commitlog,log,data,saved_caches

Moved that package there and started the build

$ cp /tmp/apache-cassandra-1.0.8-bin.tar.gz .

Unzipped and extracted the contents

$ gunzip apache-cassandra-1.0.8-bin.tar.gz
$ tar xvf apache-cassandra-1.0.8-bin.tar

Moved the long directory name to first instance cassA-1.0.8

$ mv apache-cassandra-1.0.8 cassA-1.0.8

Extracted again and renamed this to the other two planned instances:

$ tar xfv apache-cassandra-1.0.8-bin.tar
$ mv apache-cassandra-1.0.8 cassB-1.0.8  

$ tar xfv apache-cassandra-1.0.8-bin.tar
$ mv apache-cassandra-1.0.8 cassC-1.0.8  

This gave me three packages to build, and each with a unique IP

  cassA-1.0.8   10.1.1.101
  cassB-1.0.8   10.1.1.102
  cassC-1.0.8   10.1.1.103

Edit configuration files in each instance (casaA-1.0.8 used as example:)

$ vi cassA-1.0.8/conf/cassandra.yaml 

[...]

# directories where Cassandra should store data on disk.
data_file_directories: 
    - /home/bigdata/data/cassA

# commit log
commitlog_directory: /home/bigdata/commitlog/cassA

# saved caches
saved_caches_directory: /home/bigdata/saved_caches/cassA

[...]

# If blank, Cassandra will request a token bisecting the range of
# the heaviest-loaded existing node.  If there is no load information
# available, such as is the case with a new cluster, it will pick
# a random token, which will lead to hot spots.
initial_token: 0

[...]

# Setting this to 0.0.0.0 is always wrong.
listen_address: 10.1.1.101

[...]

rpc_address: 10.1.1.101

[...]

          # seeds is actually a comma-delimited list of addresses.
          # Ex: ",,"
          - seeds: "10.1.100.101,10.1.100.102,10.1.100.103"
[...]

Setting a separate logfile is recommended. Edit config to set separate log

vi cassA-1.0.8/conf/log4j-server.properties

[...]
log4j.appender.R.File=/home/bigdata/log/cassA.log
[...]

Repeat for instances cassB and cassC, setting the token value for B and C to appropriate values (see Extra Credit below if you need to know how to do *that* part):

#cassB
initial_token: 56713727820156410577229101238628035242

#cassC
initial_token: 113427455640312821154458202477256070485

To enable the JMX management console, each instance will require it’s own port. Edit the env file to set that up.

vi cassA-1.0.8/conf/cassandra-env.sh

[...]
# Specifies the default port over which Cassandra will be available for
# JMX connections.
JMX_PORT="8001"
[...]

Repeated for the other two instances, defining 8002 and 8003 respectively.

Now, for the final trick, start up the instances:

  cassA-1.0.8/bin/cassandra
  cassB-1.0.8/bin/cassandra
  cassC-1.0.8/bin/cassandra

Cluster elements started up, and they can be seen active in the process table here:

$ ps -lf
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S bigdata   4554     1  2  80   0 - 226846 futex_ 12:13 pts/0   00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=
0 S bigdata   4593     1  2  80   0 - 210824 futex_ 12:13 pts/0   00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=
0 S bigdata   4632     1  2  80   0 - 226830 futex_ 12:13 pts/0   00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=
0 R bigdata   5047  3054  0  80   0 -  5483 -      12:16 pts/0    00:00:00 ps -lf

Finally, to check the status, connect to of the JMX node ports and check the ring. You only need to connect to one of the cluster’s nodes to check the complete cluster’s status:

$ bin/nodetool -h 10.1.100.101 -port 8001 ring
Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                               113427455640312821154458202477256070485     
10.1.100.101    datacenter1 rack1       Up     Normal  21.86 KB        33.33%  0                                           
10.1.100.102    datacenter1 rack1       Up     Normal  20.28 KB        33.33%  56713727820156410577229101238628035242      
10.1.100.103    datacenter1 rack1       Up     Normal  29.1 KB         33.33%  113427455640312821154458202477256070485      

Now, that’s a functional 3 instance cluster running on a single node. These are not in separate VMs, and if you wanted to experiment with this on a larger cluster, running multiple instances on multiple VM’s on a single hypervisor.. I don’t really see why you cannot!

In the next article, I’m going to start feeding data into the cluster. Stay tuned for that!


Extra Credit:

To create the token value I needed for this three ring cluster, I used the following PERL script. BTW, bignum is required unless you want PERL printing these big numbers in scientific notation:

#!/usr/bin/perl
use bignum;
my $nodes = shift;
print "Calculate tokens for $nodes nodes\n";
print "node 0\ttoken: 0\n" unless $nodes;
exit unless $nodes;
my $factor = 2**127;
print "factor = $factor\n";
for (my $i=0;$i<$nodes;$i++) {
	my $token = $i * ( $factor / $nodes);
	print "node $i\ttoken: $token\n";
}

Running the script for three nodes gave me the following results:

$ ./maketokens.pl  3

Calculate tokens for 3 nodes
factor = 170141183460469231731687303715884105728
node 0	token: 0
node 1	token: 56713727820156410577229101238628035242.67
node 2	token: 113427455640312821154458202477256070485.34

Additional Comments:

If you are setting up a standard mutli-box cluster, make sure you have the following ports opened up on any firewalls. If not, the cluster members wont' find each other:

# TCP port, for commands and data
storage_port: 7000

# SSL port, for encrypted communication.  Unused unless enabled in
# encryption_options
ssl_storage_port: 7001

NEXT: Setting up a Java build env to prepare for Cassandra development

5 thoughts on “Cassandra and Big Data – building a single-node “cluster””

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.