Cassandra – Getting off the ground.
Continuation of post: Apache Cassandra Project – processing “Big Data”
While researching a project on Big Data services, I knew that I’d need a multi-node cluster to experiment with, but a pile of hardware was not immediately available.
Using the VERY helpful book Cassandra High Performance Cookbook I was able to build a 3 node cluster on a single machine. This is how I did it:
For this cluster test example, I am using Ubunto 10, with following JVM
JVM vendor/version: OpenJDK 64-Bit Server VM/1.6.0_22
Downloaded Cassandra 1.0.8 package from here:
http://apache.mirrors.tds.net//cassandra/1.0.8/apache-cassandra-1.0.8-bin.tar.gz
Created new user on system: bigdata
Create the required base data directories
$ mkdir commitlog,log,data,saved_caches
Moved that package there and started the build
$ cp /tmp/apache-cassandra-1.0.8-bin.tar.gz .
Unzipped and extracted the contents
$ gunzip apache-cassandra-1.0.8-bin.tar.gz $ tar xvf apache-cassandra-1.0.8-bin.tar
Moved the long directory name to first instance cassA-1.0.8
$ mv apache-cassandra-1.0.8 cassA-1.0.8
Extracted again and renamed this to the other two planned instances:
$ tar xfv apache-cassandra-1.0.8-bin.tar $ mv apache-cassandra-1.0.8 cassB-1.0.8 $ tar xfv apache-cassandra-1.0.8-bin.tar $ mv apache-cassandra-1.0.8 cassC-1.0.8
This gave me three packages to build, and each with a unique IP
cassA-1.0.8 10.1.1.101 cassB-1.0.8 10.1.1.102 cassC-1.0.8 10.1.1.103
Edit configuration files in each instance (casaA-1.0.8 used as example:)
$ vi cassA-1.0.8/conf/cassandra.yaml [...] # directories where Cassandra should store data on disk. data_file_directories: - /home/bigdata/data/cassA # commit log commitlog_directory: /home/bigdata/commitlog/cassA # saved caches saved_caches_directory: /home/bigdata/saved_caches/cassA [...] # If blank, Cassandra will request a token bisecting the range of # the heaviest-loaded existing node. If there is no load information # available, such as is the case with a new cluster, it will pick # a random token, which will lead to hot spots. initial_token: 0 [...] # Setting this to 0.0.0.0 is always wrong. listen_address: 10.1.1.101 [...] rpc_address: 10.1.1.101 [...] # seeds is actually a comma-delimited list of addresses. # Ex: ", , " - seeds: "10.1.100.101,10.1.100.102,10.1.100.103" [...]
Setting a separate logfile is recommended. Edit config to set separate log
vi cassA-1.0.8/conf/log4j-server.properties [...] log4j.appender.R.File=/home/bigdata/log/cassA.log [...]
Repeat for instances cassB and cassC, setting the token value for B and C to appropriate values (see Extra Credit below if you need to know how to do *that* part):
#cassB initial_token: 56713727820156410577229101238628035242 #cassC initial_token: 113427455640312821154458202477256070485
To enable the JMX management console, each instance will require it’s own port. Edit the env file to set that up.
vi cassA-1.0.8/conf/cassandra-env.sh [...] # Specifies the default port over which Cassandra will be available for # JMX connections. JMX_PORT="8001" [...]
Repeated for the other two instances, defining 8002 and 8003 respectively.
Now, for the final trick, start up the instances:
cassA-1.0.8/bin/cassandra cassB-1.0.8/bin/cassandra cassC-1.0.8/bin/cassandra
Cluster elements started up, and they can be seen active in the process table here:
$ ps -lf F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 S bigdata 4554 1 2 80 0 - 226846 futex_ 12:13 pts/0 00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy= 0 S bigdata 4593 1 2 80 0 - 210824 futex_ 12:13 pts/0 00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy= 0 S bigdata 4632 1 2 80 0 - 226830 futex_ 12:13 pts/0 00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy= 0 R bigdata 5047 3054 0 80 0 - 5483 - 12:16 pts/0 00:00:00 ps -lf
Finally, to check the status, connect to of the JMX node ports and check the ring. You only need to connect to one of the cluster’s nodes to check the complete cluster’s status:
$ bin/nodetool -h 10.1.100.101 -port 8001 ring Address DC Rack Status State Load Owns Token 113427455640312821154458202477256070485 10.1.100.101 datacenter1 rack1 Up Normal 21.86 KB 33.33% 0 10.1.100.102 datacenter1 rack1 Up Normal 20.28 KB 33.33% 56713727820156410577229101238628035242 10.1.100.103 datacenter1 rack1 Up Normal 29.1 KB 33.33% 113427455640312821154458202477256070485
Now, that’s a functional 3 instance cluster running on a single node. These are not in separate VMs, and if you wanted to experiment with this on a larger cluster, running multiple instances on multiple VM’s on a single hypervisor.. I don’t really see why you cannot!
In the next article, I’m going to start feeding data into the cluster. Stay tuned for that!
Extra Credit:
To create the token value I needed for this three ring cluster, I used the following PERL script. BTW, bignum is required unless you want PERL printing these big numbers in scientific notation:
#!/usr/bin/perl use bignum; my $nodes = shift; print "Calculate tokens for $nodes nodes\n"; print "node 0\ttoken: 0\n" unless $nodes; exit unless $nodes; my $factor = 2**127; print "factor = $factor\n"; for (my $i=0;$i<$nodes;$i++) { my $token = $i * ( $factor / $nodes); print "node $i\ttoken: $token\n"; }
Running the script for three nodes gave me the following results:
$ ./maketokens.pl 3 Calculate tokens for 3 nodes factor = 170141183460469231731687303715884105728 node 0 token: 0 node 1 token: 56713727820156410577229101238628035242.67 node 2 token: 113427455640312821154458202477256070485.34
Additional Comments:
If you are setting up a standard mutli-box cluster, make sure you have the following ports opened up on any firewalls. If not, the cluster members wont' find each other:
# TCP port, for commands and data storage_port: 7000 # SSL port, for encrypted communication. Unused unless enabled in # encryption_options ssl_storage_port: 7001
NEXT: Setting up a Java build env to prepare for Cassandra development