Cassandra – Running some simple tests, including a multi-get strategy.

PREV: Re-Configuring an Empty Cassandra Cluster

Time for the rubber to meet the road. Get some data loaded and validate the theoretical concepts garnered from the documentation consumed.

This is an record example (IP’s have been changed to protect the clueless):

      ip_key: 1598595809
          ip: 10.2.162.225
     prop_id: 1033
    property: Bad Stuff
      threat: 1
   attribute: suspicious
        meta: 10.25.112.7
    detected: 2012-01-05 15:17:14
detected_sec: 1325805434
    reported: 2012-01-06 01:44:02
reported_sec: 1325843042

Preliminary model concept centers around the IP, however with over 60,000,000 records there are overlaps, so a single IP is not going to survive as the primary key. Trying to get a distribution out of MySQL takes some time. Here are some distributions by key. Thousands of of events per IP, and this is just a short 1 month window:

+------------+--------+
| ip_dec     | events |
+------------+--------+
| 3158358206 |   2705 |
|  652542280 |   2506 |
| 3495573656 |   2089 |
| 3232235778 |   2015 |
| 1072721396 |   1528 |
|  652542281 |   1432 |
| 3232235876 |   1427 |
| 3448822506 |   1232 |
| 1280052209 |   1106 |
| 3232235779 |   1086 |
+------------+--------+

Now, Cassandra will support MILLIONS of column items on a single row, thus, this actually might work, and scale without using Super Column Families (SCFs). Using the detected time seconds as the column name with an attribute suffix, then enclosing the data in a JSON blob could provide the required results. Using the datekey as a secondary index across the columns, or using them as a time progression. Concepts that need to be tested, which precisely the task at hand.

Considering that a good detected time is not always available, and the data is processed in batches, there could be a heavy grouping of timestamps. If there are a variety of issues detected on a specific IP, at the same obfuscated time, loss of data will occur. This is certainly NOT the desired result. Given this, the datastamp is not unique enough for a hash structure datastore such as Cassandra, without using SCFs.

A structure such as this could deliver the required granularity:

ipstore[$ipkey][$timekey][$propkey] = JSON:{}, JSON:{}, JSON{}...  ;

To get started with loading data, wrote a quick test program in Java, compliled it and ran it:

test1.java – source code

public class test1 {
  public static void main (String [] args) {
    System.out.println("Cassandra Calling!");
  }
}

compiling….

java/src/loader1$ javac test1.java -d ../../class/.

executing…

/java/class$ java test1
Cassandra Calling!

Environment confirmed for compiling loader code. With a model in mind…

ipstore[$ipkey][$popkey][$timestamp] = JSON:{}

..and IP data to load,

ipp < get_a_million.sql > a_million_ips.dta
cass:~$ ls -l
126180075 2012-03-13 13:06 a_million_ips.dta

cass:~$ wc -l a_million_ips.dta
1000001 a_million_ips.dta

...next it's designing the schema builder and loader.

REFERENCE: Setting up a Java build env to prepare for Cassandra development

With the environment confirmed, and a test file (test1.java) written, execute and verify function:

cass:~$ ant -DclassToRun=test1 run
Buildfile: ./build.xml

[...]

run:
     [java] This is Java.... drink up!

VERIFIED.

To get moving forward, I created a Utilities class and a DB connector Class. You can look at the source code for those at these two links:

Util Source Code

Cassandra DB Connector Source Code

With the code done, need to perform a couple of house keeping tasks to get it prepared for loading.

Adding the ks33 keyspace

[default@unknown] create keyspace ks33:
c7944700-6e2e-11e1-0000-13393ec611bd
Waiting for schema agreement...
... schemas agree across the cluster

[default@unknown] use ks33;
Authenticated to keyspace: ks33

Adding the cf33 ColumnFamily to ks33 Keyspace:

[default@ks33] create column family cf33 with comparator = UTF8Type; 
2501f8b0-6e2f-11e1-0000-13393ec611bd
Waiting for schema agreement...
... schemas agree across the cluster

Next, to load 100 trial rows. Here is a link to the source code:

Source for useMultiGet (tba)

hpcass@feed0:~/cassIP/java/cBuild$ host=10.1.0.123 port=9160 inserts=100 ks=ks33 cf=cf33 ant -DclassToRun=c01.useMultiGet run
Buildfile: /home/hpcass/cassIP/java/cBuild/build.xml

init:

compile:
    [javac] Compiling 1 source file to /home/hpcass/cassIP/java/cBuild/build/classes

dist:
      [jar] Building jar: /home/hpcass/cassIP/java/cBuild/dist/lib/cassIP.jar

run:
     [java] get time   89062577
     [java] mget time 494039096

BUILD SUCCESSFUL

Here are some results from multi-get tests. It's actually the inverse of my hope, the multi-get seems to rapidly lose it's benefit.

5 Item Slices  (1000 item dataset)
=========================================================
run:                    RUN 1      RUN 2      RUN 3   
     [java] get time  339041199  436440551  358115310
     [java] mget time 172484370  174690508  182833140

10 Item Slices  (1000 item dataset)
=========================================================
run:                    RUN 1      RUN 2      RUN 3   
     [java] get time  346512511  332820479  314136351
     [java] mget time 394049160  251152592  234719383

25 Item Slices  (1000 item dataset)
=========================================================
run:                    RUN 1      RUN 2      RUN 3   
     [java] get time  335286775  293802010  295948562
     [java] mget time 464933443  324505741  312226035

What I didn't expect to see, based on the information in the 'High Performance
Cookbook, was rapid fall-off in performance, and in face in all cases in the
slices of size 25 inverted the performance, showing that it became worse.

2 Item Slices  (1000 item dataset)
=========================================================
run:                    RUN 1      RUN 2      RUN 3   
     [java] get time  285509637  331970814  317512021
     [java] mget time 104567639   96477512  124040195

One thing I didn't think of testing was doing a slice of size 1, and see if maybe part of the perceived performance in the lower slices is really cache hits. AH! Look at this, it looks like the *test* is highly suspect at best. I think this shows some evidence the performance 'benefit' of the multi-get is really a cache hit artifact from extracting the exact same data a second time:

host=10.1.0.123 port=9160 inserts=1000 ks=ks33 cf=cf33 slice=1 ant -DclassToRun=c01.useMultiGet run
Buildfile: /home/hpcass/cassIP/java/cBuild/build.xml

1 Item Slices  (1000 item dataset)
=========================================================
run:                    RUN 1      RUN 2      RUN 3   
     [java] get time  295158535  298466321  283438099
     [java] mget time 109982545  103658894   98260286

This demonstrator failure to perform, is not a failure in and of itself. It's provided useful information regarding some concepts recommended in some documentation, but may not really be a true best practice. I long ago developed a healthy skepticism of expert advice in lieu of verification.

5 thoughts on “Cassandra – Running some simple tests, including a multi-get strategy.”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.