PREV: Re-Configuring an Empty Cassandra Cluster
Time for the rubber to meet the road. Get some data loaded and validate the theoretical concepts garnered from the documentation consumed.
This is an record example (IP’s have been changed to protect the clueless):
ip_key: 1598595809 ip: 10.2.162.225 prop_id: 1033 property: Bad Stuff threat: 1 attribute: suspicious meta: 10.25.112.7 detected: 2012-01-05 15:17:14 detected_sec: 1325805434 reported: 2012-01-06 01:44:02 reported_sec: 1325843042
Preliminary model concept centers around the IP, however with over 60,000,000 records there are overlaps, so a single IP is not going to survive as the primary key. Trying to get a distribution out of MySQL takes some time. Here are some distributions by key. Thousands of of events per IP, and this is just a short 1 month window:
+------------+--------+ | ip_dec | events | +------------+--------+ | 3158358206 | 2705 | | 652542280 | 2506 | | 3495573656 | 2089 | | 3232235778 | 2015 | | 1072721396 | 1528 | | 652542281 | 1432 | | 3232235876 | 1427 | | 3448822506 | 1232 | | 1280052209 | 1106 | | 3232235779 | 1086 | +------------+--------+
Now, Cassandra will support MILLIONS of column items on a single row, thus, this actually might work, and scale without using Super Column Families (SCFs). Using the detected time seconds as the column name with an attribute suffix, then enclosing the data in a JSON blob could provide the required results. Using the datekey as a secondary index across the columns, or using them as a time progression. Concepts that need to be tested, which precisely the task at hand.
Considering that a good detected time is not always available, and the data is processed in batches, there could be a heavy grouping of timestamps. If there are a variety of issues detected on a specific IP, at the same obfuscated time, loss of data will occur. This is certainly NOT the desired result. Given this, the datastamp is not unique enough for a hash structure datastore such as Cassandra, without using SCFs.
A structure such as this could deliver the required granularity:
ipstore[$ipkey][$timekey][$propkey] = JSON:{}, JSON:{}, JSON{}... ;
To get started with loading data, wrote a quick test program in Java, compliled it and ran it:
test1.java – source code
public class test1 { public static void main (String [] args) { System.out.println("Cassandra Calling!"); } }
compiling….
java/src/loader1$ javac test1.java -d ../../class/.
executing…
/java/class$ java test1 Cassandra Calling!
Environment confirmed for compiling loader code. With a model in mind…
ipstore[$ipkey][$popkey][$timestamp] = JSON:{}
..and IP data to load,
ipp < get_a_million.sql > a_million_ips.dta cass:~$ ls -l 126180075 2012-03-13 13:06 a_million_ips.dta cass:~$ wc -l a_million_ips.dta 1000001 a_million_ips.dta
...next it's designing the schema builder and loader.
REFERENCE: Setting up a Java build env to prepare for Cassandra development
With the environment confirmed, and a test file (test1.java) written, execute and verify function:
cass:~$ ant -DclassToRun=test1 run Buildfile: ./build.xml [...] run: [java] This is Java.... drink up!
VERIFIED.
To get moving forward, I created a Utilities class and a DB connector Class. You can look at the source code for those at these two links:
Cassandra DB Connector Source Code
With the code done, need to perform a couple of house keeping tasks to get it prepared for loading.
Adding the ks33 keyspace
[default@unknown] create keyspace ks33: c7944700-6e2e-11e1-0000-13393ec611bd Waiting for schema agreement... ... schemas agree across the cluster [default@unknown] use ks33; Authenticated to keyspace: ks33
Adding the cf33 ColumnFamily to ks33 Keyspace:
[default@ks33] create column family cf33 with comparator = UTF8Type; 2501f8b0-6e2f-11e1-0000-13393ec611bd Waiting for schema agreement... ... schemas agree across the cluster
Next, to load 100 trial rows. Here is a link to the source code:
Source for useMultiGet (tba)
hpcass@feed0:~/cassIP/java/cBuild$ host=10.1.0.123 port=9160 inserts=100 ks=ks33 cf=cf33 ant -DclassToRun=c01.useMultiGet run Buildfile: /home/hpcass/cassIP/java/cBuild/build.xml init: compile: [javac] Compiling 1 source file to /home/hpcass/cassIP/java/cBuild/build/classes dist: [jar] Building jar: /home/hpcass/cassIP/java/cBuild/dist/lib/cassIP.jar run: [java] get time 89062577 [java] mget time 494039096 BUILD SUCCESSFUL
Here are some results from multi-get tests. It's actually the inverse of my hope, the multi-get seems to rapidly lose it's benefit.
5 Item Slices (1000 item dataset) ========================================================= run: RUN 1 RUN 2 RUN 3 [java] get time 339041199 436440551 358115310 [java] mget time 172484370 174690508 182833140 10 Item Slices (1000 item dataset) ========================================================= run: RUN 1 RUN 2 RUN 3 [java] get time 346512511 332820479 314136351 [java] mget time 394049160 251152592 234719383 25 Item Slices (1000 item dataset) ========================================================= run: RUN 1 RUN 2 RUN 3 [java] get time 335286775 293802010 295948562 [java] mget time 464933443 324505741 312226035
What I didn't expect to see, based on the information in the 'High Performance
Cookbook, was rapid fall-off in performance, and in face in all cases in the
slices of size 25 inverted the performance, showing that it became worse.
2 Item Slices (1000 item dataset) ========================================================= run: RUN 1 RUN 2 RUN 3 [java] get time 285509637 331970814 317512021 [java] mget time 104567639 96477512 124040195
One thing I didn't think of testing was doing a slice of size 1, and see if maybe part of the perceived performance in the lower slices is really cache hits. AH! Look at this, it looks like the *test* is highly suspect at best. I think this shows some evidence the performance 'benefit' of the multi-get is really a cache hit artifact from extracting the exact same data a second time:
host=10.1.0.123 port=9160 inserts=1000 ks=ks33 cf=cf33 slice=1 ant -DclassToRun=c01.useMultiGet run Buildfile: /home/hpcass/cassIP/java/cBuild/build.xml 1 Item Slices (1000 item dataset) ========================================================= run: RUN 1 RUN 2 RUN 3 [java] get time 295158535 298466321 283438099 [java] mget time 109982545 103658894 98260286
This demonstrator failure to perform, is not a failure in and of itself. It's provided useful information regarding some concepts recommended in some documentation, but may not really be a true best practice. I long ago developed a healthy skepticism of expert advice in lieu of verification.
Hi,
I am looking for Java code that could help me to extract data from cassandra Database.
request you to help me on this