TJ

From deep
Jump to: navigation, search

Next Meeting:

Friday July 8, 2016: In-Person

Milestones

(June)July 8, 2016: Successful execution of byteCount program using custom InputFormat using digital corpus on grace. Requires converting E01 formatted images to raw format. This will demonstrate successful execution of custom FileInputFormat on actual disk images.

Dependent Task: When will, confirm, transfer of corpus be completed to hamming? (HPC Team)
TJ Task: Convert corpus from E01 to raw format. Build all libraries in /work/DEEP/
Dependent Task: requires libewf install on hamming and grace(we assume we can do a local install)
TJ Task: Transfer raw corpus into HDFS.
TJ Feasibility Check: Can parallel convert E01 to raw and Transfer to HDFS? 
Constraint is storage capacity using PBS. Reference cecombs and jvanbrua scripts in /work

(June)July 2016: Bulk Extractor execution of single email scanner in hadoop MapReduce paradigm utilizing java JNA to call bulk extractor APIs.

TJ Task: Local Install bulk_extractor on grace. May require multiple package dependencies
TJ Task: Familiarity with JNA and Bulk Extractor API interaction using non-MapReduce Paradigm(Mark’s be-java code)
TJ Task: Finish writing/modifying Mark’s bulk extractor Map Reduce code

July 2016: Successful Bulk Extractor execution on grace using corpus raw disk images in HDFS.

August 2016: Exploratory changes to HDFS settings and custom InputFormat to examine any performance gains or degradations. Test map reduce execution of bulk extractor on varying disk sizes to determine if any benefits exists for a particular disk size.

September 2016: Ideally start writing thesis.

January 2017: Draft thesis to advisors

March 2017: Final thesis submission for signatures

Task Checklist

Status Estimated Duration Responsibility Task
Done 1 week HPC Team Confirm corpus tranfer to hamming
Done 1 week TJ Feasibility Check: Can parallel convert E01 to raw and transfer to HDFS?
Done 1 week HPC Team Install libewf on grace
Done 1 week TJ Convert corpus from E01 to raw, transfer to HDFS
Done 1 week TJ Remove raw images from hamming after conversion
Done 1 month TJ Successful execution of byteCount program using custom InputFormat using raw corpus images on grace.
Done 1 week HPC Team Install Bulk Extractor dependencies on grace.
Done 1 week TJ Install Bulk Extractor
2 weeks TJ Familiarity using JNA to call Bulk Extractor APIs in non-MapReduce paradigm
2 weeks TJ Write MapReduce job to execute bulk extractor APIs in MapReduce paradigm
1 Month TJ Successful execution of single bulk extractor email scanner using MapReduce paradigm on subset of corpus data
1 Month TJ Successful execution of MapReduce bulk extractor email scanner against full email scanner. Tweak input size and measure performance gain/loss.
2 Months TJ Write Thesis
1 Month Advisors Review draft Thesis
2 Months TJ/Advisors Modify/Review/Finalize thesis for submission

TODO

  • As you complete milestones, please leave pointers to notes on how to replicate them (link to online howto is fine if it exists).
  • At soonest convenience, update this page with your notes, including how to get to the error you are currently trying to resolve.
  • By next week, please get your version of word count.java to run, and point to a howto so I can repeat it.


FOR Michael

  • Find and revive Mark's Bulk Extractor Code.
  • Read TJ's updated notes and work through tutorials.

Errors

  • RESOLVED 7/14/2015: Within /home/hduser/bin execute below:
    • This particular link in the second paragraph was what gave me the thought of not being in same directory.
[hduser@hadoopMaster bin]$ hadoop jar wc.jar /hdfs/wordcount/input /hdfs/wordcount/output
Exception in thread "main" java.lang.ClassNotFoundException: /hdfs/wordcount/input
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at org.apache.hadoop.util.RunJar.main(RunJar.java:205)
[hduser@hadoopMaster bin]$ 

It was pointed out by Mark Gondree and various resources this could be because the environment variables not configured correctly. wc.jar is our own compiled version of the wordcount.java found at ApacheWordcount

Due to the environment variables being incorrect here are current settings which work in /home/hduser/.bashrc 
which is on all datanodes as well:
#JAVA
export JAVA_HOME=/opt/jdk1.7.0_71
export PATH=$JAVA_HOME/bin:$PATH
#Hadoop
export HADOOP_HOME=/opt/hadoop-2.5.1
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar
export CLASSPATH=`hadoop classpath`
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar


Issue resolved was due to having .jar file in different directory then where the .class files that program was compiled with were located. Simply moved .jar file into WordCount/ directory and runs without error. Steps used to compile and build jar file below:

In /home/hduser/bin/WordCount:
javac WordCount.java
jar cf wc.jar WordCount.*class
To verify it runs:
hadoop jar wc.jar WordCount /hdfs/wordcount/input /hdfs/wordcount/output
Must make sure /hdfs/wordcount/output DOES NOT EXIST in HDFS or will not work. 
This is a "safety" feature with hadoop to make sure you do not overwrite any data accidentally.
To remove this directory if it exists:
hadoop fs -rm -r -f /hdfs/wordcount/output/

Hadoop/HDFS start/stop

To start:

  • /opt/hadoop-2.5.1/sbin/start-dfs.sh
  • /opt/hadoop-2.5.1/sbin/start-yarn.sh
Verification on hadoopMaster: jps  
PID NameNode
PID Jps
PID SecondaryNameNode
PID ResourceManager
Verification on slave nodes: jps
PID NodeManager
PID Jps
PID DataNode
MapReduce job Verification:
hadoop jar /opt/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar wordcount 

\ /hdfs/wordcount/input /hdfs/wordcount/output

To stop:

  • /opt/hadoop-2.5.1/sbin/stop-yarn.sh
  • /opt/hadoop-2.5.1/sbin/stop-dfs.sh