place pst files in pst-extract/pst/
bin/explode_psts.sh- runs readpst to convert pst to mboxbin/normalize_mbox.sh- mbox files to jsonbin/run_spark_tika.sh- tika extract text of attachmentsbin/run_tika_content_join.sh- join attachment text with email jsonbin/run_spark_content_split.sh- removes base64 encoded attachment from emails json and puts the json in to a separate directorybin/run_spark_emailaddr.sh- email address extraction and community assignmentbin/run_spark_email_community_assign.sh- assign communities to email json objectsbin/run_spark_topic_clustering.sh- assign topic clustering to email json objects output by community assignbin/run_spark_mitie.sh- Run MITIE to generate entities for email and add to email json generated by topic clusteringbin/run_spark_es_ingest_emailaddr.sh- ingest emailaddrs to ES indexbin/run_spark_es_ingest_attachments.sh- ingest attachments to ES indexbin/run_spark_es_ingest_emails.sh- ingest emails with entities to ES index
** Location Extraction **
Locations extracted from text
bin/build_clavin_index.shsetup location index (only needs to be run once)bin/run_location_extract.shextracts locations from text body uses input frombin/run_spark_content_splittask
Locations extracted by IP
bin/setup_geo2ip.shsetup geoip indexbin/run_spark_originating_location.shextracts location from ip address
This product includes GeoLite2 data created by MaxMind, available from http://www.maxmind.com.