To install from Hackage, run:
cabal install smapTo install from source, you can use that or download this repo and run
stack install smapYou will need cabal or stack if you don't already have one of them.
The setup:
cat > patients << EOF
Bob Smith
Jane Doe
John Smith
Carol Carell
EOF
cat > has_cold << EOF
Jane Doe
John Smith
EOF
cat > has_mumps << EOF
Jane Doe
Carol Carell
EOFSick patients:
$ smap cat has_cold has_mumps
Jane Doe
John Smith
Carol CarellYou can also use - instead of a filename to represent stdin/stdout. (This works for any command.)
$ cat has_cold | smap cat - has_mumps
Jane Doe
John Smith
Carol CarellIf you don't provide any arguments, cat will assume you mean stdin.
$ cat has_cold has_mumps | smap cat
Jane Doe
John Smith
Carol CarellHealthy patients:
$ smap sub patients has_cold has_mumps
Bob SmithPatients with both a cold and mumps:
$ smap int has_cold has_mumps
Jane DoePatients who only have a cold or mumps, but not both:
$ smap xor has_cold has_mumps
Carol Carell
John SmithWhen using smap with sets, the behavior is pretty straightforward. It gets a bit more complicated when
dealing with maps.
If you provide smap with a filepath, it will construct a map where the keys equal the values. (This
is equivalent to a set). If you pass in +file1,file2
as an argument, smap will construct a map using lines from file1 as keys and lines from file2 as values.
We can get a list of patient last names using cut -f 2 -d ' ' <patient file>
$ smap cat +<(cut -f 2 -d ' ' patients),patients
Bob Smith
Jane Doe
Carol CarellTo understand the above:
- <(cut -f 2 -d ' ' patients)gets a list of all the patients' last names and creates a virtual file with this list. See bash process substitution.
- +<(cut -f 2 -d ' ' patients),patientsconstructs a stream where the keys are the last names and the values are the whole names.
cat deduplicates by key, so if we see a second (or third, or fourth, etc.) person from a given family we don't print them out.
$ smap int +<(cut -f 2 -d ' ' patients),patients <(cut -f 2 -d ' ' has_cold)
Bob Smith
Jane Doe
John SmithTo understand the above:
- <(cut -f 2 -d ' ' patients)gets a list of all the patients' last names.
- +<(cut -f 2 -d ' ' patients),patientsconstructs a stream where the keys are the last names and the values are the whole names.
- <(cut -f 2 -d ' ' has_cold)gets a list of family names of everyone who has a cold.
So int is filtering the first argument (treated as a key,value stream) by the keys present in the second argument.
If you're processing lots of lines and running up against memory limits,
you can use the --approximate option to keep track of a 64-bit hash
of each line instead of the entire line.