cat data_file.txt | egrep '^X' | sort | uniq | wc -l
This counts the number of unique lines in the text file that begin with X. The great thing is that it's quite readable. You can read the data left to right
"we take the data in data_file.txt, extract the lines that begin with X, sort it, discard the (consecutive) dupes, and then count the number of lines."
Unfortunately, most programming languages force you to do this back-to-front. In a many languages, you'd have to do something crazy like:
wc_l ( uniq ( sort ( filter (data_file, beginsWithX))))
Function application should be left-to-right, not right-to-left. That feels more natural. I've written a little tool that allows me to program simple tools in the Haskell language in this style. Instead of using sed and awk and perl and python from the command line, I hope to use this tool instead. I'm calling it l2r and it wraps around runhugs, a Haskell interpreter.
l2r "x <- cat \"edge_list.txt\"; x & lines & filter ( head # (=='X') ) & sort & uniq & unlines & putStr"The import things are the ampersands (&). They act like the pipe (|) in the bash, allowing functions to be listed left-to-right in the natural way. x is simply a (lazy) string that holds the entire contents of the file.
You might have hoped that you could simply do something like
l2r "cat \"edge_list.txt\" & lines & filter ( head # (=='X') ) & sort & uniq & unlines & putStr"
instead. But you can't. I'm not going to explain it fully here, but there is an important difference between x and cat. x is a plain String, but cat is a special object called a Monad which describes a (potentially side-effect-ful) operation which reads in a String.
A more useful example for me in my research is:
l2r 'x <- cat "edge_list.txt"; x & lines & map words & map (take 2) & map (map readInt) & filter ( \(a:b:[]) -> a /= b ) & map sort & sort & uniq & map (map show) & map unwords & unlines & putStr'as it is code which identifies all the unique undirected edges in an edge list representation of a network.
Finally, here is the l2r code itself:
Any questions or feedback to aaronmcdaid@gmail.com
A simple example to count the length of the lines in a file is:
ReplyDeletecat file.txt | l2r 'x <- stdin; x & lines & map length & print'
To count the number of words, not characters, use:
cat file.txt | l2r 'x <- stdin; x & lines & map words & map length & print'
I think Perl can be used with both styles, so you pick whichever you find most readable at any place?
ReplyDeleteWhy not just use arrows and (>>>)? You could use a nicer (shorter) operator like &. You could easily have something like:
ReplyDeletefrom :: IO String -> (String -> IO ()) -> IO ()
from source f = source >>= f
(&) = (>>>)
then in main:
main = do
from stdin $ lines & map length & mapM_ print
Thanks id,
ReplyDeleteThat does make sense. But it might be difficult to explain why two symbols, '$' and '&', are needed. I like the unix command line tools where '|' can be used throughout.
I like all these solutions, but none of them is 100% perfect yet :-)
Maybe I want a single operator which will do 'The Right Thing' regardless of whether it has data or an IOmonad or a function on its left.