How to count the number of occorrences of each word in each line of a file?



Suppose we have a file - words.txt - with its content where each line is a word and we would like to count each unique word and sort them in ascending order. 

More specifically, we have words.txt as follows:

  1 foo
  2 bar
  3 foo
  4 foo
  5 bar
And we would like to get the result as follows:

  word,count
  foo,3
  bar,2

We use the following steps to approach this problem:
  1. Read and sort the content to count unique words and their counts
  2. Sort in reversing order by comparing their count values
  3. Print the count and word informatin in the "<count>,<word>" format
  4. Finally, append the column names "word,count" and print out





1. Read and sort the content to count unique words and their counts

$cat words.txt | sort | uniq -c
Note you can always check options available for a command, e.g., uniq, with --help option

$uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

Mandatory arguments to long options are mandatory for short options too.
  -c, --count           prefix lines by the number of occurrences
This will give us the following results so far:

2 bar
3 foo

2. Sort in reversing order by comparing their count values

$cat words.txt | sort | uniq -c | sort -rn
where -rn indicate comparing numerical values and sort in reversed/descending order. And we get the results so far as:

3 foo
2 bar

3. Print the count and word informatin in the "<count>,<word>" format using awk command (awk is abbreviated from the names of the developers – Aho, Weinberger, and Kernighan)

cat words.txt | sort | uniq -c | sort -nr | awk '{print $2","$1}'
Now, we are almost done with the output as

foo,3
bar,2
and just need to append headers on top.

4. Finally, append the column names "word,count" and print out using header command line tool from dsutils

$cat words.txt | sort | uniq -c | sort -nr | awk '{print $2","$1}' | header -a word,count

Putting everythign together, we have the final solution as follows:

$cat words.txt | sort | uniq -c | sort -nr | awk '{print $2","$1}' | header -a word,count
with our desired output

  word,count
  foo,3
  bar,2

No comments:

Post a Comment