Guangyuan's Research and Development Blog: How to count the number of occorrences of each word in each line of a file?

Suppose we have a file - words.txt - with its content where each line is a word and we would like to count each unique word and sort them in ascending order.

More specifically, we have words.txt as follows:


  1 foo
  2 bar
  3 foo
  4 foo
  5 bar

And we would like to get the result as follows:


  word,count
  foo,3
  bar,2

We use the following steps to approach this problem:

Read and sort the content to count unique words and their counts
Sort in reversing order by comparing their count values
Print the count and word informatin in the "<count>,<word>" format
Finally, append the column names "word,count" and print out

1. Read and sort the content to count unique words and their counts


$cat words.txt | sort | uniq -c

Note you can always check options available for a command, e.g., uniq, with --help option


$uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

Mandatory arguments to long options are mandatory for short options too.
  -c, --count           prefix lines by the number of occurrences

This will give us the following results so far:


2 bar
3 foo

2. Sort in reversing order by comparing their count values


$cat words.txt | sort | uniq -c | sort -rn

where -rn indicate comparing numerical values and sort in reversed/descending order. And we get the results so far as:


3 foo
2 bar

3. Print the count and word informatin in the "<count>,<word>" format using awk command (awk is abbreviated from the names of the developers – Aho, Weinberger, and Kernighan)


cat words.txt | sort | uniq -c | sort -nr | awk '{print $2","$1}'

Now, we are almost done with the output as


foo,3
bar,2

and just need to append headers on top.

4. Finally, append the column names "word,count" and print out using header command line tool from dsutils


$cat words.txt | sort | uniq -c | sort -nr | awk '{print $2","$1}' | header -a word,count

Putting everythign together, we have the final solution as follows:


$cat words.txt | sort | uniq -c | sort -nr | awk '{print $2","$1}' | header -a word,count

with our desired output


  word,count
  foo,3
  bar,2

How to count the number of occorrences of each word in each line of a file?

No comments:

Post a Comment