Word Frequency Table
Finding the most common words in a text file is a famous “programming pearl”, originally posed by Jon Bentley (1984). Several interesting solutions have been proposed by Knuth (an exquisite model of “literate programming”, 1986), McIlroy (an engineering example of combining a timeless set of “UNIX tools”, 1986), Hanson (an alternate efficient solution, 1987).
This project requires a simple solution, comparable in style and performance to McIlroy’s pipelined solution (though not as performant as the much more complex solutions of Knuth and Hanson). For this, we will use a simple functional programming style based on .NET LINQ framework.
Problem statement: Given a text file and an integer k, you are to print the words (and their frequencies of occurrence) whose frequencies of occurrence are the k largest (upper limit) in order of decreasing frequency; breaking any ties in increasing words’ alphabetical order. Words are consecutive sequences of ASCII alphabetical letters (A-Za-z), where letter case is ignored, and all other characters are ignored. A short minimalistic example will clarify these specs.
Sample text file, called test1.txt:
.....
Sample result, for k=3: The output must be a text file with two columns. Extra white spaces are permitted but discarded. Please use this format, because your output will be assessed by a script, using whitespace tolerant textual comparisons.
4 AA
2 CCC
2 DD
To understand this, let us consider the following intermediate steps. Word list extracted from this text file:
Canonical word list – upper case versions to ignore the case:
DD DD CCC AA AA AA CCC BB AA
An unsorted frequency table:
4 AA
1 BB
2 CCC
2 DD
A sorted frequency table: 1st by decreasing frequencies, 2nd by increasing alphabetical word order:
4 AA
2 CCC
2 DD
1 BB
Taking the first k=3 rows only, we obtain the above shown result.
Programs: (1) a C# solution (required); (2) a Node JS solution (required); and
(3) an F# solution (optionally, bonus). Essentially, each program must be totally contained in one single file and use only the standard libraries extant in the labs.
The C# solution must use the .NET LINQ framework and System.IO.
The Node JS solution must only use linq-es2015 (external) and fs (internal). The F# solution must only use the F# core modules and System.IO.
These programs must be named after your upi, e.g. jbon007.cs, jbon007.js, jbon007.fs. The dropbox will reject programs that do not follow this convention.
Compilation: Assuming the C# and F# compilers are on the current PATH, the required C# and the optional F# programs must be compilable in the labs with these simple commands (no external libraries are permitted):
csc jbon007.cs
fsc jbon007.fs
The required Node JS program will not be compiled, but executed directly by Node.
Execution – C# and F#
After the above compilation, the C# and F# programs will have a name like jbon007.exe. This program will be given (1) a mandatory parameter, the text file name (path); and (2) an optional parameter, the integer k, with the default value k=3. If errors occur, an error line must appear. Sample runs and outputs:
>jbon007.exe test1.txt 2
4 AA
2 CCC
>jbon007.exe test1.txt 10
4 AA
2 CCC
2 DD
1 BB
>jbon007.exe test1.txt
4 AA
2 CCC
2 DD
>jbon007.exe test1.txt x
*** Error: ...
>jbon007.exe testx.txt
*** Error: ...
>jbon007.exe
*** Error: ...
In the error texts, in place of ellipses (…), display either system created messages, or your custom messages (presumably more adequate). However, we do not prescribe these texts. The important part is that the programs must catch and display something meaningful for all possible errors.
Execution – Node JS
The JS program will be executed under Node, so the command lines will be longer and errors may look differently – but still all errors must be caught.
>node jbon007.js test1.txt 2
4 AA
2 CCC
>node jbon007.js test1.txt 10
4 AA
2 CCC
2 DD
1 BB
>node jbon007.js test1.txt
4 AA
2 CCC
2 DD
>node jbon007.js test1.txt x
*** Error: ...
>node jbon007.js testx.txt
*** Error: ...
>node jbon007.js
*** Error: ...
Functional style: Besides the results’ accuracy, your submission will also be assessed on its high-level functional style. For this project, we will be specifically looking at a consistent formatting, a reasonable size, and at the number of low-level explicit loops used, i.e. for, foreach, while statements.
Generally, a functional style naturally results in crisper and shorter programs (without artificially squeezing many expressions on one single line). The model solutions have about 50-60 lines, including a few blank lines inserted for readability, and no explicit loops.
As we will see in the lectures, functional programming offers high-level functions that encapsulate and hide most required loops.
Submission
Please submit to the project Dropbox (https://adb.auckland.ac.nz/) the following items:
(1) Your C# source file, upi.cs – required
(2) Your Node JS source file, upi.js – required
(3) Your F# source file, upi.fs – allowed (optional bonus)
References (just in case you are interested to read more about this problem and its very efficient solutions)
Bentley, J., Knuth, D., McIlroy, D.: Programming pearls: A literate program. Commun. ACM 29(6), 471-483 (Jun 1986),
http://doi.acm.org/10.1145/5948.315654
Appendix
A sample bash script solution, based on McIlroy’s pipeline of UNIX tools. Again, no explicit loops. Just in case, as you do not have to study this if are not familiar with UNIX.
cat test1.txt |
tr -cs A-Za-z '\n' | awk 'NF' |
tr a-z A-Z | gsort | uniq -c |
gsort -k1rn -k2 | head -3
#sed 3q
1513067809116994.png
案例cs作业:C# F# nodejs Word Frequency Table
2018-04-21