utilitiesLast additions: 23 December 2000
This is a dynamic (growing and changing) collection of programs and
Unix-scripts that I use for playing with text-files. The collection is not
complete, by no means bugfree and perhaps even not very original. But the source
is available under the GNU copyleft, and you can do wat you want with them.
Hopefully other people that are engaged in Information Retrieval and cd
/WCorpus Linguistics find something here that they can use. I know that
I wished for a collection of similar programs when I started.
All programs compile and run under Linux. As they are not very complicated,
they should run on most Unix-systems.
programs and scripts below often depend on the availability of a weighted index.
This can be generated by e.g. SMART, using a keyword-document weight or by the
discim-program below that computes the discrimination value of keywords. You can
download the SMART system here
as smart.11.0.tar.z. After some minor editing of the Makefile it compiles and
runs beautifully on Linux too, but it has a lousy documentation. For a short
introduction into the care and feeding of SMART see my attempts to
teach the use of Smart.
- If some of the programs are of use to you, let me know.
- If you find bugs, please bring them under my attention.
- If you know of other programs that should be in this collection, please
Do not forget to check the help-function of the programs (by typing 'progname
-h'). It may contain additions or modifications that have not yet found their
way in the man pages.
- How to use Paai's Text Utilities.
- (man-page) prints sentence_number and 2-sentence-chain-similarity to
stdout. Sentence-similarity is computed by counting active chains. A chain is
active when a particular word occurrs within a certain distance of that same
- Extracts entries from a bibtex-file and does some formatting. Uses boolean
operators and field control. Extremely handy for command-line junks that often
need references from their bibliography files.
- hyperg (NEW
- Computes performance of IR attributable to chance.
- (man-page) prints sentence_number and mean_of_weights of the words in the
sentence to stdout.
- (man-page) prints sentence_number and 2-sentence-similarity to stdout.
Sentence similarity is computed by Dice's or Jaccard's coefficient.
- (man-page) Computes centroid and discrimination values from weighted
index. The program shows a drastic speed improvement over Dave Dubin's program
below, from which I pinched some code.
- (man-page) computes the approximate mutual info for bigrams and displays
bigrams above a certain frequency treshold with their mutual info :
log(f(x,y)/f(x)*f(y)) / log 2
- (man-page) lists words. Rough-and-ready indexing program for relatively
- (man-page) makes a cross-tab of an index-file
- (man-page) selects various files from cross-tab file. svdinterface.
- (man-page) expands a range of 'a,c-f,...' to a,c,d,e,f...'
- (man-page) Converts an arff-file to the inputformat ofsvdinterface (sparse
- (man-page) inverse effect of matrix.
- (man-page) creates a vector representation of documents
- (man-page) computes the centroid of positive and negative classes.
- (man-page) Compares every vector in with the query and writes
the similarity to stdout.
- (man-page) Computes the word-document weights according to the atc-variant
of SMART. Very slow as compared with the original, but easier to handle.
- (man-page) converts three-column index to sparse matrix as used for svd
- (man-page) cuts filename in substrings.
- (man-page) Converts the output of svdinterface to a rectangular matrix.
- (man-page) Simple command-line front-end for gnuplot.
Programs by other authors
- The WEKA stuff i used till now (2000) is obsolete and superseded by new
programs in JAVA. Please refer to the WEKA site for more info..
- (help) Displays columns from arffile (By the WEKA crew).
- (help) arffinfo reads from and displays information about the arff
- (help) arffsplit reads from and outputs two files, and
. The proportion of the output directed into each file is specified
by the options (WEKA).
- (compressed file) Sources & documentation of a Hierarchical Cluster
Analysis and Principal Component Analysis program. Compiles just fine under
Linux. You will have to download it seperately by clicking the word "CLUSTER"
- SVM-Light V1.0
- SVM-Light is a fully functional and fast implementation of Vapnik's
Support Vector Machine for the pattern recognition problem. The optimization
algorithm used is a refined version of the decomposition algorithm proposed in
[Osuna, et al., 1997]. It will be described in detail in a forthcoming paper.
The implementation has modest memory requirements and can handle problems with
many thousands of support vectors efficiently (quoted from a email by J.
Semantic Indexing or SVD
- Parts of Michael Berry's SVDPACK hacked by Hinrich Schuetze. All that you
need to do Single Value Decomposition. Pick up the stuff by clicking here.
- Just for completeness here is the Smart-link again.
- This is a perl-script by Ralf Hauser that performs a similar service for
HTML as bibtex does for LaTeX. It uses normal bibtex-files and indeed calls
bibtex itself. I have not yet tested it extensively, but it seems to work most
of the time. Call the program without parameters for help and examples.
- Download all
- The programs. Well, most of them. Do not forget to press the shift-key...
- "La joie de se voir imprimé"
Last update of this URL : 23 dec. 1997.