Paai's text utilities

Last additions: 23 December 2000

This is a dynamic (growing and changing) collection of programs and Unix-scripts that I use for playing with text-files. The collection is not complete, by no means bugfree and perhaps even not very original. But the source is available under the GNU copyleft, and you can do wat you want with them.

Hopefully other people that are engaged in Information Retrieval and cd /WCorpus Linguistics find something here that they can use. I know that I wished for a collection of similar programs when I started.

All programs compile and run under Linux. As they are not very complicated, they should run on most Unix-systems.

The programs and scripts below often depend on the availability of a weighted index. This can be generated by e.g. SMART, using a keyword-document weight or by the discim-program below that computes the discrimination value of keywords. You can download the SMART system here as smart.11.0.tar.z. After some minor editing of the Makefile it compiles and runs beautifully on Linux too, but it has a lousy documentation. For a short introduction into the care and feeding of SMART see my attempts to teach the use of Smart.


Do not forget to check the help-function of the programs (by typing 'progname -h'). It may contain additions or modifications that have not yet found their way in the man pages.

PTU: Introduction
How to use Paai's Text Utilities.
chains
(man-page) prints sentence_number and 2-sentence-chain-similarity to stdout. Sentence-similarity is computed by counting active chains. A chain is active when a particular word occurrs within a certain distance of that same token.
extract
Extracts entries from a bibtex-file and does some formatting. Uses boolean operators and field control. Extremely handy for command-line junks that often need references from their bibliography files.
hyperg (NEW May '98)
Computes performance of IR attributable to chance.
sent_wgt
(man-page) prints sentence_number and mean_of_weights of the words in the sentence to stdout.
sent_til
(man-page) prints sentence_number and 2-sentence-similarity to stdout. Sentence similarity is computed by Dice's or Jaccard's coefficient.
discrim
(man-page) Computes centroid and discrimination values from weighted index. The program shows a drastic speed improvement over Dave Dubin's program below, from which I pinched some code.
bigrams
(man-page) computes the approximate mutual info for bigrams and displays bigrams above a certain frequency treshold with their mutual info :
log(f(x,y)/f(x)*f(y)) / log 2
listwords
(man-page) lists words. Rough-and-ready indexing program for relatively small files.
matrix
(man-page) makes a cross-tab of an index-file
wordsel
(man-page) selects various files from cross-tab file. svdinterface.
arg-expand
(man-page) expands a range of 'a,c-f,...' to a,c,d,e,f...'
arfftosvd
(man-page) Converts an arff-file to the inputformat ofsvdinterface (sparse matrix).
de-matrix
(man-page) inverse effect of matrix.
docvec
(man-page) creates a vector representation of documents
rocchio
(man-page) computes the centroid of positive and negative classes.
simil
(man-page) Compares every vector in with the query and writes the similarity to stdout.
smallsmart
(man-page) Computes the word-document weights according to the atc-variant of SMART. Very slow as compared with the original, but easier to handle.
smarttosvd
(man-page) converts three-column index to sparse matrix as used for svd
splitname
(man-page) cuts filename in substrings.
svdtoarff
(man-page) Converts the output of svdinterface to a rectangular matrix.
doe_gnuplot
(man-page) Simple command-line front-end for gnuplot.

Programs by other authors

WEKA
The WEKA stuff i used till now (2000) is obsolete and superseded by new programs in JAVA. Please refer to the WEKA site for more info..
arffcols
(help) Displays columns from arffile (By the WEKA crew).
arffinfo
(help) arffinfo reads from and displays information about the arff file (WEKA).
arffsplit
(help) arffsplit reads from and outputs two files, and . The proportion of the output directed into each file is specified by the options (WEKA).

CLUSTER
(compressed file) Sources & documentation of a Hierarchical Cluster Analysis and Principal Component Analysis program. Compiles just fine under Linux. You will have to download it seperately by clicking the word "CLUSTER" above.

SVM-Light V1.0 (link)
SVM-Light is a fully functional and fast implementation of Vapnik's Support Vector Machine for the pattern recognition problem. The optimization algorithm used is a refined version of the decomposition algorithm proposed in [Osuna, et al., 1997]. It will be described in detail in a forthcoming paper. The implementation has modest memory requirements and can handle problems with many thousands of support vectors efficiently (quoted from a email by J. Thorsten).

Latent Semantic Indexing or SVD
Parts of Michael Berry's SVDPACK hacked by Hinrich Schuetze. All that you need to do Single Value Decomposition. Pick up the stuff by clicking here.

SMART
Just for completeness here is the Smart-link again.

BIBHTML
This is a perl-script by Ralf Hauser that performs a similar service for HTML as bibtex does for LaTeX. It uses normal bibtex-files and indeed calls bibtex itself. I have not yet tested it extensively, but it seems to work most of the time. Call the program without parameters for help and examples.
Download all
The programs. Well, most of them. Do not forget to press the shift-key...

Publications

Bibliography
"La joie de se voir imprimé"

Last update of this URL : 23 dec. 1997.