
oss-license-extract
-------------------

Copyright (c) 2001 Arbor Networks, Inc.
Copyright (c) 2001 Scott Iekel-Johnson <scottij@arbor.net>.

A program to generate a comprehensive license and copyright notice for
a given set of source files.  Particularly useful when redistributing
open source software or auditing existing licenses.

This program is a Perl script which recursively scans a list of files
and directories for program files, and attempts to automatically
extract license and copyright information from them.  Copyright
statements are linked with their accompanying licenses and each
license is compared with the licenses from the other files to
eliminate duplicates.  The output is a comprehensive list containing a
single copy of each unique license and its associated copyright
holders covering all files scanned.  This list is printed to standard
output by default, though a different name can be given on the command
line.

Files and directories to scan can be given on the command line and/or
specified in a text file with the -f option.  A program file is
currently defined as one which matches *.c, *.h, *.sh, *.pl, or *.py
(the program actually calls find(1) with those globs to determine
which files to read).

The program detects licenses by looking for blocks of text that start
with either "Redistribution" or "Permission".  It assumes any
copyright notices will be found before the license, if present.
Multiple licenses in the same file are supported.  Licenses which
contain several common forms of the advertising clause and/or a
warranty will have the authors' names' replaced with "copyright
holder".  This is to avoid detecting as unique the large number of
licenses that differ only in the author's name.

Some licenses include additional text between the copyright notice and
the start of the license itself.  The program will include this text
at the end of all other license and copyright information, after
checking for and removing duplicates.  This text can be excluded using
the '-x' option.

Duplicate licenses are detected by converting the text to lower case,
stripping all punctuation and whitespace, and then comparing the
remainder.  Obviously, licenses which differ only by transposed
characters, typos, and other subtle changes will be considered two
completely different licenses.  Improvements to this algorithm are
welcomed and encouraged.

