Интерактивная система просмотра системных руководств (man-ов)
dictd (8)
>> dictd (8) ( Linux man: Команды системного администрирования )
NAME
dictd - a dictionary database server
SYNOPSIS
dictd [options]
DESCRIPTION
dictd
is a server for the Dictionary Server Protocol (DICT), a TCP transaction
based query/response protocol that allows a client to access dictionary
definitions from a set of natural language dictionary databases.
Since startup time is significant, the server is designed to run
continuously, and should
not
be run from
inetd(8).
Databases are distributed seperately from the server.
BACKGROUND
For many years, the Internet community has relied on the "webster" protocol
for access to natural language definitions. The webster protocol supports
access to a single dictionary and (optionally) to a single thesaurus. In
recent years, the number of publicly available webster servers on the
Internet has dramatically decreased.
Fortunately, several freely-distributable dictionaries and lexicons have
recently become available on the Internet. However, these
freely-distributable databases are not accessible via a uniform interface,
and are not accessible from a single site. They are often small and
incomplete individually, but would collectively provide an interesting and
useful database of English words. Examples include the Jargon file, the
WordNet database, MICRA's version of the 1913 Webster's Revised Unabridged
Dictionary, and the Free Online Dictionary of Computing. (See the DICT
protocol specification (RFC) for references.) Translating and non-English
dictionaries are also becoming available (for example, the FOLDOC
dictionary is being translated into Spanish).
The webster protocol is not suitable for providing access to a large
number of separate dictionary databases, and extensions to the current
webster protocol were not felt to be a clean solution to the
dictionary database problem.
The DICT protocol is designed to provide access to multiple databases.
Word definitions can be requested, the word index can be searched
(using an easily extended set of algorithms), information about the
server can be provided (e.g., which index search strategies are
supported, or which databases are available), and information about a
database can be provided (e.g., copyright, citation, or distribution
information). Further, the DICT protocol has hooks that can be used
to restrict access to some or all of the databases.
dictd(8)
is a server that implements the DICT protocol. Bret Martin implemented
another server, and several people (including Bret and myself) have
implemented clients in a variety of languages.
OPTIONS
-V or --version
Display version information.
-L or --license
Display copyright and license information.
-h or --help
Display help information.
-v or --verbose or -dverbose
Be verbose.
-c file or --config file
Specify configuration file. The default is
/etc/dictd.conf,
but may be changed in the
dictd.h
file at compile time (DICT_CONFIG_FILE).
--port service
Specifies the port (e.g., 2628) or service (e.g., dict) for connections.
The default is 2628, as specified in the DICT Protocol RFC, but may be
changed in the
dictd.h
file at compile time (DICT_DEFAULT_SERVICE).
--depth length
Specify the queue length for
listen(2).
Specifies the number of pending socket connections which are queued by the
operating system. Some operating systems may silently limit this value to
5 (older BSD systems) or 128 (Linux). The default is 10 but may be changed
in the
dictd.h
file at compile time (DICT_QUEUE_DEPTH).
--delay seconds
Specifies the number of seconds a client may be idle before the server will
close the connection. Idle time is defined to be the time the server is
waiting for input and does not include the time the server spends searching
the database. Connections are closed without warning since no provision
for premature connection termination is specified in the DICT protocol
RFC. The default is 600 seconds (10 minutes), but may be changed in the
dictd.h
file at compile time (DICT_DEFAULT_DELAY).
--limit children
Specifies the number of daemons that may be running simultaneously. Each
daemon services a single connection. If the limit is exceeded, a
(serialized) connection will be made by the server process, and a response
code 420 (server temporarily unavailable) will be sent to the client. This
parameter should be adjusted to prevent the server machine from being
overloaded by dict clients, but should not be set so low that many clients
are denied useful connections. The default is 100, but may be changed in
the
dictd.h
file at compile time (DICT_DAEMON_LIMIT).
-l option or --log option
Specify a logging option. Some of the more verbose options are used
primarily for debugging the server code, and are not practical for normal
use.
server
Log server diagnostics. This is extrememly verbose.
connect
Log all connections.
stats
Log all children terminations.
commands
Log all commands. This is extrememly verbose.
client
Log results of CLIENT command.
found
Log all words found in the databases.
notfound
Log all words not found in the databases.
timestamp
When logging to a file, use a full timestamp like that which syslog would
produce. Otherwise, no timestamp is made, making the files shorter.
host
Log name of foreign host.
min
Set a minimal number of options. If logging is activated (to a file, or
via syslog), and no options are set, then the minimal set of options will
be used.
all
Set all of the options.
none
Clear all of the options.
To facilitate location of interesting information in the log file, entries
are marked with initial letters indicating the class of the line being
logged:
I
Information about the server, connections, or termination statistics.
These lines are generally not designed to be parsed automatically.
E
Error messages.
C
CLIENT command information.
D
Definitions found in the databases searched.
M
Matches found in the database searched.
N
Matches which were not found in the databases searched.
T
Trace of exact line sent by client.
To preserve anonymity of the client, do
not
use the
connect
or
host
options. Clients may or may not send host information using the CLIENT
command, but this should be an option that is selectable on the client
side.
Activate a debugging option. There are several, all of which are only
useful to developers. They are documented here for completeness. A list
can be obtained interactively by using
-d
with an illegal option.
verbose
The same as
-v or --verbose.
Adds verbosity to other options.
scan
Debug the scanner for the configuration file.
parse
Debug the parser for the configuration file.
search
Debug the character folding and binary search routines.
init
Report database initialization.
port
Log client-side port number to the log file.
lev
Debug Levenshtein search algorithm.
auth
Debug the authorization routines.
nodetach
Do not detach as a background process. Implies that a copy of the log
file will appear on the standard outout.
nofork
Do not fork daemons to service requests. Be a single-threaded server.
This option implies
nodetach,
and is most useful for using a debugger to find the point at which daemon
processes are dumping core.
alt
Debugs
altcompare
in
index.c.
CONFIGURATION FILE
The configuration file defaults to
/etc/dictd.conf,
but can be specified on the command line with the
-c
option (see above). The configuration file has four distinct sections.
At this time, each section must appear in the specified order, although
only the Database section is required.
Syntax
The following keywords are valid in a configuration file: access, allow,
deny, group, database, data, index, filter, prefilter, postfilter, name,
user, authonly, site. Keywords are case sensitive. String arguments that
contain spaces should be surrounded by double quotes. Without quoting,
strings may contain alphanumeric characters and _, -, ., and *, but not
spaces. Strings must be on a single line and cannot be continued between
lines. Comments start with # and extend to the end of the line.
Access Specification
Access specifications may occur in the Access Section or in the Database
Section. The access specification will be described here.
For allow, deny, and authonly, a star (*) may be used as a wild card that
matches any number of characters. A question mark (?) may be used as a
wildcard that matches a single character. For example, 10.0.0.* and *.edu
are valid strings.
The syntax is as follows:
allow string
The string specifies a domain name or IP address which is allows access the
server (in the Access Section) or to a database (in the Database Section).
deny string
The string specifies a domain name or IP address which is denies access to
the server (in the Access Section) or to a database (in the Database
Section). Note that if reverse DNS is not working, then only the IP number
will be checked. Therefore, it is essential to deny networks based on IP
number, since a denial based on domain name may not always be checked.
authonly string
This form is only useful in the Access Section. The string specifies a
domain name or IP address which is allowed access to the server but not to
any of the databases. All commands are valid except DEFINE, MATCH, and
SHOW DB. More specifically AUTH is a valid command, and commands which
access the databases are not allowed.
userstring
This form is only useful in the Database Section. The string specifies a
username that is allowed to access this database after a successful AUTH
command is executed.
site string
Used to specify the filename for the site information file, a flat text
file which will be displayed in response to the SHOW SERVER command. This
section, if present, must be first.
access { access specification }
This section, the second if the Site Section is present, contiains access
restrictions for the server and all of the databases collectively.
Per-database control is specified in the Database Section
database string { database specification }
This section is required. The string specifies the name of the database
(e.g., wn or web1913). The database specification describes the database:
data string
Specifies the filename for the flat text database.
index string
Specifies the filename for the index file.
prefilter string
Specifies the prefilter command. When a chunk of the compressed database
is read, it will be filtered with this filter before being decompressed.
This may be used to provide some additional compression that knows about
the data and can provide better compression than the LZ77 algorithm used by
zlib.
postfilter string
Specifies the postfilter command. When a chunk of the compressed database
is read, it will be filtered with this filter before the offset and length
for the entry are used to access data. This is provided for symmetry with
the prefilter command, and may also be useful for providing additional
database compression.
filter string
Specifies the filter command. After the entry is extracted from the
database, it will be filtered with this filter. This may be used to
provide formatting for the entry (e.g., for html).
Warning:
This is not currently implemented.
name string
Specifies the short name of the database (e.g., "1913 Webster's"). If the
string begins with @, then it specifies the headword to look up in the
dictionary to fine the short name of the database. The default is
"@00-database-short", but this may be changed in the
access { access specification }
Used to restrict access to this particular database.
dictd.h
file at compile time (DICT_SHORT_ENTRY_NAME).
user string string
The first string specifies the username, and the second string specifies
the shared secret for this username. When the AUTH command is used, the
client will provide the username and a hashed version of the shared
secret. If the shared secret matches, the user is said to have
authenticated, and will have access to databases whose access
specifications allow that user (by name, or by wildcard). If present, this
section must appear last in the configuration file. There may be many user
entries. The shared secret should be kept secret, as anyone who has access
to it can access the shared databases (assuming access is not denied by
domain name).
DETERMINATION OF ACCESS LEVEL
When a client connects, the global access specification is scanned, in
order, until a specification matches. If no access specification exists,
all access is allowed (e.g., the action is the same as if "allow *" was the
only item in the specification). For each item, both the hostname and IP
are checked. For example, consider the following access specification:
allow 10.42.*
authonly *.edu
deny *
With this specification, all clients in the 10.42 network will be allowed
access to unrestricted databases; all clients from *.edu sites will be
allowed to authenticate, but will be denied access to all databases, even
those which are otherwise unrestricted; and all other clients will have
their connection terminated immediately. The 10.42 network clients can
send an AUTH command and gain access to restricted databases. The *.edu
clients must send an AUTH command to gain access to any databases,
restricted or unrestricted.
When the AUTH command is sent, the access list for each database is
scanned, in order, just as the global access list is scanned. However,
after authentication, the client has an assocciated username. For example,
consider the following access specification:
user u1
deny *.com
user u2
allow *
If the client authenticated as u1, then the client will have access to this
database, even if the client comes from a *.com site. In contrast, if the
client authenticated as u2, the client will only have access if it does not
come from a *.com site. In this case, the "user u2" is redundant, since
that client would also match "allow *".
Warning:
Checks are performed for domain names and for IP addresses. However, if
reverse DNS for a specific site is not working, it is possible that a
domain name may not be available for checking. Make sure that all denials
use IP addresses. (And consider a future enhancement: if a domain name is
not available, should denials that depend on a domain name match anything?
This is the more conservative viewpoint, but it is not currently
implemented.)
SEARCH ALGORITHMS
The DICT standard specifies a few search algorithms that must be
implemented, and permits others to be supported on a server-dependent
basis. The following search strategies are supported by this server. Note
that
all
strategies are case insensitive. Most ignore non-alphanumeric,
non-whitespace characters.
exact
An exact match. This algorithm uses a binary search and is one of the
fastest search algorithms available.
prefix
Prefix match. This algorithm also uses a binary search and is very fast.
substring
Match a substring anywhere in the headword. This search strategy uses a
modified Boyer-Moore-Horspool algorithm. Since it must search the whole
index file, it is not as fast as the exact and prefix matches.
suffix
Suffix match. This search strategy also uses a modified
Boyer-Moore-Horspool algorithm, and is as fast as the substring search.
re
POSIX 1003.2 (modern) regular expression search. Modern regular
expressions are the ones used by
egrep(1).
These regular expressions allow predefined character classes (e.g.,
[[:alnum:]], [[:alpha:]], [[:digit:]], and [[:xdigit:]] are useful for this
application); uses * to match a sequence 0 or more matches of the previous
atom; uses + to match a sequence of 1 or more matches of the previous atom;
uses ? to match a sequence of 0 or 1 matches of the prevous atom; used ^ to
match the beginning of a word, uses $ to match the end of a word, and
allows nested subexpression and alternation with () and |. For example,
"(foo|bar)" matches all words that contain either "foo" or "bar". To match
these special characters, they must be quoted with two backslashes (due to
the quoting characteristics of the server).
Warning:
Regular expression matches can take 10 to 300 times longer than substring
matches. On a busy server, with many databases, this can required more
than 5 minutes of waiting time, depending on the complexity of the regular
expression.
regexp
Old (basic) regular expressions. These regular expressions don't support
|, +, or ?. Groups use escaped parentheses. While modern regular
expressions are generally easier to use, basic regular expressions have a
back reference feature. This can be used to match a second occurrence of
something that was already matched. For example, the following expression
finds all words that begin and end with the same three letters:
^\\(...\\).*\\1$
Note the use of the double backslashes to escape the special characters.
This is required by the DICT protocol string specification (a single
backslash quotes the next character -- we use two to get a single backslash
through to the regular expression engine).
Warning:
Note that the use of backtracking is even slower than the use of general
regular expressions.
soundex
The Soundex algorithm, a classic algorithm for finding words that sound
similar to each other. The algorithm encodes each word using the first
letter of the word and up to three digits. Since the first letter is
known, this search is relatively fast, and it sometimes good for correcting
spelling errors when the Levenshtein algorithm doesn't help.
lev
The Levenshtein algorithm (string edit distance of one). This algorithm
searches for all words which are within an edit distance of one from the
target word. An "edit" means an insertion, deletion, or transposition.
This is a rapid algorithm for correcting spelling errors, since many
spelling errors are within a Levenshtein distance of one from the oroginal
word.
DATABASE FORMAT
Databases for
dictd
are distributed separately. A database consists of two files. One is a
flat text file, the other in the index.
The flat text file contains dictionary entries (or any other suitable
data), and the index contains tab-delimited tuples consisting of the
headword, the byte offset at which this entry begins in the flat text file,
and the length of the entry in bytes. The offset and length are encoded
using base 64 encoding using the 64-character subset of International
Alphabet IA5 discussed in RFC 1421 (printeable encoding) and RFC 1522
(base64 MIME). Encoding the offsets in base 64 saves considerable space
when compared with the usual base 10 encoding, while still permitting tab
characters (ASCII 9) to be used for delimiting fields in a record. Each
record ends with a newline (ASCII 10), so the index file is human readable.
The flat text file may be compressed using
gzip(1)
(not recommended) or
dictzip(1)
(highly recommended). Optimal speed will be obtained using an uncompressed
file. However, the
gzip
compression algorithm works very well on plain text, and can result in
space savings typically between 60 and 80%. Using a file compressed with
gzip(1)
is not recommended, however, because random access on the file can only be
accomplished by serially decompressing the whole file, a process which is
prohibitively slow.
dictzip(1)
uses the same compression algorithm and file format as does
gzip(1),
but provides a table that can be used to randomly access compressed blocks
in the file. The use of 50-64kB blocks for compression typically degrades
compression by less than 10%, while maintaining acceptable random access
capabilities for all data in the file. As an added benefit, files
compressed with
dictzip(1)
can be decompressed with
gzip(1)
or
zcat(1).
(Note: recompressing a
dictzip'd
file using, for example,
znew(1)
will destroy the random access characteristics of the file. Always
compress data files using
dictzip(1).)
ACKNOWLEDGEMENTS
Special thanks to Jean-loup Gailly and Mark Adler for writing the zlib
general purpose data compression library. The version contained with
dictd
is not necessarily an original version and
may have been modified,
although any modifications are probably trivial. The key features of the
dictzip
random-access compression algorithm utilize a documented extension of the
gzip format, and do not require any modifications to zlib. For more
information on zlib, please see the zlib home page at
http://quest.jpl.nasa.gov/zlib/.
Special thanks to Henry Spencer for his regex package. The package
contained with
dictd
is not necessarily an original version and
may have been modified.
For more information on regex, please see
ftp://zoo.toronto.edu/pub/regex.shar.
COPYING
The main source files for the
dictd
server and the
dictzip
compression program were written by Rik Faith ([email protected]) and are
distributed under the terms of the GNU General Public License. If you need
to distribute under other terms, write to the author.
The main libraries used by these programs (zlib, regex, libmaa) are
distributed under different terms, so you may be able to use the libraries
for applications which are incompatible with the GPL -- please see the
copyright notices and license information that come with the libraries for
more information, and consult with your attorney to resolve these issues.
BUGS
The regular expression searches do not ignore non-whitespace,
non-alphanumeric characters as do the other searches. In practice, this
isn't much of a problem.
The databases are memory mapped and cannot be updated while the server is
running.
There is no way to get a running server to re-read the configuration file,
so databases cannot be added or deleted on the fly.