rwdedupe(1) SiLK Tool Suite rwdedupe(1)NAMErwdedupe - Eliminate duplicate SiLK Flow records
SYNOPSISrwdedupe [--ignore-fields=FIELDS] [--packets-delta=NUM]
[--bytes-delta=NUM] [--stime-delta=NUM] [--duration-delta=NUM]
[--temp-directory=DIR_PATH] [--buffer-size=SIZE]
[--note-add=TEXT] [--note-file-add=FILE]
[--compression-method=COMP_METHOD] [--print-filenames]
[--output-path=PATH] [--site-config-file=FILENAME]
{[--xargs] | [--xargs=FILENAME] | [FILE [FILE ...]]}
rwdedupe--help
rwdedupe--help-fields
rwdedupe--version
DESCRIPTIONrwdedupe reads SiLK Flow records from one or more input sources.
Records that appear in the input file(s) multiple times will only
appear in the output stream once; that is, duplicate records are not
written to the output. The SiLK Flows are written to the file
specified by the --output-path switch or to the standard output when
the --output-path switch is not provided and the standard output is not
connected to a terminal.
Note: As part of its processing, rwdedupe re-orders the records before
writing them.
rwdedupe reads SiLK Flow records from the files named on the command
line or from the standard input when no file names are specified and
--xargs is not present. To read the standard input in addition to the
named files, use "-" or "stdin" as a file name. If an input file name
ends in ".gz", the file will be uncompressed as it is read. When the
--xargs switch is provided, rwdedupe will read the names of the files
to process from the named text file, or from the standard input if no
file name argument is provided to the switch. The input to --xargs
must contain one file name per line.
By default, rwdedupe will consider one record to be a duplicate of
another when all the fields in the records match exactly. From another
point on view, any difference in two records results in both records
appearing in the output. Note that all means every field that exists
on a SiLK Flow record. The complete list of fields is specified in the
description of --ignore-fields in the "OPTIONS" section below.
To have rwdedupe ignore fields in the comparison, specify those fields
in the --ignore-fields switch. When --ignore-fields=FIELDS is
specified, a record is considered a duplicate of another if all fields
except those in FIELDS match exactly. rwdedupe will treat FIELDS as
being identical across all records. Put another way, if the only
difference between two records is in the FIELDS fields, only one of
those records will be written to the output.
The --packets-delta, --bytes-delta, --stime-delta and --duration-delta
switches allow for "fuzziness" in the input. For example, if
--stime-delta=NUM is specified and the only difference between two
records is in the sTime fields, and the fields are within NUM
milliseconds of each other, only one record will be written to the
output.
During its processing, rwdedupe will try to allocate a large (near 2GB)
in-memory array to hold the records. (You may use the --buffer-size
switch to change this maximum buffer size.) If more records are read
than will fit into memory, the in-core records are temporarily stored
on disk as described by the --temp-directory switch. When all records
have been read, the on-disk files are merged to produce the output.
By default, the temporary files are stored in the /tmp directory.
Because of the sizes of the temporary files, it is strongly recommended
that /tmp not be used as the temporary directory, and rwdedupe will
print a warning when /tmp is used. To modify the temporary directory
used by rwdedupe, provide the --temp-directory switch, set the
SILK_TMPDIR environment variable, or set the TMPDIR environment
variable.
OPTIONS
Option names may be abbreviated if the abbreviation is unique or is an
exact match for an option. A parameter to an option may be specified
as --arg=param or --arg param, though the first form is required for
options that take optional parameters.
--ignore-fields=FIELDS
Ignore the fields listed in FIELDS when determining if two flow
records are identical; that is, treat FIELDS as being identical
across all flows. By default, all fields are treated as
significant.
FIELDS is a comma separated list of field-names, field-integers,
and ranges of field-integers; a range is specified by separating
the start and end of the range with a hyphen (-). Field-names are
case-insensitive. Example:
--ignore-fields=stime,12-15
The list of supported fields are:
sIP,1
source IP address
dIP,2
destination IP address
sPort,3
source port for TCP and UDP, or equivalent
dPort,4
destination port for TCP and UDP, or equivalent
protocol,5
IP protocol
packets,pkts,6
packet count
bytes,7
byte count
flags,8
bit-wise OR of TCP flags over all packets
sTime,9
starting time of flow (milliseconds resolution)
duration,10
duration of flow (milliseconds resolution)
sensor,12
name or ID of sensor at the collection point
in,13
router SNMP input interface or vlanId if packing tools were
configured to capture it (see sensor.conf(5))
out,14
router SNMP output interface or postVlanId
nhIP,15
router next hop IP
class,20,type,21
class and type of sensor at the collection point (represented
internally by a single value)
initialFlags,26
TCP flags on first packet in the flow
sessionFlags,27
bit-wise OR of TCP flags over all packets except the first in
the flow
attributes,28
flow attributes set by flow generator
application,29
guess as to the content of the flow. Some software that
generates flow records from packet data, such as yaf(1), will
inspect the contents of the packets that make up a flow and use
traffic signatures to label the content of the flow. SiLK
calls this label the application; yaf refers to it as the
appLabel. The application is the port number that is
traditionally used for that type of traffic (see the
/etc/services file on most UNIX systems). For example, traffic
that the flow generator recognizes as FTP will have a value of
21, even if that traffic is being routed through the standard
HTTP/web port (80).
--packets-delta=NUM
Treat the packets field on two records as being the same if the
values differ by NUM packets or less. If not specified, the
default is 0.
--bytes-delta=NUM
Treat the bytes field on two records as being the same if the
values differ by NUM bytes or less. If not specified, the default
is 0.
--stime-delta=NUM
Treat the start-time field on two records as being the same if the
values differ by NUM milliseconds or less. If not specified, the
default is 0.
--duration-delta=NUM
Treat the duration field on two records as being the same if the
values differ by NUM milliseconds or less. If not specified, the
default is 0.
--temp-directory=DIR_PATH
Specify the name of the directory in which to store data files
temporarily when more records have been read that will fit into
RAM. This switch overrides the directory specified in the
SILK_TMPDIR environment variable, which overrides the directory
specified in the TMPDIR variable, which overrides the default,
/tmp.
--buffer-size=SIZE
Set the maximum size of the buffer to use for holding the records,
in bytes. A larger buffer means fewer temporary files need to be
created, reducing the I/O wait times. The default maximum for this
buffer is near 2GB. The SIZE may be given as an ordinary integer,
or as a real number followed by a suffix "K", "M" or "G", which
represents the numerical value multiplied by 1,024 (kilo),
1,048,576 (mega), and 1,073,741,824 (giga), respectively. For
example, 1.5K represents 1,536 bytes, or one and one-half
kilobytes. (This value does not represent the absolute maximum
amount of RAM that rwdedupe will allocate, since additional buffers
will be allocated for reading the input and writing the output.)
--output-path=PATH
Write the SiLK Flow records to the specified file or named pipe.
When the standard output is not a terminal and this switch is not
provided or its argument is "-" or "stdout", the records are
written to the standard output.
--note-add=TEXT
Add the specified TEXT to the header of the output file as an
annotation. This switch may be repeated to add multiple
annotations to a file. To view the annotations, use the
rwfileinfo(1) tool.
--note-file-add=FILENAME
Open FILENAME and add the contents of that file to the header of
the output file as an annotation. This switch may be repeated to
add multiple annotations. Currently the application makes no
effort to ensure that FILENAME contains text; be careful that you
do not attempt to add a SiLK data file as an annotation.
--compression-method=COMP_METHOD
Specify how to compress the output. When this switch is not given,
output to the standard output or to named pipes is not compressed,
and output to files is compressed using the default chosen when
SiLK was compiled. The valid values for COMP_METHOD are determined
by which external libraries were found when SiLK was compiled. To
see the available compression methods and the default method, use
the --help or --version switch. SiLK can support the following
COMP_METHOD values when the required libraries are available.
none
Do not compress the output using an external library.
zlib
Use the zlib(3) library for compressing the output, and always
compress the output regardless of the destination. Using zlib
produces the smallest output files at the cost of speed.
lzo1x
Use the lzo1x algorithm from the LZO real time compression
library for compression, and always compress the output
regardless of the destination. This compression provides good
compression with less memory and CPU overhead.
best
Use lzo1x if available, otherwise use zlib. Only compress the
output when writing to a file.
--print-filenames
Print to the standard error the names of input files as they are
opened.
--site-config-file=FILENAME
Read the SiLK site configuration from the named file FILENAME.
When this switch is not provided, rwdedupe searches for the site
configuration file in the locations specified in the "FILES"
section.
--xargs
--xargs=FILENAME
Causes rwdedupe to read file names from FILENAME or from the
standard input if FILENAME is not provided. The input should have
one file name per line. rwdedupe will open each file in turn and
read records from it, as if the files had been listed on the
command line.
--help
Print the available options and exit.
--help-fields
Print the description and alias(es) of each field and exit.
--version
Print the version number and information about how SiLK was
configured, then exit the application.
LIMITATIONS
When the temporary files and the final output are stored on the same
file volume, rwdedupe will require approximately twice as much free
disk space as the size of input data.
When the temporary files and the final output are on different volumes,
rwdedupe will require between 1 and 1.5 times as much free space on the
temporary volume as the size of the input data.
EXAMPLE
In the following examples, the dollar sign ("$") represents the shell
prompt. The text after the dollar sign represents the command line.
Suppose you have made several rwfilter(1) runs to find interesting
traffic:
$ rwfilter --start-date=2008/02/04 ... --pass=data1.rw
$ rwfilter --start-date=2008/02/04 ... --pass=data2.rw
$ rwfilter --start-date=2008/02/04 ... --pass=data3.rw
$ rwfilter --start-date=2008/02/04 ... --pass=data4.rw
You now want to merge that traffic into a single output file, but you
want to ensure that any records appearing in multiple output files are
only counted once. You can use rwdedupe to merge the output files to a
single file, data.rw:
$ rwdedupe data1.rw data2.rw data3.rw data4.rw --output=data.rw
ENVIRONMENT
SILK_TMPDIR
When set and --temp-directory is not specified, rwdedupe writes the
temporary files it creates to this directory. SILK_TMPDIR
overrides the value of TMPDIR.
TMPDIR
When set and SILK_TMPDIR is not set, rwdedupe writes the temporary
files it creates to this directory.
SILK_CLOBBER
The SiLK tools normally refuse to overwrite existing files.
Setting SILK_CLOBBER to a non-empty value removes this restriction.
SILK_CONFIG_FILE
This environment variable is used as the value for the
--site-config-file when that switch is not provided.
SILK_DATA_ROOTDIR
This environment variable specifies the root directory of data
repository. As described in the "FILES" section, rwdedupe may use
this environment variable when searching for the SiLK site
configuration file.
SILK_PATH
This environment variable gives the root of the install tree. When
searching for configuration files, rwdedupe may use this
environment variable. See the "FILES" section for details.
SILK_TEMPFILE_DEBUG
When set to 1, rwdedupe prints debugging messages to the standard
error as it creates, re-opens, and removes temporary files.
FILES
${SILK_CONFIG_FILE}
${SILK_DATA_ROOTDIR}/silk.conf
/data/silk.conf
${SILK_PATH}/share/silk/silk.conf
${SILK_PATH}/share/silk.conf
/usr/local/share/silk/silk.conf
/usr/local/share/silk.conf
Possible locations for the SiLK site configuration file which are
checked when the --site-config-file switch is not provided.
${SILK_TMPDIR}/
${TMPDIR}/
/tmp/
Directory in which to create temporary files.
SEE ALSOrwfilter(1), rwfileinfo(1), sensor.conf(5), silk(7), yaf(1), zlib(3)SiLK 3.11.0.1 2016-02-19 rwdedupe(1)