Text::CSV::Separator(3User Contributed Perl DocumentatiText::CSV::Separator(3)NAMEText::CSV::Separator - Determine the field separator of a CSV file
VERSION
Version 0.19 - December 4, 2007
SYNOPSIS
use Text::CSV::Separatorqw(get_separator);
my @char_list = get_separator(
path => $csv_path,
exclude => $array1_ref, # optional
include => $array2_ref, # optional
echo => 1, # optional
);
my $separator;
if (@char_list) {
if (@char_list == 1) { # successful detection
$separator = $char_list[0];
} else { # several candidates passed the tests
# Some code here
} else { # no candidate passed the tests
# Some code here
}
# "I'm Feeling Lucky" alternative interface
# Don't forget to include the 'lucky' parameter
my $separator = get_separator(
path => $csv_path,
lucky => 1,
exclude => $array1_ref, # optional
include => $array2_ref, # optional
echo => 1, # optional
);
DESCRIPTION
This module provides a fast detection of the field separator character
(also called field delimiter) of a CSV file, or more generally, of a
character separated text file (also called delimited text file), and
returns it ready to use in a CSV parser (e.g., Text::CSV_XS,
Tie::CSV_File, or Text::CSV::Simple). This may be useful to the
vulnerable -and often ignored- population of programmers who need to
process automatically CSV files from different sources.
The default set of candidates contains the following characters: ','
';' ':' '|' '\t'
The only required parameter is the CSV file path. Optionally, the user
can specify characters to be excluded or included in the list of
candidates.
The routine returns an array containing the list of candidates that
passed the tests. If it succeeds, this array will contain only one
value: the field separator we are looking for. On the other hand, if no
candidate survives the tests, it will return an empty list.
The technique used is based on the following principle:
· For every line in the file, the number of instances of the
separator character acting as separators must be an integer
constant > 0 , although a line may also contain some instances
of that character as literal characters.
· Most of the other candidates won't appear in a typical CSV
line.
As soon as a candidate misses a line, it will be removed from the
candidates list.
This is the first test done to the CSV file. In most cases, it will
detect the separator after processing the first few lines. In
particular, if the file contains a header line, one line will probably
be enough to get the job done. Processing will stop and return control
to the caller as soon as the program reaches a status of 1 single
candidate (or 0 candidates left).
If the routine cannot determine the separator in the first pass, it
will do a second pass based on several heuristic techniques. It checks
whether the file has columns consisting of time values, comma-separated
decimal numbers, or numbers containing a comma as the group separator,
which can lead to false positives in files that don't have a header
row. It also measures the variability of the remaining candidates. Of
course, you can always create a CSV file capable of resisting the
siege, but this approach will work correctly in many cases. The
possibility of excluding some of the default candidates may help to
resolve cases with several possible winners. The resulting array
contains the list of possible separators sorted by their likelihood,
being the first array item the most probable separator.
The module also provides an alternative interface with a simpler
syntax, which can be handy if you think that the files your program
will have to deal with aren't too exotic. To use it you only have to
add the lucky => 1 key-value pair to the parameters hash and the
routine will return a single value, so you can assign it directly to a
scalar variable. If no candidate survives the first pass, it will
return "undef". The code skips the 2nd pass, which is usually
unnecessary, so the program won't store counts and won't check any
existing regularities. Hence, it will run faster and will require less
memory. This approach should be enough in most cases.
FUNCTIONS
get_separator(%options)
Returns an array containing the field separator character (or
characters, if more than one candidate passed the tests) of a CSV
file. In case no candidate passes the tests, it returns an empty
list.
The available parameters are:
path Required. The path to the CSV file.
exclude Optional. Array containing characters to be excluded from
the candidates list.
include Optional. Array containing characters to be included in the
candidates list.
lucky Optional. If selected, get_separator will return one single
character, or "undef" in case no separator is detected. Off
by default.
echo Optional. Writes to the standard output messages describing
the actions performed. Off by default. This is useful to
keep track of what's going on, especially for debugging
purposes.
EXPORT
None by default.
EXAMPLE
Consider the following scenario: Your program must process a batch of
csv files, and you know that the separator could be a comma, a
semicolon or a tab. You also know that one of the fields contains time
values. This field will provide a fixed number of colons that could
mislead the detection code. In this case, you should exclude the colon
(and you can also exclude the other default candidate not considered,
the pipe character):
my @char_list = get_separator(
path => $csv_path,
exclude => [':', '|'],
);
if (@char_list) {
my $separator;
if (@char_list == 1) {
$separator = $char_list[0];
} else {
# Some code here
}
}
# Using the "I'm Feeling Lucky" interface:
my $separator = get_separator(
path => $csv_path,
lucky => 1,
exclude => [':', '|'],
);
MOTIVATION
Despite the popularity of XML, the CSV file format is still widely used
for data exchange between applications, because of its much lower
overhead: It requires much less bandwidth and storage space than XML,
and it also has a better performance under compression (see the
References below).
Unfortunately, there is no formal specification of the CSV format. The
Microsoft Excel implementation is the most widely used and it has
become a de facto standard, but the variations are almost endless.
One of the biggest annoyances of this format is that in most cases you
don't know a priori what is the field separator character used in a
file. CSV stands for "comma-separated values", but most of the
spreadsheet applications let the user select the field delimiter from a
list of several different characters when saving or exporting data to a
CSV file. Furthermore, in a Windows system, when you save a
spreadsheet in Excel as a CSV file, Excel will use as the field
delimiter the default list separator of your system's locale, which
happens to be a semicolon for several European languages. You can even
customize this setting and use the list separator you like. For these
and other reasons, automating the processing of CSV files is a risky
task.
This module can be used to determine the separator character of a
delimited text file of any kind, but since the aforementioned ambiguity
problems occur mainly in CSV files, I decided to use the Text::CSV::
namespace.
REFERENCES
<http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm>
<http://www.xml.com/pub/a/2004/12/15/deviant.html>
SEE ALSO
There's another module in CPAN for this task,
Text::CSV::DetectSeparator, which follows a different approach.
ACKNOWLEDGEMENTS
Many thanks to Xavier Noria for wise suggestions. The author is also
grateful to Thomas Zahreddin, Benjamin Erhart, Ferdinand Gassauer, and
Mario Krauss for valuable comments and bug reports.
AUTHOR
Enrique Nell, <perl_nell@telefonica.net>
COPYRIGHT AND LICENSE
Copyright (C) 2006 by Enrique Nell.
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
perl v5.14.0 2007-12-04 Text::CSV::Separator(3)