Text::CSV::Separator man page on Fedora

Man page or keyword search:  
man Server   31170 pages
apropos Keyword Search (all sections)
Output format
Fedora logo
[printable version]

Text::CSV::Separator(3User Contributed Perl DocumentatiText::CSV::Separator(3)

NAME
       Text::CSV::Separator - Determine the field separator of a CSV file

VERSION
       Version 0.19 - December 4, 2007

SYNOPSIS
	   use Text::CSV::Separator qw(get_separator);

	   my @char_list = get_separator(
					   path	   => $csv_path,
					   exclude => $array1_ref, # optional
					   include => $array2_ref, # optional
					   echo	   => 1,	   # optional
					);

	   my $separator;
	   if (@char_list) {
	       if (@char_list == 1) {		# successful detection
		   $separator = $char_list[0];
	       } else {				# several candidates passed the tests
		   # Some code here
	   } else {				# no candidate passed the tests
		   # Some code here
	   }

	   # "I'm Feeling Lucky" alternative interface
	   # Don't forget to include the 'lucky' parameter

	   my $separator = get_separator(
					   path	   => $csv_path,
					   lucky   => 1,
					   exclude => $array1_ref, # optional
					   include => $array2_ref, # optional
					   echo	   => 1,	   # optional
					);

DESCRIPTION
       This module provides a fast detection of the field separator character
       (also called field delimiter) of a CSV file, or more generally, of a
       character separated text file (also called delimited text file), and
       returns it ready to use in a CSV parser (e.g., Text::CSV_XS,
       Tie::CSV_File, or Text::CSV::Simple).  This may be useful to the
       vulnerable -and often ignored- population of programmers who need to
       process automatically CSV files from different sources.

       The default set of candidates contains the following characters: ','
       ';'  ':'	 '|'  '\t'

       The only required parameter is the CSV file path. Optionally, the user
       can specify characters to be excluded or included in the list of
       candidates.

       The routine returns an array containing the list of candidates that
       passed the tests. If it succeeds, this array will contain only one
       value: the field separator we are looking for. On the other hand, if no
       candidate survives the tests, it will return an empty list.

       The technique used is based on the following principle:

       ·       For every line in the file, the number of instances of the
	       separator character acting as separators must be an integer
	       constant > 0 , although a line may also contain some instances
	       of that character as literal characters.

       ·       Most of the other candidates won't appear in a typical CSV
	       line.

       As soon as a candidate misses a line, it will be removed from the
       candidates list.

       This is the first test done to the CSV file. In most cases, it will
       detect the separator after processing the first few lines. In
       particular, if the file contains a header line, one line will probably
       be enough to get the job done.  Processing will stop and return control
       to the caller as soon as the program reaches a status of 1 single
       candidate (or 0 candidates left).

       If the routine cannot determine the separator in the first pass, it
       will do a second pass based on several heuristic techniques. It checks
       whether the file has columns consisting of time values, comma-separated
       decimal numbers, or numbers containing a comma as the group separator,
       which can lead to false positives in files that don't have a header
       row. It also measures the variability of the remaining candidates.  Of
       course, you can always create a CSV file capable of resisting the
       siege, but this approach will work correctly in many cases. The
       possibility of excluding some of the default candidates may help to
       resolve cases with several possible winners.  The resulting array
       contains the list of possible separators sorted by their likelihood,
       being the first array item the most probable separator.

       The module also provides an alternative interface with a simpler
       syntax, which can be handy if you think that the files your program
       will have to deal with aren't too exotic. To use it you only have to
       add the lucky => 1 key-value pair to the parameters hash and the
       routine will return a single value, so you can assign it directly to a
       scalar variable.	 If no candidate survives the first pass, it will
       return "undef".	The code skips the 2nd pass, which is usually
       unnecessary, so the program won't store counts and won't check any
       existing regularities. Hence, it will run faster and will require less
       memory. This approach should be enough in most cases.

FUNCTIONS
       get_separator(%options)
	   Returns an array containing the field separator character (or
	   characters, if more than one candidate passed the tests) of a CSV
	   file. In case no candidate passes the tests, it returns an empty
	   list.

	   The available parameters are:

	   path	   Required. The path to the CSV file.

	   exclude Optional. Array containing characters to be excluded from
		   the candidates list.

	   include Optional. Array containing characters to be included in the
		   candidates list.

	   lucky   Optional. If selected, get_separator will return one single
		   character, or "undef" in case no separator is detected. Off
		   by default.

	   echo	   Optional. Writes to the standard output messages describing
		   the actions performed. Off by default.  This is useful to
		   keep track of what's going on, especially for debugging
		   purposes.

EXPORT
       None by default.

EXAMPLE
       Consider the following scenario: Your program must process a batch of
       csv files, and you know that the separator could be a comma, a
       semicolon or a tab.  You also know that one of the fields contains time
       values. This field will provide a fixed number of colons that could
       mislead the detection code.  In this case, you should exclude the colon
       (and you can also exclude the other default candidate not considered,
       the pipe character):

	   my @char_list = get_separator(
					   path	   => $csv_path,
					   exclude => [':', '|'],
					);

	   if (@char_list) {
	       my $separator;
	       if (@char_list == 1) {
		   $separator = $char_list[0];
	       } else {
		   # Some code here
	       }
	   }

	   # Using the "I'm Feeling Lucky" interface:

	   my $separator = get_separator(
					   path	   => $csv_path,
					   lucky   => 1,
					   exclude => [':', '|'],
					 );

MOTIVATION
       Despite the popularity of XML, the CSV file format is still widely used
       for data exchange between applications, because of its much lower
       overhead: It requires much less bandwidth and storage space than XML,
       and it also has a better performance under compression (see the
       References below).

       Unfortunately, there is no formal specification of the CSV format.  The
       Microsoft Excel implementation is the most widely used and it has
       become a de facto standard, but the variations are almost endless.

       One of the biggest annoyances of this format is that in most cases you
       don't know a priori what is the field separator character used in a
       file.  CSV stands for "comma-separated values", but most of the
       spreadsheet applications let the user select the field delimiter from a
       list of several different characters when saving or exporting data to a
       CSV file.  Furthermore, in a Windows system, when you save a
       spreadsheet in Excel as a CSV file, Excel will use as the field
       delimiter the default list separator of your system's locale, which
       happens to be a semicolon for several European languages. You can even
       customize this setting and use the list separator you like. For these
       and other reasons, automating the processing of CSV files is a risky
       task.

       This module can be used to determine the separator character of a
       delimited text file of any kind, but since the aforementioned ambiguity
       problems occur mainly in CSV files, I decided to use the Text::CSV::
       namespace.

REFERENCES
       <http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm>

       <http://www.xml.com/pub/a/2004/12/15/deviant.html>

SEE ALSO
       There's another module in CPAN for this task,
       Text::CSV::DetectSeparator, which follows a different approach.

ACKNOWLEDGEMENTS
       Many thanks to Xavier Noria for wise suggestions.  The author is also
       grateful to Thomas Zahreddin, Benjamin Erhart, Ferdinand Gassauer, and
       Mario Krauss for valuable comments and bug reports.

AUTHOR
       Enrique Nell, <perl_nell@telefonica.net>

COPYRIGHT AND LICENSE
       Copyright (C) 2006 by Enrique Nell.

       This library is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

perl v5.14.0			  2007-12-04	       Text::CSV::Separator(3)
[top]

List of man pages available for Fedora

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net