PowerShell File Processing

PowerShell is good for some things. The data object model makes for good functionality. But making all data objects makes PowerShell a dog for handling large data files. Often in our business we use both Windows and Linux to process the same data. Linux does not generally like the files PowerShell creates.

Text File Encoding.

To understand what is going on a little explanation is needed. By default if Text is redirected to a file in PowerShell the encoding is UCS-2 LE BOM.

PS>echo “This is a test” > test.txt

If you open this file in a text editor it looks like a normal file but if you open it in a hex editor. You can see first the BOM “Byte Order Mark” FF FE and then the text in Unicode. The BOM makes all Linux and Linux based utilities choke. E.g. Hashcat does not a like file with the BOM.

Hex File

You can also encode a file in UCS-2 BE BOM. In that case the BOM is FE FF and byte order is reverse - the characters start with the 00. Most Linux utilities like plain UTF8 files with no BOM. UTF8 is often also called ASCII. UTF8 encoded files are also ½ the size of UCS-2 LE BOM encoded files because each character is only one byte long. Note sometimes UCS-2 is called UTF-16. To confuse matters more you can encode a file in UTF8 BOM.

How do you determine the text encoding of a file? You can always open it up in a hex editor and look at the first few bytes. Here is a table listing different BOMs.

BOMEncoding
EF BB BFUTF8 BOM
FE FFUTF16 Little Endian – UCS-2 LE BOM
FF FEUTF16 Big Endian – UCS2 BE BOM
00 00 FF FEUTF32 Big Endian
FE FEF 00 00UTF32 Little Endian
2B 2FUTF7
ASCII – UTF8

Confusing as this is, not being aware of file encoding can cause several issues with PowerShell. Unfortunately, PowerShell does not have an inbuilt function to detect file encoding. If the file is small enough to open in a text editor like notepad++ it will tell you the encoding. But then this would be irrelevant because you can easily search and manipulate the file. 99% of the time the files you encounter will be encoded either UTF8 or UCS-2 LE BOM. If you are having issues it is sometimes a good idea to look at the file encoding.

There is a PowerScript on my github to determine the file encoding Github

Changing File Encoding

PowerShell does have built-in functions you can use to change file encoding: out-file and append-content. For both of these commands you can add the -encoding parameter and change the encoding to one of unknown, string, unicode, bigendianunicode, utf8, utf7, utf32, ascii, default and oem. By now you should know that Unicode = UTF16 = UCS-2 LE BOM, UTF8 = UTF8 BOM etc. Why can’t they agree on one designation? Note: if you don’t designate the encoding the default in PowerShell is UCS-2 LE BOM.

Line Length

If you are not thoroughly confused at this point there is another quirk to know about PowerShell. By default the maximum line length is 260 characters. When PowerShell reaches this limit it wraps the line. Ugh. This behavior can be observed when using the redirect operators and when using all character out functions (like out-file) except append-content. Very often we deal with log files whose lines are longer than 260 characters. When we process them with PowerShell without consideration PowerShell wraps these lines messing up our log file.

Processing Large Text files in PowerShell

With these quirks and limitations in mind lets get on to processing large files with PowerShell.

Objects

For PowerShell everything is an object. This can be very useful, but it also introduces a lot of overhead. For doing simple file manipulation like sorting lines, splitting a file into parts, joining two files together, and getting the unique lines in a file PowerShell sucks. Luckily PowerShell has direct access to .NET file handling routines that allow PowerShell to handle files two to four times bigger than standard PowerShell and much faster.

I have a PowerShell library on github that uses the .Net routines to make file operations must faster.

If a file is read with the get-content function each line is read into memory as a separate object. This provides expanded processing capability but introduces overhead. A file can be read all at once as a single object using the built in .Net routines.

[Io.File]::ReadAllText($path)

Searching Large Text Files with Select-String

PowerShell has grep functionality built in with the Select-String function. It takes standard RegExp expressions, but it too has some quirks. By default if you search a file it lists each line that matches the RegEx preceded by the full name of the file, then the line number, then the line. I seldom want this. So searching a simple file with one word per line and matching the word ‘test’ will product output like this.

E:\somefolder\somefile.txt:134:test

Ugly! This can be fixed in one of two ways. The simple and not complete way. Is to only return the line.

(Sls ‘text’ f:\powershell\filelibrary).line | add-content -utf8 test.txt

Results in:

Library of text file handling routines using dot net routines for speed.
type text.txt | out-utf8 c:\example.txt
    takes the text from text.txt and writes it to example.txt encoded utf8 no bom
$reader = [System.IO.File]::OpenText($inFile)
$reader = [System.IO.File]::OpenText($inFile)
$reader = [System.IO.File]::OpenText($inFile)

A little explanation of that construction. The parenthesis around sls 'text' f:\powershell\filelibrary.ps1 allows you to access any part of the select-string object that is returned. Here we only want the line

To see all the parts of the MatchInfo object returned feed the output of the Select-String command to get-member. You can see all the parts available

    TypeName: Microsoft.PowerShell.Commands.MatchInfo

    Name         MemberType Definition
    ----         ---------- ----------
    Equals       Method     bool Equals(System.Object obj)
    GetHashCode  Method     int GetHashCode()
    GetType      Method     type GetType()
    RelativePath Method     string RelativePath(string directory)
    ToString     Method     string ToString(), string ToString(string directory)
    Context      Property   Microsoft.PowerShell.Commands.MatchInfoContext Context {get;set;}
    Filename     Property   string Filename {get;}
    IgnoreCase   Property   bool IgnoreCase {get;set;}
    Line         Property   string Line {get;set;}
    LineNumber   Property   int LineNumber {get;set;}
    Matches      Property   System.Text.RegularExpressions.Match[] Matches {get;set;}
    Path         Property   string Path {get;set;}
    Pattern      Property   string Pattern {get;set;}

Matches

Sometimes you want to only return the text that matches your regex and not the whole line. E.g. if you want to extract the IP addresses from a text file. The following line will do this.

sls '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' -allmatches $filename | %{$_.matches.value}

gain sls returns match objects and we are selecting the value of each match.

-allmatches is used so it will get multiple IP address per line.

Regex Captures

Select-string supports capturing only part of the line that matches. You might want to do this if you want to extract parts of a line. Captures are designated by parenthesis and retrieved similar to the previous example.

sls '(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' -allmatches $filename | %{$_.matches.captures[0].value}

Multiple captures are designated by the index to captures.

CSV Files

A large part of the files we process are csv files. PowerShell has good csv processing routines built in.

If you have a large csv log file and you only want to select parts of it use the import-csv and export-csv functions.

A command like this will filter by date.

Import.csv log.csv | where {$_.datetime -gt ‘2019-03-01 00:00’} | export-csv -notype filtered.csv

Another quirk of PowerShell is it insists on putting the type of csv in the first line of the csv unless you tell it not to. ☹ That is the reason for the -notype option (short for -NoTypeInformation) given to export-csv.

Processing a CSV in this way, PowerShell only processes one line at a time and does not run into size limitations.

If you want to filter on multiple fields you can do that with something like this:

Where {$_.datetime -gt ‘2019-03-01 00:00’ -and $_.method -eq ‘PUT’}

You can also pare down the log file by only selecting the fields you want. You can also reorder the fields for those stupid log files that put the datetime last.

Import-csv log.csv | select datetime, method, url | export-csv -notype output.csv

If you encounter a log file with non-standard delimiter you can designate that with the -delimiter argument.

Import-csv -delim ‘|’ log.csv

Or

Import-csv -delim "`t" log.csv

For a tsv

You can designate a header for a csv file with no header

Import-csv -header "field1","field2","Field3" log.csv

XML Files

PowerShell can process XML files natively. The [XML] type transforms XML content and returns an object.

[XML]somefile.xml will parse the xml file and return an object. You can then parse and select the output like a csv.

There are also four comandlets that allow you to process xml files ConvertTo-Xml, Export-Clixml, Import-Clixml and Select-Xml.

JSON Files

PowerShell has two Json commandlets 'ConvertFrom-Json and ConvertTo-Json' that process Json lines. With the limitation that they only process one line at a time they work well.