Filtering Content

I haven’t had an update in a while because I’ve been working on two very large projects that aren’t quite ready for launch. I’ll have more information about the Athletics Recruiting form application, a form that will look like the Middlebury College web site but collect information about potential Athletes directly into Banner, as it continues to move toward launch stage. This post will touch on an aspect of that project. The second project is a secret for now, but there should be plenty of interesting content about it beginning in March. To give a tiny hint: it will be the first site using MCMS other than the Middlebury College web site. As you might imagine, this created some interesting conflicts and challenges.

But this post is about filtering content and information contained in one data source and preparing it for insertion into another. Specifically, I’ve been trying to move historical Athletics recruit records from a FileMaker Pro database into Banner. The FMP database is fairly unstructured, not so much caring whether you enter “Lit”, “English Lit”, or “English Literature” in the field marked “SAT II Subject Test Name”. Banner is significantly less care free about this type of data integrity.

I’ve created a program that contains several custom data structures to hold the values for the data import. Using “structs” is much easier than the way I used to do these types of scripted data transfers: by using untyped arrays to hold the information. To give you a concrete example, consider the following two code snippets:

string[] record = new string[4];
record[0] = “101010”;
record[1] = “Ian”;
record[2] = “McBride”;
record[3] = “S”;

Or….

public struct Record {
public string ID;
public string SRTIDEN_FIRST_NAME;
public string SRTIDEN_LAST_NAME;
public string SRTIDEN_MI;
}

Record record = new Record();
record.ID = “101010”;
record.SRTIDEN_FIRST_NAME = “Ian”;
record.SRTIDEN_LAST_NAME = “McBride”;
record.SRTIDEN_MI = “S”;

Now the second example definitely contains more lines of actual code, but it’s also much less ambiguous. Consider too what happens when, rather than 4 data fields, we really have 50-100. “Now was birth month the fifteenth array index or the seventeeth?” versus, “OK, just type ‘record.SRTPERS_BIRTH_MON'”. Being able to name each data element the same as the database column in which you’ll insert the value is also tremendously helpful when you’re paging through a couple thousand lines of code.

Additionally, rather than paging through a switch statement to handle all of the exception cases for transferring one data encoding to another, I decided to just use a Hashtable and access it dynamically to translate between the two data sets. Here’s some more examples, starting with the old way of using a switch statement:

switch(fmp_line[“coach”]) {
case “DAN”:
record.COACH = “11111111”;
break;
case “JME”:
record.COACH = “22222222”;
break;
case “RDR”:
record.COACH = “33333333”;
break;
case “SCM”:
record.COACH = “44444444”;
break;
default:
record.COACH = “55555555”;
break;
}

And now using a Hashtable:

htCoaches = new Hashtable();
htCoaches[“DAN”] = “11111111”;
htCoaches[“JME”] = “22222222”;
htCoaches[“RDR”] = “33333333”;
htCoaches[“SCM”] = “44444444”;
htCoaches[“”] = “55555555”;

record.SYRRLST_RECRUITER_PIDM = (string)htCoaches[cols[(int)htCols[“coach_recruiting”]]];

To further explain, the Hashtable htCols is dynamically populated by parsing the first line of the comma separated data file that I get as an extract from FileMaker Pro. The value it returns is the index position of that column in the file, meaning that if the column position changes I don’t have to update my script. This is useful if it’s decided that the CSV needs to contain more or fewer columns. This index is passed to the string array “cols”, which contains one of the column values from each line of the CSV. The value from the CSV is matched against the value in the htCoaches Hashtable, which returns a PIDM to insert into the Banner database.

Another interesting problem that I ran into is that a comma separated file can and will also contain commas that are not to be used as separators. When you export a CSV from Excel, it will enclose any fields that contain commas in their values in quotes so that Middlebury, VT becomes “Middlebury, VT”. Unfortunately, the way that I was parsing each line of the CSV was just to split it into chunks every time there was a comma. This meant that “Middlebury, VT” became “Middlebury [break for next column] VT”. This would typically be the place where you’d write a complicated regular expression to further break down the text and make sure that you weren’t grabbing “Middlebury, VT” as two separate columns, but I found a better solution.

// feed the current line of the file into a string variable
string line = stream_reader.ReadLine();
string[] pieces = line.Split(“\””, StringSplitOptions.None);
string result = “”;

// go to EVERY OTHER piece in the string and look for commas
for (int i = 1; i < pieces.Length; i = i + 2) {
pieces[i] = pieces[i].Replace(‘,’, ‘ ‘);
}

// put it back together
for (int i = 0; i < pieces.Length; i++) {
result += pieces[i];
}

line = result;

Note that we need to go to every other instance because the first index will be the line up to the first quote mark which will likely contain commas that are supposed to break the line into columns. Then the space between the first quote mark and the second quote mark will always contain only commas that were in the original data and so on. I chose to replace these commas with just a blank space character, but it could easily be modified to contain a string you’d never find in the source content like “<here_is_a_comma>”, which gets translated back into a comma during post processing.

One more thing before I sign off for the day. I noticed that if you had a spreadsheet with a lot of blank cells and tried to save it as a CSV you wouldn’t always get a comma for each column, meaning that my script would think that there were several fewer columns than there really should have been in a line. I found this helpful tip for making sure that you got a properly formatted CSV:

Select one of the cells in your header row.
Press Shift + Ctrl + 8.
Press Ctrl + H to get the Replace dialog box
Leave the “Find What” section empy.
In the Replace With box enter a single space character
Click Replace All.

Then save your Excel file as a CSV and you’re all set for the most fun thing in the world: parsing a CSV with regular expressions and data structures!

Library & Information Technology Services

Middlebury College

Filtering Content

Leave a Reply Cancel reply