Village Pump (technical) - Widget For Downloading Tables Into .csv Format?

Widget For Downloading Tables Into .csv Format?

Hi, I'm not sure if this is the right place to ask this. I was referred by the Teahouse to come here.

I often find myself looking at a table in Wikipedia (or elsewhere) and thinking to myself: "I'd like to download that data into Excel and rearrange it or merge it with data from elsewhere." But if I just copy the table and paste it, I usually get a long string of information. Instead of tabs between fields and a carriage return at the end of the line, there are just spaces between each of the entries. Using tools like BBedit, Tex-Edit Plus, Word, and lots of perseverance, I can usually recreate the table in a form that I can upload to Excel from this data stream, but it is often a difficult process (especially if the cells of the table have textual data with spaces).

So I would really like someone to create a widget that would make this task easy. What I envision: at the top of each table in Wikipedia there would be an icon. If you clicked on the icon, it would automatically download the table in .csv (comma-separated values) format which could then be easily imported into Excel or another spreadsheet program (Google Docs, etc.). I assume it would be relatively easy to create such a widget since tables have a regular and simple markup language (though text that extends over two or more cells might be a bit tricky). There are already widgets for converting spreadsheets into Wiki's markup language, but as far as I can tell, no easy way to go the other direction.

I see that there is a command line tool called wiki2csv at Wikipedia: Tools#Export: Conversion to other formats, but it seems much too complicated for the average user. I want it to be possible to go to a page like List of Justices of the Supreme Court of the United States and just click on an icon to download the table data in .csv format.

Is this something that already exists somewhere? Is it a good idea? Can/will someone do it? Randy Schutt (talk) 15:40, 12 December 2012 (UTC)

Workin' on it. Writ Keeper ⚇♔ 15:43, 12 December 2012 (UTC)
Not exactly the way you want to do it, but you can, instead, query the web page from Excel. In Excel 2007, it's at Data→Get External Data. You enter the URL, it loads the page in a special dialog, and you then choose which objects you want to import into the spreadsheet. I just tried it with the page you mentioned and it's not ideal – it doesn't manage to pick out the tables separately like it does with other sites I've used it on, but the data you need is there, embedded within a bunch of other stuff, starting at cell FA194. —— 16:04, 12 December 2012 (UTC)
Okay, try User:Writ Keeper/Scripts/tableConverter.js. It's a bit ugly, but it should work (possibly not in IE, though). When you get the download prompt, do pay attention to what the file is named; the file name starts out as weird garbage in my tests, and I don't know how to get around that. So just make sure to name it something sensible, and remember to end it with the ".csv" extension. Any advice on how to sex up the appearance or output would be welcome. Writ Keeper ⚇♔ 17:05, 12 December 2012 (UTC)
Neither the 2004 or 2008 versions of Macintosh Excel seem to be able to import from a url. The 2008 version let's me import an html file from my desktop (File-Import...) but that, of course, involves an extra few steps (edit page on Wikipedia, copy table lines, paste into a new text document, save as an html file). I can do this, but it would be nice for it to be more automated. I don't know how to run the script that Writ_Keeper has created. Perhaps someone else can try it. I was hoping there would be something as simple as what the Census Bureau does, for example, at the bottom of this page: http://quickfacts.census.gov/qfd/states/39000.html . Randy Schutt (talk) 17:52, 12 December 2012 (UTC)
Oh, sorry, just put importScript("User:Writ Keeper/Scripts/tableConverter.js"); into your common.js page. Once you do that (you might have to bypass the browser cache afterwards; instructions for how to do that are at this page), you should see a link that will appear below every wikitable, and when you click on it, it'll prompt you to download a file. As I said, just rename the file to something sensible with the ".csv" file extension, and it should be good. Writ Keeper ⚇♔ 18:19, 12 December 2012 (UTC)
Thanks. I now have a nice "Export as CVS" line at the bottom of the table (though note that it is actually part of the table - a new cell in the first column and below the last row). But when I click on it I don't get the hoped for results. I'm running OS 10.5.8 on my iMac. On Safari 5.0.6, I get a new page with all the table's html code splayed across the screen. With Firefox 16.0.2, I can save the file (with a weird name zMpfGqwp-1.part), which I can rename and open in Excel, but the file has no content. With Chrome 21.0.1180.90 I get a file ("Download") that I can rename and open in Excel. It has the table information, but still includes all the anchors and span html code (could that be stripped out?). Randy Schutt (talk) 18:56, 12 December 2012 (UTC)
Randy, your few extra steps to get HTML into Excel won't work. The text shown on Wikipedia's edit screen is in wikitext format, not HTML. It doesn't turn into HTML format just because you save it in an HTML file. A technique that should work (and is also fewer steps) is to use your browser's File ‣ Save As... command while reading the page. In the Save dialog, select HTML (without images) as the file type. (The default in some browsers is a "web archive" or with-images format such as MHTML. I'm not sure Excel can handle that.) Writ Keeper's script is the easiest option if it works for you (I've not tried it myself). – PartTimeGnome (talk | contribs) 18:49, 12 December 2012 (UTC)
PartTimeGnome, oh yes, you're right. In the past, I've used a text editor to put the proper html headers on the file to make it into a proper html file. But for converting Wikipedia, your solution would be much better/easier. This process doesn't work if I use Excel 2003 for Mac (since it can't open html files), but it does with Excel 2008 for Mac. So this is a solution. Still, I would like a one-button solution and especially to have this be a normal part of Wikipedia (rather than having to install scripts, etc.) But, so far, Writ Keeper's script isn't working very well for me. Maybe I'm asking for too much. Randy Schutt (talk) 19:20, 12 December 2012 (UTC)
Hmm, it's working for me in Firefox and Chrome, though I did have to fix a few bugs. Perhaps you should bypass your cache again and retry. What table are you trying this on? Writ Keeper ⚇♔ 19:27, 12 December 2012 (UTC)
one small problem is that tables in enwiki often contain junk in the form some-string-for-sorting (it's not "junk" per se, but it cause pain for the converter. you probably want to do something like
//instead of: var str = "" + $(el).text; //you want something like: var clone = $(el).clone; clone.find('*').filter(function{return $(this).css('display')=='none'}).remove; var str = clone.text;
peace - קיפודנחש (aka kipod) (talk) 20:42, 12 December 2012 (UTC)
Writ_Keeper, it works pretty well now: On this page List of Justices of the Supreme Court of the United States, Mac OS 10.5.8, Safari 5.0.6 opens up a new page with all the items enclosed in quotation marks and delimited by commas, and all the anchors and span code has been stripped out. I can copy and paste it into Excel and it converts without much effort. Firefox 16.0.2 worked for me once I turned off TACO (oops, my bad). And Chrome 21.0.1180.90 works pretty well too. The script works well on the two tables on this page too: United States Senate elections, 2014 and the second table has lots of empty cells. It mostly works on this page too List of United States Supreme Court Justices by time in office, but the term in office number is not quite right: for example, for Douglas it is 700413358000000000013,358 instead of 13,358. Perhaps the comma in the number is screwing up the conversion. But overall, this is a good solution for me. Thank you very much. Now my wishlist is: make the "Export to cvs" text smaller, make it into a button instead of a cell in the table, put it at the top right of the table instead of the bottom, and make it available to everyone without them having to put the script call in common.js.
what cause the problem you report is not the comma, it's exactly what i was talking about: an invisible field that's placed in sortable tables for sorting purposes. the remedy would be to remove the invisible data before generating the file. my snippet (admittedly untested) is supposed to do that. peace - קיפודנחש (aka kipod) (talk) 21:52, 12 December 2012 (UTC)
Yes, in this case (List of United States Supreme Court Justices by time in office) the Term in Days column is using the Template:Nts command to ensure sortability. Randy Schutt (talk) 22:46, 12 December 2012 (UTC)
on 2nd thought, my snippet is an overshoot. all you have to do is replace var str = "" + $(el).text; with var str = $(':visible', el).text; peace - קיפודנחש (aka kipod) (talk) 00:01, 13 December 2012 (UTC)
Sorry, I was distracted by shenanigans elsewhere. That looks like a great solution, kipod, thanks! Writ Keeper ⚇♔ 01:17, 13 December 2012 (UTC)
I just checked and the new version works great. Thank you very much Writ Keeper and Kipon for your good work. Randy Schutt (talk) 16:44, 14 December 2012 (UTC)
I just added instructions for using this on Writ Keepers page (User:Writ Keeper/Scripts/tableConverter). I hope this is ok.

Read more about this topic:  Village Pump (technical)

Famous quotes containing the word tables:

    The evening falters. Couples in their coats
    Are leaving gaps already, and the rest
    Move tables closer.
    Philip Larkin (1922–1986)