I need Excel help, off to Amazon’s Mechanical Turk
Update: I already got the cleaned file, thanks to several of my readers! Appreciate the help!
OK, I downloaded the latest change.xml file from weblogs.com. If you don’t know what weblogs.com is, this is a service that most weblog tools will “ping,” or let know that someone has just published.
In the early days of blogging Dave Winer and other bloggers would watch this page like a hawk since it would display when new people had just posted. Remember, when I started blogging there were only a couple of hundred bloggers with only a few dozen posts a day. You could read this page just like many of us read TechMeme or TailRank now.
Anyway, I just downloaded the last hour and there were more than 60,000 entries in that file. Whew! OK, I went through brute force and cleaned up just the “As.” Brute force means I just went through and deleted them by hand, not using any macro or scripts.
It’s taking too long to do it by hand (60,000 URLs is too many) and, anyway, it’d be fun to redo this test over and over to see if the numbers of blogs done from each service change depending on the day of week and time of day.
Anyway, here’s what I need done. This is a perfect job for Amazon’s Mechanical Turk. That service lets you spec out a small job, and get someone who has a little extra time to do it to do it for you for a reasonable fee.
On the other hand, I’ll also ask here. Here’s what I need:
1) Take my Excel .XLS file (I’ll clean it up and put it into a column for you) and delete all the URLs that don’t come from blogspot.com; wordpress.com; livejournal.com; spaces.live.com; typepad.com.
That’s it. Easy, huh? Should take one of the programmer types here a few minutes to write an Excel macro to do that. If you’d rather me just hand you a comma-delimited text file, I can do that too. Or, you can just go get the file yourself from weblogs.com (it’s an XML file) and clean it up yourself. I just need the URLs, I don’t care about anything else.

Powered By
August 20th, 2006 at 5:46 pm
Yeah, how soon do you need it done by.
I might have a go at it
contact me if you want me to
August 20th, 2006 at 5:53 pm
How about tomorrow? Or, tonight, if you have time. I’ll contact you.
August 20th, 2006 at 6:14 pm
I just tried to download the xml file, it appears to have a wrong xml format (A string literal was not closed in one of the records), maybe will try later…
August 20th, 2006 at 6:19 pm
Here is a powershell approach.
You can paste this into a PS commandline window and you should end up with a scrubbed.csv file ready for Excel.
If you want, let me know where I can post the file directly.
$wc = New-Object Net.Webclient
$wc.DownloadString(”http://rpc.weblogs.com/changes.xml”) > changes.xml
$data = [xml](get-content changes.xml)
$data.weblogUpdates.weblog | where {$_.url -match “blogspot.com|wordpress.com|livejournal.com|spaces.live.com|typepad.com”} | Export-Csv -NoTypeInformation scrubbed.csv
August 20th, 2006 at 6:26 pm
Robert: You came from Microsoft. You should know a way or two to do this in Excel without any macros or programmer help. I’d take a stab at it, but I left my Windows laptop at work. Maybe tomorrow if nobody else has taken a shot.
August 20th, 2006 at 6:32 pm
I sent you an excel file - I created the CSV file using a short Smalltalk script.
August 20th, 2006 at 6:32 pm
I make it
147675 in total
9726 blogspot
698 wordpress
522 livejournal
540 spaces.live
238 typepad
for the last changes.xml.
Typepad users can map different URLs to their typepad weblog, so that may skew things.
And I may have made gross errors, at 3am. :)
August 20th, 2006 at 6:40 pm
Robert, try using an Advanced Filter in Excel. Enter this as the criterion range:
url
“*wordpress*”
“*blogspot*”
“*livejournal*”
“*spaces.live*”
“*typepad*”
August 20th, 2006 at 7:00 pm
Export to a text file, and from a linux command line:
grep “blogspot.com\|wordpress.com\|livejournal.com\|spaces.live.com\|typepad.com” oldxlsfile.txt > newxlsfile.txt
August 20th, 2006 at 7:03 pm
Richard: just because someone worked at Microsoft does that mean they know how to write a kernel-level driver? So, what does that have to do with anything?
Thanks everyone for helping out! Got several versions of the cleaned file now.
August 20th, 2006 at 7:15 pm
Hi Robert,
If you have grep handy, export a csv, and run
grep -v ‘blogspot.com|wordpress.com|livejournal.com|spaces.live.com|typepad.com’ file.csv
If you dont, email me the file, and I’ll do it.
August 20th, 2006 at 7:18 pm
I’ll organize the excel file for you tonight. I’m starting up a home business as a virtual assitant and this is the kind of thing that I would do for my clients.
August 20th, 2006 at 7:31 pm
I think a lot of use are looking at different time windows. I get
Windows Live Spaces 795
WordPress 363
LiveJournal 245
TypePad 114
Blogspot 8844
Comparing those numbers with other numbers here suggests wide variation across time. One hour is not going to be a very useful snapshot. Well except for showing huge numbers for Blogspot.com.
One thing I did not filter for but that I assume you will is duplicates. That is to say a brief scan of my output suggests that a lot of people are posting several times in the same hour window. Most of the ones I saw were in Blogspot.com but who knows how many there are.
August 20th, 2006 at 8:17 pm
I can wrap something up rather quickly if all the other solutions don’t pan out. :)
http://steve.emxsoftware.com/LINQ/Solving+all+Scobles+problems+with+Linq+to+XML
August 20th, 2006 at 8:39 pm
Robert: Learning to use several fancy tricks like pivot tables in Excel is considered pretty important at APC so I figured it’s probably important at Microsoft too ;)
August 20th, 2006 at 10:44 pm
Robert! I emailed you a small utiliy that might do the trick. Do check it out.
August 20th, 2006 at 11:56 pm
I downloaded the file at around 9:15pm PST and got a total of 5,709 from the requested sites and a breakdown of:
Blogspot: 4376 (77%)
Spaces: 574(10%)
Wordpres: 512 (9%)
LiveJournal: 159 (3%)
Typepad: 88 (2%)
Laurie
http://www.yourgotogirls.com
August 21st, 2006 at 12:11 am
I was able to do it using Zoho Creator. Please have a look at http://zohocreator.com/sugan/weblogs-tracker/
You can see
* a view that displays the data from weblogs.com that matches your criteria
* form to import the xls to Zoho Creator (using the “Import Data” link at the top of the form)
I have written a simple two line code to check if the url contains blogspot.com, wordpress.com, livejournal.com, spaces.live.com or typepad.com. Only those entries that contain this in the url will get persisted in Zoho Creator.
You can view the script at http://zohocreator.com/sugan/weblogs-tracker/73096/viewscript/
I had taken the xml from weblogs.com (Only the changes made in a span of 5 minutes). The xls that I imported can be found at http://zohocreator.com/sample/xls/weblogs_scobelizer.xls. It has 14,641 entries. But only 796 entries matched your criteria.
Thanks,
Suganya.
August 21st, 2006 at 12:21 am
I am sorry. The xls can be found at http://zohocreator.com/sample/xls/weblogs_scobelizer.xls
Thanks,
Suganya.
August 21st, 2006 at 1:01 am
I had no idea that blogspot was so dominant … I’d really like to see these numbers a oouple of times a day for a week or so. Do you think that weblogs.com might have these statistics readily available or will we have to come up with some sort of script to pull this a couple times a day for a week?
August 21st, 2006 at 1:09 am
wow - I would have used access and attach the worksheet as a table. then use the query designer. second step would be to create a report, and may be a chart for clarity…
just a thought from up north
August 21st, 2006 at 6:18 am
I just wrote a small script which will read the file and write the requested urls into a separate xml file. just copy the following code into a notepad and save it in the same folder as changes.xml. Just ignore the
tags. I could not mail it to you because my email server rejects any vbscript present in the mail.dim fs, ts, ts1
set fs=CreateObject("Scripting.FileSystemObject")
dim s
set ts=fs.opentextfile("changes.xml")
set ts1=fs.createtextfile("changes1.xml")
do while ts.atendofstream=false
s=ts.readline
if len(trim(s))>0 then
if instr(1,s,"url") >0 then
if instr(1,s,"blogspot.com")>0 then
ts1.writeline s
elseif instr(1,s,"wordpress.com")>0 then
ts1.writeline s
elseif instr(1,s,"livejournal.com")>0 then
ts1.writeline s
elseif instr(1,s,"spaces.live.com")>0 then
ts1.writeline s
elseif instr(1,s,"typepad.com")>0 then
ts1.writeline s
end if
else
ts1.writeline s
end if
end if
loop
August 21st, 2006 at 10:46 am
You know, a *real* geek would have just hacked some code in perl or something. :-P
August 21st, 2006 at 12:43 pm
Yeah, wget grep / egrep and alike would do the trick most neatly - if you really would like to go that way, I can give you the stand alone WINDOWS executables.
This is how I earned my living in controlling the last 10 years: using awk, grep and co to prepare data for excel.
As for pure Excel I would agree on use ‘count if’ formula, you would always have your list which can be updated and the system would use “count if” with all the blogservices you would like to look at.
August 22nd, 2006 at 10:56 am
[...] Robert Scoble nearly got as far as submitting a task to Mechanical Turk but simply by discussing this on his blog, the task was completed before getting anyway near the Amazon site. [...]
August 22nd, 2006 at 12:07 pm
[...] Just learned about this from Scoble. Sounds interesting. Anyone that has a lot on their plate can offload stuff to a Mechanical Turk [...]
December 18th, 2006 at 9:33 am
Hello, i love scobleizer.com! Let me in, please :)
November 6th, 2007 at 2:02 pm
http://exceltipsandtricks.blogspot.com/
let me help u