By Alex Plough February 10, 2014
Let’s be honest, nobody decided to become a journalist so they could use Excel every day.
People inclined to write for a living tend to be much more comfortable using Word than those other programs in Microsoft’s Office suite, and many journalists are uncomfortable with – or even downright intimidated by – Excel.
Which is too bad, because a cub reporter today armed with the most basic knowledge of spreadsheets can query more information in a few hours than Woodward and Bernstein could have managed in weeks of cross referencing hand-written records.
If your goal is to write elegant pieces about the arts or about travel, then this article is not for you. But if you want to write about business, public policy or science – and if you are committed to uncovering facts and driving the news agenda – then there is no better set of skills to learn than a basic command of spreadsheet programs like Excel.
Excel is one of those programs that inspire Zen-like emotions amongst it devotees. A practiced user can navigate the data almost without thinking, and a well-structured spreadsheet can be as elegant as a complicated chess move or a beautiful piece of music.
This level of fanaticism is no reason to be intimated. At their most basic level, spreadsheets are simply a way to collect, store, analyze and draw conclusions – as well as story ideas – from information. More importantly, anyone can master the basics.
Excel’s real power lies in its versatility. What started out as a tool for accountants led the computer-driven revolution of the financial industry in the mid-1980s. Traders and analysts were suddenly able quickly compare tons of performance data for companies and industries in real time, and professional investors could managed their portfolio at an unprecedented level of detail.
At the same time, it allowed social scientists and economists to perform calculations without the need for expensive specialized software. Excel also fueled the rise of Computer Assisted Reporting (CAR for short) in newsrooms. One of the pioneers of this movement is Professor Philip Meyer, whose 1973 book ‘Precision Journalism’ set out the blueprint for data-driven journalism.
If you have Microsoft Office installed on your computer, then Excel should already be there. Look for the “Office” tab under “All Programs” and select the Excel icon. Alternatives to Excel include the free software Apache Open Office and also Google Spreadsheets (create a free Google account and go to Google Drive). While both of these programs are excellent for free software, they lack the full functionality of Excel so it is worth buying a copy.
Go ahead and open up Excel and create a new blank workbook. You are looking at container with lots of empty compartments (called cells) for your pieces of data. Notice that each cell has a reference number based on the row and column.
The first step is to understand the difference between a row and a column, this is a simple but sometimes tricky distinction to get your head around. In a well-structured spreadsheet, each row should refer to a single and distinct ‘thing’. Each column, on the other hand, refers to a characteristic describing the ‘thing’ that each row represents.
One test is to ask yourself when you’re looking at data in your spreadsheet, ‘if I changed the order of these rows, would I lose any information?’ If the answer is yes, then the structure of your data needs some work. This will make more sense when we start working with real data.
We’ll be using this spreadsheet from the US Internal Revenue Service (IRS), which holds data on income tax returns from 1913 to 2005. It is one of a number of tables looking at income tax returns found on the IRS website, which is a good place to explore for useful data.
Download the spreadsheet and open it in Excel. Before we can do any analysis, the first step is to work out what the rows and columns represent. (At this stage it is important not to make too many assumptions what this could be. If it’s not obvious then you should take some time to study the data, talk to experts who use it regularly or call up the institution that produced it.
In this case, the ‘thing’ each row represents is a tax year. The columns show various characteristics about each tax year, such as the number of returns and a breakdown of different types of income people reported in that year. There are some simple questions you should always ask about each dataset you work with.
Now to start thinking about what this data can tell you. This is where the journalistic instinct kicks in. Imagine you are talking with someone in a bar who has all this information in their head. What would you ask them? In this case you might start with broad questions and then narrow it down, for example:
Before we can answer these questions, we should do some data hygiene. We want to remove the formatting of this spreadsheet to make it as easy to analyze as possible. To do that we will select, copy and paste all the data into a new sheet.
Start by highlighting the cell containing the year “1913” (column A, row 9) by clicking it with your mouse, this will be the top left corner of our selection. Next scroll down to row 101 and scroll right to the end of our columns. When you can see the bottom right of the data, hold down the shift key and select the cell containing “980259” (column K, row 101).
When you use this method, called a shift click, Excel will select everything in between (and including) the top left and bottom right cells. Just remember to hold down the shift key when choosing the bottom right cell of your selection. A quick trick to check you have picked the right cells is to hold down the ‘Ctrl’ key and tap ‘.’ (Period), which will quickly take you to the four corners of your data.
Copy these selected cells, known in Excel speak as a ‘range’ of cells, using Ctrl+C and look to the bottom right of the Excel window where you’ll see a tab called ‘Sheet1’, clicking the ‘+’ icon next to it will open a new blank sheet. In the new sheet choose the second cell down in column A, right click and choose ‘Paste Special’ and choose ‘Values’ or ‘Text’. We left the first row blank so we can add the names of each column. As there are not too many we can do this by hand. Refer to the original sheet and type in the appropriate names.
The main reason to create a ‘clean’ copy of the data is that you always want to have an untouched copy of the original data. This is a rule to live (or die) by. Before publishing any story based on this data you must be able to repeat your calculations, which can become very tricky once the original data has changed.
Data-driven stories have a greater authority because of their quantitative backbone, however this means your standards of proof are also higher. It is all too easy to manipulate statistics and readers will pounce on mistakes, so you must be able to confidently describe every step you took in your analysis. Another rule is save before each step and give your worksheet a new name (e.g. IRS data 1, IRS data 2 etc), that way if you do find an error it is easier to track it down to a specific version of your data.
There are two ways to answer question one. The first is simple to re-order the rows based on the values in column K (‘Total tax liability’). Highlight the top left cell in the sheet (Row A, Column 1), then look at the top of Excel for a button called ‘Data’. Selecting this will bring up the tools you need to work with data. In the ‘sort and filter’ section, click the ‘Filter’ icon (which is shaped like a funnel).
Now each column should have a small downwards arrow button, make sure that all the columns in your range have this icon. If they don’t, then make sure you are not highlighting just a single column and try scrolling across to make sure there are no empty columns splitting your data. Once you’ve made sure every column has this arrow, click the icon on ‘Total tax liability’ to give you a drop-down menu of options and choose ‘Sort Largest to Smallest’.
We can now see in which year US citizens paid the most tax (the answer should be the year 2000). But if you think about it, this may not be the most accurate answer to our question, as incomes have also risen through the years. A more revealing figure might be the amount paid in taxes relative to amount of income, or rather the percentage of total income that gets eaten up by taxes.
Luckily Excel makes this very easy to find out. First get rid of the Filter buttons by clicking the Funnel icon. Then create a new column heading after ‘Total tax liability’ called ‘Tax as % of Income’. In the first empty row type in this formula and press enter:
=K2/C2
To calculate percentages we divide the part by the whole. In this case we divide ‘Total tax liability’ (K2) by ‘Total income’ (C2). Now highlight this new value and hover your mouse over its bottom right hand corner, the mouse will turn into a small ‘+’ sign. Double click when this happens and Excel will repeat the formula for all the rows in the column.
We can now re-sort the new column using the filter tool to find out in which was the best year for the taxman. The answer is 1981 (Not everything was better in the 80s.) What other percentages might be interesting to look at? How about salaries and wages as a percentage of total income? This reveals a dramatic 50 year shift in how American are making their money.
Think about what the data cannot tell you. In this case, we can only see income that people have reported to the IRS, so it says nothing about the amount lost to tax evasion (the so-called ‘Tax Gap’). Data does not exist in a vacuum. It’s important to know who collected the data, how it was collected and why. Thinking about this context of a dataset will also spur ideas of other data you can combine it with, for example the income tax rates in each year.
The possible combinations are limitless. As we know that each row refers to a year, we can look for any data that can be summarized annually. To get an idea of the vast amounts of data available, look at the St. Louis Federal Reserve data portal (also known as FRED) – just one of many useful websites.
A scientific approach helps with this process: Start with a hypothesis to test (for example, working Americans did not benefit from the President Bush era tax cuts) then think about what data you would need to disprove this statement, find it and start doing some simple calculations.
More often than not the data you’ll be working with is the result of a human being filling out a form. People make mistakes, so remain skeptical. In our example we can be fairly confident that the IRS knows what it’s doing, but not all sources are as reliable. As mentioned earlier, think about who collected the data. Was it a team of government statisticians or was it a lone academic researcher who wanted to make headlines?
Your first question should always be, ‘Does this result make sense?’ If it looks too good to be true, then it usually is. As a rule of thumb, repeat your calculations three times before running to an editor with an exciting result. On any major story you should call up whoever put the database together, or someone who uses it regularly, and take them step-by-step through your method to check for mistakes. While they don’t have to agree with your conclusions, the last thing you want is a simple math error to ruin your whole story.
Try not to become dispirited if your first steps in Excel are a little awkward, these skills do not come naturally to the vast majority of people. Just like learning any new program or language, it takes practice along with trial and error when things go wrong. Once the first few stories are under your belt, your comfort level will quickly improve and you’ll be on your way to mastering one of the most useful skills a business journalist can have.
This entry was posted on Monday, February 10th, 2014 at 10:30 am. It is filed under Tools & Resources and tagged with business journalism, Data, Excel, tutorial. You can follow any responses to this entry through the RSS 2.0 feed.
Comments are closed.