Shout Future: Artificial Intelligence Course

In this industry, it's a tired old cliche to say that we're building the future. But that's true now more than at any time since the Industrial Revolution. The proliferation of personal computers, laptops, and cell phones has changed our lives, but by replacing or augmenting systems that were already in place. Email supplanted the post office; online shopping replaced the local department store; digital cameras and photo sharing sites such as Flickr pushed out film and bulky, hard-to-share photo albums. AI presents the possibility of changes that are fundamentally more radical: changes in how we work, how we interact with each other, how we police and govern ourselves.

Fear of a mythical "evil AI" derived from reading too much sci-fi won't help. But we do need to ensure that AI works for us rather than against us; we need to think ethically about the systems that we're building. Microsoft's CEO, Satya Nadella, writes:

The debate should be about the values instilled in the people and institutions creating this technology. In his book Machines of Loving Grace, John Markoff writes, 'The best way to answer the hard questions about control in a world full of smart machines is by understanding the values of those who are actually building these systems.' It's an intriguing question, and one that our industry must discuss and answer together.

What are our values? And what do we want our values to be? Nadella is deeply right in focusing on discussion. Ethics is about having an intelligent discussion, not about answers, as such—it's about having the tools to think carefully about real-world actions and their effects, not about prescribing what to do in any situation. Discussion leads to values that inform decision-making and action.

The word "ethics" comes from "ethos," which means character: what kind of a person you are. "Morals" comes from "mores," which basically means customs and traditions. If you want rules that tell you what to do in any situation, that's what customs are for. If you want to be the kind of person who executes good judgment in difficult situations, that's ethics. Doing what someone tells you is easy. Exercising good judgement in difficult situations is a much tougher standard.

Exercising good judgement is hard, in part, because we like to believe that a right answer has no bad consequences; but that's not the kind of world we have. We've damaged our sensibilities with medical pamphlets that talk about effects and side effects. There are no side effects; there are just effects, some of which you might not want. All actions have effects. The only question is whether the negative effects outweigh the positive ones. That's a question that doesn't have the same answer every time, and doesn't have to have the same answer for every person. And doing nothing because thinking about the effects makes us uncomfortable is, in fact, doing something.

The effects of most important decisions aren't reversible. You can't undo them. The myth of Pandora's box is right: once the box is opened, you can't put the stuff that comes out back inside. But the myth is right in another way: opening the box is inevitable. It will always be opened; if not by you, by someone else. Therefore, a simple "we shouldn't do this" argument is always dangerous, because someone will inevitably do it, for any possible "this." You may personally decide not to work on a project, but any ethics that assumes people will stay away from forbidden knowledge is a failure. It's far more important to think about what happens after the box has been opened. If we're afraid to do so, we will be the victims of whoever eventually opens the box.

Finally, ethics is about exercising judgement in real-world situations, not contrived situations and hypotheticals. Hypothetical situations are of very limited use, if not actually harmful. Decisions in the real world are always more complex and nuanced. I'm completely uninterested in whether a self-driving car should run over the grandmothers or the babies. An autonomous vehicle that can choose which pedestrian to kill surely has enough control to avoid the accident altogether. The real issue isn't who to kill, where either option forces you into unacceptable positions about the value of human lives, but how to prevent accidents in the first place. Above all, ethics must be realistic, and in our real world, bad things happen.

That's my rather abstract framework for an ethics of AI. I don't want to tell data scientists and AI developers what to do in any given situation. I want to give scientists and engineers tools for thinking about problems. We surely can't predict all the problems and ethical issues in advance; we need to be the kind of people who can have effective discussions about these issues as we anticipate and discover them.

Talking through some issues

What are some of the ethical questions that AI developers and researchers should be thinking about? Even though we're still in the earliest days of AI, we're already seeing important issues rise to the surface: issues about the kinds of people we want to be, and the kind of future we want to build. So, let's look at some situations that made the news.

Pedestrians and passengers

The self-driving car/grandmother versus babies thing is deeply foolish, but there's a variation of it that's very real. Should a self-driving car that's in an accident situation protect its passengers or the people outside the car? That's a question that is already being discussed in corporate board rooms, as it was at Mercedes recently, which decided that the company's duty was to protect the passengers rather than pedestrians. I suspect that Mercedes' decision was driven primarily by accounting and marketing: who will buy a car that will sacrifice the owner to avoid killing a pedestrian? But Mercedes made an argument that's at least ethically plausible: they have more control over what happens to the person inside the car, so better to save the passenger than to roll the dice on the pedestrians. One could also argue that Mercedes has an ethical committent to the passengers, who have put their lives in the hands of their AI systems.

The bigger issue is to design autonomous vehicles that can handle dangerous situations without accidents. That's the real ethical choice. How do you trade off cost, convenience, and safety? It's possible to make cars that are more safe or less safe; AI doesn't change that at all. It's impossible to make a car (or anything else) that's completely safe, at any price. So, the ethics here ultimately come down to a tradeoff between cost and safety, to ourselves and to others. How do we value others? Not grandmothers or babies (who will inevitably be victims, just as they are now, though hopefully in smaller numbers), but passengers and pedestrians, Mercedes' customers and non-customers? The answers to these questions aren't fixed, but they do say something important about who we are.

Crime and punishment

COMPAS is commercial software used in many state courts to recommend prison sentences, bail terms, and parole. In 2016, ProPublica published an excellent article showing that COMPAS consistently scores blacks as greater risks for re-offending than whites who committed similar or more serious crimes.

Although COMPAS has been secretive about the specifics of their software, ProPublica published the data on which their reports were based. Abe Gong, a data scientist, followed up with a multi-part study, using ProPublica's data, showing that the COMPAS results were not "biased." Abe is very specific: he means "biased" in a technical, statistical sense. Statistical bias is a statement about the relationship between the outputs (the risk scores) and the inputs (the data). It has little to do with whether we, as humans, think the outputs are fair.

Abe is by no means an apologist for COMPAS or its developers. As he says, "Powerful algorithms can be harmful and unfair, even when they're unbiased in a strictly technical sense." The results certainly had disproportionate effects that most of us would be uncomfortable with. In other words, they were "biased" in the non-technical sense. "Unfair" is a better word that doesn't bring in the trapping of statistics.

The output of a program reflects the data that goes into it. "Garbage in, garbage out" is a useful truism, especially for systems that build models based on terabytes of training data. Where does that data come from, and does it embody its own biases and prejudices? A program's analysis of the data may be unbiased, but if the data reflects arrests, and if police are more likely to arrest black suspects, while letting whites off with a warning, a statistically unbiased program will necessarily produce unfair results. The program also took into account factors that may be predictive, but that we might consider unfair: is it fair to set a higher bail because the suspect's parents separated soon after birth, or because the suspect didn't have access to higher education?

There's not a lot that we can do about bias in the data: arrest records are what they are, and we can't go back and un-arrest minority citizens. But there are other issues at stake here. As I've said before, I'm much more concerned about what happens behind closed doors than what happens in the open. Cathy O'Neil has frequently argued that secret algorithms and secret data models are the real danger. That's really what COMPAS shows. It is almost impossible to discuss whether a system is unfair if we don't know what the system is and how it works. We don't just need open data; we need to open up the models that are built from the data.

COMPAS demonstrates, first, that we need a discussion about fairness, and what that means. How do we account for the history that has shaped our statistics, a history that was universally unfair to minorities? How do we address bias when our data itself is biased? But we can't answer these questions if we don't also have a discussion about secrecy and openness. Openness isn't just nice; it's an ethical imperative. Only when we understand what the algorithms and the data are doing, can we take the next steps and build systems that are fair, not just statistically unbiased.

Child labor

One of the most penetrating remarks about the history of the internet is that it was "built on child labor." The IPv4 protocol suite, together with the first implementations of that suite, was developed in the 1980s, and was never intended for use as a public, worldwide, commercial network. It was released well before we understood what a 21st century public network would need. The developers couldn't forsee more than a few tens of thousands of computers on the internet; they didn't anticipate that it would be used for commerce, with stringent requirements for security and privacy; putting a system on the internet was difficult, requiring handcrafted static configuration files. Everything was immature; it was "child labor," technological babies doing adult work.

Now that we're in the first stages of deploying AI systems, the stakes are even higher. Technological readiness is an important ethical issue. But like any real ethical issue, it cuts both ways. If the public internet had waited until it was "mature," it probably would never have happened; if it had happened, it would have been an awful bureacratic mess, like the abandoned ISO-OSI protocols, and arguably no less problematic. Unleashing technological children on the world is irresponsible, but preventing those children from growing up is equally irresponsible.

To move that argument to the 21st century: my sense is that Uber is pushing the envelope too hard on autonomous vehicles. And we're likely to pay for that—in vehicles that perhaps aren't as safe as they should be, or that have serious security vulnerabilities. (In contrast, Google is being very careful, and that care may be why they've lost some key people to Uber.) But if you go to the other extreme and wait until autonomous vehicles are "safe" in every respect, you're likely to end up with nothing: the technology will never be deployed. Even if it is deployed, you will inevitably discover risk factors that you didn't forsee, and couldn't have forseen without real experience.

I'm not making an argument about whether autonomous vehicles, or any other AI, are ready to be deployed. I'm willing to discuss that, and if necessary, to disagree. What's more important is to realize that this discussion needs to happen. Readiness itself is an ethical issue, and one that we need to take seriously. Ethics isn't simply a matter of saying that any risk is acceptable, or (on the other hand) that no risk is acceptable. Readiness is an ethical issue precisely because it isn't obvious what the "right" answer is, or whether there is any "right" answer. Is it an "ethical gray area"? Yes, but that's precisely what ethics is about: discussing the gray areas.

The state of surveillance

In a chilling article, The Verge reports that police in Baltimore used a face identification application called Geofeedia, together with photographs shared on Instagram, Facebook, and Twitter, to identify and arrest protesters. The Verge's report is based on a more detailed analysis by the ACLU. Instagram and the other companies quickly terminated Geofeedia's account after the news went public, though they willingly provided the data before it was exposed by the press.

Applications of AI to criminal cases quickly get creepy. We should all be nervous about the consequences of building a surveillance state. People post pictures to Instagram without thinking of the consequences, even when they're at demonstrations. And, while it's easy to say "anything you post should be assumed to be public, so don't post anything that you wouldn't anyone to see," it's difficult, if not impossible, to think about all the contexts in which your posts can be put.

The ACLU suggests putting the burden on the social media companies: social media companies should have "clear, public, and transparent policies to prohibit developers from exploiting user data for surveillance." Unfortunately, this misses the point: just as you can't predict how your posts will be used or interpreted, who knows the applications to which software will be put? If we only have to worry about software that's designed for surveillance, our task is easy. It's more likely, though, that applications designed for innocent purposes, like finding friends in crowds, will become parts of surveillance suites.

The problem isn't so much the use or abuse of individual Facebook and Instagram posts, but the scale that's enabled by AI. People have always seen other people in crowds, and identified them. Law enforcement agencies have always done the same. What AI enables is identification at scale: matching thousands of photos from social media against photos from drivers' license databases, passport databases, and other sources, then taking the results and crossing them with other kinds of records. Suddenly, someone who participates in a demonstration can find themselves facing a summons over an old parking ticket. Data is powerful, and becomes much more powerful when you combine multiple data sources.

We don't want people to be afraid of attending public gatherings, or in terror that someone might take a photo of them. (A prize goes to anyone who can find me on the cover of Time. These things happen.) But it's also unreasonable to expect law enforcement to stick to methodologies from the 80s and earlier: crime has certainly moved on. So, we need to ask some hard questions—and "should law enforcement look at Instagram" is not one of them. How does automated face recognition at scale change the way we relate to each other, and are those changes acceptable to us? Where's the point at which AI becomes harassment? How will law enforcement agencies be held accountable for the use, and abuse, of AI technologies? Those are the ethical questions we need to discuss.

Our AIs are ourselves

Whether it's fear of losing jobs or fear of a superintelligence deciding that humans are no longer necessary, it's always been easy to conjure up fears of artificial intelligence.

But marching to the future in fear isn't going to end well. And unless someone makes some fantastic discoveries about the physics of time, we have no choice but to march into the future. For better or for worse, we will get the AI that we deserve. The bottom line of AI is simple: to build better AI, be better people.

That sounds trite, and it is trite. But it's also true. If we are unwilling to examine our prejudices, we will implement AI systems that are "unfair" even if they're statistically unbiased, merely because we won't have the interest to examine the data on which the system is trained. If we are willing to live under an authoritarian government, we will build AI systems that subject us to constant surveillance: not just through Instagrams of demonstrations, but in every interaction we take part in. If we're slaves to a fantasy of wealth, we won't object to entrepreneurs releasing AI systems before they're ready, nor will we object to autonomous vehicles that preferentially protect the lives of those wealthy enough to afford them.

But if we insist on open, reasoned discussion of the tradeoffs implicit in any technology; if we insist that both AI algorithms and models are open and public; and if we don't deploy technology that is grossly immature, but also don't suppress new technology because we fear it, we'll be able to have a healthy and fruitful relationship with the AIs we develop. We may not get what we want, but we'll be able to live with what we get.

Walt Kelly said it best, back in 1971: "we have met the enemy and he is us." In a nutshell, that's the future of AI. It may be the enemy, but only if we make it so. I have no doubt that AI will be abused and that "evil AI" (whatever that may mean) will exist. As Tim O'Reilly has argued, large parts of our economy are already managed by unintelligent systems that aren't under our control in any meaningful way. But evil AI won't be built by people who think seriously about their actions and the consequences of their actions. We don't need to forsee everything that might happen in the future, and we won't have a future if we refuse to take risks. We don't even need complete agreement on issues such as fairness, surveillance, openness, and safety. We do need to talk about these issues, and to listen to each other carefully and respectfully. If we think seriously about ethical issues and build these discussions into the process of developing AI, we'll come out OK.

To create better AI, we must be better people.

February 18, 2017 No comments

Executive Summary

O’Reilly Data Science Salary Survey, we’ve analyzed input from 983 respondents working in the data space, across a variety of industries— representing 45 countries and 45 US states. Through the results of our 64-question survey, we’ve explored which tools data scientists, analysts, and engineers use, which tasks they engage in, and of course—how much they make.

Key findings include:

Python and Spark are among the tools that contribute most to salary.
Among those who code, the highest earners are the ones who code the most.
SQL, Excel, R and Python are the most commonly used tools.
Those who attend more meetings, earn more.
Women make less than men, for doing the same thing.
Country and US state GDP serves as a decent proxy for geographic salary variation (not as a directestimate, but as an additional input for a model).
The most salient division between tool and tasks usage is between those who mostly use Excel, SQL, and a small number of closed source tools—and those who use more open source tools and spend more time coding.
R is used across this division: even people who don’t code much or use many open source tools, use R.
A secondary division emerges among the coding half— separating a younger, Python-heavy data scientist/analyst group, from a more experienced data scientist/engineer cohort that tends to use a high number of tools and earns the highest salaries.

To see our complete model and input your own metrics to predict salary, see Appendix B: The Regression Model (but beware—there’s a transformation involved: don’t forget to square the result!).

Introduction

O’Reilly Media have collected survey data from data scientists, engineers, and others in the data space, about their skills, tools, and salary. Across our four years of data, many key trends are more or less constant: median salaries, top tools, and correlations among tool usage. For this year’s analysis, we collected responses from September 2015 to June 2016, from 983 data professionals.

In this report, we provide some different approaches to the analysis, in particular conducting clustering on the respondents (not just tools). We have also adjusted the linear model for improved accuracy, using a square root transform and publicly available data on geographical variation in economies. The survey itself also included new questions, most notably about specific data-related tasks and anychange in salary.

Salary: The Big Picture

The median base salary of the entire sample was $87K. This figure is slightly lower than in previous years (last year it was $91K), but this discrepancy is fully attributable to shifts in demographics: this year’s sample had a higher share of non-US respondents and respondents aged 30 or younger. Three-fifths of the sample came from the US, and these respondents had a median salary of $106K.

Understanding Interquartile Range

For a number of survey questions, we show graphs of answer shares and the median salaries of respondents who gave particular answers. While median salary is probably the best number to compare how much two groups of people make, it doesn’t say anything about the spread or variation of salaries. In addition to median, we also show theinterquartile range (IQR)—two numbers that delineate salaries of the middle 50%. This range isnot a confidence interval, nor is it based on standard deviations.

As an example, the IQR for US respondents was $80K to $138K, meaning one quarter of US respondents had salaries lower than $80K and one quarter had salaries higher than $138K. Perhaps more illustrative of the value of the IQR is comparing the US Northeast and Midwest: the Northeast has a higher median salary ($105K vs. $98K) but the third quartile cutoffs are $133K for the Northeast and $138K for the Midwest. This indicates that there is generally more variation in Midwest salaries, and that among top earners—salaries might be even higher in the Midwest than in the Northeast.

How Salaries Change

We also collected data on salary change over the last three years. About half of the sample reported a 20% change, and the salary of 12% of the sampledoubled. We attempted to model salary change with other variables from the survey, but the model performed much more poorly, with an R2 of just 0.221. Many of the same significant features in the salary regression model also appeared as factors in predicted salary change: Spark/Unix, high meeting hours, high coding hours, and building prototype models, all predict higher salary growth, while using Excel, gender disparity, and working at an older company predict lower salary growth. Geography also correlated positively with salary change, meaning that

Assessing Your Salary

To use the model for you own salary, refer to the full model in Appendix B: The Regression Model, and add up the coefficients that apply to you. Once all of the constants are added, square the result for a final salary estimate (note: the coefficients are not in dollars). The contribution of a particular coefficient to the eventual salary estimate depends on the other coefficients: the higher the salary, the higher the contribution of each coefficient.

For example, the salary difference between a junior data scientist and a senior architect will be greater in a country with high salaries than somewhere with lower salaries.

Factors that Influence Salary: The Regression Model

WE HAVE INCLUDED OUR FULL regression model in Appendix B: The Regression Model. For this year’s report, we have made two important changes to the basic, parsimonious linear model we presented in the 2015 report. We have included: 1) external geographic data (GDP by US state and country), and 2) a square root transformation. The transformation adds one step to the linear model: we add up model coefficients, and then square the result. Both of these changes significantly improve the accuracy in salary estimates.

Our model explains about three-quarters of the variance in the sample salaries (with an R2 of 0.747). Roughly half of the salary variance is due to geography and experience. Given the important factors that can not be captured in the survey— for example, we don’t measure competence or evaluate the quality of respondents’ work output—it’s not surprising that a large amount of variance is left unexplained.

Impact of Geography

Geography has a huge impact on salary, but is not adequately captured due to sample size. For example, if a country is represented by only one or two respondents, this isn’t enough to justify giving the country its own coefficient. For this reason, we use broad regional coefficients (e.g., “Asia” or “Eastern Europe”), keeping in mind however that economic differences within a region are huge, and thus the accuracy of the model suffers.

To get around this problem, we’ve used publicly available records of per capita GDP of countries and US states. While GDP itself doesn’t translate to salary, it can serve a proxy function for geographic salary variation. Note that we use per capita GDP on the state and country level; therefore the model is likely to produce an inaccurate estimate with GDP figures for smaller geographic units.

Two exceptions were made to the GDP data before incorporating it into the model. The per capita GDP of Washington DC is $181K—much greater than in neighboring Virginia ($57K) and Maryland ($60K). Many (if not most) data science jobs in Maryland and Virginia are actually in the greater DC metropolitan area, and the survey data suggest that average data science salaries in these three places are not radically different from each other. Using the true $181K figure would produce gross overestimates for DC salaries, and so the per capita GDP figure for DC was replaced with that of Maryland, $60K.

The other exception is California. In all of the salary surveys we have conducted, California has had the highest median salary of any state or country, even though its per capita GDP ($62K) is not ranked so high (nine states have higher per capita GDPs, as do two countries that were represented in the sample, Switzerland and Norway). The anomaly is likely due to the San Francisco Bay Area, where, depending on how the region is defined, per capita GDP is $80K–$90K. As a major tech center, the Bay Area is likely overrepresented in the sample, meaning that the geographic factor attributable to California should be pushed upward; an appropriate compromise was $70K.

Considering Gender

There is a difference of $10K between the median salaries of men and women. Keeping all other variables constant—same roles, same skills—women make less than men.

Age, Experience, and Industry

Experience and age are two important variables that influence salary. The coefficient for experience (+3.8) translates to an increase of $2K–$2.5K on average, per year of experience. As for age, the biggest jump is between people in their early and late 20s, but the difference between those aged 31–65 and those over 65 is also significant.

We also asked respondents to rate their bargaining skills on a scale of 1 to 5, and those who gave higher self-evaluations tended to have higher salaries. The difference in salary between two data scientists, one with a bargaining skill “1” and the other with “5”, with otherwise identical demographics and skills, is expected to be $10K–$15K.

Finally, in terms of work-life balance, our results show that once you are working beyond 60 hours, salary estimates actually go down.

How You Spend Your Time

Importance of Tasks

The type of work respondents do was captured through four different types of questions:

involvement in specific tasks
job title
time spent in meetings
time spent coding

For every task, respondents chose from three options: no engagement, minor engagement, or major engagement.

The task with the greatest impact on salary (i.e., the greatest coefficient) was developing prototype models. Respondents who indicated major engagement with this task received on average a $7.4K boost, based on our model. Even minor engagement in developing prototype models had a +4.4 coefficient.

Relevance of Job Titles

When both tasks and job titles are included in the training set, job title “wins” as a better predictor of salary. It’s notable however, that titles themselves are not necessarily accurate at describing what people do. For example, even among architects there was only a 70% rate of major engagement inplanning large software projects—a task that theoretically defines the role. Since job title does perform well as a salary predictor, despite this inconsistency, it may be that “architect,” for example, is a symbol of seniority as much as anything else.

Respondents with “upper management” titles—mostly C-level executives at smaller companies, directors and VPs—had a huge coefficient of +20.2. Engagement in tasks associated with managerial roles also had a positive impact on salary, namely: organizing team projects (+9.7), identifying business problems to be solved with analytics (+1.5/+6.7), and communicating with people outside the company (+5.4).

Time Spent in Meetings

People who spend more time in meetings tend to make more. This is the variable we often use as a reminder that the model does not guarantee that the relationships between significant variables and salary are causative: if someone starts scheduling many meetings (and doesn’t change anything else in their workday) it is unlikely that this will lead to anything positive, much less a raise.1

Role of Coding

The highest median salaries belong to those who code 4–8 hours per week; the lowest to those who don’t code at all. Notably, only 8% of the sample reported that they don’t code at all, significantly down from last year’s 20%. Coding is clearly an integral part of being a data scientist.

1Of course, we haven’t actually tested this. If you try it out, let us know how it goes.

The Impact of Tool Choice

The Top Tools

The top two tools in the sample were Excel and SQL, both with use by 69% of the sample, followed by R (57%) and Python (54%). Compared to last year, Excel is up (from 59%), as is R (from 52%), while SQL and Python are only slightly higher than last year.

Over 90% of the sample reported spending at least some time coding, and 80% used at least one of Python, R, and Java, although only 8% used all three. The most commonly used tools (except for operating systems) were included in the model training data as individual coefficients; of these, Python, JavaScript, and Excel had significant coefficients: +4.6, –2.2 and –7.4, respectively. Less commonly used tools were first grouped together into clusters and aggregate features were included that represent counts of tools used from each cluster. For five clusters that were found to have a significant correlation with salary, coefficients are added on a per-tool basis.2

The cluster with the largest coefficient was centered on Spark and Unix, contributing +3.9 per tool. Spark usage was 20%, up from last year’s a modest 3%, and it continues to be used by the more well paid individuals in the sample.

In contrast to the largely open source Spark/Unix cluster, the second highest cluster coefficient (+2.4) was assigned to a cluster dominated by proprietary software: Tableau, Teradata, Netezza, Microstrategy, Aster Data, and Jaspersoft. In last year’s report, Teradata also featured as a tool with a large, positive coefficient. The other three clusters with significant coefficients mostly consisted of open source data tools.

Which Tools to Add to Your Stack

While the model we’ve explained is a good way to get an estimate for how much someone earns given a certain tool stack, it doesn’t necessarily work as a good guide for which tool to learn next. The real question is whether a tool is useful for getting done what you need to get done. If you never have to analyze more data than can fit into memory on your local machine, you might not get any benefit—much less a salary boost—by using a tool that leverages distributed systems, for example.

Salary and Sequences of Tools

In the following sequences of tools, the next tool in the sequence was frequently used by respondents who used all earlier tools, and these sequences had the best salary differentials at each step.

If you know the first tool in a sequence, you might consider learning the second, and so on.

2Tools are added up to a maximum number. This is because few respondents had more than that number of tools from the cluster, and so if someone uses more, there is no evidence to support continued addition of coefficients.

The Relationship Between Tools and Tasks: Clustering Respondents

DATA PROFESSIONALS ARE NOT A homogenous group— there are various types of roles in the space. While it is easier—and more common—to classify roles based on titles, clustering based on tools and tasks is a more rigorous way to define the key divisions between respondents of the survey. Every respondent is assigned to one of four clusters based on their tools and tasks.3

The four clusters were not evenly populated: their shares of the survey sample were 29%, 31%, 23%, and 17%, respectively. They can be described as shown on the right.

A selection of tool and task percentages are described in the sections that follow, and the full profiles of tool/task percentages are found in Appendix A: Full Cluster Profiles.

Operating Systems

In our three previous Data Science Salary Surveyreports, the clearest division in tool clusters separated one group of open source, usually GUI-less tools, from another consisting of proprietary software, largely developed by Microsoft. Common tools in the open source group have been Linux, Python, Spark, Hadoop, and Java, and common tools in the Microsoft/ closed source group include Windows, Excel, Visual Basic, and MS SQL Server. This same division appears when we clusterrespondents, and is clearest when we look at the usage of operating systems:

A set of tasks also emphasize the division between the first two and last two clusters. The following percentages represent respondents who indicatedmajor engagement in these tasks:

For all of the above tasks, the top two percentages were held by clusters 3 or 4 and were both much higher than either percentage for clusters 1 and 2.

Python, Matplotlib, Scikit-Learn

Another set of tools that exposed the primary split between clusters 1/2 and 3/4 are Python and two of its popular packages, Matplotlib (for visualization) and Scikit-Learn (for machine learning):

Survey respondents assigned to clusters 3 and 4 tend to use Python much more than those assigned to 1 and 2, and the relative difference (as a ratio) grows when we look at the two packages: cluster 3 and 4 respondents are 8–10 times as likely to use them as cluster 1 and 2 respondents. Between clusters 3 and 4 there is a difference as well, albeit more minor: cluster 3 has a higher Python usage rate, while a larger share of cluster 4 respondents don’t use Python or these packages. It turns out that these are the only tools whose highest usage rate is among cluster 3 respondents.4 For most other tools that are used much more frequently by clusters 3 and 4 than by 1 and 2, they are also used more frequently by cluster 4 than by cluster 3.

Cluster 4 rates for two tasks also stand out:

Cluster 4, it seems, is much more of an “open source data engineer” descriptor than cluster 3, which heads in that direction but not nearly to the same extent. It’s not rare for cluster 3 respondents to have used these tools—86% of them used at least one—but on average they only used about 2.2. In comparison, respondents in cluster 4 used an average of 5.3 tools. The fact that ETL and data management are much more important in cluster 4 than cluster 3, implies that while both might represent data science, cluster 3 tends toward the analyst’s side of the field, and cluster 4 tends toward the engineering or architecture side.

As for the other two clusters, differences between clusters 1 and 2 become apparent once we look at the rest of the aforementioned proprietary tool set. Cluster 2 respondents tended to use these much more frequently.

For most of tools shown below, cluster 1 has the second highest usage rate, but they significantly lag behind those of cluster 2. Cluster 1 respondents tended to use fewer tools in general: just under 8 on average, compared to 10, 13, and 21 for the three other clusters, respectively.

Tasks Without Coding

There are also some tasks that are undertaken by cluster 2 respondents significantly more frequently than those in other clusters:

The first two tasks are functions of an analyst, and are fairly common among cluster 3 and 4 respondents as well. Crucially, none of these tasks depend on being able to code (at least, not as much as the four tasks above that are closely associated with clusters 3 and 4). The low percentages for cluster 1 sheds some light on the nature of this cluster: most respondents in the sample whose primary function is not as a data scientist, analyst, or manager seem to be grouped there. This includes programmers who aren’t deep in the space (e.g., Java programmers who only use a few data tools). There are analysts and data scientists in cluster 1, but they tend to have small tool sets, and the composite feature of non-participation in many data tasks and non-use of data tools is what binds cluster 1 together.

Some of the proprietary tools listed above are used by respondents in cluster 4 about as much as those in cluster 1, most notably SQL Server. In other words, they begin to violate the primary cluster 1/2 vs. 3/4 split. A few other tools and tasks take this pattern even further, or simply don’t show large usage differences between clusters:

Tableau, Oracle, Teradata, and Oracle BI usage is higher in clusters 2 and 4, lower in clusters 1 and 3. The same is true for SQL, but like Excel and R, it’s exceptional in its wide usage across all four clusters. In fact, SQL and Excel are the only two tools (or tasks) that are used by over half of the respondents in each cluster. R is not used as much by cluster 1, but usage among the other three clusters is about the same: 67%– 69%. Data cleaning and basic exploratory analysis are similarly high for clusters 2, 3, and 4, and much lower for cluster 1. These tasks and tools cut across the cluster boundaries, and don’t seem to have much correlation with the more salient tool/task differences.

Managerial and Business Strategy Tasks

Perhaps even more illustrative of the connection between clusters 2 and 4 are the managerial/business strategy tasks. The implication is that respondents in 2/4 tend to be more senior, which turns out to be true, but only to an extent. In terms of years of experience, clusters 1, 2, and 4 are about the same—8–9 years on average—while for the cluster 3, the average is much smaller: only 4.4 years; a similar difference exists for age.

Despite representing the least experienced cohort, cluster 3 isn’t the lowest paid; that distinction goes to cluster 1, with a median salary of $72K. At $84K, cluster 3 is still lower than cluster 2 ($88K), but cluster 4 salaries tended to be far higher than either, with a median of $112K. Cluster 4 respondents tend to use a far greater number of tools than respondents in the other clusters, and many of the tools they commonly use are ones that had positive coefficients in the regression model.

3We tried a variety of clustering algorithms with various numbers of clusters, and the two best performing models came from KMeans, with two and four clusters. The partition in the 2-cluster model is more or less preserved in the 4-cluster model, so we will use the latter, keeping in mind that there is a primary split between the first two and last two clusters.

4Excluding tools that didn’t have a significant difference between the top two percentages: Mac OS X, ggplot, Vertica, and Stata.

Wrapping Up: What to Consider Next

THE REGRESSION MODEL WE USE to predict salary describes relationships between variables, but not where the relationships come from, or whether they are directly causative. For example, someone might work for a company with a colossal budget that can afford high salaries and expensive tools, but this doesn’t mean that their high salary is driven up by their tool choice.

Of course, it’s not so simple with salary. When tools become industry standards, employers begin to expect them, and it can hurt your chances of landing a good job if you are missing key tools: it’s in your interest to keep up with new technology. If you apply for a job at a company that is clearly interested in hiring someone who knows a certain tool, and this tool is used by people who earn high salaries, then you have leverage knowing that it will be hard for them to find an alternative hire without paying a premium.

This information isn’t just for the employees, either. Business leaders choosing technologies need to consider not just the software costs, but labor expenses as well. We hope that the information in this report will aid the task of building estimates for such decisions.

If you made use of this report, please consider taking the 2017 survey. Every year we work to build on the last year’s report, and much of the improvement comes from increased sample sizes. This is a joint research effort, and the more interaction we have with you, the deeper we will be able to explore the data science space. Thank you!

Appendix A: Full Cluster Profiles

Appendix B: The Regression Model

+60.0 Constant: everyone starts with this number

+2.6 Multiply by per capita GDP, in thousands (e.g., for Iowa, 2.6 * 52.8 = 137.28)

-7.8 gender = Female

+3.8 Per year of experience

+7.4 Per bargaining skill “point”

+17.2 Age: 26 to 30

+22.5 Age: 31 to 35

+24.8 Age: 36 to 65

+38.5 Age: over 65

+3.9 Academic speciality is/was mathematics, statistics or physics

+12.2 PhD

-9.7 Currently a student (full- or part-time, any level)

+2.2 industry = Software (incl. SaaS, Web, Mobile)

+3.0 industry = Banking/Finance

-2.0 industry = Advertising/Marketing/PR

-24.5 industry = Education

-3.9 industry = Computers/Hardware

+7.1 industry = Search/Social Networking

+3.6 Company size: 501 to 10,000

+7.7 Company size: 10,000 or more

-4.3 Company age: over 10 years old

-8.2 Coding: 1 to 3 hours/week

–3.0 Coding: 4 to 20 hours/week

–0.5 Coding: Over 20 hours/week

+1.0 Meetings: 1 to 3 hours/week

+9.2 Meetings: 4 to 8 hours/week

+20.6 Meetings: 9 to 20 hours/week

+21.1 Meetings: Over 20 hours/week

+1.0 Workweek: 46 to 60 hours

–2.4 Workweek: Over 60 hours

+20.2 Job title: Upper Management

-0.9 Job title: Engineer/Developer/Programmer

+3.1 Job title: Manager

-1.0 Job title: Researcher

+14.3 Job title: Architect

+4.6 Job title: Senior Engineer/Developer

+4.5 ETL (minor involvement)

-1.9 ETL (major involvement)

-4.9 Setting up/maintaining data platforms (minor involvement)

+4.4 Developing prototype models (minor involvement)

+12.1 Developing prototype models (major involvement)

-1.3 Developing hardware, or working on projects that

require expert knowledge of hardware (major)

+9.7 Organizing and guiding team projects (major)

+1.5 Identifying business problems to be solved with analytics (minor)

+6.7 Identifying business problems to be solved with

analytics (major)

+5.4 Communicating with people outside your company

(major)

+3.2 Most or all on work done using cloud computing

+4.6 Python

-2.2 JavaScript

-7.4 Excel

+1.7 for each of MySQL, PostgreSQL, SQLite, Redshift,

Vertica, Redis, Ruby (up to 4 tools)

+3.9 for each of Spark, Unix, Spark MlLib, ElasticSearch, Scala, H2O, EMC/Greenplum, Mahout (up to 5 tools)

+1.5 for each of Hive, Apache Hadoop, Cloudera, Hortonworks, Hbase, Pig, Impala (up to 5 tools)

+2.4 for each of Tableau, Teradata, Netezza (IBM), Microstrategy, Aster Data (Teradata), Jaspersoft (up to 3 tools)

+1.3 for each of MongoDB, Kafka, Cassandra, Zookeeper, Storm, JavaScript InfoVis Toolkit, Go, Couchbase (up to 4 tools)

Article image: The Seven Virtues, by Brueghel, published by Philippe Galle.(source: Wikimedia Commons).

December 23, 2016 5 comments

Search This Blog

It's A "Holly Jolly" Artificial Intelligence Enabled Special Christmas