Shout Future: Data Science

Microsoft is undertaking several projects dedicated to sustainability

Microsoft has been making significant contributions in Tech for Good and has taken significant steps towards environment conservation. The company’s going green mantra is underscored by the $1.1 million in 2016, fundraising and 5,949 number of volunteering hours put in by its employees.

But it doesn’t stop there. Microsoft’s ecosystem allows the firm, its employees, and the business partners to leverage new technologies for improving sustainability of their companies and communities. The Redmond giant recently tied up with The Nature Conservancy, a nonprofit to extend support for nonprofits globally.

greening the planet

Microsoft’s commitment towards nature is deeply rooted in the technologies it utilizes. Microsoft announced a $1 billion commitment to bring cloud computing resources to nonprofit organizations around the world. The firm donates near $2 million every day in products and services to nonprofits as a part of the commitment.

Microsoft has extended its support to organizations like World Wildlife Fund, Rocky Mountain Institute,Carbon Disclosure Project, Wildlife Conservation Society, and the U.N. Framework Convention on Climate Change’s (UNFCCC) Climate Neutral Now initiative.

Here are a slew of use cases

How Prashant Gupta’s initiative is helping farmers in Andhra Pradesh increase revenue? Prashant Gupta works as a Cloud + Enterprise Principal Director at Microsoft. Gupta is undertaking significant developments for environment. Earlier, Gupta had facilitated a partnership for Microsoft with a United Nations agency, the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), and the Andhra Pradesh government. The project involved helping ground nut farmers cope with the drought.

Gupta and his team leveraged advanced analytics and machine learning to launch a pilot program with a Personalized Village Advisory Dashboard for 4,000 farmers in 106 villages in Andhra Pradesh. It also included a Sowing App with 175farmers in one district.

Based on weather conditions, soil, and other indicators; the Sowing App advises farmers on the best time to sow. The Personalized Village Advisory Dashboard provides insights about soil health, fertilizer recommendations, and seven-day weather forecasts.

Microsoft’s Azure cloud platform for Nature Conservancy’s Coastal Resilience program: The Coastal Resilience is a public-private partnership led by The Nature Conservancy to help coastal communities address the devastating effects of climate change and natural disasters. The program has trained and helped over 100 communities globally about the uses and applications of the Microsoft’s Natural Solutions Toolkit.

The toolkit contains a suite of geospatial tools and web apps for climate adaptation and resilience planning across land and sea environments. This has helped in strategizing for risk reduction, restoration, and resilience to safeguard local habitats, communities, and economies.

Puget Sound: Puget Sound’s lowland river valleys is a treasure house, delivering valuable assets, a wealth of natural, agricultural, industrial, recreational, and health benefits to the four million people who live in the region. However, the communities are at increasing risk of flooding issues from rising sea levels, more extreme coastal storms, and more frequent river flooding.

The Conservancy’s Washington chapter is building a mapping tool as part of the Coastal Resilience toolkit to reduce the flow of polluted storm water into Puget Sound. Emily Howe, an aquatic ecologist is in charge of the project, which revolves around developing the new Storm water Infrastructure mapping tool. This tool will be eventually integrated into the Puget Sound Coastal Resilience tool set, that will be hosted on Azure.

Furthermore, it will include a high-level heat map of storm water pollution for the region, combining an overlay of pollution data with human and ecological data for prioritizing areas of concern.

Data helps in Watershed Management: Today, around 1.7 billion people living in the world’s largest cities depend on water flowing from watersheds. However, estimates suggest that those sources of watershed will be tapped by up to two-thirds of the global population, by 2050.

Kari Vigerstol, The Nature Conservancy’s Global Water Funds Director of Conservation had overseen development of a tool to provide them with better data. The project entailed assisting cities and protecting their local water sources. 4,000 cities were analyzed by “Beyond the Source”. The results stated that natural solutions can improve water quality for four out of five cities.

Furthermore, The Natural Solutions Toolkit is being leveraged globally to better understand and protect water resources around the world. Through the water security toolkit, cities will be furnished with a more powerful set of tools. Users can also explore data, and access proven solutions and funding models utilizing the beta version of Protecting Water Atlas. This tool will help in improving water quality and supply for the future.

Microsoft is illuminating these places with its innovative array of big data and analytics offerings

In Finland, Microsoft partnered with CGI to develop a smarter transit system for the city of Helsinki. This data-driven initiative saw Microsoft utilize the city’s existing warehouse systems to create a cloud-based solution that could collate and analyse travel data. Helsinki’s bus team noticed a significant reduction in fuel costs and consumption, besides realizing increased travel safety, and improved driver performance.
Microsoft Research Lab Asia designed a mapping tool, called Urban Air for the markets in China. The tool allows users to see, and even predict, air quality levels across 72 cities in China. The tool furnishes real-time, detailed air quality information, making use of big data and machine learning. Additionally, the tool combines a mobile app, which is used about three million times per day.
Microsoft is implementing environmental strategies worldwide. The firm is assisting the city of Chicago in designing new ways to gather data. Additionally, the firm is also helping the city utilize predictive analytics in order to better address water, infrastructure, energy, and transportation challenges.
Boston serves as another great instance where Microsoft is working to spread information about the variety of urban farming programs in Boston. Microsoft is also counting on the potential of AI and other technology to increase the impact for the city.
Microsoft has also partnered with Athena Intelligence for developing the hill city of San Francisco. As a part of this partnership, Microsoft is leveraging Athena’s data processing and visualization platform to gather valuable data about land, food, water, and energy. This will help in improving local decision-making.

Outlook

Data is not all that matters. At the end, it’s essentially about how cities can be empowered to take action based on that data. Microsoft has comprehensively supported the expansion of The Nature Conservancy’s innovative Natural Solutions Toolkit. The solution suite is already powering on-the-ground and in-the-water projects around the world, besides benefiting coastal communities, residents of the Puget Sound, and others globally.

Microsoft is doing an excellent job, delivering on its promise to empower people and organizations globally to thrive in a resource-constrained world. The organization is empowering researchers, scientists and policy specialists at nonprofits by providing them with technology that addresses sustainability.

May 11, 2017 4 comments

Insurance fraud, insurance, data analytics, big data, insurance fraud cases

While there is no doubt that the insurance segment is witnessing an unprecedented annual growth, insurers continue to struggle with loss-leading portfolios and lower insurance penetration among consumers. Insurers are facing increasing pressure to strike the right balance, while ensuring adherence to underwriting and claims decisions in the face of regulatory pressures, growth of digital channels and increasing competition. Adding to this is the need to secure the good risks, while weeding out the bad risks.

Insurers are turning their attention towards big data and analytics solutions to help check fraud, recognize misrepresentation and prevent identity theft. With the government’s recent push to adopt digitization, the Aadhaar card plays a crucial role, linking income tax permanent account numbers (PANs), banks, credit bureaus, telecoms and utilities and providing a unified and centralized data registry that profiles an individual’s economic behaviour. The e-commerce boom provides additional data on financial behaviour.

Fraudulent practices

Claims fraud is a threat to the viability of the health insurance business. Although health insurers regularly crack down on unscrupulous healthcare providers, fraudsters continually exploit any new loopholes with forged documents purporting to be from leading hospitals.

Medical ID theft is one of the most common techniques adopted by fraudsters. Due to this, claim funds are paid into their bank accounts, through identity theft. The insurer’s procedures allows for the policyholder to send a scanned image of his/her cheque, with the bank account details for ID purposes, which is then manipulated by the fraudsters.

Besides forged documents, other common sources of fraud come from healthcare providers themselves, with cases of ‘upgrading’ (billing for more expensive treatments than those provided), ‘phantom billing’ and ‘ganging’ (billing for services provided to family members or other individuals accompanying the patient, but not delivered).

Health insurers have to take action before an insurance claim is paid and to put an end to the ‘pay-and-chase’ approach. Using data to validate a pre-payment would be far more useful than having to ‘chase’ for a payment. This approach, however, rests on real-time access to information sources.

Life insurance’s woes

India’s life insurers suffer from low persistency rates that see more than one in three policies lapse by the end of the second year. This may be attributed to mis-selling, misrepresentation of material facts, premeditated fabrication and in other cases suppression of facts.

Life insurers have been facing fraud that is largely data driven and can be curbed with effective use of data analytics. While seeking customer information, insurers should perform checks against public record databases to ensure they have insights into the validity of personal information. This can be achieved through data mining and validation from various sources. For instance, in the US, frauds are committed through stolen social security numbers or driver’s license numbers, or those of deceased individuals. Data accessed from various sources will help identify if the person in question is using multiple identities or multiple people are using the identity presented.

The use of public, private and proprietary databases to obtain information not typically found in an individual’s wallet to create knowledge-based authentication questions which are designed to be answered only by the correct individual can also help reduce fraud significantly.

Continuous evaluation of existing customers is also critical for early fraud detection. For example, one red flag for potential fraud can involve beneficiary or address changes for new customers. Insurers should verify address changes, as many consumers do not know their identity has been stolen until after it has happened. By applying relationship analytics, insurers can obtain insights into the relationship between the insured, the owner, and the beneficiary, to help determine whether those individuals are linked to other suspicious entities or are displaying suspicious behaviour patterns.

Solutions for all

Like in most developed insurance markets, it is imperative that data on policies, claims and customers be made available on a shared platform, in real-time. Such a platform can allow for real-time enquiries on customers. It can also facilitate screening of the originator of every proposal. Insurers would contribute policy, claims and distributors’ information to the repository on a regular basis. Such data repositories can provide insights to help insurers detect patterns, identify nexus and track mis-selling.

Insurance data is dynamic and hence data analytics cannot depend only on past behaviour patterns. So data has to be updated regularly. Predictive analysis can play a significant role in identifying distributor nexus, mis-selling and repeated misrepresentations. Relationship analytics could be used to identify linked sellers and suspected churn among them.

These data platform-based solutions are not just about preventing reputational risk and loss of business, but with controlled and more informed risk selection, there could be a positive impact on pricing of products. The whole process of underwriting new business with greater granularity of risk and greater transparency can bring in new customers, but it could also out-price some others. There can be increased scrutiny of agents, brokers and distributors to eliminate any suspects from the system.

Successful fraud prevention strategies include shifting towards a proactive approach that detects fraud prior to policy issuance, and leveraging red flags or business rules, real-time identity checks, relationship analytics, and predictive models. Insurers who leverage both internal data and external data analytics will better understand fraud risks throughout their customer life cycles, and will be more prepared to detect and mitigate those risks.

May 09, 2017 2 comments

Table of Contents:

Introduction
Nature of Data
1. Time series data.
2. Spatial data
3. Spacio-temporal data.
Categories of data
1.Primary data
1. Direct personal interviews.
2. Indirect Oral interviews.
3. Information from correspondents.
4. Mailed questionnaire method.
5. Schedules sent through enumerators.
2. Secondary data
1. Published sources
2. Unpublished sources.

Data gathering techniques

Introduction:
Everybody collects, interprets and uses information, much of it in numerical or statistical forms in day-to-day life. It is a common practice that people receive large quantities of information everyday through conversations, televisions, computers, the radios, newspapers, posters, notices and instructions. It is just because there is so much information available that people need to be able to absorb, select and reject it.

In everyday life, in business and industry, certain statistical information is necessary and it is independent to know where to find it how to collect it. As consequences, everybody has to compare prices and quality before making any decision about what goods to buy. As employees of any firm, people want to compare their salaries and working conditions, promotion opportunities and so on. In time the firms on their part want to control costs and expand their profits.

One of the main functions of statistics is to provide information which will help on making decisions. Statistics provides the type of information by providing a description of the present, a profile of the past and an estimate of the future.

The following are some of the objectives of collecting statistical information.
1. To describe the methods of collecting primary statistical information.
2. To consider the status involved in carrying out a survey.
3. To analyse the process involved in observation and interpreting.
4. To define and describe sampling.
5. To analyse the basis of sampling.
6. To describe a variety of sampling methods.

Statistical investigation is a comprehensive and requires systematic collection of data about some group of people or objects, describing and organizing the data, analyzing the data with the help of different statistical method, summarizing the analysis and using these results for making judgements, decisions and predictions.
The validity and accuracy of final judgement is most crucial and depends heavily on how well the data was collected in the first place. The quality of data will greatly affect the conditions and hence at most importance must be given to this process and every possible precaution should be taken to ensure accuracy while collecting the data.

Nature of data:
It may be noted that different types of data can be collected for different purposes. The data can be collected in connection with time or geographical location or in connection with time and location.
The following are the three types of data:
1. Time series data.
2. Spatial data
3. Spacio-temporal data.

Time series data Analysis:
It is a collection of a set of numerical values, collected over a period of time. The data might have been collected either at regular intervals of time or irregular intervals of time.
Spatial Data:
If the data collected is connected with that of a place, then it is termed as spatial data. For example, the data may be
1. Number of runs scored by a batsman in different test matches in a test series at different places.
2. District wise rainfall in a state.
3. Prices of silver in four metropolitan cities.
Spacio Temporal Data:
If the data collected is connected to the time as well as place then it is known as spacio temporal data.

Categories of data:
Any statistical data can be classified under two categories depending upon the sources utilized. These categories are,
1. Primary data
2. Secondary data

Primary data:
Primary data is the one, which is collected by the investigator himself for the purpose of a specific inquiry or study. Such data is original in character and is generated by survey conducted by individuals or research institution or any organisation.
For example, if a researcher is interested to know the impact of noon meal scheme for the school children, he has to undertake a survey and collect data on the opinion of parents and children by asking relevant questions. Such a data collected for the purpose is called primary data.

The primary data can be collected by the following five methods.
1. Direct personal interviews.
2. Indirect Oral interviews.
3. Information from correspondents.
4. Mailed questionnaire method.
5. Schedules sent through enumerators.

1. Direct personal interviews:
The persons from whom information’s are collected are known as informants. The investigator personally meets them and asks questions to gather the necessary information’s. It is the suitable method for intensive rather than extensive field surveys. It suits best for intensive study of the limited field.

Merits:
1. People willingly supply informations because they are approached personally. Hence, more response noticed in this method than in any other method.
2. The collected informations are likely to be uniform and accurate. The investigator is there to clear the doubts of the informants.
3. Supplementary informations on informant’s personal aspects can be noted. Informations on character and environment may help later to interpret some of the results.
4. Answers for questions about which the informant is likely to be sensitive can be gathered by this method.
5. The wordings in one or more questions can be altered to suit any informant. Explanations may be given in other languages also. Inconvenience and misinterpretations are thereby avoided.

Limitations:
1. It is very costly and time consuming.
2. It is very difficult, when the number of persons to be interviewed is large and the persons are spread over a wide area.
3. Personal prejudice and bias are greater under this method.

2. Indirect Oral Interviews:
Under this method the investigator contacts witnesses or neighbours or friends or some other third parties who are capable of supplying the necessary information. This method is preferred if the required information is on addiction or cause of fire or theft or murder etc., If a fire has broken out a certain place, the persons living in neighbourhood and witnesses are likely to give information on the cause of fire.
In some cases, police interrogated third parties who are supposed to have knowledge of a theft or a murder and get some clues. Enquiry committees appointed by governments generally adopt this method and get people’s views and all possible details of facts relating to the enquiry. This method is suitable whenever direct sources do not exist or cannot be relied upon or would be unwilling to part with the information.
The validity of the results depends upon a few factors, such as the nature of the person whose evidence is being recorded, the ability of the interviewer to draw out information from the third parties by means of appropriate questions and cross examinations, and the number of persons interviewed. For the success of this method one person or one group alone should not be relied upon.

3. Information from correspondents:
The investigator appoints local agents or correspondents in different places and compiles the information sent by them. Informations to Newspapers and some departments of Government come by this method. The advantage of this method is that it is cheap and appropriate for extensive investigations. But it may not ensure accurate results because the correspondents are likely to be negligent, prejudiced and biased. This method is adopted in those cases where informations are to be collected periodically from a wide area for a long time.

4. Mailed questionnaire method:
Under this method a list of questions is prepared and is sent to all the informants by post. The list of questions is technically called questionnaire. A covering letter accompanying the questionnaire explains the purpose of the investigation and the importance of correct informations and requests the informants to fill in the blank spaces provided and to return the form within a specified time. This method is appropriate in those cases where the informants are literates and are spread over a wide area.

Merits:
1. It is relatively cheap.
2. It is preferable when the informants are spread over the wide area.

Limitations:
1. The greatest limitation is that the informants should be literates who are able to understand and reply the questions.
2. It is possible that some of the persons who receive the questionnaires do not return them.
3. It is difficult to verify the correctness of the informations furnished by the respondents.
With the view of minimizing non-respondents and collecting correct information, the questionnaire should be carefully drafted. There is no hard and fast rule. But the following general principles may be helpful in framing the questionnaire. A covering letter and a self addressed and stamped envelope should accompany the questionnaire.
The covering letter should politely point out the purpose of the survey and privilege of the respondent who is one among the few associated with the investigation. It should assure that the informations would be kept confidential and would never be misused. It may promise a copy of the findings or free gifts or concessions etc.,

Characteristics of a good questionnaire:
1. Number of questions should be minimum.
2. Questions should be in logical orders, moving from easy to more difficult questions.
3. Questions should be short and simple. Technical terms and vague expressions capable of different interpretations should be avoided.
4. Questions fetching YES or NO answers are preferable. There may be some multiple choice questions requiring lengthy answers are to be avoided.
5. Personal questions and questions which require memory power and calculations should also be avoided.
6. Question should enable cross check. Deliberate or unconscious mistakes can be detected to an extent.
7. Questions should be carefully framed so as to cover the entire scope of the survey.
8. The wording of the questions should be proper without hurting the feelings or arousing resentment.
9. As far as possible confidential informations should not be sought.
10. Physical appearance should be attractive, sufficient space should be provided for answering each question.

5. Schedules sent through Enumerators:
Under this method enumerators or interviewers take the schedules, meet the informants and filling their replies. Often distinction is made between the schedule and a questionnaire. A schedule is filled by the interviewers in a face-to-face situation with the informant. A questionnaire is filled by the informant which he receives and returns by post. It is suitable for extensive surveys.

Merits:
1. It can be adopted even if the informants are illiterates.
2. Answers for questions of personal and pecuniary nature can be collected.
3. Non-response is minimum as enumerators go personally and contact the informants.
4. The informations collected are reliable. The enumerators can be properly trained for the same.
5. It is most popular methods.

Limitations:
1. It is the costliest method.
2. 2. Extensive training is to be given to the enumerators for collecting correct and uniform informations.
3. Interviewing requires experience. Unskilled investigators are likely to fail in their work.

Before the actual survey, a pilot survey is conducted. The questionnaire/Schedule is pre-tested in a pilot survey. A few among the people from whom actual information is needed are asked to reply. If they misunderstand a question or find it difficult to answer or do not like its wordings etc., it is to be altered. Further it is to be ensured that every questions fetches the desired answer.

Merits and Demerits of primary data:
1. The collection of data by the method of personal survey is possible only if the area covered by the investigator is small. Collection of data by sending the enumerator is bound to be expensive. Care should be taken twice that the enumerator record correct information provided by the informants.
2. Collection of primary data by framing a schedules or distributing and collecting questionnaires by post is less expensive and can be completed in shorter time.
3. Suppose the questions are embarrassing or of complicated nature or the questions probe into personnel affairs of individuals, then the schedules may not be filled with accurate and correct information and hence this method is unsuitable.
4. The information collected for primary data is mere reliable than those collected from the secondary data.

Secondary Data:
Secondary data are those data which have been already collected and analysed by some earlier agency for its own use; and later the same data are used by a different agency.

According to W.A.Neiswanger, ‘A primary source is a publication in which the data are published by the same authority which gathered and analysed them. A secondary source is a publication, reporting the data which have been gathered by other authorities and for which others are responsible’.

Sources of Secondary data:
In most of the studies the investigator finds it impracticable to collect first-hand information on all related issues and as such he makes use of the data collected by others. There is a vast amount of published information from which statistical studies may be made and fresh statistics are constantly in a state of production. The sources of secondary data can broadly be classified under two heads:

1. Published sources, and
2. Unpublished sources.

1. Published Sources:
The various sources of published data are:
1. Reports and official publications of
(i) International bodies such as the International Monetary Fund, International Finance Corporation and United Nations Organisation.
(ii) Central and State Governments such as the Report of the Tandon Committee and Pay Commission.
2. Semi-official publication of various local bodies such as Municipal Corporations and District Boards.
3. Private publications-such as the publications of –
(i) Trade and professional bodies such as the Federation of Indian Chambers of Commerce and Institute of Chartered Accountants.
(ii) Financial and economic journals such as ‘Commerce’ , ‘Capital’ and ‘ Indian Finance’ .
(iii) Annual reports of joint stock companies.
(iv) Publications brought out by research agencies, research scholars, etc.

It should be noted that the publications mentioned above vary with regard to the periodically of publication. Some are published at regular intervals (yearly, monthly, weekly etc.,) whereas others are ad hoc publications, i.e., with no regularity about periodicity of publications.

Note: A lot of secondary data is available in the internet. We can access it at any time for the further studies.

2. Unpublished Sources
All statistical material is not always published. There are various sources of unpublished data such as records maintained by various Government and private offices, studies made by research institutions, scholars, etc. Such sources can also be used where necessary

Precautions in the use of Secondary data
The following are some of the points that are to be considered in the use of secondary data
1. How the data has been collected and processed
2. The accuracy of the data
3. How far the data has been summarized
4. How comparable the data is with other tabulations
5. How to interpret the data, especially when figures collected for one purpose is used for another
Generally speaking, with secondary data, people have to compromise between what they want and what they are able to find.

Merits and Demerits of Secondary Data:
1. Secondary data is cheap to obtain. Many government publications are relatively cheap and libraries stock quantities of secondary data produced by the government, by companies and other organisations.
2. Large quantities of secondary data can be got through internet.
3. Much of the secondary data available has been collected for many years and therefore it can be used to plot trends.
4. Secondary data is of value to: - The government – help in making decisions and planning future policy.
- Business and industry – in areas such as marketing, and sales in order to appreciate the general economic and social conditions and to provide information on competitors.
- Research organisations – by providing social, economical and industrial information.

May 06, 2017 No comments

After knowing the relationship between two variables we may be interested in estimating (predicting) the value of one variable given the value of another. The variable predicted on the basis of other variables is called the “dependent” or the ‘explained’ variable and the other the ‘independent’ or the ‘predicting’ variable. The prediction is based on average relationship derived statistically by regression analysis. The equation, linear or otherwise, is called the regression equation or the explaining equation.

For example, if we know that advertising and sales are correlated we may find out expected amount of sales for a given advertising expenditure or the required amount of expenditure for attaining a given amount of sales.

The relationship between two variables can be considered between, say, rainfall and agricultural production, price of an input and the overall cost of product, consumer expenditure and disposable income. Thus, regression analysis reveals average relationship between two variables and this makes possible estimation or prediction.

Definition:

Regression is the measure of the average relationship between two or more variables in terms of the original units of the data.

Types Of Regression:

The regression analysis can be classified into:
a) Simple and Multiple
b) Linear and Non –Linear
c) Total and Partial

a) Simple and Multiple:

In case of simple relationship only two variables are considered, for example, the influence of advertising expenditure on sales turnover. In the case of multiple relationship, more than two variables are involved.

On this while one variable is a dependent variable the remaining variables are independent ones. For example, the turnover (y) may depend on advertising expenditure (x) and the income of the people (z). Then the functional relationship can be expressed as y = f (x,z).

b) Linear and Non-linear:

The linear relationships are based on straight-line trend, the equation of which has no-power higher than one. But, remember a linear relationship can be both simple and multiple. Normally a linear relationship is taken into account because besides its simplicity, it has a better predective value, a linear trend can be easily projected into the future. In the case of non-linear relationship curved trend lines are derived. The equations of these are parabolic.

c) Total and Partial:

In the case of total relationships all the important variables are considered. Normally, they take the form of a multiple relationships because most economic and business phenomena are affected by multiplicity of cases. In the case of partial relationship one or more variables are considered, but not all, thus excluding the influence of those not found relevant for a given purpose.

Linear Regression Equation:

If two variables have linear relationship then as the independent variable (X) changes, the dependent variable (Y) also changes. If the different values of X and Y are plotted, then the two straight lines of best fit can be made to pass through the plotted points. These two lines are known as regression lines. Again, these regression lines are based on two equations known as regression equations. These equations show best estimate of one variable for the known value of the other. The equations are linear.
Linear regression equation of Y on X is
Y = a + bX ……. (1)
And X on Y is X = a + bY……. (2) a, b are constants.

From (1) We can estimate Y for known value of X.
(2) We can estimate X for known value of Y.

February 18, 2017 No comments

In the distributed age, news organizations are likely to see their stories shared more widely, potentially reaching thousands of readers in a short amount of time. At the Washington Post, we asked ourselves if it was possible to predict which stories will become popular. For the Post newsroom, this would be an invaluable tool, allowing editors to more efficiently allocate resources to support a better reading experience and richer story package, adding photos, videos, links to related content, and more, in order to more deeply engage the new and occasional readers clicking through to a popular story.

Here’s a behind-the-scenes look at how we approached article popularity prediction.

Data science application: Article popularity prediction

There has not been much formal work in article popularity prediction in the news domain, which made this an open challenge. For our first approach to this task, Washington Post data scientists identified the most-viewed articles on five randomly selected dates, and then monitored the number of clicks they received within 30 minutes after being published. These clicks were used to predict how popular these articles would be in 24 hours.

Using the clicks 30 minutes after publishing yielded poor results. As an example, here are five very popular articles:

popular article 1 — Figure 1. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission.

popular article 2 — Figure 2. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission.

popular article 3 — Figure 3. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission.

popular article 4 — Figure 4. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission.

popular article 5 — Figure 5. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission.

Table 1 lists the actual number of clicks these five articles received 30 minutes and 24 hours after being published. The takeaway: looking at how many clicks a story gets in the first 30 minutes is not an accurate way to measure its potential for popularity:

Table 1. Five popular articles.
Articles	# clicks @ 30mins	# clicks @ 24hours
9/11 Flag	6,245	67,028
Trump Policy	2,015	128,217
North Carolina	1,952	11,406
Hillary & Trump	1,733	310,702
Gary Johnson	1,318	196,798

Prediction features

In this prediction task, Washington Post data scientists have explored four groups of features: metadata, contextual, temporal, and social features. Metadata and contextual features, such as authors and readability, are extracted from the news articles themselves. Temporal features come mainly from an internal site-traffic collection system. Social features are statistics from social media sites, such as Twitter and Facebook.

Figure 6 lists all of the features we used in this prediction task. (More details about these features can be found in the paper, on which we we collaborated with Dr. Naren Ramakrishnan and Yaser Keneshloo from Discovery Analytics Center Virginia Tech, "Predicting the Popularity of News Articles.")

Figure 6. List of features used. Credit: Yaser Keneshloo, Shuguang Wang, Eui-Hong Han, Naren Ramakrishnan, used with permission.

Regression task

Figure 7 illustrates the process that we used to build regression models. In the training phase, we built several regression models using 41,000 news articles published by the Post. To predict the popularity of an article, we first collected all features within 30 minutes after its publication, and then used pre-trained models to predict its popularity in 24 hours.

Figure 7. Statistical Modeling. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission.

Evaluation

To measure the performance of the prediction task, we conducted two evaluations. First, we conducted a 10-fold cross validation experiment on the training articles. Table 2 enumerates the results of this evaluation. On average, the adjusted R2 is 79.4 (out of 100) with all features. At the same time, we realized that metadata information is the most useful feature aside from the temporal clickstream feature.

Table 2. 10-fold cross validation results.
Features	Predicted R2
Baseline	69.4
Baseline + Temporal	70.4
Baseline + Social	72.5
Baseline + Context	71.1
Baseline + Metadata	77.2
All	79.4

The second evaluation was done in production after we deployed the prediction system at the Post. Using all articles published in May 2016, we got an adjusted R2 of 81.3% (out of 100).

Figure 8 shows scatter plots of prediction results for articles published last May. The baseline system on the left uses a single feature: the total number of clicks at 30 minutes. On the right is a more complete system using all features listed in Figure 1. The red lines in each plot are the lower and upper error bounds. Each dot represents an article, and, of course, ideally all dots would fall within the error bounds. As you can see, there are many more errors in the baseline system.

Figure 8. Scatter plot of prediction results (May 2016). Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission.

Production deployment

We built a very effective regression model to predict the popularity of news articles. The next step was to deploy it to production at the Post.

The prediction quality relies on the accuracy of features and speed to obtain them. It is preferred to build this prediction task as a streaming service to collect up-to-date features in real time. However, this comes with a challenge. We have to process tens of millions of points of click data every day to predict the popularity of thousands of Post articles. A streaming infrastructure facilitates the fast prediction task with minimal delays.

Architecture

Figure 9 illustrates the overall architecture of the prediction service in the production environment at the Post. Visitors who read news articles generate page view data, which is stored in a Kafka server and then fed into back-end Spark Streaming services. Other features such as metadata and social features are collected by separate services, and then fed into the same Spark Streaming services. With all these collected features, prediction is done with a pre-trained regression model, and results are stored to an HBase server and also forwarded to the Kafka server. The Post newsroom is also alerted of popular articles via Slack and email.

Spark Streaming in clickstream collection

The Spark Streaming framework is used in several components in our prediction service. Figure 10 illustrates one process we use Spark Streaming for: to collect and transform the clickstream (page view) data into prediction features.

The clickstream stored in Kafka is fed into the Spark Streaming framework in real time. The streaming process converts this real time stream into smaller batches of clickstream data. Each batch is converted into a simplified form, and then another format the pre-trained regression model can easily consume. In the end, page view features are stored in the HBase database.

Figure 10. Clickstream processing. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission.

System in the real world

Washington Post journalists monitor predictions using real-time Slack and email notifications. The predictions can be used to drive promotional decisions on the Post home page and social media channels.

We created a Slack bot to notify the newsroom if, 30 minutes after being published, an article is predicted to be extremely popular. Figure 11 shows Slack notifications with the current number and forecasted number of clicks in 24 hours.

We also automatically generate emails that gather that day’s predictions and summarize articles’ predicted and actual performance at the end of the day. Figure 12 shows an example of these emails. This email contains the publication time, predicted clicks, actual clicks in first 30 minutes, actual clicks in first 24 hours, and actual clicks from social media sites in first 30 minutes.

Figure 12. Popularity summary email. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission.

In addition to this being a tool for our newsroom, we are also integrating it into Washington Post advertising products such as PostPulse. PostPulse packages advertiser content with related editorial contents, and delivers a tailored, personalized advertisement to the target group. Figure 13 shows an example of this product in action, in which an advertiser’s video on 5G wireless technology is paired with editorially produced technology articles. A member of the advertising team puts the package together and receives candidate editorial articles as recommendations to include in the package. These articles are ranked according to relevance and expected popularity.

Figure 13. PostPulse example. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission.

Practical challenges

We faced two main challenges when we deployed this service to production. First is the scale of the data. Each day, we process a huge and increasing amount of data for prediction; the system must scale with it, using limited resources. We profile the service’s performance in terms of execution time, and identifying that persistent storage (HBase) is a significant bottleneck. Whenever we have to store or update HBase, it is expensive. To reduce this, we accumulate multiple updates before we physically update HBase. This runs the risk of some potential data loss and less accurate predictions if the service crashes between two updates to HBase. We’ve tuned the system and found a good balance so updates are not too delayed.

The second challenge is dependencies on external services, which we use to collect various features. However, if these external APIs are not reachable, the prediction service should still be available. Thus, we adopted a decoupled microservice infrastructure, in which each feature collection process is a separate microservice. If one or more microservices are down, the overall prediction service will still be available, just with reduced accuracy until these external services are back online.

Continuous experiments and future work

Moving forward, we will explore a few directions. First, we want to identify the time frame in which an article is expected to reach peak traffic (after running some initial experiments, the results are promising). Second, we want to extend our prediction to articles not published by the Washington Post. Last but not least, we want to address distribution biases in the prediction process. Articles can get much more attention when they are in a prominent position on our home page or spread through large channels on social media.