## Tuesday, 3 December 2013

### R in Insurance Conference, London, 14 July 2014

Following the very positive feedback that Andreas and I have received from delegates of the first R in Insurance conference in July of this year, we are planning to repeat the event next year. We have already reserved a bigger auditorium.

The second conference on R in Insurance will be held on Monday 14 July 2014 at Cass Business School in London, UK.

This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation. Topics covered may include actuarial statistics, capital modelling, pricing, reserving, reinsurance and extreme events, portfolio allocation, advanced risk tools, high-performance computing, econometrics and more. All topics will be discussed within the context of using R as a primary tool for insurance risk management, analysis and modelling.

The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.

Invited talks will be given by:
• Arthur Charpentier, Département de mathématiques Université du Québec à Montréal
• Montserrat Guillen, Dept. Econometrics University of Barcelona together with Leo Guelman, Royal Bank of Canada (RBC Insurance division)
The members of the scientific committee are: Katrien Antonio (University of Amsterdam and KU Leuven), Christophe Dutang (Université du Maine, France), Jens Nielsen (Cass), Andreas Tsanakas (Cass) and Markus Gesmann (ChainLadder project).

Details about the registration and abstract submission process will be published soon on www.RinInsurance.com.

The organisers, Andreas Tsanakas and Markus Gesmann, gratefully acknowledge the sponsorship of Mango Solutions, RStudio, Cybaea and PwC.

## Tuesday, 26 November 2013

### Not only verbs but also believes can be conjugated

Following on from last week, where I presented a simple example of a Bayesian network with discrete probabilities to predict the number of claims for a motor insurance customer, I will look at continuos probability distributions today. Here I follow example 16.17 in Loss Models: From Data to Decisions [1].

Suppose there is a class of risks that incurs random losses following an exponential distribution (density $$f(x) = \Theta {e}^{- \Theta x}$$) with mean $$1/\Theta$$. Further, I believe that $$\Theta$$ varies according to a gamma distribution (density $$f(x)= \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha \,-\, 1} e^{- \beta x }$$) with shape $$\alpha=4$$ and rate $$\beta=1000$$.

In the same way as I had good and bad driver in my previous post, here I have clients with different characteristics, reflected by the gamma distribution. I shall call the gamma distribution with the above parameters my prior parameter distribution and the exponential distribution the prior predictive distribution.

The textbook tells me that the unconditional mixed distribution of an exponential distribution with parameter $$\Theta$$, whereby $$\Theta$$ has a gamma distribution, is a Pareto II distribution (density $$f(x) = \frac{\alpha \beta^\alpha}{(x+\beta)^{\alpha+1}}$$) with parameters $$\alpha,\, \beta$$. Its k-th moment is given in the general case by
$E[X^k] = \frac{\beta^k\Gamma(k+1)\Gamma(\alpha - k)}{\Gamma(\alpha)},\; -1 < k < \alpha.$ Thus, I can calculate the prior expected loss ($$k=1$$) as $$\frac{\beta}{\alpha-1}=\,$$333.33.
Now suppose I have three independent observations, namely losses of $100,$950 and $450 over the last 3 years. The mean loss is$500, which is higher than the \$333.33 of my model.

Question: How should I update my belief about the client's risk profile to predict the expected loss cost for year 4 given those 3 observations?

Visually I can regard this scenario as a graph, with evidence set for years 1 to 3 that I want to propagate through to year 4.

## Tuesday, 19 November 2013

### Predicting claims with a Bayesian network

Here is a little Bayesian Network to predict the claims for two different types of drivers over the next year, see also example 16.15 in [1].

Let's assume there are good and bad drivers. The probabilities that a good driver will have 0, 1 or 2 claims in any given year are set to 70%, 20% and 10%, while for bad drivers the probabilities are 50%, 30% and 20% respectively.

Further I assume that 75% of all drivers are good drivers and only 25% would be classified as bad drivers. Therefore the average number of claims per policyholder across the whole customer base would be:
0.75*(0*0.7 + 1*0.2 + 2*0.1) + 0.25*(0*0.5 + 1*0.3 + 2*0.2) = 0.475
Now a customer of two years asks for his renewal. Suppose he had no claims in the first year and one claim last year, how many claims should I predict for next year? Or in other words, how much credibility should I give him?

To answer the above question I present the data here as a Bayesian Network using the gRain package [2]. I start with the contingency probability tables for the driver type and the conditional probabilities for 0, 1 and 2 claims in year 1 and 2. As I assume independence between the years I set the same probabilities. I can now review my model as a mosaic plot (above) and as a graph (below) as well.

Next, I set the client's evidence (0 claims in year one and 1 claim in year two) and propagate these back through my network to estimate the probabilities that the customer is either a good (73.68%) or a bad (26.32%) driver. Knowing that a good driver has on overage 0.4 claims a year and a bad driver 0.7 claims I predict the number of claims for my customer with the given claims history as 0.4789.

Alternatively I could have added a third node for year 3 and queried the network for the probabilities of 0, 1 or 2 claims given that the customer had zero claims in year 1 and one claim in year 2. The sum product of the number of claims and probabilities gives me again an expected claims number of 0.4789.

### References

[1] Klugman, S. A., Panjer, H. H. & Willmot, G. E. (2004), Loss Models: From Data to Decisions, Wiley Series in Proability and Statistics.

[2] Søren Højsgaard (2012). Graphical Independence Networks with the gRain Package for R. Journal of Statistical Software, 46(10), 1-26. URL http://www.jstatsoft.org/v46/i10/

### Session Info

R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] Rgraphviz_2.6.0 gRain_1.2-2     gRbase_1.6-12   graph_1.40.0

loaded via a namespace (and not attached):
[1] BiocGenerics_0.8.0 igraph_0.6.6       lattice_0.20-24    Matrix_1.1-0
[5] parallel_3.0.2     RBGL_1.38.0        stats4_3.0.2       tools_3.0.2

## Tuesday, 12 November 2013

### googleVis 0.4.7 with RStudio integration on CRAN

In my previous post, I presented a preview version of googleVis that provided an integration with RStudio's Viewer pane (introduced with version 0.98.441).

Over 80% in my little survey favoured the new default output mechanism of googleVis within RStudio. Hence, I uploaded googleVis 0.4.7 on CRAN over the weekend.

However, there were also some thoughtful comments, which suggested that the RStudio Viewer pane is not always the best option. Indeed, Flash charts and gvisMerge output will still be displayed in your default browser, but also if you work on larger charts and with smaller screen, then the browser might still be the better option compared to the Viewer pane - of course you can launch the browser from the Viewer pane as well.

Hence, googleVis gained a new option 'googleVis.viewer' that controls the default output of the googleVis plot method. On package load it is set to getOption("viewer") and if you use RStudio, then its viewer pane will be used for displaying non-Flash and un-merged charts. You can set options("googleVis.viewer" = NULL) and the googleVis plot function will open all output in the default browser again. Thanks to J.J. from RStudio for the tip.

The screen shot below shows a geo chart within the RStudio Viewer pane of the
devastating typhoon track of Haiyan that hit Southeast Asia last week.

### Session Info

RStudio v0.98.456 and R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
[7] base

other attached packages:

loaded via a namespace (and not attached):
[1] RJSONIO_1.0-3 tools_3.0.2


## Tuesday, 5 November 2013

### Display googleVis charts within RStudio

The preview version 0.98.441 of RStudio introduced a new viewer pane to render local web content and with that it allows me to display googleVis charts within RStudio rather than in a separate browser window.

I think this is a rather nice feature and hence I have updated the plot method in googleVis to use the RStudio viewer pane as the default output. If you use another editor, or if the plot is using one of the Flash based charts, then the browser is still the default display.

The behaviour can also be controlled via the option viewer. Set options("viewer"=NULL) and googleVis will plot all output in the browser again.

Of course shiny apps can also run in the viewer pane. Here is the example of the renderGvis help page of googleVis. For more information about the new viewer pane see the online RStudio documentation.

For the time being you can get the next version 0.4.6 of googleVis from our project site only. Please get in touch if you find any issues or bugs with this version, or add them to our issues list.

Is this a step in the right direction? Please use the voting buttons below.

### Session Info

R Under development (unstable) (2013-10-25 r64109)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

loaded via a namespace (and not attached):
[1] RJSONIO_1.0-3 tools_3.1.0 

## Tuesday, 29 October 2013

### High resolution graphics with R

For most purposes PDF or other vector graphic formats such as windows metafile and SVG work just fine. However, if I plot lots of points, say 100k, then those files can get quite large and bitmap formats like PNG can be the better option. I just have to be mindful of the resolution.

As an example I create the following plot:
x <- rnorm(100000)

Saving the plot as a PDF creates a 5.2 MB big file on my computer, while the PNG output is only 62 KB instead. Of course, the PNG doesn't look as crisp as the PDF file.
png("100kPoints72dpi.png", units = "px", width=400, height=400) plot(x, main="100,000 points", col=adjustcolor("black", alpha=0.2)) dev.off()

Hence, I increase the resolution to 150 dots per pixel.
png("100kHighRes150dpi.png", units="px", width=400, height=400, res=150)
dev.off()

This looks a bit odd. The file size is only 29 KB but the annotations look too big. Well, the file has only 400 x 400 pixels and the size of a pixel is fixed. Thus, I have to provide more pixels, or in other words increase the plot size. Doubling the width and height as I double the resolution makes sense.
png("100kHighRes150dpi2.png", units="px", width=800, height=800, res=150)
dev.off()

Next I increase the resolution further to 300 dpi and the graphic size to 1600 x 1600 pixels. The file is still very crisp. Of course the file size increased. Now it is 654 KB in size, yet sill only about 1/8 of the PDF and I can embed it in LaTeX as well.
png("100kHighRes300dpi.png", units="px", width=1600, height=1600, res=300)
dev.off()

Note, you can click on the charts to access the original files of this post.

## Tuesday, 22 October 2013

### Review: Kölner R Meeting 18 October 2013

The Cologne R user group met last Friday for two talks on split apply combine in R and XLConnect by Bernd Weiß and Günter Faes respectively, before the usual Schnitzel and Kölsch at the Lux.

### Split apply combine in R

The apply family of functions in R is incredible powerful, yet for newcomers often somewhat mysterious. Thus, Bernd gave an overview of the different apply functions and their cousins. The various functions differ in their object inputs, e.g. vectors, arrays, data frames or lists, and their outputs. Other related functions are by, aggregate and ave. While functions like aggregate reduce the output size, others like ave will return as many rows as the input object and repeat the results where necessary.

Alternatively to the base R function Bernd touched also on the **ply functions of the plyr package. The function names are certainly easier to remember, but their syntax can be a little awkward (.()). Bernd's slides, in German, are already available from our Meetup site.

### XLConnect

When dealing with data stored in spreadsheets most member of the group rely on read.csv and write.csv in R. However, if you have a spreadsheet with multiple tabs and formatted numbers, read.csv becomes clumsy, as you would have to save each tab without any formatting in separate files.

Günter presented the XLConnect as an alternative to read.csv or indeed RODBC for reading spreadsheet data. It uses the Apache POI API as the underlying interface. XLConnect requires a Java runtime environment on your computer, but no installation of Excel. That makes it a true platform independent solution to exchange data with spreadsheets and R. Not only can you read defined rows and columns from Excel into R, or indeed named ranges, but in the same way data can be stored in Excel files again and to top it all - also graphic output from R.

### Next Kölner R meeting

The next meeting is scheduled for 13 December 2013. A discussion of the data.table package is already on the agenda.

Please get in touch if you would like to present and share your experience, or indeed if you have a request for a topic you would like to hear more about. For more details see also our Meetup page.

Thanks again to Bernd Weiß for hosting the event and Revolution Analytics for their sponsorship.