Publishing data analysis post

2017-12-28 20:09:26 -05:00 · 2017-12-28 20:09:26 -05:00 · 61c8fac08c
commit 61c8fac08c
parent 4247ac4d79
2 changed files with 69 additions and 23 deletions
--- a/content/post/day-trading-generating-training-data.md
+++ b/content/post/day-trading-generating-training-data.md
@ -1,38 +1,84 @@
 ---
 title: "Getting Into Day Trading: Analyzing The Moving Average"
 date: 2017-11-04T14:11:54-04:00
-draft: true
-tags: ["day trading", "data analysis", "julia"]
+draft: false
+tags: ["day trading", "data analysis", "python"]
 ---

-Now that we have a Julia environment good to go, and a dataset available, time to start doing some real analysis.
-
 I know that I have this bit of data for the WLTW symbol, and what would be helpful is to see that data completely
 plotted in all of it's glory. Let's take a look at the closing costs (y) plotted against the date (x).

 ![Image](/img/post/WLTW_CLOSING_COSTS.png)

-Not bad, we can see an ok trend going from January to December 2016. This data isn't very useful yet but I can
-showcase some awesome Julia packages, and how I generated the graph.
+This is a good start, but how good are the SMA's at tracking this close cost? Let's first write a little Python that will grab
+the SMA for a given window, and the end of the window it was calculated for the X-axis.

-I used DataFrames.jl to store the data, Query.jl to grab a subset of the data, and Gadfly.jl to plot the data.
-All of these are excellent libraries for doing your thing when analyzing.
+```python
+import numpy as np

-```julia
-data = readtable("prices.csv", header=True)
-q = @from i in data begin
-    @where i.symbol == "WLTW"
-    @select {i.date, i.close}
-    @collect DataFrame
-end
-
-p = (q, y=:close, Geom.Point, Guide.Title("Closing Costs: WLTW - 2016"))
-draw(PNG("wltw_closing_costs.png", 6inch, 4inch), p)
+def moving_avs(col, window):
+    moving_avs = {}
+    for i in range(0, len(col), window):
+        moving_avs[i] = np.mean(col[i:i+window])
+    return moving_avs
 ```

-Now I'd like to add the plots for the 3-day SMA, and the 5-day SMA to the plot of WLTW closing costs. What these
-are, are the average of either the last 3 days or the last 5 days for a single datapoint. I believe that
-by doing so, we may be able to visualize if either datapoint is adequate in predicting trends in this data. I'll be looking for
-how close any given moving average is to the actual trend of the close costs for the WLTW security.
+Using Numpy for analysis, and Pandas for Series to hold my values, I can use this function to create a dictionary tracking exactly what
+day I am ending an SMA calculation on, as well as the SMA for that range. Window becomes the step size in the range call, and `np.mean` does
+the work calculating simple moving averages for slices of the data array.

+Now I can plug in my values to the function to generate some simple moving averages.
+
+```python
+data = pd.read_csv("prices.csv")
+wltw = data[data["symbol"] == "WLTW"]
+
+threedaysma = moving_avs(wltw, 3)
+fivedaysma = moving_avs(wltw, 5)
+```
+
+Back to the orignal question, how well do the SMAs track against the closing cost? Well let's find out.
+
+```python
+import matplotlib.pyplot as plt
+plt.plot(wltw["close"])
+plt.plot(list(threedaysma.keys()), list(threedaysma.values()))
+plt.show()
+```
+
+The resulting graph is here for the three day SMA:
+
+![Image](/img/post/threedaysma.png)
+
+That looks very very promising for this small timerange. The three day SMA follows the closing cost very closely.
+Now I don't know about you, but I'd like to see just how closely the three day SMA follows the closing cost. I learned
+in statistics of a little measure called correlation. From the interwebs:
+
+> Correlation is a statistical measure for how two or more variables fluctuate together.
+
+Now, I won't go into too many details here, as I have mammoth libraries at my disposal. However, I can explain the basics of
+the measure of correlation. Correlation is between the values of -1 and 1, inclusive. A value of 1 means that the two datasets are positively
+correlated (fluctuate together), while a value of -1 means that the two datasets are negatively correlated (fluctuate inversely). Any number in-between
+represents how strongly correlated datasets are positively or negatively, and 0 means that the data is not correlated whatsoever.
+
+To calculate correlation, I use the Numpy method for the Pearson product-moment correlation given two array-like inputs. First, I clean the data.
+I'll do this by dropping close cost values that don't correspond to the end of SMA windows for the 3-day SMA.
+
+```python
+cleaned = wltw.iloc[list(threedaysma.keys()),:]
+```
+
+And now, to calculate our Pearson product-moment correlation coefficients
+
+```python
+threedaysma_array = np.array(list(threedaysma.values()))
+print(np.corrcoef(cleaned["close"], threedaysma_array))
+
+#[[ 1.          0.96788571]
+# [ 0.96788571  1.        ]]
+
+```
+
+What this output 2D array tells us, is that the data are very strongly correlated! For the points we cleaned, the correlation coefficient is almost 1. That
+is great news, and we can most likely use this moving forward for forecasting and short-term trading.

--- a/static/img/post/threedaysma.png
+++ b/static/img/post/threedaysma.png