Hamming distance

In Information theory , the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.

Source : Wikipedia Hamming distance

In KDB, finding the hamming distance of two strings would be really trivial.

q)hammDist:{count where x=y}
q)hammDist["karolin" ; "kathrin"]
3
q)hammDist[1011101b;1001001b]
2

K equivalent would be :

k)hammDist:{#&~x=y}

Though it would need some input checks like type check, string length comparison.

q)hammDist:{$[count[x]=count[y];count where  (),xy;'`mismatch]}
q)
q)hammDist["k" ; "k"]
0
q)hammDist["k" ; "ks"]
'mismatch
q)hammDist[1011101b;1001001b]
2

 

Euclidean distance

In mathematics, the Euclidean distance  is the “ordinary” straight-line distance or geometric distance between two points in multidimensional space. 

Source : Wikipedia

The Euclidean distance between points p and q is the length of the line segment connecting them ( \overline{\mathbf{p}\mathbf{q}}).

In Cartesian coordinates, if p = (p1p2,…, pn) and q = (q1q2,…, qn) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula:

Voronoi diagrams using Euclidean distance method

As a simple illustration, consider a group of shops in a city. Suppose we want to estimate the number of customers of a given shop. With all else being equal (price, products, quality of service, etc.), it is reasonable to assume that customers choose their preferred shop simply by distance considerations: they will go to the shop located nearest to them.

Squared Euclidean distance

The standard Euclidean distance can be squared in order to place progressively greater weight on objects that are farther apart. In this case, the equation becomes

In KDB+ , the code would look like :

q)euclDist:{sqrt sum xexp[(x-y);2]}
q)euclDist[4 2;1 6]
5f

q)euclSqDist:{sum xexp[(x-y);2]}
q)euclSqDist[4 2;1 6]
25f

Covariance

Covariance

Covariance is a measure of how much two random variables vary together. It generalizes the concept of variance to multiple random variables.  Instead of measuring the variation of a single random variable, the covariance measures the variation of two variables with each other.

If greater values of one variable correspond to the greater values of another variable (same hold for the lesser values), the variables tend to show similar behavior and the Covariance is said to be positive.  On the other hand, if the greater number of values of one variable correspond to the lesser number of value of another variable, the variables tend to show opposite behavior and the Covariance is said to be negative. The sign of the Covariance shows the tendency of the linear relationship between the variables. A zero Covariance suggests the variable are uncorrelated.

In the Financial world, Covariance is a measure of the degree to which returns on two risky assets move in tandem. A positive covariance means that asset returns move together, while a negative covariance means returns move inversely. ( Source – Investopedia )

Definition

The covariance between two jointly distributed real-valued random variables X and Y is defined as the expected product of their deviations from their individual expected values:

\operatorname {cov} (X,Y)=\operatorname {E} {{\big [}(X-\operatorname {E} [X])(Y-\operatorname {E} [Y]){\big ]}},

where E[X] is the expected value of X, also known as the mean of X. The covariance is also sometimes denoted σXY or σ(X,Y), in analogy to variance.

Example

Example of covariance between Google and Apple prices in Excel:

covar.png

KDB

KDB has an inbuilt function cov to calculate the covariance.

q)x:1126.79 1143.75 1118.29 1104.73 1069.52 1078.92 1090.93 1095.06 1109.64 1126.00 1160.04 1164.50;
q)y:175.50 178.97 178.39 178.12 175.00 176.21 176.82 176.67 175.03 176.94 179.98 181.72
q)
q)cov[x;y]
44.7096
Resources

Wikipedia

Variance

Variance is the squared deviation of a random variable from its mean. It measures how far a set of (random) numbers are spread out from their average value.

it is an important tool is statistical science. it is the square of standard deviation, and the covariance of the random variable with itself, often represented by  ,  or .

Definition

\operatorname {Var} (X)=\operatorname {E} \left[(X-\mu )^{2}\right].

where X is random variable &  μ is mean of   (    ) 

In terms of covariance  :

Following is an example from Wikipedia showing the two populations with the same mean but different variances. The red population has mean 100 and variance 100 (SD=10) while the blue population has mean 100 and variance 2500 (SD=50).

KDB has an inbuilt function to calculate the variance of a population :

q)var 3 5 5 5 6 6 8 10i
4f

Note that the type is automatically promoted to float even the input is of data type integer.

Standard Deviation Vs Absolute Deviation

Standard Deviation

Standard Deviation is the square root of the variance of the random variable or probability distribution. Variance in itself is an excellent measure of variability and range; a larger variance reflects a greater spread in the data set.

Check out my post around Standard Deviation

Absolute deviation of an element of a data set is the absolute difference between that element and a given point. Typically the deviation is reckoned from the central value, being construed as some type of average, most often the median or sometimes the mean of the data-set.

Average absolute deviation

The average absolute deviation of a data set is the average of the absolute deviations. Note here we are talking about averaging the absolute deviation calculated using the mean, median, mode, or some another measure of central tendency.

Check out my post on different ways of calculating the Average absolute deviation

 

 

Moving Average

In statistics, a Moving Average is used to analyze a set of data points by creating a series of averages of different subsets of the full data set.

Given a data series and fixed subset size(N), the first element of Moving Average is obtained by averaging the initial fixed subset (first N elements) of number series. Then the subset is move forward by 1 element, removing the first element and adding the N+1th element of the subset, and average is calculated for the 2nd element of the Moving average. And the process is repeated over the entire data series.

Following graph depicts the Moving Average of Google Weekly closing price for 8 years for window size 10 & 20.

MovingAverage

Moving Average is used commonly for analyzing the short term fluctuations from a time series data and highlighting the long term trends. It can be easily seen in the above graph that the line MA(20) is smoothing out a lot of fluctuation as compared to the MA(10).

Moving averages help smooth price action and filter out the noise. They also form the building blocks for many other technical indicators and overlays, such as the McClellan Oscillator and Bollinger Bands.

Types of Moving Average :

Simple moving average (SMA) and Exponential moving average (EMA) are the most popular type of Moving Average.

Moving Average Excel

Amicable numbers

Amicable numbers are two different numbers so related that the sum of the proper divisors of each is equal to the other number.

For example, the smallest pair of amicable numbers is (220, 284); for the proper divisors of 220 are 1, 2, 4, 5, 10, 11, 20, 22, 44, 55 and 110, of which the sum is 284; and the proper divisors of 284 are 1, 2, 4, 71 and 142, of which the sum is 220.

Thâbit ibn Kurrah rule

The Thâbit ibn Kurrah rule is a method for discovering amicable numbers invented in the tenth century by the Arab mathematician Thâbit ibn Kurrah.

It states that if

p = 3 × 2n − 1 − 1,
q = 3 × 2n − 1,
r = 9 × 22n − 1 − 1,

where n > 1 is an integer and p, q, and r are prime numbers, then 2n×p×q and 2n×r are a pair of amicable numbers. This formula gives the pairs (220, 284) for n=2, (17296, 18416) for n=4, and (9363584, 9437056) for n=7, but no other such pairs are known. Numbers of the form 3 × 2n − 1 are known as Thabit numbers. In order for Thābit’s formula to produce an amicable pair, two consecutive Thabit numbers must be prime; this severely restricts the possible values of n.

Euler’s rule

Euler’s rule’ is a generalization of the Thâbit ibn Kurrah rule. It states that if

p = (2(n – m)+1) × 2m − 1,
q = (2(n – m)+1) × 2n − 1,
r = (2(n – m)+1)2 × 2m + n − 1,

where n>m> 0 are integers and p, q, and r are prime numbers, then 2n×p×q and 2n×r are a pair of amicable numbers. Thābit’s rule corresponds to the case m=n-1. Euler’s rule creates additional amicable pairs for (m,n)=(1,8), (29,40) with no others being known.

q code for Thabit rule

q)pThabit:{-1+3*2 xexp x-1};
q)qThabit:{-1+3*2 xexp x};
q)rThabit:{-1+9*2 xexp -1+2*x};
q)isPrime:{ $[x~2;1b;0~x mod 2;0b;x<1;0b;2=count where 0=( x mod ) each til x+1]};
q)amicableNos:{(x[0]*x[1]*2 xexp y;x[2]*2 xexp y) };
q)thabitNos:{$[all isPrime each r:`int$(pThabit;qThabit;rThabit)@\:x; amicableNos[r;x];0]} ;
q)thabitNos[2]
220 284f

Source: Wikipedia

Step Function

A mathematical function of a single variable that remains constant within each of a series of adjacent intervals but changes in value from one interval to the next.

Examples :

  • Mathematical functions like floor function, ceiling function, signum function
  • A constant function is a trivial example of a step function e.g.  y=1
  • The rectangular function, the normalized boxcar function, is the next simplest step function, and is used to model a unit pulse.

ceiling step function

Step function in KDB

Signum function

The function signum returns -1, 0 or 1 if the argument is negative, zero or positive respectively. It can be applied item-wise to lists, dictionaries and tables, and to all data types except symbol.

q)signum  -1 -3 9 0 2 -2 3
-1 -1 1 0 1 -1 1i

Floor function

The floor function maps a real number to the largest previous integer. More precisely, floor(x) = \lfloor x\rfloor is the largest integer not greater than x.

q)floor 8.4 6.3 9.3 5.4 3.8 9.7 8.8 5.8 6.8 4.5
8 6 9 5 3 9 8 5 6 4

Ceiling function

The ceiling function map a real number to the smallest following integer. More precisely,  ceiling(x) =  \lceil x \rceil is the smallest integer not less than x.

q)ceiling 8.4 6.3 9.3 5.4 3.8 9.7 8.8 5.8 6.8 4.5
9 7 10 6 4 10 9 6 7 5

Sorted attribute (`s#)

The sorted attribute(`s#)  when applied to a dictionary makes the dictionary into a step function.

q)sd:`s#1 3 5 7 9!1 3 25 49 81
q)d[6]
25
q)d[8]
49

Temporal Data

In traditional RDMSs, temporal changes in data are often represented by adding valid-time interval information to each relationship. This is usually achived by adding start and end columns to the relational tables.
Applying a sorted attribute(`s#) on a keyed table gives similar effect in KDB. check out more on Temporal data.

Parted attribute (`p#)

The parted attribute (`#p) indicates that the list represents a step function in which all occurrences of a particular output value are adjacent. The range is an int or temporal type that has an underlying int value, such as years, months, days, etc. You can also partition over a symbol provided it is enumerated.

q)`p#5 3 3 4 4 4 1 1 1 2

OHLC (Open-High-Low-Close)

An OHLC chart is typically used to illustrate the Opening , High, Low and Closing price of a security/Instrument during a certain period of time. It displays more information about the price movement as compared to a Line chart.

This type of chart is used by Technical Analysts to spot the trends of security on a short term basis.

Ways of displaying OHLC

  • Bar Chart
  • Candlestick Chart

Both charts display exactly same information but Candlestick chart is more preferred among traders as it is easier to read.

Bar Chart

Bar chart

Bar chart

Each vertical line on the chart shows the price range (the highest and lowest prices) over a period of time. The open price is marked by a dash pointing to the left and the close price is marked by a dash pointing to the right.

 

 

 

Candlestick Chart

Candlestick Chart

Candlesticks chart

In a candlestick chart the wider bodies of the candles show the opening and closing prices during the period. And the wicks (the lines on top and bottom of the candles) show the lowest and highest prices during the period.

 

 

q Implementation

Since KDB is specially designed for time series data ,there are some built-in functions which makes it trivial to write a query to find OHLC in KDB

q)select open:first price, high:max price, low:min price , close:last price 
by sym,date from quote

q)select open:first price, high:max price, low:min price , close:last price 
by date from quote where sym=`goog

Resources:

Truncatable Primes

In number theory, a left-truncatable prime is a prime number if the leading (“left”) digit is successively removed, then all resulting numbers are prime.
For example 9137, since 9137, 137, 37 and 7 are all prime.

A right-truncatable prime is a prime which remains prime when the last (“right”) digit is successively removed.

For example 7393, since 7393, 739, 73, 7 are all prime.

There are exactly 4260 decimal left-truncatable primes and 83 right-truncatable primes.

There are 15 primes which are both left-truncatable and right-truncatable. They have been called two-sided primes.

Source: Wikipedia

Project Euler: Problem 37

The number 3797 has an interesting property. Being prime itself, it is possible to continuously remove digits from left to right, and remain prime at each stage: 3797, 797, 97, and 7. Similarly we can work from right to left: 3797, 379, 37, and 3.

Find the sum of the only eleven primes that are both truncatable from left to right and right to left.

NOTE: 2, 3, 5, and 7 are not considered to be truncatable primes.

q/KDB Implementation

To start with, first i have wrote a function which identifies whether a number is prime or not. To speed up the calculation i am maintaining a dictionary which contains the prime numbers. If a number is not present in the dictionary it will return boolean null which is boolean false.
If a given number is divisible by any prime number lesser than that number than that number is not a prime number.

q)primeDict:(2 3 5 7)!(1111b)
q)isPrime:{$[x~1;0b;x~2;1b;0~x mod 2;0b;primeDict[x];1b;x<last key primeDict;0b;
    any 0=(x mod ) each key primeDict;0b;[primeDict[x]:1b;1b]]}

Warm up the dictionary till 1000000

q)isPrime each til 1000000

To check if all numbers in a given list are prime or not.

 q)areAllPrimes:{all isPrime distinct x}

To check whether a given number is Right Prime Truncatable & Left Prime Truncatable.

q)rightPrimeTrunc:{$[isPrime x;areAllPrimes {"J"$neg[y] _string x}[x] 
     each til count string x;0b]}

q)leftPrimeTrunc:{$[isPrime x;areAllPrimes {"J"$y _string x}[x] each til count string x;0b]}

We’ll abort the operation if any number contains either 0 4 6 or 8. Since 2 is prime number it might happen that in left or right function, 2 is generated in intermediate operation.

q)haveEvens:{any count[x]>x?/:"0468"}

Now check whether a number contains even number otherwise check its left truncatable & right truncatable.

q)bothPrimeTrunc:{$[haveEvens s:string x;:0b;not leftPrimeTrunc x;0b; rightPrimeTrunc x]}

q)where bothPrimeTrunc each til 1000000
2j, 3j, 5j, 7j, 23j, 37j, 53j, 73j, 313j, 317j, 373j, 797j, 3137j, 3797j, 739397j

q)sum 4_2j, 3j, 5j, 7j, 23j, 37j, 53j, 73j, 313j, 317j, 373j, 797j, 3137j, 3797j, 739397j
 748317j