Subset Sum Problem


Definition and Examples

Subset sum is one of many NP-complete computational problems. See the classic book “Computers and Intractability” by Garey and Johnson. The problem can be defined as follow: Given a set S of integers and one integer t, Is there a subset S’⊆S such that the sum of all values in S’ is t?

So for example, let S1={2, 3, 7, 25, 67, 179, 356, 819}. Is there a subset of S1 such that the sum is t = 390? The answer to this one is yes, as S1‘ = {2, 7, 25, 356} is a subset which sum is equal to 390. How about for t=385, is there also an answer for this one? Unfortunately no, as there is no such subset which sum will be equal to this one. The closest ones probably either S1‘={3, 25, 356} or S1‘={7, 25, 356}.

Just by looking at the above instance, determining the answer is quite easy. We can start by finding an element that is less than but closest to the target number. Deduct the number by that element, and repeat the process until the target number is zero or no more element from the set can be tested.

Now lets try another example. Let S2={114, 171, 399, 1425, 619, 603, 1092, 283}. Is there a subset for this S2 such that the sum is t=3030? You still can get the answer, but definitely not as easy as the above example.

The difference between this example and the previous one is the relative values among elements. The first example has super increasing elements, in which an element value in the set is always bigger than the sum of all smaller elements in the set. So when determining whether an element should / should not be included in the subset is quite direct: If it is bigger than target number it should be out, if it is smaller than (but closest to) the target number it should be in.

The second example has no such characteristics. If an element is smaller than (eventhough closest to) the target number, it is not necessarily in the subset. So we have to do trial and error to find the correct subset.

NP-hardness

This phenomenon is quite typical for NP-complete problems. For some instances, the solutions are quite easy to find, but for other instances, the solutions seem harder and may take a long time to answer.

Formally how do we know that this problem is NP-hard? By showing that any algorithm can solve this problem can also (be used to) solve another NP-complete problem. This is a kind of proof by contradiction.

As we know that it is very difficult to construct an efficient algorithm for solving any NP-complete problem. The proof basically will conclude that finding an efficient algorithm for solving this problem will not be easy because if there is one, then by the process shown, there is at least an NP-complete problem that can be solved efficiently. Hence contradiction.

For Subset Sum problem, 3-SAT problem is usually used as the known NP-complete problem.

To reiterate, we have to remember that proving a problem is NP-complete doesn’t mean all instances of the problem are hard to solve. It is in worst case instances that the problem is very hard to solve.

Merkle-Hellman Crypto

Although in general we are hoping efficient solution to this kind of problems, some applications do take the opportunity of  this charateristics. The problem can be used as a coding system. For example, let the set S1 be from the first example above. There are 8 elements in the set, each element will represent a bit in one byte system, i.e. a bit 1 is represented by membership in the subset and bit 0 otherwise. So for example, byte value 195, which is  in binary 1100 0011 can represented by subset S1‘={2, 3, 356, 819} or t=1180, meaning that value 195 can be coded as 1180.

By itself this coding system is not very useful. Merkle-Hellman developed a public-private key cryptosystem based on this characteristics. Using the first example is easy to retrieve the original subset back, however using the second one it is very hard to retrieve the original subset S’ from value t. So the first instance can be used as a private key, while the second instance can be used as a public key. The only thing left is to find a correspondence between the first and the second instance.

Function f(a)=(ω.a) mod N, i.e. multiply a by certain value ω, then modulo by N will throw the value a into some other value within range of 0 to N-1. Let b=f(a) and ω’.ω = 1. The inverse function g(b)=(ω’.b) mod N. It will be that g(b) = g(f(a)) = (ω’.((ω.a) mod N)) mod N = (ω’.ω.a) mod N = a, as long as a is originally within 0..N-1

Now we just have to find a good N and ω for the formula, so that f(a) values are not super increasing and close to random values.

  1. To cover all elements of S within range, N should be larger than the total sum of original elements of S. N=1600 is a good choice for our above example.
  2. Select ω relatively prime to (i.e. not a factor of) N. If ω is a factor of N, then the range of 0..N-1 will effectively be reduced to 0..(N/ω)-1. For our example we select ω=57
  3. Then we should find ω’, such that (ω’.ω) mod N = 1, i.e. x.(ω’.ω) == y.N+1 for some integer (x,y).  Because have selected N=1600 and ω=57, then ω’=393, as (ω’.ω) = (57×393) = 22401 = 14×1600 +1
  4. So if S1={2, 3, 7, 25, 67, 179, 356, 819}, then newS={f(2), f(3), …, f(819)}={114, 171, 399, 1425, 619, 603, 1092, 283} which is set S2 in our second example above.
  5. Set (S1 from first example above, the super increasing set, N, ω, and ω’) will be our private key. The set (S2 from second example, which is the result of f(a) function, and N) is our public key. While no body else should know our first set, everybody else should know this second set.

Sending Secret Message

Let say we are trying to send secret message “Done”, which in hexadecimal are 44h, 6Fh, 6Eh, 65h. Binary of 44h is 0100 0100b, and the ones in that binary number correspond to elements 171 and 603 in the set for public key and the total t=171+603=774. The other 3 numbers are also encrypted in the same way. So we have (774, 1567, 1284, 570) to be sent to the owner of the private key. You may want to verify whether it is true that it is not very easy to recover the set elements (of the public key) that make up those numbers.

On the receiver site, the data will be decrypted as follows. First the inverse function g(b) is applied to each number. g(774)=(393×774) mod 1600 = 182. We do the same for the rest, so we have (182, 1431, 612, 10). As the original set is super increasing, it easy to know that 182 (the first number) = 3 + 179, and the corresponding bits are 0100 0010b=44h=’D’. We do the same thing for the rest, having back the original message “Done”.

Breaking Merkle-Hellman Crypto

Given the public key set and the encrypted message, one can try to find the elements make up each of the numbers. For set of size N, there are 2N possible sums to compare to. With the set size of only 8 elements surely any brute force method will easily break the code in rather short time. And as same sequence of sub-messages is encrypted the same, occurence of them may tip off the actual message and can be used to speed up the breaking process.

Although you increase the set size, if the secret message is too long (compared to set size), probabilistically the message becoming easier to break. The reason is that the longer the message, the higher possibility that any part of the message differ only few (read: 1) bits than the other parts.

Using previous example, which public set is S2={114, 171, 399, 1425, 619, 603, 1092, 283}, the encrypted message is (774, 1567, 1284, 570). To break it we have to find out the elements from the set that make up those numbers. However, 1567-283=1284. As all the numbers in the encrypted message are supposely correct sum of the elements, we immediately know that the 2nd and the 3rd numbers differ on the last bit! So we now know the second byte should be xxxx xxx1 and the third byte should be xxxx xxx0. And by elimination like this we can reduce the need to do brute force until the last few bits.

Exercises

  1. Develop a faster than simple brute force algorithm to find elements of set that make up t.
  2. Modify Merkle-Hellman Crypto so it can avoid simple attack as describe above.
  3. Find a new attack method to that modified crypto.

2 responses to “Subset Sum Problem”

Leave a Reply