Parallel matrix multiplication python

Part III: Matrix multiplication on multiple cores in Python, Java and C++

This is Part III of my matrix multiplication series. Part I was about simple matrix multiplication algorithms and Part II was about the Strassen algorithm. Part III is about parallel matrix multiplication.

We got some pretty interesting results for matrix multiplication so far. Now, I would like to get to know in how far performance increases (or decreases) if I make use of multiple cores. I only have two cores, so I hope somebody with a better computer will also run the ikj single core algorithm form part I and the parallel version from this article and post the results as a comment.

The implementations

As last time, I’ve added the scripts to a GIT repository. So you can test it on your machine.

Before we start implementing code for multiple processors, we have to get an algorithm that is actually parallelisable. You could use Cannon’s algorithm, a algorithm that makes use of systolic arrays or try to find a solution by your own. The Scalable Universal Matrix Multiplication Algorithm (short: SUMMA) could also work. The paper that I’ve linked is well-written and easy to understand. You should definitively read it, if you’re interested in matrix multiplication.

I will not use any advanced algorithm in this article. I will make the outer most for loop of the ikj-algorithm (see part I) execute in parallel.

Читайте также:  Функция вычисления факториала питон

More about parallel matrix multiplication:

Python

Because of global interpreter lock (GIL), you need to start new processes in Python (source).

The ikj single core algorithm implemented in Python needs:

time python ikjMultiplication.py -i 2000.in > 2000-nonparallel.out real 36m0.699s user 35m53.463s sys 0m2.356s 

The most simple way to parallelize the ikj algorith is to use the multiprocessing module and compute every line of the result matrix C with a new process. But for the 2000×2000-example, this would mean we started 2000 processes. The overhead is much worse than the benefit:

time python ikjMultiplication.py -i 2000.in > 2000-parallel.out real 20m47.693s user 40m34.460s sys 0m2.092s 

When we share memory, the code looks like this:

#!/usr/bin/env python # -*- coding: utf-8 -*- import multiprocessing, numpy, ctypes def read(filename): lines = open(filename, "r").read().splitlines() A = [] B = [] matrix = A for line in lines: if line != "": matrix.append(map(int, line.split("\t"))) else: matrix = B return A, B def printMatrix(matrix, f): for line in matrix: f.write("\t".join(map(str, line)) + "\n") def lineMult(start): global A, B, mp_arr, part n = len(A) # create a new numpy array using the same memory as mp_arr arr = numpy.frombuffer(mp_arr.get_obj(), dtype=ctypes.c_int) C = arr.reshape((n, n)) for i in range(start, start + part): for k in range(n): for j in range(n): C[i][j] += A[i][k] * B[k][j] def ikjMatrixProduct(A, B, threadNumber): n = len(A) pool = multiprocessing.Pool(threadNumber) pool.map(lineMult, range(0, n, part)) # mp_arr and arr share the same memory arr = numpy.frombuffer(mp_arr.get_obj(), dtype=ctypes.c_int) C = arr.reshape((n, n)) return C def extant_file(x): """ 'Type' for argparse - checks that file exists but does not open. """ if not isfile(x): raise argparse.ArgumentError("  does not exist".format(x)) return x if __name__ == "__main__": import argparse, sys from os.path import isfile from argparse import ArgumentParser parser = ArgumentParser(description="ikjMatrix multiplication") parser.add_argument( "-i", "--input", dest="filename", required=True, type=extant_file, help="input file with two matrices", metavar="FILE", ) parser.add_argument( "-o", "--output", type=argparse.FileType(mode="w"), default=sys.stdout, dest="output", help="file to write output to (default=stdout)", ) args = parser.parse_args() A, B = read(args.filename) n, m, p = len(A), len(A[0]), len(B[0]) threadNumber = 2 part = len(A) / threadNumber if part  1: part = 1 # shared, can be used from multiple processes mp_arr = multiprocessing.Array(ctypes.c_int, n * p) C = ikjMatrixProduct(A, B, threadNumber) printMatrix(C, args.output) 

and it needs MUCH more time:

time python ikjMultiplication-shared.py -i 2000.in > 2000-parallel-2threads.out real 131m35.433s user 250m36.820s sys 0m9.533s 

When we don’t use shared memory, things run faster:

#!/usr/bin/env python # -*- coding: utf-8 -*- import multiprocessing, numpy, ctypes def read(filename): lines = open(filename, "r").read().splitlines() A = [] B = [] matrix = A for line in lines: if line != "": matrix.append(map(int, line.split("\t"))) else: matrix = B return A, B def printMatrix(matrix, f): for line in matrix: f.write("\t".join(map(str, line)) + "\n") def lineMult(start): global A, B, C, part n = len(A) for i in range(start, start + part): for k in range(n): for j in range(n): C[i][j] += A[i][k] * B[k][j] def ikjMatrixProduct(A, B, threadNumber): n = len(A) pool = multiprocessing.Pool(threadNumber) pool.map(lineMult, range(0, n, part)) return C def extant_file(x): """ 'Type' for argparse - checks that file exists but does not open. """ if not isfile(x): raise argparse.ArgumentError("  does not exist".format(x)) return x if __name__ == "__main__": import argparse, sys from os.path import isfile from argparse import ArgumentParser parser = ArgumentParser(description="ikjMatrix multiplication") parser.add_argument( "-i", "--input", dest="filename", required=True, type=extant_file, help="input file with two matrices", metavar="FILE", ) parser.add_argument( "-o", "--output", type=argparse.FileType(mode="w"), default=sys.stdout, dest="output", help="file to write output to (default=stdout)", ) args = parser.parse_args() A, B = read(args.filename) n, m, p = len(A), len(A[0]), len(B[0]) threadNumber = 2 part = len(A) / threadNumber if part  1: part = 1 C = [[0 for i in range(n)] for j in range(n)] C = ikjMatrixProduct(A, B, threadNumber) printMatrix(C, args.output) 
time python ikjMultiplication.py -i 2000.in > 2000-parallel-4threads.out real 22m46.066s user 41m42.396s sys 0m2.324s 

Java

import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.LinkedList; import java.util.List; import java.util.ArrayList; import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; public class Shell  static ListArrayListArrayListInteger>>> read(String filename)  ArrayListArrayListInteger>> A = new ArrayListArrayListInteger>>(); ArrayListArrayListInteger>> B = new ArrayListArrayListInteger>>(); String thisLine; try  BufferedReader br = new BufferedReader(new FileReader(filename)); // Begin reading A while ((thisLine = br.readLine()) != null)  if (thisLine.trim().equals(""))  break; > else  ArrayListInteger> line = new ArrayListInteger>(); String[] lineArray = thisLine.split("\t"); for (String number : lineArray)  line.add(Integer.parseInt(number)); > A.add(line); > > // Begin reading B while ((thisLine = br.readLine()) != null)  ArrayListInteger> line = new ArrayListInteger>(); String[] lineArray = thisLine.split("\t"); for (String number : lineArray)  line.add(Integer.parseInt(number)); > B.add(line); > br.close(); > catch (IOException e)  System.err.println("Error: " + e); > ListArrayListArrayListInteger>>> res = new LinkedListArrayListArrayListInteger>>>(); res.add(A); res.add(B); return res; > static void printMatrix(int[][] matrix)  for (int[] line : matrix)  int i = 0; StringBuilder sb = new StringBuilder(matrix.length); for (int number : line)  if (i != 0)  sb.append("\t"); > else  i++; > sb.append(number); > System.out.println(sb.toString()); > > public static int[][] parallelMult(ArrayListArrayListInteger>> A, ArrayListArrayListInteger>> B, int threadNumber)  int[][] C = new int[A.size()][B.get(0).size()]; ExecutorService executor = Executors.newFixedThreadPool(threadNumber); ListFutureint[][]>> list = new ArrayListFutureint[][]>>(); int part = A.size() / threadNumber; if (part  1)  part = 1; > for (int i = 0; i  A.size(); i += part)  System.err.println(i); Callableint[][]> worker = new LineMultiplier(A, B, i, i+part); Futureint[][]> submit = executor.submit(worker); list.add(submit); > // now retrieve the result int start = 0; int CF[][]; for (Futureint[][]> future : list)  try  CF = future.get(); for (int i=start; i  start+part; i += 1)  C[i] = CF[i]; > > catch (InterruptedException e)  e.printStackTrace(); > catch (ExecutionException e)  e.printStackTrace(); > start+=part; > executor.shutdown(); return C; > public static void main(String[] args)  String filename; int cores = Runtime.getRuntime().availableProcessors(); System.err.println("Number of cores:\t" + cores); int threads; if (args.length  3)  filename = "3.in"; threads = cores; > else  filename = args[1]; threads = Integer.parseInt(args[2]); > ListArrayListArrayListInteger>>> matrices = read(filename); ArrayListArrayListInteger>> A = matrices.get(0); ArrayListArrayListInteger>> B = matrices.get(1); int[][] C = parallelMult(A, B, threads); printMatrix(C); > > 
import java.util.ArrayList; import java.util.concurrent.Callable; public class LineMultiplier implements Callableint[][]>  ArrayListArrayListInteger>> A; ArrayListArrayListInteger>> B; int start; int end; public int[][] C; public LineMultiplier(ArrayListArrayListInteger>> a, ArrayListArrayListInteger>> b, int s, int e)  A = a; B = b; C = new int[a.size()][b.get(0).size()]; start = s; end = e; > @Override public int[][] call()  for (int i = start; i  end; i++)  for (int k = 0; k  B.size(); k++)  for (int j = 0; j  B.get(0).size(); j++)  C[i][j] += A.get(i).get(k) * B.get(k).get(j); > > > return C; > > 

Execute it with only one thread:

time java Shell -i 2000.in 1 > 2000-paralllel.out Number of cores: 2 0 real 0m40.571s user 0m42.259s sys 0m0.388s 

Execute it with two threads:

time java Shell -i 2000.in 2 > 2000-paralllel.out Number of cores: 2 0 1000 real 0m30.188s user 0m54.999s sys 0m0.512s 

Note that real time is lower than user time. The reason is simply that the execution time on each processor is added. So user time might be double as high as real time!

We got from 40.571s down to 0m30.188s!

Information for parallelism

C++

Making the ikj-algorithm parallel is trivial with C++. You only need to add #pragma omp parallel for before the outer most for loop and add -fopenmp as a compile flag! (If you really want to see the code, go to my Git repository.)

$ time ./ikj-algorithm.out 2 2000.in > 2000-parallel.out real 0m12.563s user 0m20.569s sys 0m0.156s 

So we got from 20.407 seconds down to 12.563 seconds by adding only one line!

More Information

Источник

[Python] 3 Ways of Multi-threaded Matrix Multiplication

We want to create matrix multiplication (3 x 3) program in multi-threaded way.

Input: Matrix A, B and each one of them is 3×3 size.
Output: Matrix C which is the resultant of matrix A * B

Input Decomposition

There are several way when multiply 2 matrices, one of them is Block Matrix, on which you divide the matrix to sub-matrices (under some constraints), then multiplying the sub-matrices and finally sum them up in matrix C.

Here, the matrices are 3×3, so, we can divide them to A = [A1, A2, A3] transpose, B = [B1, B2, B3], where A1, A2, A3 are of size 3×1 and B1, B2, B3 are of size 1×3. Each thread calculates A1*B1, A2*B2, A3*B3. The result of these multiplications is 3×3 matrix.

The only thing remaining now is add the 3 3×3 matrices together to get a matrix C.

Output Decomposition

We know that the elements of each cell in the matrix C rows depends on a row of matrix A and the matrix B columns. So, we make each thread calculates the row values of matrix C. And the input of each thread is a row from matrix A and the whole matrix B.

od

Intermediate Decomposition

It is similar to Input Decomposition, the only difference here is that every thread will return a unique matrix C (intermediate matrix). In other words, every thread will return an intermediate matrix with size 3×3. So, we will have 3 temp matrix of size 3×3, then the matrix C = (thread_1_matrix_C + thread_2_matrix_C, thread_3_matrix_C)

inter

The input is: A = [[1, 2, 3], [4, 5, 6], [7, 8, 9]], B = [[9, 8, 7], [6, 5, 4], [3, 2, 1]]


References

– Algorithms and Parallel Computing by Fayez Gebali
– Introduction to Parallel Computing, Second Edition by Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar

Источник

Оцените статью