Nikolaos I. Ploskas Supervisors: ... simplex algorithm is one of the top ten algorithms with the ... Support Systems II-Recent Developments Applied to...

7 downloads 76 Views 2MB Size

PhD Thesis

Hybrid Optimization Algorithms Implementation on GPU

Nikolaos I. Ploskas

Supervisors: Nikolaos Samaras, Associate Professor (Principal) Georgios Evangelidis, Professor Konstantinos Paparrizos, Professor

January 2014

2

Hybrid Optimization Algorithms Implementation on GPU

Nikolaos Ploskas Department of Applied Informatics, School of Information Sciences, University of Macedonia

Abstract Linear Programming (LP) is a signiﬁcant area in the ﬁeld of operations research. The simplex algorithm is one of the top ten algorithms with the greatest inﬂuence in the 20th century and the most widely used method for solving LP problems. The main aim of this thesis is to investigate the computational aspects of two simplex type algorithms: (i) the revised simplex algorithm, and (ii) a primal-dual exterior point simplex algorithm. Furthermore, Graphical Processing Units (GPUs) have gained a lot of popularity and have been applied to LP algorithms. Hence, the simplex type algorithms and the diﬀerent methods in these algorithms are implemented both as CPU- and GPU-based implementations. Preconditioning techniques are important in solving linear problems, as they improve their computational properties. Scaling is the most widely used preconditioning technique in linear optimization algorithms and is used to reduce the condition number of the constraint matrix, improve the numerical behavior of the algorithms and reduce the number of iterations required to solve linear problems. Ten existing scaling techniques are reviewed and computationally compared. Moreover, we also propose a GPU-based implementation of these techniques. Computational results show that on average the speedup gained from the GPU-based implementations of all scaling methods is about 7x. The choice of the pivot element at each iteration is one of the most critical step in simplex type algorithms. Good choices of the entering variable can lead to fast convergence to the optimal solution, while poor choices lead to more iterations and worst execution times or even no solutions of the LPs. Six existing pivoting rules for the revised simplex algorithm are implemented. Furthermore, we also propose a GPU-based implementation of these rules. Computational results show that only the Steepest Edge rule can be implemented eﬃciently on a GPU. The computation of the basis inverse is the most time-consuming step in simplex type algorithms. This inverse does not have to be computed from scratch at any iteration, but updating schemes can be applied to accelerate this calculation. We review and implement ﬁve basis updating schemes. Then, we propose an implementation of two updating schemes on a CPU-GPU system. Computational results show that the Modiﬁcation of the Product Form of the Inverse method is fastest than the other basis updating methods both on a CPU and on a GPU and its’ speedup is up to 19 for the time of the basis inverse and 5 for the total time. 3

Finally, we propose two eﬃcient GPU-based implementations of the revised simplex algorithm and a primal-dual exterior point simplex algorithm. Both GPU-based algorithms have been implemented in MATLAB using MATLABs Parallel Computing Toolbox. Computational results on randomly generated optimal sparse and dense linear programming problems are also presented. The results show that the proposed GPU-based implementations outperform MATLABs interior point method.

Supervisor: Nikolaos Samaras Title: Associate Professor

4

List of Publications 1. Ploskas, N., Samaras, N. (2013). Eﬃcient GPU-based Implementations of Simplex Type Algorithms. Applied Mathematics and Computation (submitted for publication). 2. Ploskas, N., Samaras, N. (2013). GPU Accelerated Pivoting Rules for the Simplex Algorithm. Journal of Systems and Software (submitted for publication). 3. Ploskas, N., Samaras, N. (2013). A Computational Comparison of Basis Updating Schemes for the Simplex Algorithm on a CPU-GPU System. American Journal of Operations Research, 3, 497–505. 4. Ploskas, N., Samaras, N. (2013). A Computational Comparison of Scaling Techniques for Linear Optimization Problems on a GPU. International Journal of Computer Mathematics (accepted with minor revision). 5. Ploskas, N., Samaras, N. (2013). Combining Pivoting Rules for the Revised Simplex Algorithm. Optimization Letters (submitted for publication). 6. Ploskas, N., Samaras, N., Papathanasiou, J. (2013). A Decision Support System for Solving Linear Programming Problems. International Journal of Decision Support Systems Technology (submitted for publication). 7. Ploskas, N., Samaras, N. (2013). Computational Comparison of Pivoting Rules for the Revised Simplex Algorithm. In Proceedings of the XI Balkan Conference on Operational Research, Belgrade, Serbia, 309–317. 8. Ploskas, N., Samaras, N. (2013). Basis Update on Simplex Type Algorithms. In Book of Abstracts of the EWG-DSS Thessaloniki 2013, Thessaloniki, Greece, 11. 9. Ploskas, N., Samaras, N. (2013). The Impact of Scaling on Simplex Type Algorithms. In Proceedings of the 6th Balkan Conference in Informatics, ACM Proceedings, Thessaloniki, Greece, 17–22. 5

10. Ploskas, N., Samaras, N., Margaritis, K. (2013). A parallel implementation of the revised simplex algorithm using OpenMP: some preliminary results. Optimization Theory, Decision Making, and Operational Research Applications, Springer Proceedings in Mathematics & Statistics, 31, 163–175. 11. Ploskas, N., Samaras, N., Papathanasiou, J. (2013). A Web-Based Decision Support System using Basis Update on Simplex Type Algorithms. Decision Support Systems II-Recent Developments Applied to DSS Network Environments, Springer Berlin Heidelberg, Springer Proceedings in Business Information Processing, 164, 102–114. 12. Ploskas, N., Samaras, N., Papathanasiou, J. (2013). A Decision Support System for Solving Linear Programming Problems. In Book of Abstracts of the EURO Mini-Conference GRAZ-2013 on Collaborative Decision Systems in Economics, Complex Societal & Environmental Applications, Graz, Austria, 4. 13. Glavelis, T., Ploskas, N., Samaras N. (2013). Combining Interior and Exterior Simplex Type Algorithms. In Proceedings of the 17th Panhellenic Conference on Informatics, ACM Proceedings, Thessaloniki, Greece, 174–179. 14. Ploskas, N., Samaras, N., Papathanasiou, J. (2012). LU decomposition in the revised simplex algorithm. In Proceedings of the 23th National Conference, Hellenic Operational Research Society, Athens, Greece, 77–81. 15. Ploskas, N., Samaras, N., Margaritis K. (2011). A parallel implementation of the revised simplex algorithm using OpenMP: some preliminary results. In Proceedings of the 1st International Symposium & 10th Balkan Conference on Operational Research, Thessaloniki, Greece, 385–392. 16. Glavelis, Th., Ploskas, N., Samaras, N. (2010). A computational evaluation of some free mathematical software for scientiﬁc computing. Journal of Computational Science, 1(3), 150–158. 6

17. Glavelis, Th., Ploskas, N., Samaras, N. (2010). A computational evaluation of some software for mathematical programming. In Proceedings of the Applied Computing, Iadis International Conference, Timisoara, Romania, 256–260. 18. Ploskas, N., Samaras, N., Sifaleras, A. (2009). A parallel implementation of an exterior point algorithm for linear programming problems. In Proceedings of the 9th Balcan Conference on Operational Research (BALCOR 2009), Constanta, Romania. 19. Glavelis, Th., Ploskas, N., Samaras, N. (2008). A peer review of the parallel and distributed toolboxes of matrix programming languages for mathematical programming. In Proceedings of the 20th National Conference, Hellenic Operational Research Society, Spetses, Greece, 767–775.

7

8

Acknowledgments I consider a great honor having Dr. Nikolaos Samaras as supervisor. He gave me the opportunity to discover the world of scientiﬁc research and showed me how to properly conduct research whilst allowing me the room to work in my own way. I would never ﬁnish this dissertation without his inspirational guidance, help and conﬁdence, even when things looked diﬃcult. For me, he is not only a teacher, but also a lifetime friend and advisor. I am also grateful to my advisors, Prof. Konstantinos Paparrizos, and Prof. George Evangelidis, for their valuable comments and suggestions. I would like also to thank Prof. Konstantinos Margaritis for his valuable comments. I would like to thank all the anonymous reviewers that participated in the review process of the proposed methods and algorithms. Their feedback were essential and helped signiﬁcantly to improve the quality of the proposed methods and algorithms. I would like to thank NVIDIA for their support through the Academic Partnership Program. I would like also to thank my colleagues Themistoklis Glavelis, George Geranis, Charalampos Triantafyllidis, and Dorothea Petraki for our collaboration and friendship. I also wish to express my thanks to my friends for making my life emotionally ample. Special thanks to my friend Lazaros Karaliotas for proofreading my dissertation. Last but not least, I owe immense gratitude to my family, my father, Ioannis, my mother, Theodora, and my sister, Katerina for never stopping supporting me. My special thanks go to Nagia; her endless understanding and support proved essential for the completion of this thesis. Without their help this study would have never been possible.

9

10

Contents 1 Introduction

21

1.1

GPU Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

1.2

Results and Contributions . . . . . . . . . . . . . . . . . . . . . . . .

26

1.3

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2 Linear Programming Algorithms

31

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.2

Revised Simplex Algorithm

. . . . . . . . . . . . . . . . . . . . . . .

34

2.3

Primal - Dual Exterior Point Simplex Algorithm . . . . . . . . . . . .

38

2.4

Interior Point Method . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3 Scaling Techniques

47

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

3.3

Scaling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

3.3.1

Mathematical Preliminaries . . . . . . . . . . . . . . . . . . .

49

3.3.2

Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . . . .

50

3.3.3

de Buchet . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.3.4

Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.3.5

Equilibration . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

3.3.6

Geometric Mean . . . . . . . . . . . . . . . . . . . . . . . . .

54

3.3.7

IBM MPSX . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

3.3.8

Lp-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

11

3.4

3.5

Implementation of Scaling Techniques on a GPU . . . . . . . . . . . .

56

3.4.1

Implementation of GPU-Based Scaling Techniques . . . . . . .

56

3.4.2

Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

3.4.3

GPU-based Arithmetic Mean . . . . . . . . . . . . . . . . . .

58

3.4.4

GPU-based de Buchet . . . . . . . . . . . . . . . . . . . . . .

58

3.4.5

GPU-based Entropy . . . . . . . . . . . . . . . . . . . . . . .

63

3.4.6

GPU-based Equilibration . . . . . . . . . . . . . . . . . . . . .

64

3.4.7

GPU-based Geometric Mean . . . . . . . . . . . . . . . . . . .

64

3.4.8

GPU-based IBM MPSX . . . . . . . . . . . . . . . . . . . . .

68

3.4.9

GPU-based Lp-norm . . . . . . . . . . . . . . . . . . . . . . .

68

Computational Comparison of the CPU- and GPU- Based Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.6

3.7

71

Computational Study: Investigating the Impact Of Scaling on Linear Programming Algorithms . . . . . . . . . . . . . . . . . . . . . . . . .

77

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

4 Pivoting Rules

81

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

4.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

4.3

Pivoting Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.3.1

Bland’s Rule

. . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.3.2

Dantzig’s Rule

. . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.3.3

Greatest Increment Method . . . . . . . . . . . . . . . . . . .

85

4.3.4

Least Recently Considered Method . . . . . . . . . . . . . . .

85

4.3.5

Partial Pricing Rule . . . . . . . . . . . . . . . . . . . . . . .

85

4.3.6

Steepest Edge Rule . . . . . . . . . . . . . . . . . . . . . . . .

86

GPU Accelerated Pivoting Rules . . . . . . . . . . . . . . . . . . . .

86

4.4.1

Implementation of GPU-Based Pivoting Rules . . . . . . . . .

86

4.4.2

Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

4.4.3

GPU-based Bland’s Rule . . . . . . . . . . . . . . . . . . . . .

88

4.4

12

4.4.4

GPU-based Dantzig’s Rule . . . . . . . . . . . . . . . . . . . .

88

4.4.5

GPU-based Greatest Increment Method . . . . . . . . . . . .

88

4.4.6

GPU-based Least Recently Considered Method . . . . . . . .

89

4.4.7

GPU-based Partial Pricing Rule . . . . . . . . . . . . . . . . .

89

4.4.8

GPU-based Steepest Edge Rule . . . . . . . . . . . . . . . . .

91

4.5

Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.6

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

5 Basis Update Methods

101

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.3

Basis Inversion Updating Schemes . . . . . . . . . . . . . . . . . . . . 103 5.3.1

Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . 103

5.3.2

Built-in Function Inv of MATLAB . . . . . . . . . . . . . . . 104

5.3.3

LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3.4

MPFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3.5

PFI

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.4

Computational Results of CPU-based Implementations . . . . . . . . 106

5.5

Implementation of PFI and MPFI Updating Schemes on a CPU-GPU system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.5.1

GPU-based PFI . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.5.2

GPU-based MPFI . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.6

Computational Results of GPU-based Implementations . . . . . . . . 110

5.7

Computational Results of the Impact of Basis Update on Simplex Type Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.8

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6 Eﬃcient GPU-based Implementations of Simplex Type Algorithms117 6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.3

GPU-based Implementations . . . . . . . . . . . . . . . . . . . . . . . 126 13

6.3.1

Implementation of the GPU-Based Revised Simplex Algorithm 126

6.3.2

Implementation of the GPU-Based Primal-Dual Exterior Point Simplex Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 129

6.4

Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.5

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7 Web-based Educational Software for Solving Linear Programming Problems

141

7.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7.2

A Web-Based Educational Software using Basis Update on Simplex Type Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.3

7.2.1

Object-Oriented Analysis . . . . . . . . . . . . . . . . . . . . . 143

7.2.2

Educational Software Presentation . . . . . . . . . . . . . . . 145

A Web-Based Educational Tool for Solving Linear Programming Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.4

7.3.1

Analysis and Design . . . . . . . . . . . . . . . . . . . . . . . 148

7.3.2

Educational Software Presentation . . . . . . . . . . . . . . . 150

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8 Conclusions

155

8.1

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8.2

Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

14

List of Figures 1-1 High Level Description of GPU Architecture . . . . . . . . . . . . . .

25

3-1 Flow Chart of the GPU-based Arithmetic Mean . . . . . . . . . . . .

57

4-1 Flow Chart of the GPU-based Pivoting Rules . . . . . . . . . . . . .

87

4-2 Average Speedup over the Randomly Generated Optimal Dense LPs .

96

4-3 Average Speedup over the Netlib and M´esz´aros Set of LPs . . . . . .

97

4-4 Average Portion of Pivoting Time to Total Execution Time over the Randomly Generated Optimal Dense LPs . . . . . . . . . . . . . . . .

98

4-5 Average Portion of Pivoting Time to Total Execution Time over the Netlib and M´esz´aros Set of LPs . . . . . . . . . . . . . . . . . . . . .

98

4-6 Speedup of the Total and the Pivoting Time of the Steepest Edge Rule 100 5-1 Basis Inverse Time Comparison . . . . . . . . . . . . . . . . . . . . . 108 5-2 Total Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 108 5-3 Speedup Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5-4 Average Total and Inverse Time of the Exterior and Revised Simplex Algorithm Using Three Diﬀerent Updating Methods over the Netlib Set 115 6-1 Flow Chart of the GPU-based Revised Simplex Algorithm . . . . . . 127 6-2 Flow Chart of the GPU-based Primal-Dual Exterior Point Simplex Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6-3 Speedup over the randomly generated optimal dense LPs . . . . . . . 137 6-4 Speedup over the randomly generated optimal sparse LPs with density 10% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 15

6-5 Speedup over the randomly generated optimal sparse LPs with density 20% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7-1 Decision Making Process . . . . . . . . . . . . . . . . . . . . . . . . . 144 7-2 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7-3 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7-4 Initial Screen of the web-based DSS . . . . . . . . . . . . . . . . . . . 146 7-5 Report of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7-6 Decision Making Process . . . . . . . . . . . . . . . . . . . . . . . . . 148 7-7 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7-8 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7-9 Initial Screen of the proposed DSS

. . . . . . . . . . . . . . . . . . . 152

7-10 Report Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

16

List of Tables 2.1

Revised Simplex Algorithm

. . . . . . . . . . . . . . . . . . . . . . .

34

2.2

Primal-Dual Exterior Point Simplex Algorithm . . . . . . . . . . . . .

39

2.3

Mehrotra’s predictor-corrector method . . . . . . . . . . . . . . . . .

45

3.1

GPU-based Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . .

59

3.2

GPU-based de Buchet for the case p = 1 . . . . . . . . . . . . . . . .

61

3.3

GPU-based de Buchet for the case p = 2 . . . . . . . . . . . . . . . .

62

3.4

GPU-based de Buchet for the case p = ∞ . . . . . . . . . . . . . . .

63

3.5

GPU-based Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

3.6

GPU-based Equilibration . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.7

GPU-based Geometric Mean . . . . . . . . . . . . . . . . . . . . . . .

67

3.8

GPU-based IBM MPSX . . . . . . . . . . . . . . . . . . . . . . . . .

68

3.9

GPU-based Lp-norm for the case p = 1 . . . . . . . . . . . . . . . . .

69

3.10 GPU-based Lp-norm for the case p = 2 . . . . . . . . . . . . . . . . .

70

3.11 Statistics of the Netlib LPs (optimal, Kennington and infeasible LPs)

73

3.12 Scaling time of the CPU-based implementations (secs) . . . . . . . .

74

3.13 Scaling time of the GPU-based implementations (secs) . . . . . . . .

75

3.14 Speedup Obtained by the GPU-based Implementations . . . . . . . .

76

3.15 Results of MATLAB’s IPM Algorithm over Randomly Optimal Generated Problems with Density 10% . . . . . . . . . . . . . . . . . . .

79

3.16 Results of EPSA Algorithm over Randomly Optimal Generated Problems with Density 10% . . . . . . . . . . . . . . . . . . . . . . . . . . 17

79

3.17 Results of Revised Simplex Algorithm over Randomly Optimal Generated Problems with Density 10% . . . . . . . . . . . . . . . . . . . .

79

3.18 Results of MATLAB’s IPM Algorithm over Randomly Optimal Generated Problems with Density 20% . . . . . . . . . . . . . . . . . . .

79

3.19 Results of EPSA Algorithm over Randomly Optimal Generated Problems with Density 20% . . . . . . . . . . . . . . . . . . . . . . . . . .

79

3.20 Results of Revised Simplex Algorithm over Randomly Optimal Generated Problems with Density 20% . . . . . . . . . . . . . . . . . . . .

80

4.1

GPU-based Bland’s Rule . . . . . . . . . . . . . . . . . . . . . . . . .

88

4.2

GPU-based Dantzig’s Rule . . . . . . . . . . . . . . . . . . . . . . . .

89

4.3

GPU-based Greatest Increment Method

. . . . . . . . . . . . . . . .

90

4.4

GPU-based Least Recently Considered Method . . . . . . . . . . . .

90

4.5

GPU-based Partial Pricing Rule . . . . . . . . . . . . . . . . . . . . .

91

4.6

GPU-based Steepest Edge Rule . . . . . . . . . . . . . . . . . . . . .

91

4.7

Statistics of the Netlib (optimal, Kennington and infeasible LPs) and M´esz´aros LPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

4.8

Number of Iterations over the Randomly Generated Optimal Dense LPs 94

4.9

Number of Iterations Over the Netlib and M´esz´aros Set of LPs . . . .

94

4.10 Total Execution Time of the CPU-based Implementations over the Randomly Generated Optimal Dense LPs . . . . . . . . . . . . . . . .

95

4.11 Total Execution Time of the CPU-based implementations over the Netlib and M´esz´aros Set of LPs . . . . . . . . . . . . . . . . . . . . .

95

4.12 Total Execution Time of the GPU-based Implementations over the Randomly Generated Optimal Dense LPs . . . . . . . . . . . . . . . .

96

4.13 Total Execution Time of the GPU-based Implementations over the Netlib and M´esz´aros Set of LPs . . . . . . . . . . . . . . . . . . . . .

96

4.14 Pivoting Execution Time on the CPU- and GPU-based Implementations and Speedup of the Steepest Edge Rule over the Randomly Generated Optimal Dense LPs . . . . . . . . . . . . . . . . . . . . . . . . 18

99

4.15 Pivoting Execution Time on the CPU- and GPU-based Implementations and Speedup of the Steepest Edge Rule over the Netlib and M´esz´aros Set of LPs . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

5.1

Basis Inverse Time (secs) . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.2

Total Time (secs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3

GPU-based PFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.4

GPU-based MPFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.5

Basis Inverse and Total Time of the GPU-based Implementations (secs) 111

5.6

Speedup Obtained by the GPU-based Implementations . . . . . . . . 112

5.7

Results of the Exterior Simplex Algorithm Using Three Diﬀerent Updating Methods over the Netlib Set . . . . . . . . . . . . . . . . . . . 114

5.8

Results of the Revised Simplex Algorithm Using Three Diﬀerent Updating Methods over the Netlib Set . . . . . . . . . . . . . . . . . . . 114

6.1

LP Algorithms’ Parallelization . . . . . . . . . . . . . . . . . . . . . . 124

6.2

GPU-based Revised Simplex Algorithm . . . . . . . . . . . . . . . . . 130

6.3

GPU-based Primal-Dual Exterior Point Simplex Algorithm . . . . . . 132

6.4

Total execution time over the randomly generated optimal dense LPs

6.5

Total execution time over the randomly generated optimal sparse LPs

134

with density 10% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.6

Total execution time over the randomly generated optimal sparse LPs with density 20% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.7

Number of iterations over the randomly generated optimal dense LPs

6.8

Number of iterations over the randomly generated optimal sparse LPs

136

with density 10% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.9

Number of iterations over the randomly generated optimal sparse LPs with density 20% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

19

20

Chapter 1 Introduction The simplex algorithm is one of the top ten algorithms with the greatest inﬂuence in the 20th century [23]. Since the introduction of the simplex algorithm in 1947 [20], Linear Programming (LP) has been widely used in many practical problems. After the publication of the simplex algorithm, many researchers in the 1950s and 1960s published their papers showing the practical eﬃciency of the simplex algorithm. However, the size of linear programming problems (LPs) grew up signiﬁcantly. Consequently, the simplex algorithm began to encounter computational issues in the solution of large LPs. Furthermore, Klee & Minty [54] proved that the worst-case complexity of simplex algorithm is exponential by formulating an example. Later, Borgwardt [11] showed that the average number of pivot steps required by the simplex algorithm is polynomial in real LPs. The proof of Klee & Minty [54] lead the researchers to ﬁnd a polynomial algorithm for LPs. In 1979 Khachian [53] modiﬁed an algorithm proposed by Dikin [22] and Shor [103] and proved that the modiﬁed algorithm is polynomial. It was one of the most important results not only for mathematical programming but also for theoretical computer science. Although Khachian’s algorithm was polynomial, the computational results showed that it was very ineﬃcient compared to the simplex algorithm. Researchers focused to design polynomial algorithms that should also be eﬃcient. Such an algorithm proposed by Karmarkar in 1984 [52]. Karmarkar proposed an interior point method that proved to be more eﬃcient than the simplex algorithm. Since 21

then, many interior point methods have been proposed [39] [40] [67] [73]. Computational results showed that interior point methods outperform the simplex algorithm for large scale LPs [8]. However, the implementation of pivoting algorithms that will be more eﬃcient than simplex algorithm is still an active ﬁeld of research. An approach to enhance the performance of simplex algorithm is to move in the exterior of the feasible solution and construct basis infeasible solutions instead of constructing feasible solutions like simplex algorithm does. Such an algorithm is called exterior point simplex algorithm and was proposed by Paparrizos initially for the assignment problem [81] and then for the solution of LPs [82]. A more eﬀective approach is the Primal-Dual Exterior Point Simplex Algorithm (PDEPSA) proposed by Samaras [101]. PDEPSA [83] can deal with the problems of stalling and cycling more eﬀectively and as a result improves the performance of the primal dual exterior point algorithms. The advantage of PDEPSA stems from the fact that it uses an interior point in order to compute the leaving variable in contrast to primal dual exterior point algorithms which use a boundary point. As in the solution of any large scale mathematical system, the computational time for large-scale LPs is a major concern. Parallel programming is a good practice for solving computationally intensive problems in operations research. The application of parallel processing for LP has been introduced in the early 1970s. However, only since the beginning of the 1980s attempts have been made to develop parallel implementations. In the late 1980s and early 1990s parallel implementations of the simplex algorithm [17] [24] [63] [109] [112] and the revised simplex algorithm [24] [43] [44] [49] [9] [105] have been proposed. The parallel implementations of the simplex algorithm achieved large speedups in dense problems. For instance, Thomadakis & Liu [112] proposed a parallel implementation that was up to 1000 times better than the corresponding serial implementation in large-scale random LPs. On the other hand, the parallel implementations of the revised simplex algorithm cannot achieve such speedups due to the sparsity that characterizes the most real LPs. The most indicative parallel implementation of the revised simplex algorithm was that proposed 22

by Shu [105], which was up tp 17 times better than the corresponding serial implementation in small Netlib LPs [35]. Furthermore, with the advances made in hardware, Graphical Processing Units (GPUs) have gained a lot of popularity in the past decade and have been widely applied to scientiﬁc computing applications. GPUs have been successfully applied for the solution of LPs using the simplex algorithm. Jung and O’Leary [51] proposed a CPU-GPU implementation of the Interior Point Method for dense LPs and their computational results showed a speedup up to 1.4 on medium-sized Netlib LPs [35] compared to the corresponding CPU implementation. Spampinato and Elster [107] presented a GPU-based implementation of the revised simplex algorithm on a GPU with NVIDIA CUBLAS [78] and NVIDIA LAPACK libraries [58]. Their implementation showed a maximum speedup of 2.5 on large random LPs compared to the corresponding CPU implementation. Bieling et al. [7] also proposed an implementation of the revised simplex algorithm on a GPU. They compared their GPU-based implementation with the serial GLPK solver and reported a maximum speedup of 18 in single precision. Lalami et al. [56] proposed an implementation of the tableau simplex algorithm on a CPU-GPU system. Their computational results on randomly generated dense problems showed a maximum speedup of 12.5 compared to the corresponding CPU implementation. Lalami et al. [57] extended their previous work [56] on a multiGPU implementation and their computational results on randomly generated dense problems showed a maximum speedup of 24.5. Li et al. [64] presented a GPU-based parallel algorithm, based on Gaussian elimination, for large scale LPs that outperforms the CPU-based algorithm. Meyer et al. [75] proposed a mono- and a multi-GPU implementation of the tableau simplex algorithm and compared their implementation with the serial CLP solver. Their implementation outperformed CLP solver on large sparse LPs. Gade-Nielsen and Jorgensen [33] presented a GPU-based interior point method and their computational results showed a speedup of 6 on randomly generated dense LPs and a speedup of 2 on randomly generated sparse LPs compared to the MATLAB’s built-in function linprog. Smith et al. [106] proposed a GPU-based interior point method and their computational results showed a maximum speedup of 1.27 23

on large sparse LPs compared to the corresponding multi-core CPU implementation.

1.1

GPU Computing

This section brieﬂy describes the architecture of an NVIDIA GPU in terms of hardware and software. GPU is a multi-core processor having thousands of threads running concurrently. GPU has many cores aligned in a particular way forming a single hardware unit. Data parallel algorithms are well suited for such devices, since the hardware can be classiﬁed as SIMT (Single-Instruction Multiple Threads). GPUs outperform CPUs in terms of GFLOPS (Giga Floating Point Operations per Second). For example, concerning the equipment utilized in the computational study presented in the following Sections, a high-end Core i7 processor with 3.46 GHz delivers up to a peak of 55.36 GFLOPs, while a high-end NVIDIA Quadro 6000 delivers up to a peak of 1030.4 GFLOPs. GPUs oﬀer good computational performance and eﬃcient memory bandwidth. There are currently two major programming models available for programming on GPUs, CUDA (Compute Uniﬁed Device Architecture) and OpenCL (Open Computing Language). NVIDIA introduced CUDA in late 2006 and is only available with NVIDIA GPUs, while Apple introduced OpenCL in 2008 and is available on GPUs of diﬀerent vendors and even on CPUs. We have implemented the GPU codes in this thesis using CUDA, because it is currently more mature. NVIDIA CUDA is an architecture that manages data-parallel computations on a GPU. CUDA enables users to execute codes on their GPUs and it is based on a Single Instruction Multiple Threads (SIMT) programming model. A CUDA program includes two portions, one that is executed on the CPU and another that is executed on the GPU. The CPU should be viewed as the host device, while the GPU should be viewed as co-processor. The code that can be parallelized is executed on the GPU as kernels, while the rest is executed on the CPU. CPU starts the execution of each portion of code and invokes a kernel function, so, the execution is moved to the GPU. The connection between CPU memory and GPU memory is through a fast PCIe 16x 24

Figure 1-1: High Level Description of GPU Architecture

point to point link. Each code that is executed on the GPU is divided into many threads. Each thread executes the same code independently on diﬀerent data. A thread block is a group of threads that cooperate via shared memory and synchronize their execution to coordinate their memory accesses. A grid consists of a group of thread blocks and a kernel is executed on a group of grids. A kernel is the resulting code after the compilation. NVIDIA Quadro 6000, which was used in our computational experiment, consists of 14 stream processors (SP) with 32 cores each, resulting to 448 total cores. A typical use of a stream is that the GPU schedules a memory copy of results from CPU to GPU, a kernel launch and a copy of results from the GPU to CPU. A high level description of the GPU architecture is shown in Figure 1-1.

25

1.2

Results and Contributions

The main contributions of this thesis are as follows: • We implement and computationally compare ten well-known scaling techniques: (i) arithmetic mean, (ii) de Buchet for the case p = 1, (iii) de Buchet for the case p = 2, (iv) entropy, (v) equilibration, (vi) geometric mean, (vii) IBM MPSX, (viii) Lp -norm for the case p = 1, (ix) Lp -norm scaling technique for the case p = 2, and (x) Lp -norm scaling technique for the case p = ∞ and de Buchet for the case p = ∞. Moreover, we study the impact of scaling prior to the application of the linear programming algorithms. We apply these techniques in order to investigate the signiﬁcance of scaling on: (i) simplextype or pivoting algorithms, (ii) interior-point methods, and (iii) exterior point simplex type algorithms. Additionally, we propose GPU-based implementations of these techniques. A thorough computational study is presented to establish the practical value of the GPU-based implementations. • We implement six well-known pivoting rules used on simplex-type algorithms: (i) Bland’s rule, (ii) Dantzig’s rule, (iii) Greatest Increment Method, (iv) Least Recently Considered Method, (v) Partial Pricing rule, and (vi) Steepest Edge rule. Furthermore, we propose GPU-based implementations of the aforementioned pivoting rules. Computational results are presented to show that that the proposed GPU-based implementations of the pivoting rules outperform the corresponding CPU-based implementations. • We implement ﬁve widely used basis update schemes: (i) Gaussian elimination, (ii) built-in function of MATLAB called inv, (iii) LU decomposition, (iv) Product Form of the Inverse, and (v) a Modiﬁcation of the Product Form of the Inverse. Moreover, we incorporate three of these methods on the exterior and the revised simplex algorithm in order to highlight the signiﬁcance of the choice of the basis update method on simplex type algorithms and the reduction that can oﬀer to the solution time. Additionally, we propose GPU-based imple26

mentations of these basis update schemes. A thorough computational study is presented to establish the practical value of the GPU-based implementations. • We propose two eﬃcient GPU-based implementations of the revised simplex algorithm and a primal-dual exterior point simplex algorithm. Computational results show that the proposed GPU-based implementations outperform MATLAB’s interior point method. The GPU-based primal-dual exterior point simplex algorithm is the ﬁrst GPU-based exterior point algorithm developed for the solution of a linear programming problem. This algorithm shows superior computational results compared to the serial MATLAB’s interior point method and our GPU-based implementation of the revised simplex algorithm. Hence, the development of the algorithm introduces an innovation to the ﬁeld of linear programming algorithms on GPUs since it is the only algorithm of its kind. • We design and implement two web-based educational tools for learning computational issues of simplex type algorithms.

1.3

Overview

In Chapter 2, mathematical preliminaries and the fundamentals of linear programming are presented. In addition, the revised simplex algorithm proposed by Dantzig [18], a primal-dual exterior point simplex algorithm proposed by Samaras [101] and the interior point method proposed by Mehrotra [73] are introduced. These algorithms will be used in the computational studies in other chapters. The aim of Chapter 3 is twofold. First, we review and implement ten scaling techniques with a focus on the implementation of them on a GPU: (i) arithmetic mean, (ii) de Buchet for the case p = 1, (iii) de Buchet for the case p = 2, (iv) entropy, (v) equilibration, (vi) geometric mean, (vii) IBM MPSX, (viii) Lp -norm for the case p = 1, (ix) Lp -norm scaling technique for the case p = 2, and (x) Lp -norm scaling technique for the case p = ∞ and de Buchet for the case p = ∞. A computational study on the Netlib set (optimal, Kennington and infeasible LPs) is presented to es27

tablish the practical value of GPU-based implementations. Finally, a computational study of the impact of scaling prior to the application of the simplex type algorithms is presented. In this computational study, we calculate both the CPU time and the number of iterations with and without scaling for a set of sparse randomly generated optimal LPs. The scaling techniques that we applied to the above mentioned algorithms are: (i) arithmetic mean, (ii) equilibration, and (iii) geometric mean scaling techniques. The problems with and without scaling are solved with three simplex type algorithms in order to investigate the signiﬁcance of scaling: (i) simplex-type or pivoting algorithms, (ii) interior-point methods (IPM) and (iii) exterior point simplex type algorithms (EPSA). The work of this chapter is based on our papers [89] [90]. Chapter 4 presents the implementation of six well-known pivoting rules for the revised simplex algorithm: (i) Bland’s rule, (ii) Dantzig’s rule, (iii) Greatest Increment Method, (iv) Least Recently Considered Method, (v) Partial Pricing rule, and (vi) Steepest Edge rule. We also propose GPU-based implementations of the aforementioned pivoting rules. Computational results on randomly generated optimal dense linear programs and on a set of benchmark problems (netlib–optimal, kennington, netlib–infeasible, M´esz´aros) show that the proposed GPU-based implementations of the pivoting rules outperform the corresponding CPU-based implementations. The work of this chapter is based on our papers [87] [93] [91]. In Chapter 5, we implement ﬁve widely used basis update schemes: (i) Gaussian elimination, (ii) built-in function of MATLAB called inv, (iii) LU decomposition, (iv) Product Form of the Inverse, and (v) a Modiﬁcation of the Product Form of the Inverse. We perform a computational comparison in which the basis inverse is computed with ﬁve diﬀerent updating schemes. Then, we propose an implementation of two updating schemes on a CPU-GPU System using MATLAB and CUDA environment. A computational study on randomly generated full dense linear programs is presented to establish the practical value of the GPU-based implementations. Furthermore, we incorporate three of these methods on the exterior and the revised simplex algorithm in order to highlight the signiﬁcance of the choice of the basis update method on simplex type algorithms and the reduction that can oﬀer to the solution time. Finally, we 28

perform a computational comparison in which the basis inverse is computed with the aforementioned updating methods. The work of this chapter is based on our papers [84] [85] [86]. In Chapter 6, we propose two eﬃcient GPU-based implementations of the revised simplex algorithm and a primal-dual exterior point simplex algorithm. Computational results show that the proposed GPU-based implementations outperform MATLAB’s interior point method. The GPU-based primal-dual exterior point simplex algorithm is the ﬁrst GPU-based exterior point algorithm developed for the solution of a linear programming problem. This algorithm shows superior computational results compared to the serial MATLAB’s interior point method and our GPU-based implementation of the revised simplex algorithm. The work of this chapter is based on our paper [94]. Chapter 7 presents the design and implementation of two web-based educational tools for learning computational issues of simplex type algorithms. The ﬁrst software assists students in the selection of the linear programming algorithm and basis update method for solving their linear programming problems. The second software builds on the ﬁrst one and goes further to explore all diﬀerent steps of the linear programming algorithms. Two linear programming algorithms are incorporated in the software: (i) Revised Simplex Algorithm [18] and (ii) Exterior Primal Simplex Algorithm [83]. The DSS also includes a variety of diﬀerent methods for the diﬀerent steps of these algorithms. More speciﬁcally, ten scaling techniques, ﬁve basis update methods and eight pivoting rules have been implemented in the software. The student can either select the algorithm and the appropriate methods to solve a linear programming problem or perform a thorough computational study with all combinations of algorithms and methods in order to gain an insight on its linear programming problem. The work of this chapter is based on our papers [92] [95] [96]. Finally, in Chapter 8 we present the conclusions of this thesis and give ideas for future work.

29

30

Chapter 2 Linear Programming Algorithms 2.1

Introduction

Linear Programming (LP) is perhaps the most important and well–studied optimization problem. Lots of real world problems can be formulated as Linear Programming problems (LPs). LP is the process of minimizing or maximizing a linear objective ∑n function z = j=1 cj xj to a number of linear equality and inequality constraints. Simplex algorithm is the most widely used method for solving LPs. Consider the following linear program (LP.1) in the standard form:

min

cT x

s.t.

Ax = b

(LP.1)

x≥0 where A ∈ Rm×n , (c, x) ∈ Rn , b ∈ Rm , and T denotes transposition. We assume that A has full rank, rank(A) = m, m < n. Consequently, the linear system Ax = b is consistent. The simplex algorithm searches for an optimal solution by moving from one feasible solution to another, along the edges of the feasible region. The dual problem associated with the (LP.1) is presented in (DP.1): 31

bT w

min s.t.

AT w + s = c

(DP.1)

s≥0 where w ∈ Rm and s ∈ Rn . The most well-known method for the optimization problem is the simplex algorithm developed by George B. Dantzig [20]. The simplex algorithm begins with a primal feasible basis and uses pricing operations until an optimum solution is computed. It also guarantees monotonicity of the objective value. It has been proved that the expected number of iterations in the solution of a linear problem is polynomial [11]. Moreover, the worst case complexity has exponential behavior [54]. Using a basic partition (B, N ), the linear problem (LP.1) can be written as shown in (LP.2).

min

cTB xB + cTN xN

s.t.

AB xB + AN xN = b

(LP.2)

xB , xN ≥ 0 In (LP.2), AB is an m × m non-singular sub-matrix of A, called basic matrix or basis. The columns of A which belong to subset B are called basic and those which belong to N are called non basic. The solution xB = (AB )−1 b, xN = 0 is called a basic solution. A solution x = (xB , xN ) is feasible iﬀ x > 0. Otherwise, (LP.2) is infeasible. The solution of (DP.1) is computed by the relation s = c − AT w, where w = (cB )T (AB )−1 are the simplex multipliers and s are the dual slack variables. The basis AB is dual feasible iﬀ s ≥ 0. In each iteration, simplex algorithm interchanges a column of matrix AB with a column of matrix AN and constructs a new basis AB . Since Dantzig’s initial contribution, researchers have made many eﬀorts in order to en32

hance the performance of simplex algorithm. In the 1990s a totally diﬀerent approach arose; namely Exterior Point Simplex Algorithm (EPSA). The ﬁrst implementation of an EPSA was introduced for the assignment problem [81]. The main idea of EPSA is that it moves in the exterior of the feasible region and constructs basic infeasible solutions instead of feasible solutions calculated by the simplex algorithm. Although EPSA outperforms the original simplex algorithm, it also has some computational disadvantages. The main disadvantage is that in many LPs, EPSA can follow a path, which steps away from the optimal solution. This drawback can be avoided if the exterior path is replaced with a dual feasible simplex path. The most eﬀective types of EPSA algorithms are the primal-dual versions. It has been observed that replacing the exterior path of an EPSA with a dual feasible simplex path results in an algorithm free from the computational disadvantages of EPSA [83]. A more eﬀective approach is the Primal-Dual Exterior Point Simplex Algorithm (PDEPSA) [101]. PDEPSA can deal with the problems of stalling and cycling more eﬀectively and as a result improves the performance of the primal dual exterior point algorithms. The advantage of PDEPSA stems from the fact that it uses an interior point in order to compute the leaving variable in contrast to primal dual exterior point algorithms which use a boundary point. At the same time when Dantzig presented the simplex algorithm, other researchers proposed interior point algorithms that traverse across the interior of the feasible region [30] [50] [118]. However, these eﬀorts did not compete the simplex algorithm in practice due to the expensive execution time per iteration and the possibility of numerical instability. The ﬁrst polynomial algorithm for LP was proposed by Khachiyan [53]. The development of the ellipsoid method had a great impact on the theory of linear programming, but the proposed algorithm did not achieve to compete the simplex algorithm in practice due to the expensive execution time per iteration. The ﬁrst interior point method that outperformed simplex algorithm was proposed by Karmarkar [52]. Since Karmarkar’s algorithm, many improvements have been made both in theory and in practice of interior point methods. The best known interior point method for linear programming ﬁnds an ε-accurate solution of a linear problem 33

√ in O ( nln (1/ε)) iterations [100]. In the following sections of this chapter, we present the linear programming algorithms that will be used in the next chapters. The structure of the chapter is as follows. In Section 2.2, we present the revised simplex algorithm proposed by Dantzig [18]. In Section 2.3, a primal-dual exterior point simplex algorithm proposed by Samaras [101] is presented. Section 2.4 presents the interior point method proposed by Mehrotra [73].

2.2

Revised Simplex Algorithm

A formal description of the revised simplex algorithm is given in Table 2.1.

Table 2.1: Revised Simplex Algorithm Step 0. (Initialization). Start with a feasible partition (B, N ). Compute (AB )−1 and vectors xB , w and sN . Step 1. (Test of optimality). if sN ≥ 0 then STOP. (LP.2) is optimal. else Choose the index l of the entering variable using a pivoting rule. Variable xl enters the basis. Step 2. (Minimum ratio test). Compute the pivot column hl = (AB )−1 Al . if hl ≤ 0 then STOP. (LP.2) is unbounded. else Choose the leaving variable xB[r] = xk using the { following relation: } xB[r] =

xB[r] hil

= min

xB[i] hil

: hil < 0

Step 3. (Pivoting). Swap indices k and l. Update the new basis inverse (AB )−1 , using a basis update scheme. Go to Step 1.

We illustrate the algorithm by applying it to the following linear problem: 34

min

2x1 + 4x2 + 8x3 + 5x4 − 15x5

s.t. − x1 + x2 + 4x3 + 2x4 + 5x5 ≤ 6 12x1 + x2 + 2x3 − 3x4 + 9x5 ≥ −5 −5x1 − 6x2 − 2x3 − 2x4 − 4x5 ≤ 10

(EX.1)

x1 + 3x2 − 4x3 + 10x4 + 8x5 ≤ 14 9x1 + 5x2 + 3x3 + 2x4 − x5 ≤ 12 x1 , x2 , x3 , x4 , x5 ≥ 0

First, we introduce the slack variables x6 , x7 , x8 , x9 and x10 . The example (EX.1) can be rewritten in matrix notation as follows:

[ cT = 2 4 −1 1 12 1 A = −5 −6 1 3 9 5

8 5 −15 0 0 0 0 0 0

]

4

2

5

1

2

−3

9

0 −1 0 0

−2 −2 −4 0

0

1 0

−4 10

0

0

0 1

2 −1 0 6 −5 b = 10 14 12

0

0 0

3

8

0 0 0

[

0 0 0 1

]

In step 0, we determine the partition (B, N ), where B = 6 7 8 9 10 and N = [ ] 1 2 3 4 5 . Hence: 35

1 0 0 0 0 0 −1 0 0 0 AB = 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 −1 1 4 2 5 12 1 2 −3 9 AN = −5 −6 −2 −2 −4 1 3 −4 10 8 9 5 3 2 −1 6 5 xB = 10 14 12 [ ] w= 0 0 0 0 0 [ ] sN = 2 4 8 5 −15 It is easily seen that B is feasible because xB ≥ 0. So, we don’t have to apply Phase I. Then, we proceed with the ﬁrst iteration of the algorithm. We check if sN ≥ 0; sN 0, so we proceed with the selection of the entering variable. According to Dantzig’s rule, the entering variable is x5 . Next, we compute the pivot column:

5 −9 hl = −4 8 −1 Next, we ﬁnd that the leaving variable is x6 . Finally, we update the partition (B, N ) and the basis inverse. 36

[ ] B = 5 7 8 9 10 [ ] N= 1 2 3 4 6 1 0 0 0 5 9 5 −1 0 0 −1 (AB ) = 45 0 1 0 8 − 5 0 0 1 1 0 0 0 5

0 0 0 0 1

Now, we proceed with the second iteration of the algorithm. We check if sN ≥ 0: [ ] sN = −1 7 20 11 3 sN 0, so we proceed with the selection of the entering variable. According to Dantzig’s rule, the entering variable is x1 . Next, we compute the pivot column:

− 15

69 − 5 29 hl = − 5 13 5 44 5

Next, we ﬁnd that the leaving variable is x10 . Finally, we update the partition (B, N ) and the basis inverse. [ B= 5 7 8 [ N = 10 2 3 9 0 44 93 44 −1 −1 (AB ) = 41 0 44 73 − 44 0 1 0 44

] 9 1

]

4 6 0 0

1 44

0 0

69 44

29 1 0 44 13 0 1 − 44 5 0 0 44

Now, we proceed with the third iteration of the algorithm. We check if sN ≥ 0: [ ] 5 167 899 124 133 sN = 44 22 44 11 44 37

Indeed, sN ≥ 0, so the linear problem is optimal. The algorithm performed 3 iterations and found that the optimal solution is −19.5.

2.3

Primal - Dual Exterior Point Simplex Algorithm

A formal description of PDEPSA is given in Table 2.2. For a full description of PDEPSA see [101]. We illustrate the algorithm by applying it to the following linear problem:

min

2x1 + x2 + 4x3 + 3x4

s.t. − x1 − 2x2 + x3 − x4 ≤ −10 −x1 + 3x2 − x3 − x4 ≤ 5 4x1 − 2x2 − x3 − x4 ≤ −6

(EX.2)

x1 , x2 , x3 , x4 ≥ 0 First, we introduce the slack variables x5 , x6 , and x7 . The example (EX.2) can be rewritten in matrix notation as follows: [ cT = 2 1 4 3 0 −1 −2 1 −1 A = −1 3 −1 −1 4 −2 −1 −1 −10 b= 5 −6 38

] 0 0

1 0 0 0 1 0 0 0 1

Table 2.2: Primal-Dual Exterior Point Simplex Algorithm Step 0. (Initialization). A) Start with a dual feasible basic partition (B, N ) and an interior point y > 0 of (LP.2). Set: P = N, Q = ∅ and compute xB = (AB )−1 b, wT = (cB )T (AB )−1 , sN = (cN )T − wT AN B) Compute the direction dB from the relation: dB = yB − xB Step 1. (Test of optimality and choice of the leaving variable). if x ≥ 0 then STOP. (LP.2) is optimal. else Choose the leaving variable xk = xB[r]{from the relation: } x

x

al = −dB[r] = max −dB[r] : dB[i] > 0 ∧ xB[i] < 0 B[r] B[r] Step 2. (Computation of the next interior point). Set: a = al2+1 Compute the interior point: yB = xB + adB Step 3. (Choice ( −1 )of the entering variable). Set: HrN = AB r. A.N . Choose the entering variable xl from the{relation: } −sl HrN

= min

−sl HrN −1

: Hrj ∧ j ∈ N

Compute the pivoting column: hl = (AB ) A.l if l ∈ P then P ← P \ {l} else Q ← Q \ {l} Step 4. (Pivoting). Set: B[r] = l and Q ← Q ∪ {k} Using the new partition (B, N ) where N = (P, Q), compute the new basis inverse −1 AB and the variables xB , w, and sN . Go to step 0B. [ In step 0, we determine the partition (B, N ), where B = [ ] 1 2 3 4 . Hence: 1 0 0 AB = 0 1 0 0 0 1 39

] 5 6 7 and N =

−1 −2 1 −1 AN = −1 3 −1 −1 4 −2 −1 −1 −10 xB = 5 −6 [ ] w= 0 0 0 ] [ sN = 2 1 4 3 It is easily seen that current basis is dual feasible because sN ≥ 0. So, we do not need to apply the modiﬁed big-M method presented by Paparrizos et al. [83]. Furthermore, [ ] T the point yB = 1 16 5 is an interior point, so we can apply PDEPSA. Then, we proceed with the ﬁrst iteration of the algorithm. −10 xB = 5 −6 [ ] P =N = 1 2 3 4 Q=Ø [ ] [ ] [ ] T T T dB = yB − xB = 1 16 5 − −10 5 −6 = 11 11 11 We check if xB ≥ 0; xB 0, so we proceed with the selection of the leaving variable. The leaving variable is x5 and the entering variable xl = x2 . Next, we perform the pivoting step: [ ] P = 1 3 4 [ ] Q= 5 [ ] B= 2 6 7 [ ] N= 1 5 3 4 1 − 0 0 2 −1 3 (AB ) = 2 1 0 −1 0 1 40

5

xB = −10 4 [ ] sN = 32 12 92 25 Then, we proceed with the second iteration of the algorithm. The new direction is [ ] dTB = −4 26 1 . We check if xB ≥ 0; xB 0, so we proceed with the selection of the leaving variable. The leaving variable is x6 and the entering variable xl = x1 . Next, we perform the pivoting step: ] [ P = 3 4 [ ] Q= 5 6 [ ] B= 2 1 7 [ ] N= 6 5 3 4 1 1 0 − 5 5 (AB )−1 = − 35 − 25 0 2 2 1 3 xB = 4 −16 ] [ 7 24 3 sN = 5 5 5 1 Then, we proceed with the third iteration of the algorithm. The new direction is ] [ 269 . We check if x ≥ 0; x 0, so we proceed with the selection dTB = − 10 4 13 B B 13 of the leaving variable. The leaving variable is x7 and the entering variable xl = x4 . Next, we perform the pivoting step: [ ] P = 3 [ ] Q= 5 6 7 [ ] B= 2 1 4 [ ] N= 6 5 3 7 41

(AB )−1

− 15 51 0 1 = − 15 0 5 − 25 − 25 − 51 3 4 xB = 5 16 5

[ 3 5

sN =

7 5

24 5

] 0

xB ≥ 0, so the linear problem is optimal. The algorithm performed 3 iterations and found that the optimal solution is 14.2.

2.4

Interior Point Method

In this section, we present an interior point method proposed by Mehrotra [73]. This algorithm has been implemented by Zhang [124] and this implementation was incorporated in MATLAB. This implementation is used in the following chapters when there is a need to compare our algorithms with a sophisticated commercial solver. This implementation is a primal-dual infeasible-interior-point algorithm that takes advantage of MATLAB’s sparse-matrix functions and of existing Fortran sparse Cholesky codes. Prior to the description of Zhang’s implementation, we ﬁrst present the adequate notations used by Zhang [124]. First, consider the following primal linear program:

min

cT x

s.t.

Ax = b 0 ≤ xi ≤ ui , i ∈ I 0 ≤ xi , i ∈ J 42

(2.1)

where I and J are disjoint index sets that I ∪ J = {1, 2, ..., n} and I ∩ J = ∅. For some positive integer nu ≤ n:

I = {1, 2, ..., nu } , J = {nu + 1, nu + 2, ..., n}

(2.2)

Given a vector x ∈ ℜn , the notation xu denotes the vector in ℜnu whose elements are the ﬁrst nu elements of x. Moreover, Zhang [124] deﬁnes the appending operator “app” from ℜnu to ℜn that appends n − nu zeros to vectors ℜnu :

[app (wi )] =

w , 1 ≤ i ≤ n i

u

0, n < i ≤ n u

(2.3)

By adding the slack variables, the linear program in (2.1) can be rewritten into the standard form:

min

cT x

s.t.

Ax = b

(2.4)

xu + s = u x ≥ 0, s ≥ 0 The dual of the above standard primal linear program is:

bT y − u T w

max s.t.

AT y + z − app (w) = c z ≥ 0, w ≥ 0 43

(2.5)

where y, z and w are the dual variables and slacks. The solutions of the primal and the dual linear programs, if they exist, satisfy the following KKT conditions which is a system of linear-quadratic equations with nonnegativity constraints on some variables:

Ax − b xu + s − u F (x, z, s, w, y) = AT y + z − app (w) − c = 0, (x, z, s, w) ≥ 0 xz sw

(2.6)

where xz and sw denote component-wise multiplications and the equations xz = 0 and sw = 0 are called the complementarity conditions for the linear program.

As can be observed from (2.6), solving a linear program is equivalent to solving a system of linear equations with nonnegativity constraints on a subset of variables. The KKT system (2.6) can be rewritten into a constrained algebraic system of l equations and l variables with nonnegativity constraints on l+ variables:

F (v) = 0, vi ≥ 0, 1 ≤ i ≤ l +

(2.7)

where v = (x, z, s, w, y), l = 2n + 2nu + m and l+ = 2n + 2nu . The algorithm that was implemented by Zhang [124] was proposed by Mehrotra [73], which is called a predictor-corrector method. Mehrotra’s infeasible interior point algorithm is by far the most eﬃcient interior point method in practice [67] [73]. The algorithm can be described by the following algorithmic framework 2.3 [124]: Zhang’s [124] implementation of this algorithm was developed using MATLAB. Both m-ﬁles and mex-ﬁles were used; intensive tasks were implemented in Fortran in 44

1. 2. 3. 4. 5. 6.

Table 2.3: Mehrotra’s predictor-corrector method 0 0 Choose an

initial ( k ) point v such that vi > 0, 1 ≤ i ≤ l+. Set k = 0. do until F v is small ( )−1 ( k ) Compute ∆p v k = −F ′ v k F v and compute µk > 0. ( ) ( ( )) −1 Compute ∆c v k = −F ′ v k F v k +(∆p v k − µk e ). Choose ak > 0 and set v k+1 = v k + ak ∆p v k + ∆c v k such that vik+1 > 0, 1 ≤ i ≤ l+. Increment k by 1. end do

order to take advantage of it’s computational speed. The algorithm also includes a preprocessing and a scaling function.

45

46

Chapter 3 Scaling Techniques 3.1

Introduction

Preconditioning techniques are important in solving linear problems, as they improve their computational properties. Scaling is the most widely used preconditioning technique in linear optimization algorithms and is used to reduce the condition number of the constraint matrix, improve the numerical behavior of the algorithms and reduce the number of iterations required to solve linear problems. The aim of this chapter is twofold. First, we review and implement ten scaling techniques with a focus on their implementation on a GPU. All these techniques have been implemented under the MATLAB and CUDA environment. A computational study on the Netlib set (optimal, Kennington and infeasible LPs) is presented to establish the practical value of the GPU-based implementations. Finally, a computational study of the impact of scaling prior to the application of the linear programming algorithms is presented. In the computational study that we have conducted, we calculate both the CPU time and the number of iterations with and without scaling for a set of sparse randomly generated optimal LPs. The scaling techniques that we applied to the above mentioned algorithms are: (i) arithmetic mean, (ii) equilibration, and (iii) geometric mean scaling techniques. The problems with and without scaling are solved with three linear programming algorithms in order to investigate the signiﬁcance of scaling: (i) simplex-type or pivoting algorithms, (ii) interior-point methods (IPM) and 47

(iii) exterior point simplex type algorithms (EPSA). The structure of the chapter is as follows. In Section 3.2, we present the background of this chapter. In Section 3.3, ten scaling methods are presented and analyzed. Section 3.4 presents the implementations of these scaling methods on a GPU. In Section 3.5, the computational comparison of the CPU- and GPU- based implementations are presented over the Netlib set (optimal, Kennington and infeasible LPs). Section 3.6 presents a computational study of the impact of scaling prior to the application of the revised simplex algorithm, an inerior-point method and an exterior point simplex type algorithm over a set of sparse randomly generated optimal LPs. Finally, the conclusions of this chapter are outlined in section 3.7.

3.2

Background

As in the solution of any large scale mathematical system, the computational time and numerical accuracy for large LPs are major concerns. Preconditioning techniques can be applied to LPs prior to the application of a solver in order to improve their computational properties. Scaling is the most widely used preconditioning technique in linear optimization solvers. A matrix is badly-scaled if its nonzero elements are of diﬀerent magnitudes. Scaling is an operation in which the rows and columns of a matrix are multiplied by positive scalars and these operations lead to nonzero numerical values of similar magnitude. Scaling is used prior to the application of a linear programming algorithm for three reasons [113]: (a) to produce a compact representation of the bounds of variables, (b) to reduce the number of iterations required to solve LPs, (c) to simplify the setup of the tolerances, and (d) to reduce the condition number of the constraint matrix A and improve the numerical behavior of the algorithms. Tomlin [113] presented a thesis on scaling LPs and a computational study comparing arithmetic mean, geometric mean, equilibration, Curtis and Reid [16] scaling technique, Fulkerson and Wolfe [32] scaling technique, and various combinations on six 48

test problems of varying size. Tomlin concluded that the geometric mean scaling, optionally followed by equilibration, or the Curtis-Reid method are the best combined scaling techniques. Larsson [60] expanded on Tomlin’s study by presenting and comparing entropy, Lp norm [46], and de Buchet [21] scaling models over 135 randomly generated problems of varying size. Larsson remarked that the entropy model is often able to improve the conditioning of the randomly generated LPs. Elble and Sahinidis [26] performed a computational study comparing arithmetic mean, de Buchet scaling model, entropy scaling technique, equilibration, geometric mean, IBM MPSX method, Lp norm scaling method, binormilization scaling technique, and various combinations over Netlib and Kennington set. Elble and Sahinidis used four measures to evaluate the quality of each scaling technique: (a) scaling time, (b) solution time, (c) solution iterations, and (d) maximum condition number. Elble and Sahinidis concluded that on average no scaling technique outperforms the equilibration scaling technique despite the added complexity and computational cost. Furthermore, Elble and Sahinidis [25] have proposed an implementation of the binormilization scaling technique using a GPU that outperformed the CPU-based implementation.

3.3 3.3.1

Scaling Techniques Mathematical Preliminaries

Some necessary mathematical preliminaries should be introduced, before the presentation of the scaling techniques. We have adopted notations and terminology similar to [26]. Let A be an m x n matrix. Let ri be the row scaling factor for row i and sj be the column scaling factor for column j. Let Ni = {j|Aij ̸= 0}, where i = 1, ..., m, and Mj = {i|Aij ̸= 0}, where j = 1, ..., n. Let ni and mj be the cardinality numbers of the sets Ni and Mj , respectively. Furthermore, we assume that there are no zero rows or columns in matrix A (or if exist, we remove them with the application of a presolve technique). The scaled matrix is expressed as X = RAS, where R = diag(r1 ...rm ) 49

and S = diag(s1 ...sn ). All scaling methods presented in this chapter, perform ﬁrst a scaling of the rows and then a scaling of the columns (Gauss - Seidel based versions). A Jacobi based version of binormilization scaling technique has been proposed by Elble and Sahinidis [25] and showed smaller convergence rate, but high speed up in dense large scale LPs. We perform a computational study over sparse LPs, so we chose to implement Gauss - Seidel based versions of the scaling techniques. Row and column scales are applied iteratively. Let X 0 = A. Let k be the index of the scaling iterations, X k be the scaled matrix after k iterations, X k+1/2 be the scaled matrix only by row scaling factors in iteration k + 1, X k+1 be the scaled matrix both by row and column scaling factors in iteration k + 1, rk+1 be the row scaling factors in iteration k + 1, and sk+1 be the column scaling factors in iteration k + 1. Then, each iteration is given by: X k+1/2 = Rk+1 X k , (3.1) X

k+1

=X

k+1/2

S

k+1

where: R=

t ∏

Rk ,

k=1 t ∏

S=

(3.2) Sk

k=1

where t is the number of iterations for a scaling method to converge to the scaled matrix.

3.3.2

Arithmetic Mean

Arithmetic mean aims to decrease the variance between the nonzero elements in the coeﬃcient matrix A. In this method, each row is divided by the arithmetic mean of the absolute value of the elements in that row. The row scaling factors are presented 50

in equation 3.3:

(

∑ Xijk ni /

rik+1 =

) (3.3)

jϵNi

Similarly, each column is divided by the arithmetic mean of the absolute value of the elements in that column. The column scaling factors are presented in equation 3.4:

sk+1 j

∑ k+1/2 = mj / Xij

(3.4)

iϵMj

3.3.3

de Buchet

The de Buchet scaling model is based on the relative divergence and is formulated as shown in equation 3.5: min

∑

(r,s>0)

1/p {Aij ri sj + 1/ (Aij ri sj )}p

(3.5)

(i,j)ϵZ

where p is a positive integer and Z is the number of nonzero elements of A. We focus now our attention on the cases p = 1, 2 and ∞. For the case p = 1, the relation 3.5 is formulated as shown in equation 3.6: ∑

min

Aij ri sj + 1/ (Aij ri sj )

(r,s>0)

(3.6)

(i,j)ϵZ

The row scaling factors are presented in equation 3.7:

rik+1 =

{( ∑

)( )}1/2 ∑ k Xijk 1/ Xij

jϵNi

(3.7)

jϵNi

Similarly, the column scaling factors are presented in equation 3.8:

sk+1 j

1/2 ∑ k+1/2 ∑ k+1/2 X = 1/ Xij ij iϵMj

iϵMj

51

(3.8)

For the case p = 2, the relation 3.5 is stated as shown in equation 3.9: min

∑

(r,s>0)

1/2 {Aij ri sj + 1/ (Aij ri sj )}2

(3.9)

(i,j)ϵZ

The row scaling factors are presented in equation 3.10:

rik+1 =

{( ∑(

)2 1/ Xijk

jϵNi

)(

∑ ( )2 Xijk

)}1/4 (3.10)

jϵNi

Similarly, the column scaling factors are presented in equation 3.11:

sk+1 j

1/4 ∑( )2 ( ) k+1/2 ∑ k+1/2 2 = 1/ Xij Xij iϵMj

(3.11)

iϵMj

Finally, for the case p = ∞, the relation 3.5 is formulated as shown in equation 3.12: min max |log (Aij ri sj )|

(3.12)

(r,s>0) (i,j)ϵZ

The row scaling factors are described in equation 3.13: rik+1

{( )( )} k k 1/2 = 1/ max Xij min Xij jϵNi

jϵNi

(3.13)

Similarly, the column scaling factors are presented in equation 3.14: {( sk+1 j

3.3.4

= 1/

) ( )}1/2 k+1/2 k+1/2 max Xij min Xij iϵMj

iϵMj

(3.14)

Entropy

The entropy model was ﬁrst presented by Larsson [60] and is attributed to Dantzig and Erlander. This technique solves the model presented in equation 3.15, in order 52

to identify a scaling X, with all xij ̸= 0 of magnitude one: min

∑

Xij (log (Xij /Aij ) − 1)

(i,j)ϵZ

s.t.

∑

Xij = ni

i = 1, ..., m

Xij = mj

j = 1, ..., n

jϵNi )

∑

iϵMj

Xij ≥ 0

∀ (i, j) ϵZ (3.15)

According to Larsson [60], any algorithm for solving entropy programs can be used for scaling matrices, but it is recommended to apply the row and column scaling factors presented in equations 3.16 and 3.17: rik+1 = ni /

∑ Xijk

(3.16)

jϵNi

sk+1 = mj / j

∑ k+1/2 Xij

(3.17)

iϵMj

3.3.5

Equilibration

For each row of the coeﬃcient matrix A the largest element in absolute value is found, and the speciﬁed row of matrix A and the corresponding element of vector b is multiplied by the inverse of the largest element. Then, for each column of the coeﬃcient matrix A that does not include 1 as the largest element in absolute value, the largest element in absolute value is found, and the speciﬁed column of matrix A and the corresponding element of vector c is multiplied by the inverse of the largest element. Consequently, all the elements of matrix A will have values between -1 and 1. 53

3.3.6

Geometric Mean

Like arithmetic mean, geometric mean also aims to decrease the variance between the nonzero elements in the matrix. In this method, the row scaling factors are calculated as shown in equation 3.18: rik+1

( )−1/2 k k = max Xij min Xij jϵNi

jϵNi

(3.18)

Similarly, the column scaling factors are presented in equation 3.19: sk+1 j

3.3.7

( )−1/2 k+1/2 k+1/2 = max Xij min Xij jϵMj

jϵMj

(3.19)

IBM MPSX

The method was proposed by Benichou et al. [5] and was later adopted by IBM, which used this method in IBMs MPSX linear optimization solver. This method combines geometric mean and equilibration scaling techniques. Initially, geometric mean is performed four times or until the relation 3.20 is true. 1 Z

∑ (

) k 2

Xij

−

(i,j)ϵZ

∑ (

) k 2

Xij

/ Z < ε 2

(3.20)

(i,j)ϵZ

where Z is the cardinality number of nonzero elements of X k and ε is a tolerance, which often is set below ten. Then, a equilibration scaling is applied.

3.3.8

Lp-norm

The Lp -norm scaling model is formulated as shown in equation 3.21: min

∑

(r,s>0)

1/p |log (Aij ri sj )|p

(i,j)ϵZ

54

(3.21)

where p is a positive integer and Z is the cardinality number of nonzero elements of A. We focus now our attention on the cases p = 1, 2 and ∞. For the case p = 1, the model is formulated as shown in equation 3.22: ∑

min (r,s>0)

|log (Aij ri sj )|

(3.22)

(i,j)ϵZ

We divide each row and column by the median of the absolute value of the nonzero elements. The row scaling factors are presented in equation 3.23: { } rik+1 = 1/median Xijk |jϵNi

(3.23)

Similarly, the column scaling factors are presented in equation 3.24: { } k+1/2 sk+1 = 1/median X |iϵM j j ij

(3.24)

For the case p = 2, the model is stated as shown in equation 3.25: min

1/2

∑

|log (Aij ri sj )|2

(r,s>0)

(3.25)

(i,j)ϵZ

The row scaling factors are calculated as shown in equation 3.26: rik+1 = 1/

∏(

Xijk

)1/ni

(3.26)

jϵNi

Similarly, the column scaling factors are presented in equation 3.27: sk+1 j

= 1/

∏(

k+1/2 Xij

)1/mj

(3.27)

iϵMj

Finally, for the case p = ∞, the model is equivalent to the de Buchet method (case p = ∞).

55

3.4

Implementation of Scaling Techniques on a GPU

In this section, we present the GPU-based implementations of the aformentioned scaling methods, taking advantage of the power that modern GPUs oﬀer. The GPUbased implementations of these scaling techniques are implemented on MATLAB and CUDA. MATLAB’s GPU support was added in version R2011a. The scaling methods are built using CUDA MEX ﬁles and invoked from MATLAB. All methods take as input the constraint matrix A, the right-hand side vector b and the objective function coeﬃcients c. The outputs are the scaled matrices A, b and c, and the row and column scaling factors, r and s respectively.

3.4.1

Implementation of GPU-Based Scaling Techniques

Figure 3-1 presents the process that is performed in the GPU-based implementation of arithmetic mean. Similar steps are performed in each of the aforementioned scaling techniques. Firstly, the CPU reads the matrix A and vectors b and c from the input ﬁle. In the second step, the GPU gathers the input data and calculates the number and the sum of nonzero elements of each row. In step 3, the GPU calculates the row scaling factors as the number of nonzero elements of each row to the sum of the same row. Then, in the fourth step, the GPU updates matrix A and vector b according to the row scaling factors. In the ﬁfth step, the GPU calculates the number and the sum of nonzero elements of each column. In step 6, the GPU calculates the column scaling factors as the number of nonzero elements of each column to the sum of the same column. Then, in step 7, the GPU updates matrix A and vector c according to the column scaling factors. Finally, the CPU gathers the results and save matrix A and vectors b and c to the output ﬁle.

56

Figure 3-1: Flow Chart of the GPU-based Arithmetic Mean

3.4.2

Notations

Some necessary notations should be introduced, before the presentation of the pseudocodes of each scaling method. Let r be an 1xm vector with row scaling factors and s be an 1xn vector with column scaling factors. Let sum row be an 1xm vector with the sum of each row’s elements, sum col be an 1xn vector with the sum of each column’s elements, inv sum row be an 1xm vector with the inverse of the sum of each row’s elements, inv sum col be an 1xn vector with the inverse of the sum of each column’s elements. Furthermore, let row prod be an 1xm vector with the product of each row’s elements, col prod be an 1xn vector with the product of each column’s elements, row max be an 1xm vector with each row’s maximum element, row min be an 1xm vector with each row’s minimum element, col max be an 1xn vector with each column’s maximum element, col min be an 1xn vector with each column’s minimum element. Finally, let row med be an 1xm vector with the median of each row’s elements, col med be an 1xn vector with the product of each column’s elements, 57

count row be an 1xm vector with the number of each row’s nonzero elements, and count col be an 1xn vector with the sum of each column’s nonzero elements. The pseudocode of all scaling techniques are presented in the following sub-sections of this section. Pseudocodes include “do parallel” and “end parallel” sections, in which the workload is divided into warps that are executed sequentially on a multiprocessor. The warp size in NVIDIA Quadro 6000 is 32 threads. The pseudocodes presented in the following sub-sections do not work as they are for arrays larger than the warp size, because the scan of matrix A and vectors b and c is performed in place. The results of one warp will be overwritten by threads in another warp. We assume, without loss of generality, that matrix A and vectors b and c are double-buﬀered. This assumption is made merely to simplify the presentation of the subsequent pseudocodes. Although, the scaling process can be applied iteratively, pseudocodes present only one iteration. Finally, all pseudocodes are presented according to the equations of section 3.3.

3.4.3

GPU-based Arithmetic Mean

Table 3.1 shows the pseudocode of the implementation of the arithmetic mean scaling technique on a GPU. In the ﬁrst for-loop (lines 2:14), the row scaling factors are calculated in parallel as the number of nonzero elements of each row to the sum of the same row (line 10). If the absolute value of the sum and the inverse sum of a row are not zero (line 9), then matrix A and vector b are updated (lines 11:12). Finally, in the second for-loop (lines 17:31), the column scaling factors are calculated in parallel as the number of nonzero elements of each column to the sum of the same column (line 25). If the absolute value of the sum and the inverse sum of a column are not zero (line 24), then matrix A and vector c are updated (lines 26:27).

3.4.4

GPU-based de Buchet

Table 3.2 shows the pseudocode of the implementation of the de Buchet scaling technique for the case p = 1 on a GPU. In the ﬁrst for-loop (lines 2:14), the row scaling 58

Table 3.1: GPU-based Arithmetic Mean 1. do parallel 2. for i=1:m 3. for j=1:n 4. if A[i][j] != 0 5. sum_row[i] = sum_row[i] + |A[i][j]| 6. count_row[i] = count_row[i] + 1 7. end if 8. end for 9. if count_row[i] != 0 AND sum_row[i] != 0 10. r[i] = count_row[i] / sum_row[i] 11. A[i][:] = A[i][:] * r[i] 12. b[i] = b[i] * r[i] 13. end if 14. end for 15. end parallel 16. do parallel 17. for i=1:n 18. for j=1:m 19. if A[i][j] != 0 20. sum_col[i] = sum_col[i] + |A[i][j]| 21. count_col[i] = count_col[i] + 1 22. end if 23. end 24. if count_col[i] != 0 AND sum_col[i] != 0 25. s[i] = count_col[i] / sum_col[i] 26. A[:][i] = A[:][i] * s[i] 27. c[i] = c[i] * s[i] 30. end if 31. end for 32. end parallel

59

factors are calculated in parallel as square root of the division of the inverse sum of each row to the sum of the same row (line 10). If the absolute value of the sum and the inverse sum of a row are not zero (line 9), then matrix A and vector b are updated (lines 11:12). Finally, in the second for-loop (lines 17:29), the column scaling factors are calculated in parallel as the square root of the division of the inverse sum of each column to the sum of the same column (line 25). If the absolute value of the sum and the inverse sum of a column are not zero (line 24), then matrix A and vector c are updated (lines 26:27).

Similarly, Table 3.3 presents the pseudocode of the implementation of the de Buchet scaling technique for the case p = 2 on a GPU. In the ﬁrst for-loop (lines 2:15), the row scaling factors are calculated in parallel as the division of the inverse sum of each row to the sum of the same row to the power of 1/4 (line 11). If the absolute value of the sum and the inverse sum of a row are not zero (line 10), then matrix A and vector b are updated (lines 12:13). Finally, in the second for-loop (lines 18:30), the column scaling factors are calculated in parallel as the division of the inverse sum of each column to the sum of the same column to the power of 1/4 (line 26). If the absolute value of the sum and the inverse sum of a column are not zero (line 25), then matrix A and vector c are updated (lines 27:28).

Finally, Table 3.4 shows the pseudocode of the implementation of the de Buchet scaling technique for the case p = ∞ on a GPU. In the ﬁrst for-loop (lines 2:10), the row scaling factors are calculated in parallel as the inverse of the square root of the product of the maximum and the minimum element of the same row (line 6). If the maximum and minumum element are not zero (line 5), then matrix A and vector b are updated (lines 7:8). Finally, in the second for-loop (lines 13:21), the column scaling factors are calculated in parallel as the inverse of the square root of the product of the maximum and the minimum element of the same column (line 17). If the maximum and minumum element are not zero (line 16), then matrix A and vector c are updated (lines 18:19). 60

Table 3.2: GPU-based de Buchet for the case p = 1 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.

do parallel for i=1:m for j=1:n if A[i][j] != 0 sum_row[i] = sum_row[i] + |A[i][j]| inv_sum_row[i] = inv_sum_row[i] + 1/|A[i][j]| end if end for if sum_row[i] != 0 AND inv_sum_row[i] != 0 r[i] = (inv_sum_row(i) / sum_row(i))^(1/2); A[i][:] = A[i][:] * r[i] b[i] = b[i] * r[i] end if end for end parallel do parallel for i=1:n for j=1:m if A[i][j] != 0 sum_col[i] = sum_col[i] + |A[i][j]| inv_sum_col[i] = inv_sum_col[i] + 1/|A[i][j]| end if end if sum_col[i] != 0 AND inv_sum_col[i] != 0 s[i] = (inv_sum_col(i) / sum_col(i))^(1/2); A[:][i] = A[:][i] * s[i] c[i] = c[i] * s[i] end if end for end parallel

61

Table 3.3: GPU-based de Buchet for the case p = 2 1. 2. 3. 4. 5. 6. 7. 8. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.

do parallel for i=1:m for j=1:n if A[i][j] != 0 sum_row[i] = sum_row[i] + |A[i][j]*A[i][j]| inv_sum_row[i] = inv_sum_row[i] + 1 / |(A[i][j]*A[i][j])| end if end for if sum_row[i] != 0 AND inv_sum_row[i] != 0 r[i] = (inv_sum_row(i) / sum_row(i))^(1/4); A[i][:] = A[i][:] * r[i] b[i] = b[i] * r[i] end if end for end parallel do parallel for i=1:n for j=1:m if A[i][j] != 0 sum_col[i] = sum_col[i] + |A[i][j]*A[i][j]| inv_sum_col[i] = inv_sum_col[i] + 1 / |(A[i][j]*A[i][j])| end if end if sum_col[i] != 0 AND inv_sum_col[i] != 0 s[i] = (inv_sum_col(i) / sum_col(i))^(1/4); A[:][i] = A[:][i] * s[i] c[i] = c[i] * s[i] end if end for end parallel

62

Table 3.4: GPU-based de Buchet for the case p = ∞ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.

do parallel for i=1:m find the maximum absolute value of the element store it to row_max[i] find the minimum absolute value of the element store it to row_min[i] if row_max[i] != 0 AND row_min[i] != 0 r[i] = 1/(sqrt(row_max[i] * row_min[i])) A[i][:] = A[i][:] * r[i] b[i] = b[i] * r[i] end if end for end parallel do parallel for i=1:n find the maximum absolute value of the element store it to col_max[i] find the minimum absolute value of the element store it to col_min[i] if col_max[i] != 0 AND col_min[i] != 0 s[i] = 1/(sqrt(col_max[i] * col_min[i])) A[i][:] = A[i][:] * s[i] c[i] = c[i] * s[i] end if end for end parallel

3.4.5

in row i and in row i and

in column i and in column i and

GPU-based Entropy

Table 3.5 presents the pseudocode of the implementation of the arithmetic mean scaling technique on a GPU. In the ﬁrst for-loop (lines 2:14), the row scaling factors are calculated in parallel as the division of the number to the sum of the nonzero elements of each row (line 10). If the number of the nonzero elements of a row are not zero (line 9), then matrix A and vector b are updated (lines 11:12). Finally, in the second for-loop (lines 17:29), the column scaling factors are calculated in parallel as 63

the division of the number to the sum of the nonzero elements of each column (line 25). If the number of the nonzero elements of a column are not zero (line 24), then matrix A and vector c are updated (lines 26:27).

3.4.6

GPU-based Equilibration

Table 3.6 shows the pseudocode of the implementation of the equilibration scaling technique on a GPU. In the ﬁrst for-loop (lines 2:9), the row scaling factors are calculated in parallel as the inverse of the maximum element of each row (line 5). If the the maximum element of a row is not zero (line 4), then matrix A and vector b are updated (lines 6:7). Similarly, in the second for-loop (lines 12:19), the column scaling factors are calculated in parallel as the inverse of the maximum element of each column (line 15). If the the maximum element of a column is not zero (line 14), then matrix A and vector c are updated (lines 16:17).

3.4.7

GPU-based Geometric Mean

Table 3.7 presents the pseudocode of the implementation of the geometric mean scaling technique on a GPU. In the ﬁrst for-loop (lines 2:11), the row scaling factors are calculated in parallel as the inverse of the product of the square root of the maximum element in a row and the square root of the minimum element in a row (line 6). If the maximum and minimum element of a row are not zero (line 5), then matrix A and vector b are updated (lines 7:8). Finally, in the second for-loop (lines 14:22), the column scaling factors are calculated in parallel as the inverse of the product of the square root of the maximum element in a column and the square root of the minimum element in a column (line 18). If the maximum and minimum element of a column are not zero (line 17), then matrix A and vector c are updated (lines 19:20).

64

Table 3.5: GPU-based Entropy 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.

do parallel for i=1:m for j=1:n if A[i][j] != 0 sum_row[i] = sum_row[i] + |A[i][j]| count_row[i] = count_row[i] + 1 end if end for if count_row[i] != 0 AND sum_row[i] != 0 r[i] = count_row[i] / sum_row[i]; A[i][:] = A[i][:] * r[i] b[i] = b[i] * r[i] end if end for end parallel do parallel for i=1:n for j=1:m if A[i][j] != 0 sum_col[i] = sum_col[i] + |A[i][j]| count_col[i] = count_col[i] + 1 end if end if count_col[i] != 0 AND sum_col[i] != 0 s[i] = count_col[i] / sum_col[i]; A[:][i] = A[:][i] * s[i] c[i] = c[i] * s[i] end if end for end parallel

65

Table 3.6: GPU-based Equilibration 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

do parallel for i=1:m find the maximum element in row i and store it to row_max[i] if row_max[i] != 0 r[i] = 1 / row_max[i]; A[i][:] = A[i][:] * r[i] b[i] = b[i] * r[i] end if end for end parallel do parallel for i=1:n find the maximum element in column i and store it to col_max[i] if col_max[i] != 0 s[i] = 1 / col_max[i] A[:][i] = A[:][i] * s[i] c[i] = c[i] * s[i] end if end for end parallel

66

Table 3.7: GPU-based Geometric Mean 1. 2. 3. 4. 5. 6. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.

do parallel for i=1:m find the maximum element in row i and store it to row_max[i] find the minimum element in row i and store it to row_min[i] if row_max[i] != 0 AND row_min[i] != 0 r[i] = 1/(sqrt(row_max[i]) * sqrt(row_min[i]))) A[i][:] = A[i][:] * r[i] b[i] = b[i] * r[i] end if end for end parallel do parallel for i=1:n find the maximum element in column i and store it to col_max[i] find the minimum element in column i and store it to col_min[i] if col_max[i] != 0 AND col_min[i] != 0 s[i] = 1/(sqrt(col_max[i]) * sqrt(col_min[i]))) A[i][:] = A[i][:] * s[i] c[i] = c[i] * s[i] end if end for end parallel

67

Table 3.8: GPU-based IBM MPSX 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

do parallel find the maximum element in matrix A and store it to max_A find the minimum element in matrix A and store it to min_A end parallel if |max_A-min_A|>tol counter = 0 while counter < 4 apply geometric mean scaling technique (which is already parallelized) if Equation (20) is true break end if counter=counter+1; end while apply equilibration scaling technique (which is already parallelized) end if

3.4.8

GPU-based IBM MPSX

Table 3.8 shows the pseudocode of the implementation of the IBM MPSX scaling technique on a GPU. Initially, the maximum and minimum elements are calculated (lines 2:3); if the absolute value of their diﬀerence is greater than a very small positive number ’tol’ (line 5), then the algorithm performs four times the geometric mean scaling method (line 8) or stops when Equation 3.20 is true (lines 9:11). Finally, the equilibration scaling method is applied (line 14).

3.4.9

GPU-based Lp-norm

Table 3.9 presents the pseudocode of the implementation of the Lp -norm scaling technique for the case p = 1 on a GPU. In the ﬁrst for-loop (lines 2:9), the row scaling factors are calculated in parallel as the inverse of the median in each row (line 5). If the median of a row is not zero (line 4), then matrix A and vector b are updated (lines 6:7). Similarly, in the second for-loop (lines 12:19), the column scaling factors are calculated in parallel as the inverse of the median in each column (line 15). If the median 68

Table 3.9: GPU-based Lp-norm for the case p = 1 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

do parallel for i=1:m find the median in row i and store it to row_med[i] if row_med[i] != 0 r[i] = 1 / row_med[i] A[i][:] = A[i][:] * r[i] b[i] = b[i] * r[i] end if end for end parallel do parallel for i=1:n find the median in column i and store it to col_med[i] if col_med[i] != 0 s[i] = 1 / col_med[i] A[:][i] = A[:][i] * s[i] c[i] = c[i] * s[i] end if end for end parallel

of a column is not zero (line 14), then matrix A and vector c are updated (lines 16:17).

Similarly, Table 3.10 shows the pseudocode of the implementation of the Lp -norm scaling technique for the case p = 2 on a GPU. In the ﬁrst for-loop (lines 2:15), the row scaling factors are calculated in parallel as the inverse of the product of each row’s elements to the power of the negative inverse of the number of each row’s nonzero elements (lines 10:11). If the number of a row’s nonzero elements is not zero (line 9), then matrix A and vector b are updated (lines 12:13). Finally, in the second for-loop (lines 18:31), the column scaling factors are calculated in parallel as the inverse of the product of each column’s elements to the power of the negative inverse of the number of each column’s nonzero elements (lines 26:27). If the number of a column’s nonzero elements is not zero (line 25), then matrix A and vector c are updated (lines 28:29).

69

Table 3.10: GPU-based Lp-norm for the case p = 2 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.

do parallel for i=1:m for j=1:n if A[i][j] != 0 row_prod[i] = row_prod[i] * A[i][j] count_row[i] = count_row[i] + 1 end if end for if row_prod[i] != 0 row_prod[i] = 1 / row_prod[i]^(-1 / count_row[i]) r[i] = 1 / row_prod[i] A[i][:] = A[i][:] * r[i] b[i] = b[i] * r[i] end if end for end parallel do parallel for i=1:n for j=1:m if A[i][j] != 0 col_prod[i] = col_prod[i] * A[i][j] count_col[i] = count_col[i] + 1 end if end if col_prod[i] != 0 col_prod[i] = 1 / col_prod[i]^(-1 / count_col[i]) s[i] = 1 / col_prod[i] A[:][i] = A[:][i] * s[i] c[i] = c[i] * s[i] end if end for end parallel

70

Finally, for the case p = ∞, the pseudocode is equivalent to the de Buchet (Table 3.4).

3.5

Computational Comparison of the CPU- and GPU- Based Implementations

Computational studies have been widely used, in order to examine the practical eﬃciency of an algorithm or even compare algorithms. The computational comparison of the aforementioned updating schemes has been performed on a quad-processor Intel Core i7 3.4 GHz with 32 Gbyte of main memory and 8 cores, a clock of 3700 MHz, an L1 code cache of 32 KB per core, an L1 data cache of 32 KB per core, an L2 cache of 256 KB per core, an L3 cache of 8 MB and a memory bandwidth of 21 GB/s, running under Microsoft Windows 7 64-bit and on a NVIDIA Quadro 6000 with 6 GB GDDR5 384-bit memory, a core clock of 574 MHz, a memory clock of 750 MHz and a memory bandwidth of 144 GB/s. It consists of 14 stream processors with 32 cores each, resulting in 448 total cores. The graphics card driver installed in our system is NVIDIA 64 kernel module 306.23. The serial algorithms have been implemented using MATLAB Professional R2012b. MATLAB (MATrix LABoratory) is a powerful programming environment and is especially designed for matrix computations in general. The mex ﬁles of the GPU-based implementations have been implemented using CUDA SDK 4.2 and Microsoft Visual Studio 2012. Serial scaling techniques automatically execute on multiple computational threads, in order to take advantage of the multiple cores of the CPU. The execution time of all serial scaling techniques already includes the performance beneﬁt of the inherent multithreading in MATLAB. MATLAB supports multithreaded computation for some built-in functions. These functions automatically utilize multiple threads without the need to specify commands to handle the threads in a code. Of course, MATLABs inherent multithreading is not so eﬃcient as a pure parallel implementa71

tion. Execution times both on CPU and GPU-based scaling techniques have been measured using tic and toc MATLAB’s built-in functions. The test set used in the computational study was the Netlib set (optimal, Kennington and infeasible LPs) [35]. The Netlib library is a well known suite containing many real world LPs. Ord´on ˜ez and Freund [79] have shown that 71% of the Netlib LPs are ill-conditioned. Hence, numerical diﬃculties may occur. Table 3.11 presents some useful information about the test bed, which was used in the computational study. The ﬁrst column includes the name of the problem, the second the number of constraints, the third the number of variables, the fourth the nonzero elements of matrix A and the ﬁfth the objective value. The test bed includes 53 optimal and 7 infeasible LPs from Netlib and 5 Kennington LPs that do not have ranges and bounds sections in their mps ﬁles.

Table 3.12 presents the results from the execution of the CPU-based implementations of the scaling methods, while Table 3.13 presents the results of the GPU-based ones. Table 3.14 presents the speedup obtained by the GPU-based implementations. In Tables 3.12 - 3.14, the following abbreviations are used: (i) S1 - arithmetic mean, (ii) S2 - de Buchet for the case p = 1, (iii) S3 - de Buchet for the case p = 2, (iv) S4 - entropy, (v) S5 - equilibration, (vi) S6 - geometric mean, (vii) S7 - IBM MPSX, (viii) S8 - Lp -norm for the case p = 1, (ix) S9 - Lp -norm scaling technique for the case p = 2, and (x) S10 - Lp -norm scaling technique for the case p = ∞ and de Buchet for the case p = ∞. For each instance, we averaged times over 10 runs. All times in the Tables 3.12 - 3.13 are measured in seconds.

We must highlight that the computational results for all scaling techniques are very promising. From the above results, we observe that: (i) both the CPU- and GPU-based implementations of all scaling techniques are very close in terms of time except entropy and IBM MPSX scaling methods, (ii) GPU outperforms CPU in all problems and for all scaling techniques, and (iii) on average the speedup gained from the GPU-based implementations of all scaling methods is about 7x. The application 72

Table 3.11: Statistics of the Netlib LPs (optimal, Kennington and infeasible LPs) Name 25FV47 ADLITTLE AFIRO AGG AGG2 AGG3 BANDM BEACONFD BGINDY BGPRTR BLEND BNL1 BNL2 BRANDY CRE-A CRE-C D2Q06C DEGEN2 DEGEN3 E226 FFFFF800 ISRAEL ITEST2 ITEST6 KLEIN1 KLEIN2 KLEIN3 LOTFI MAROS-R7 OSA-07 OSA-14 OSA-30 QAP12 QAP15 QAP8 SC105 SC205 SC50A SC50B SCAGR25 SCAGR7 SCFXM1 SCFXM2 SCFXM3 SCORPION SCRS8 SCSD1 SCSD6 SCSD8 SCTAP1 SCTAP2 SCTAP3 SHARE1B SHARE2B SHIP04L SHIP04S SHIP08L SHIP08S SHIP12L SHIP12S STOCFOR1 STOCFOR2 TRUSS WOOD1P WOODW

Constraints 822 57 28 489 517 517 306 174 2,672 21 75 644 2,325 221 3,517 3,069 2,172 445 1,504 224 525 175 10 12 55 478 995 154 3,137 1,119 2,338 4,351 3,193 6,331 913 106 206 51 51 472 130 331 661 991 389 491 78 148 398 301 1,091 1,481 118 97 403 403 779 779 1,152 1,152 118 2,158 1,001 245 1,099

Variables 1,571 97 32 163 302 302 472 262 10,116 34 83 1,175 3,489 249 4,067 3,678 5,167 534 1,818 282 854 142 4 8 54 54 88 308 9,408 23,949 52,460 100,024 8,856 22,275 1,632 103 203 48 48 500 140 457 914 1,371 358 1,169 760 1,350 2,750 480 1,880 2,480 225 79 2,118 1,458 4,283 2,387 5,427 2,763 111 2,031 8,806 2,594 8,405

73

Nonzeros A 11,127 465 88 2,541 4,515 4,531 2,659 3,476 75,019 90 521 6,129 16,124 2,150 19,054 16,922 35,674 4,449 26,230 2,767 6,235 2,358 17 23 696 4,585 12,107 1,086 151,120 167,643 367,220 700,160 44,244 110,700 8,304 281 552 131 119 2,029 553 2,612 5,229 7,846 1708 4,029 3,148 5,666 5,500 2,052 8,124 1,0734 1,182 730 8,450 5,810 17,085 9,501 21,597 10,941 474 9,492 36,642 70,216 37,478

Objective value 5.50E+03 2.25E+05 -4.65E+02 -3.60E+07 -2.02E+07 1.03E+07 -1.59E+02 3.36E+04 Infeasible Infeasible -3.08E+01 1.98E+03 1.81E+03 1.52E+03 2.35E+07 2.52E+07 1.23E+05 -1.44E+03 -9.87E+02 -1.88E+01 5.56E+05 -8.97E+05 Infeasible Infeasible Infeasible Infeasible Infeasible -2.53E+01 1.50E+06 5.35E+10 1.10E+06 2.13E+06 5.23E+02 1.04E+03 2.04E+02 -5.22E+01 -5.22E+01 -6.46E+01 -7.00E+01 -1.48E+07 -2.33E+06 1.84E+04 3.67E+04 5.49E+04 1.88E+03 9.04E+02 8.67E+00 5.05E+01 9.05E+02 1.41E+03 1.72E+03 1.42E+03 -7.66E+04 -4.16E+02 1.79E+06 1.80E+06 1.91E+06 1.92E+06 1.47E+06 1.49E+06 -4.11E+04 -3.90E+04 4.59E+05 1.44E+00 1.30E+00

Table 3.12: Scaling time of the CPU-based implementations (secs) Problem 25FV47 ADLITTLE AFIRO AGG AGG2 AGG3 BANDM BEACONFD BGINDY BGPRTR BLEND BNL1 BNL2 BRANDY CRE A CRE C D2Q06C DEGEN2 DEGEN3 E226 FFFFF800 ISRAEL ITEST2 ITEST6 KLEIN1 KLEIN2 KLEIN3 LOTFI MAROS-R7 OSA-07 OSA-14 OSA-30 QAP12 QAP15 QAP8 SC105 SC205 SC50A SC50B SCAGR25 SCAGR7 SCFXM1 SCFXM3 SCORPION SCRS8 SCSD1 SCSD6 SCSD8 SCTAP1 SCTAP2 SCTAP3 SHARE1B SHARE2B SHIP04L SHIP04S SHIP08L SHIP08S SHIP12L SHIP12S STOCFOR1 STOCFOR2 TRUSS WOOD1P WOODW Average

S1 0.1872 0.0001 0.0001 0.0156 0.0312 0.0468 0.0312 0.0001 3.8376 0.0001 0.0001 0.0780 0.4836 0.0156 1.1232 0.8736 1.2948 0.0468 0.7176 0.0156 0.0468 0.0001 0.0001 0.0001 0.0001 0.0468 0.1716 0.0156 2.9016 4.0872 19.6561 81.5573 2.3244 11.5441 0.1248 0.0001 0.0156 0.0001 0.0001 0.0312 0.0001 0.0312 0.1404 0.0156 0.0468 0.0156 0.0312 0.1092 0.0156 0.1404 0.2652 0.0001 0.0001 0.0468 0.0468 0.1092 0.0624 0.1716 0.0624 0.0156 0.2808 0.6240 0.1872 0.3588 2.0953

S2 0.2028 0.0001 0.0001 0.0156 0.0624 0.0468 0.0156 0.0001 3.9312 0.0001 0.0001 0.0936 0.5616 0.0156 1.2012 0.9204 1.3260 0.0624 0.7176 0.0312 0.0468 0.0001 0.0001 0.0001 0.0156 0.0312 0.1872 0.0001 3.0576 4.2588 19.7497 85.0049 2.4180 11.5285 0.1560 0.0001 0.0156 0.0001 0.0001 0.0312 0.0156 0.0312 0.1560 0.0156 0.0624 0.0156 0.0468 0.1248 0.0312 0.1560 0.2808 0.0001 0.0156 0.0624 0.0468 0.1560 0.0624 0.2184 0.0780 0.0001 0.3120 0.6864 0.2184 0.4056 2.1704

S3 0.2028 0.0156 0.0001 0.0156 0.0624 0.0624 0.0156 0.0156 4.0560 0.0001 0.0001 0.1092 0.6084 0.0156 1.2948 0.9828 1.4040 0.0468 0.7176 0.0156 0.0468 0.0156 0.0001 0.0156 0.0001 0.0468 0.1872 0.0156 3.1668 4.4304 20.2177 83.3045 2.5116 11.8873 0.1560 0.0001 0.0156 0.0156 0.0001 0.0312 0.0156 0.0312 0.1716 0.0156 0.0624 0.0312 0.0468 0.1404 0.0312 0.1872 0.2964 0.0156 0.0001 0.0780 0.0468 0.1716 0.0624 0.2496 0.0780 0.0156 0.3276 0.7332 0.2184 0.4368 2.1745

S4 4.9296 0.0001 0.0156 0.0156 0.0312 0.0312 0.0780 0.0001 3.8532 0.0001 0.0156 2.7612 106.6579 0.0156 178.3559 126.3140 162.3346 0.4524 19.8589 0.0312 0.3276 0.0156 0.0001 0.0001 0.0001 0.0468 0.1716 0.0312 2.9172 4.0872 19.6717 81.6977 2.3712 11.4661 0.1248 0.0156 0.0312 0.0001 0.0001 0.5304 0.0156 0.0936 3.9936 0.1248 0.0468 0.0624 0.3588 7.3632 0.0156 0.1716 26.0522 0.0001 0.0156 0.0468 0.0468 11.1385 1.6536 0.1716 2.9796 0.0001 90.6210 211.3190 0.1872 0.3432 16.9700

S5 0.1872 0.0001 0.0001 0.0156 0.0468 0.0468 0.0156 0.0001 3.8064 0.0001 0.0156 0.0780 0.4680 0.0001 1.1388 0.8268 1.2636 0.0468 0.6708 0.0001 0.0468 0.0001 0.0001 0.0001 0.0001 0.0312 0.1716 0.0156 2.9484 3.9936 18.9541 79.9193 2.2776 11.1229 0.1248 0.0001 0.0001 0.0156 0.0001 0.0156 0.0156 0.0156 0.1404 0.0156 0.0312 0.0312 0.0312 0.0936 0.0156 0.1404 0.2496 0.0156 0.0001 0.0468 0.0468 0.1092 0.0468 0.1716 0.0624 0.0001 0.2808 0.6084 0.1872 0.3432 2.0463

74

S6 0.1872 0.0156 0.0001 0.0156 0.0468 0.0624 0.0156 0.0001 3.9624 0.0001 0.0156 0.0936 0.4992 0.0156 1.1324 0.9032 1.3572 0.0468 0.6864 0.0312 0.0312 0.0156 0.0001 0.0001 0.0001 0.0468 0.1872 0.0156 3.0576 4.2717 19.8277 81.9824 2.4960 11.6221 0.1404 0.0001 0.0156 0.0001 0.0001 0.0312 0.0156 0.0312 0.1404 0.0156 0.0624 0.0156 0.0468 0.1092 0.0312 0.1560 0.2808 0.0156 0.0001 0.0624 0.0468 0.1404 0.0624 0.2028 0.0780 0.0001 0.2964 0.6708 0.2028 0.3900 2.1235

S7 0.3588 0.0156 0.0001 0.0312 0.0936 0.0936 0.0624 0.0156 7.9093 0.0001 0.0156 0.1716 0.9672 0.0312 1.8746 1.5016 2.6208 0.0936 1.3416 0.0312 0.0780 0.0312 0.0001 0.0001 0.0156 0.0624 0.3432 0.0312 15.2881 8.3617 39.2343 140.0167 4.8828 23.6342 0.2652 0.0001 0.0156 0.0156 0.0001 0.0624 0.0000 0.0780 0.4368 0.0312 0.0936 0.0312 0.0780 0.2184 0.0468 0.2964 0.5304 0.0156 0.0001 0.1248 0.0780 0.2496 0.1092 0.3900 0.1248 0.0156 0.5616 1.2792 0.9360 0.7488 4.0007

S8 0.2496 0.0001 0.0156 0.0312 0.0624 0.0780 0.0312 0.0156 4.2588 0.0001 0.0156 0.1248 0.6240 0.0312 1.3016 1.0013 1.4976 0.0624 0.7644 0.0312 0.0624 0.0156 0.0156 0.0001 0.0001 0.0468 0.2184 0.0156 3.2448 5.1168 21.6529 89.1428 2.7768 12.7141 0.2028 0.0156 0.0156 0.0156 0.0001 0.0468 0.0156 0.0468 0.2028 0.0312 0.0780 0.0468 0.0936 0.2184 0.0312 0.2184 0.3588 0.0156 0.0156 0.1248 0.0780 0.2340 0.1092 0.3432 0.1404 0.0156 0.3900 0.9828 0.2340 0.5304 2.3449

S9 0.2184 0.0156 0.0001 0.0156 0.0624 0.0468 0.0156 0.0156 4.2120 0.0156 0.0001 0.1092 0.5616 0.0156 1.2014 0.9845 1.5132 0.0624 0.8268 0.0156 0.0468 0.0156 0.0001 0.0001 0.0001 0.0468 0.2340 0.0156 3.8220 5.1697 22.3614 91.1643 2.9016 14.6485 0.1560 0.0156 0.0001 0.0156 0.0001 0.0312 0.0000 0.0156 0.1560 0.0312 0.0468 0.0312 0.0312 0.1248 0.0156 0.1716 0.2652 0.0156 0.0001 0.0780 0.0468 0.1560 0.0624 0.2340 0.0780 0.0156 0.3432 0.7800 0.1716 0.3900 2.4033

S10 0.2028 0.0001 0.0001 0.0312 0.0468 0.0468 0.0312 0.0001 3.9156 0.0001 0.0001 0.0936 0.5148 0.0001 1.1354 0.9514 1.3260 0.0468 0.6708 0.0156 0.0468 0.0001 0.0156 0.0001 0.0001 0.0312 0.1872 0.0156 2.9640 4.4176 20.0015 83.4710 2.4024 12.1993 0.1404 0.0001 0.0156 0.0001 0.0001 0.0312 0.0156 0.0156 0.1404 0.0156 0.0624 0.0156 0.0468 0.1092 0.0312 0.1560 0.2652 0.0156 0.0001 0.0780 0.0468 0.1248 0.0624 0.2028 0.0780 0.0001 0.2964 0.6396 0.1872 0.3900 2.1557

Table 3.13: Scaling time of the GPU-based implementations (secs) Problem 25FV47 ADLITTLE AFIRO AGG AGG2 AGG3 BANDM BGINDY BGPRTR BLEND BNL1 BNL2 BRANDY CRE A CRE C D2Q06C DEGEN2 DEGEN3 E226 FFFFF800 ISRAEL ITEST2 ITEST6 KLEIN1 KLEIN2 KLEIN3 LOTFI MAROS-R7 OSA-07 OSA-14 OSA-30 QAP12 QAP12 QAP15 QAP8 SC105 SC205 SC50A SC50B SCAGR25 SCAGR7 SCFXM1 SCFXM3 SCORPION SCRS8 SCSD1 SCSD6 SCSD8 SCTAP1 SCTAP2 SCTAP3 SHARE1B SHARE2B SHIP04L SHIP04S SHIP08L SHIP08S SHIP12L SHIP12S STOCFOR1 STOCFOR2 TRUSS WOOD1P WOODW Average

S1 0.03100 0.00001 0.00001 0.00200 0.00371 0.00547 0.00482 0.00001 0.83950 0.00001 0.00001 0.00816 0.12500 0.00242 0.22409 0.15557 0.24900 0.00592 0.06300 0.00187 0.00647 0.00001 0.00001 0.00001 0.00001 0.00687 0.02792 0.00242 0.40600 0.81430 3.50249 15.26764 0.78000 2.89943 0.01600 0.00001 0.00254 0.00001 0.00001 0.00804 0.00001 0.00366 0.03200 0.00226 0.00677 0.00187 0.00336 0.01008 0.00184 0.03200 0.06660 0.00001 0.00001 0.01600 0.00637 0.02617 0.00737 0.03466 0.01454 0.00290 0.05713 0.17747 0.02297 0.10000 0.40762

S2 0.02970 0.00001 0.00001 0.00201 0.00665 0.00524 0.00235 0.00001 0.87065 0.00001 0.00001 0.00814 0.14531 0.00265 0.24709 0.19787 0.19874 0.00952 0.06846 0.00304 0.00591 0.00001 0.00001 0.00001 0.00232 0.00447 0.03638 0.00001 0.55283 0.86469 5.21370 17.85277 0.97736 3.02863 0.02146 0.00001 0.00304 0.00001 0.00001 0.00535 0.00159 0.00297 0.05455 0.00269 0.00937 0.00213 0.00425 0.01396 0.00322 0.04538 0.05087 0.00001 0.00141 0.01384 0.00583 0.03316 0.00932 0.06528 0.01762 0.00002 0.08173 0.16288 0.02509 0.07539 0.4851

S3 0.02891 0.00151 0.00001 0.00190 0.00625 0.00886 0.00278 0.00179 0.91531 0.00179 0.00001 0.01243 0.18299 0.00271 0.31339 0.24019 0.25630 0.00546 0.07157 0.00153 0.00602 0.00134 0.00001 0.00183 0.00001 0.00763 0.04499 0.00296 0.52233 0.92810 6.07788 25.61921 1.27751 3.27338 0.01990 0.00001 0.00215 0.00136 0.00001 0.00828 0.00176 0.00464 0.04656 0.00290 0.01033 0.00319 0.00495 0.01334 0.00330 0.07130 0.05360 0.00154 0.00001 0.02318 0.00746 0.03363 0.00776 0.07758 0.01879 0.00219 0.05732 0.17709 0.02685 0.16596 0.6354

S4 0.65986 0.00001 0.00138 0.00229 0.00277 0.00372 0.01265 0.00001 0.91778 0.00001 0.00131 0.28449 27.31438 0.00276 43.40088 32.25917 23.55395 0.07014 1.70134 0.00314 0.03454 0.00141 0.00001 0.00001 0.00001 0.00910 0.02953 0.00329 0.51442 1.34225 4.97106 19.02405 1.01333 2.84412 0.01871 0.00168 0.00607 0.00001 0.00001 0.06918 0.00145 0.00915 1.21183 0.02504 0.00989 0.00708 0.03095 0.69999 0.00197 0.07595 5.02743 0.00001 0.00151 0.00888 0.00494 1.73876 0.30476 0.03530 0.71014 0.00002 19.45784 62.00310 0.01903 0.04957 3.9298

S5 0.03122 0.00001 0.00001 0.00248 0.00480 0.00870 0.00278 0.00001 0.74331 0.00001 0.00145 0.00928 0.12210 0.00002 0.22181 0.16531 0.17434 0.00596 0.06922 0.00001 0.00625 0.00001 0.00001 0.00001 0.00001 0.00604 0.03321 0.00400 0.45905 0.88147 4.13408 18.00210 4.48332 2.65684 0.01672 0.00001 0.00002 0.00123 0.00001 0.00876 0.00218 0.00213 0.06618 0.00454 0.00583 0.00278 0.00395 0.00868 0.00148 0.08251 0.05244 0.00157 0.00001 0.01075 0.00673 0.01629 0.00600 0.04687 0.01563 0.00002 0.04200 0.14148 0.02965 0.46377 0.5198

75

S6 0.03018 0.00198 0.00001 0.00191 0.00382 0.00837 0.00243 0.00001 0.60835 0.00001 0.00124 0.01023 0.23583 0.00236 0.22239 0.18131 0.19170 0.00672 0.05070 0.00306 0.00377 0.00143 0.00001 0.00001 0.00001 0.00586 0.04527 0.00192 0.55706 0.81303 4.18020 15.16495 0.78392 2.60497 0.02059 0.00001 0.00392 0.00001 0.00001 0.00389 0.00130 0.00325 0.03744 0.00236 0.01158 0.00202 0.00357 0.01182 0.00325 0.04657 0.06257 0.00156 0.00001 0.01062 0.00470 0.02202 0.01800 0.03066 0.02286 0.00003 0.09733 0.14678 0.02268 0.04625 0.4119

S7 0.06200 0.00149 0.00001 0.00537 0.00943 0.01761 0.00980 0.00223 1.98650 0.00223 0.00135 0.02368 0.23551 0.00685 0.47963 0.37619 0.30062 0.01091 0.15392 0.00285 0.01210 0.00260 0.00001 0.00001 0.00278 0.00958 0.08300 0.01055 2.22572 1.55536 7.18940 21.30879 3.75597 5.36347 0.03562 0.00001 0.00338 0.00107 0.00001 0.01681 0.00000 0.00895 0.10954 0.01177 0.02054 0.00254 0.01083 0.02373 0.00462 0.69984 0.16913 0.00157 0.00001 0.02009 0.00943 0.04672 0.01152 0.14144 0.02490 0.00430 0.07296 0.29365 0.11958 0.33944 0.7408

S8 0.04091 0.00001 0.00125 0.00440 0.00501 0.01147 0.00504 0.00180 1.02990 0.00180 0.00112 0.01377 0.19031 0.00502 0.34241 0.25022 0.19092 0.00937 0.05400 0.00287 0.00836 0.00157 0.00156 0.00001 0.00001 0.00759 0.04250 0.00191 0.85959 1.24668 5.00709 17.96421 0.74726 2.66738 0.02443 0.00145 0.00501 0.00121 0.00001 0.00595 0.00131 0.00495 0.04719 0.00413 0.01781 0.00783 0.00768 0.02468 0.00283 0.04641 0.06608 0.00155 0.00152 0.01844 0.00819 0.03749 0.02516 0.07095 0.03909 0.00415 0.15298 0.15760 0.02635 0.06574 0.4937

S9 0.03121 0.00145 0.00001 0.00205 0.00645 0.00799 0.00255 0.00190 0.93329 0.00190 0.00001 0.01182 0.09340 0.00330 0.28947 0.24357 0.16268 0.00868 0.09665 0.00127 0.01018 0.00137 0.00001 0.00001 0.00001 0.00904 0.03583 0.00990 0.67973 1.07228 5.13130 21.81985 1.67142 3.07322 0.02799 0.00216 0.00003 0.00098 0.00001 0.01391 0.00000 0.00155 0.02753 0.00999 0.00823 0.00231 0.00513 0.01489 0.00168 -0.64884 0.21182 0.00135 0.00001 0.01760 0.00529 0.03070 0.00643 0.27420 0.01881 0.00563 0.05906 0.25373 0.02064 0.21618 0.5625

S10 0.04018 0.00001 0.00001 0.00365 0.00393 0.00672 0.00607 0.00001 0.76881 0.00001 0.00001 0.01193 0.23715 0.00002 0.27043 0.18973 0.14766 0.00621 0.05439 0.00139 0.00644 0.00001 0.00208 0.00001 0.00001 0.00556 0.03473 0.00231 0.59845 1.08673 4.32704 18.01645 0.96327 2.39131 0.02186 0.00001 0.01352 0.00001 0.00001 0.00376 0.00118 0.00197 0.02265 0.00275 0.01046 0.00200 0.00360 0.01221 0.00305 0.03282 0.04411 0.00164 0.00001 0.01574 0.00487 0.02039 0.01393 0.03767 0.02383 0.00003 0.10596 0.12771 0.01798 0.03914 0.4651

Table 3.14: Speedup Obtained by the GPU-based Implementations Problem 25FV47 ADLITTLE AFIRO AGG AGG2 AGG3 BANDM BEACONFD BGINDY BGPRTR BLEND BNL1 BNL2 BRANDY CRE A CRE C D2Q06C DEGEN2 DEGEN3 E226 FFFFF800 ISRAEL ITEST2 ITEST6 KLEIN1 KLEIN2 KLEIN3 LOTFI MAROS-R7 OSA-07 OSA-14 OSA-30 QAP12 QAP15 QAP8 SC105 SC205 SC50A SC50B SCAGR25 SCAGR7 SCFXM1 SCFXM3 SCORPION SCRS8 SCSD1 SCSD6 SCSD8 SCTAP1 SCTAP2 SCTAP3 SHARE1B SHARE2B SHIP04L SHIP04S SHIP08L SHIP08S SHIP12L SHIP12S STOCFOR1 STOCFOR2 TRUSS WOOD1P WOODW Average

S1 6.0387 10.0000 10.0000 7.8000 8.4000 8.5500 6.4700 10.0000 4.5713 10.0000 10.0000 9.5600 3.8688 6.4500 5.0123 5.6155 5.2000 7.9000 11.3905 8.3400 7.2300 10.0000 10.0000 10.0000 10.0000 6.8131 6.1452 6.4590 7.1468 5.0193 5.6120 5.3418 2.9800 3.9815 7.8001 10.0000 6.1520 10.0000 10.0000 3.8790 10.0000 8.5230 4.3875 6.9150 6.9158 8.3520 9.2930 10.8350 8.4630 4.3875 3.9820 10.0000 10.0000 2.9250 7.3520 4.1720 8.4720 4.9514 4.2918 5.3713 4.9153 3.5162 8.1492 3.5880 7.1795

S2 6.8287 8.9340 11.3980 7.7720 9.3880 8.9300 6.6400 9.8440 4.5153 10.0000 10.2420 11.4980 3.8648 5.8840 4.8614 4.6515 6.6720 6.5520 10.4825 10.2520 7.9160 11.6960 10.0000 10.0000 6.7193 6.9814 5.1461 8.4370 5.5308 4.9252 3.7880 4.7614 2.4740 3.8065 7.2681 11.0840 5.1300 9.6300 11.9940 5.8270 9.7960 10.5090 2.8595 5.8070 6.6578 7.3180 11.0230 8.9370 9.6830 3.4375 5.5200 9.1780 11.0680 4.5090 8.0300 4.7040 6.6980 3.3454 4.4258 5.1493 3.8173 4.2142 8.7032 5.3800 7.2359

S3 7.0147 10.3500 10.8080 8.2080 9.9920 7.0420 5.6160 8.7280 4.4313 10.0000 9.3160 8.7860 3.3248 5.7500 4.1316 4.0918 5.4780 8.5680 10.0265 10.2100 7.7680 11.6160 10.0000 8.5145 10.0000 6.1341 4.1613 5.2730 6.0628 4.7736 3.3264 3.2516 1.9660 3.6315 7.8381 9.5820 7.2700 11.4580 10.8340 3.7690 8.8400 6.7270 3.6855 5.3730 6.0418 9.7700 9.4570 10.5230 9.4470 2.6255 5.5300 10.1200 8.0480 3.3650 6.2760 5.1020 8.0380 3.2174 4.1518 7.1313 5.7153 4.1402 8.1332 2.6320 6.9249

S4 7.4707 7.8540 11.3080 6.8020 11.2600 8.3920 6.1660 9.6720 4.1984 10.0000 11.9360 9.7060 3.9048 5.6480 4.1095 3.9156 6.8920 6.4500 11.6725 9.9380 9.4840 11.0940 10.0000 10.0000 10.0000 5.1416 5.8120 9.4870 5.6708 3.0450 3.9572 4.2944 2.3400 4.0315 6.6721 9.2820 5.1420 10.9980 12.8600 7.6670 10.7820 10.2250 3.2955 4.9850 4.7298 8.8100 11.5930 10.5190 7.9310 2.2595 5.1820 10.3520 10.3360 5.2690 9.4780 6.4060 5.4260 4.8614 4.1958 4.8893 4.6573 3.4082 9.8372 6.9240 7.3535

S5 5.9967 10.9040 11.9780 6.2960 9.7440 5.3780 5.6140 8.5120 5.1209 10.0000 10.7220 8.4060 3.8328 5.0500 5.1341 5.0016 7.2480 7.8520 9.6905 9.8120 7.4880 11.0160 10.0000 10.0000 10.0000 5.1625 5.1679 3.8990 6.4228 4.5306 4.5848 4.4394 0.5080 4.1865 7.4621 7.7420 5.3840 12.6960 11.4960 1.7810 7.1580 7.3410 2.1215 3.4370 5.3518 11.2100 7.9050 10.7870 10.5110 1.7015 4.7600 9.9540 7.5160 4.3550 6.9580 6.7020 7.7980 3.6614 3.9918 5.5553 6.6853 4.3002 6.3132 0.7400 6.7668

76

S6 6.2027 7.8740 13.2640 8.1840 12.2360 7.4580 6.4100 9.8360 6.5134 8.7456 12.5440 9.1500 2.1168 6.6000 5.0919 4.9814 7.0800 6.9640 13.5385 10.2100 8.2680 10.9360 10.0000 10.0000 10.0000 7.9810 4.1356 8.1190 5.4888 5.2540 4.7432 5.4060 3.1840 4.4615 6.8181 9.9380 3.9820 11.8680 12.5740 8.0270 12.0400 9.6110 3.7495 6.6010 5.3898 7.7260 13.1250 9.2370 9.6030 3.3495 4.4880 9.9740 9.6080 5.8730 9.9480 6.3760 3.4660 6.6154 3.4118 3.8153 3.0453 4.5702 8.9432 8.4320 7.5810

S7 5.7867 10.5000 11.1480 5.8100 9.9220 5.3160 6.3680 6.9840 3.9815 8.7145 11.5300 7.2460 4.1068 4.5580 3.9084 3.9916 8.7180 8.5800 8.7165 10.9520 6.4480 11.9880 10.0000 10.0000 5.6189 6.5147 4.1349 2.9570 6.8688 5.3760 5.4572 6.5708 1.3000 4.4065 7.4461 6.7540 4.6220 14.5700 12.8820 3.7110 7.3320 8.7170 3.9875 2.6510 4.5578 12.2960 7.2010 9.2050 10.1310 0.4235 3.1360 9.9600 8.4120 6.2110 8.2720 5.3420 9.4800 2.7574 5.0118 3.6313 7.6973 4.3562 7.8272 2.2060 6.8010

S8 6.1007 7.4800 12.4960 7.0840 12.4560 6.7980 6.1960 8.6740 4.1352 10.0000 13.9640 9.0620 3.2788 6.2120 3.8013 4.0016 7.8440 6.6600 14.1545 10.8640 7.4600 9.9120 10.0000 10.0000 10.0000 6.1620 5.1394 8.1690 3.7748 4.1044 4.3244 4.9622 3.7160 4.7665 8.3021 10.7240 3.1160 12.8860 11.9620 7.8650 11.8760 9.4610 4.2975 7.5570 4.3798 5.9800 12.1850 8.8510 11.0270 4.7055 5.4300 10.0960 10.2340 6.7670 9.5200 6.2420 4.3400 4.8374 3.5918 3.7593 2.5493 6.2362 8.8792 8.0680 7.4918

S9 6.9987 10.7300 9.2560 7.6180 9.6680 5.8560 6.1120 8.1920 4.5131 5.6190 12.0640 9.2380 6.0128 4.7280 4.1503 4.0419 9.3020 7.1900 8.5545 12.2960 4.5960 11.3560 6.7145 10.0000 10.0000 5.1782 6.5303 1.5750 5.6228 4.8212 4.3578 4.1780 1.7360 4.7665 5.5741 7.2060 3.3200 15.8820 10.9160 2.2430 8.9480 10.0650 5.6675 3.1230 5.6878 13.5100 6.0850 8.3790 9.2690 -0.2645 1.2520 11.5600 7.6420 4.4310 8.8420 5.0820 9.7100 0.8534 4.1478 2.7693 5.8113 3.0742 8.3132 1.8040 6.6324

S10 5.0467 6.9740 14.2500 8.5580 11.8980 6.9600 5.1380 9.4120 5.0931 10.0000 13.5220 7.8440 2.1708 5.1940 4.1985 5.0145 8.9800 7.5400 12.3325 11.1880 7.2720 9.1300 7.5145 10.0000 10.0000 5.6134 5.3901 6.7410 4.9528 4.0650 4.6224 4.6330 2.4940 5.1015 6.4221 10.3000 1.1540 12.6580 12.8820 8.3050 13.2460 7.9310 6.1975 5.6810 5.9658 7.7860 12.9890 8.9410 10.2250 4.7535 6.0120 9.5200 9.7040 4.9570 9.6140 6.1220 4.4800 5.3834 3.2738 3.2053 2.7973 5.0082 10.4112 9.9640 7.4177

is memory-bound, so if we compare GPU main memory bandwidth of 144 GB/s to 21 GB/s available per processor on CPU motherboards, we end up with a ratio of about 6.85. The slightly greater speedup can be explained by the use of shared memory. Furthermore, we observe a slightly smaller speedup in large scale LPs. This occurs mainly for two reasons. First, the communication and synchronization cost for small linear problems, e.g. AFIRO (28 constraints and 32 variables), is almost zero. On the other hand, the communication and synchronization cost for large scale LPs is expensive. For example, the communication cost for OSA-30 (4, 351 constraints and 100, 024 variables) is 3.12 seconds, approximately 1/6 of the GPU’s total execution time. The second factor aﬀecting the speedup in large scale LPs is that MATLAB’s inherent multithreading in CPU performs better in large scale LPs. Finally, the results of the GPU are very accurate, because NVIDIA Quadro 6000 is fully IEEE 764 − 2008 compliant 32- and 64-bit fast double-precision.

3.6

Computational Study: Investigating the Impact Of Scaling on Linear Programming Algorithms

In Section 3.5, we reviewed and compared both the CPU- and GPU-based implementations of ten scaling techniques. We have performed a computational study over Netlib and Kennington set and concluded that arithmetic mean, equilibration and geometric mean are the best serial scaling techniques in terms of execution time. In this section, a computational study over a set of sparse randomly generated LPs is performed in order to highlight the impact of scaling prior to the application of IPM, EPSA and simplex algorithm. To the best of our knowledge, this is the ﬁrst attempt that investigates the eﬀect of scaling on IPM, EPSA and simplex algorithm. The IPM that was used in the computational study is MATLAB’s large-scale linprog built-in function. MATLAB’s IPM algorithm is based on Interior Point Solver [124], 77

a primal-dual interior point algorithm, that takes advantage of MATLAB’s sparse matrix functions. The EPSA algorithm that we used has been proposed by [83]. This algorithm outperforms simplex algorithm as the problem size increases and the density decreases. Finally, we have also included in the computational study the revised simplex algorithm proposed by Dantzig [18]. The basis inverse step is performed using a Modiﬁcation of the Product Form of the Inverse (MPFI), proposed by Benhamadou [4]. MPFI updating scheme is faster than other updating schemes [1][86], like Product Form of the Inverse (PFI). The test set used in the computational study was randomly generated. Problem instances have the same number of constraints and variables. The largest problem tested has 3000 constraints and 3000 variables. We have generated these instances with 10% and 20% density. For each problem type of a particular size 10 instances were generated using a diﬀerent seed number. For each instance we averaged times over 10 runs. All times in the following tables are measured in seconds. In this computational study we study the impact of scaling prior to the application of IPM, EPSA and simplex algorithm. So, we have executed these algorithms with and without scaling. Three diﬀerent scaling techniques have been used: (i) arithmetic mean, (ii) equilibration, and (iii) geometric mean scaling techniques. Table 3.15, Table 3.16 and Table 3.17 present the results from the execution of MATLAB’s IPM algorithm, EPSA algorithm and revised simplex algorithm, respectively, over randomly optimal generated problems with density 10%. Table 3.18, Table 3.19 and Table 3.20 present the results from the execution of MATLAB’s IPM algorithm, EPSA algorithm and revised simplex algorithm, respectively, over randomly optimal generated problems with density 20%. In Tables 3.15 – 3.20 the following abbreviations are used: (i) AM - arithmetic mean, (ii) EQ - equilibration, and (iii) GM geometric mean. From the results, we observe that: (i) equilibration is the best scaling technique on all algorithms, (ii) the eﬀect of scaling is more signiﬁcant on IPM and revised simplex algorithm, and (iii) EPSA is scaling invariant.

78

Table 3.15: Results of MATLAB’s IPM Algorithm over Randomly Optimal Generated Problems with Density 10% Density 10% 1000x1000 1500x1500 2000x2000 2500x2500 3000x3000 Average

IPM time (sec) 8.7 216.81 321.88 1,089.31 1,512.48 629.84

iterations 22 26.1 28.4 29.8 24.3 26.1

IPM with AM scaling time (sec) 7.68 96.72 162.68 399.74 670.84 267.53

iterations 20.2 24.7 24 28.1 25.9 24.6

IPM with EQ scaling time (sec) 7.69 99.54 150.42 395.48 399.08 210.44

iterations 20.5 24.8 23.5 28.1 25.6 24.5

IPM with GM scaling time (sec) 7.69 84.44 150.51 451.47 574.04 253.63

iterations 20 22.3 23.6 29.7 25.8 24.3

Table 3.16: Results of EPSA Algorithm over Randomly Optimal Generated Problems with Density 10% Density 10% 1000x1000 1500x1500 2000x2000 2500x2500 3000x3000 Average

EPSA time (sec) 6.27 14.4 28.53 35.8 51.9 27.38

iterations 947.5 1,353.50 4,397.30 2,007.90 2,051.20 2,151.50

EPSA with AM scaling

EPSA with EQ scaling

EPSA with GM scaling

time (sec) 5.69 14.35 27.67 34.69 51.26 26.73

time (sec) 5.85 14.35 27.28 34.69 51.07 26.65

time (sec) 5.99 14.21 27.69 34.69 50.83 26.68

iterations 944.3 1,333.40 4,284.50 1,957.60 2,036.70 2,111.30

iterations 965.3 1,350.40 3,770.90 2,007.50 2,104.10 2,039.60

iterations 970.3 1.310.1 4.202.9 1.973.7 2.018.2 2,095.00

Table 3.17: Results of Revised Simplex Algorithm over Randomly Optimal Generated Problems with Density 10% Density 10% 1000x1000 1500x1500 2000x2000 2500x2500 3000x3000 Average

RSM time (sec) 348.02 1,375.35 3,651.17 7,945.33 15,450.59 5,754.09

iterations 33,792.40 60,393.30 94,639.70 135,998.50 173,233.60 99,611.50

RSM with AM scaling time (sec) 222.38 864.65 2,512.86 5,119.14 9,858.00 3,715.41

iterations 19,787.80 37,145.60 63,071.70 84,925.10 111,358.20 63,257.70

RSM with EQ scaling time (sec) 114.74 495.85 1,768.99 3,657.21 6,680.26 2,543.41

iterations 10,046.10 21,911.40 41,846.50 57,218.90 73,303.10 40,865.20

RSM with GM scaling time (sec) 318.49 1,124.91 3,245.30 6,514.19 12,005.41 4,641.66

iterations 32,061.30 53,154.40 82,613.50 115,013.60 146,513.10 85,871.20

Table 3.18: Results of MATLAB’s IPM Algorithm over Randomly Optimal Generated Problems with Density 20% Density 20% 1000x1000 1500x1500 2000x2000 2500x2500 3000x3000 Average

IPM time (sec) 69.4 131.46 260.74 633.91 840.08 387.12

iterations 23 19.1 24.4 23.7 24.2 22.8

IPM with AM scaling time (sec) 26.89 51.5 111.45 271.46 328.9 158.04

iterations 22.3 23.5 26 25.1 25.3 24.4

IPM with EQ scaling time (sec) 21.81 51.22 110.68 246.72 271.82 140.45

iterations 22.1 24.9 26.8 26.4 27.6 25.6

IPM with GM scaling time (sec) 28.3 56.5 96.6 312.45 269.13 152.6

iterations 23 22.1 25.6 25 26.4 24.4

Table 3.19: Results of EPSA Algorithm over Randomly Optimal Generated Problems with Density 20% Density 20% 1000x1000 1500x1500 2000x2000 2500x2500 3000x3000 Average

EPSA time (sec) 2.29 6.91 22.14 35.05 73.65 28.01

iterations 612.4 819.7 1,361.20 1,433.40 2,032.70 1,251.70

EPSA with AM scaling

EPSA with EQ scaling

EPSA with GM scaling

time (sec) 2.23 6.87 21.81 34.9 72.48 27.66

time (sec) 2.28 6.88 21.84 33.62 72.71 27.47

time (sec) 2.26 6.33 21.9 34.76 72.66 27.58

iterations 618.4 840.5 1,354.70 1,466.10 2,028.80 1,261.30

79

iterations 620.1 824.4 1,350.20 1,407.70 2,032.90 1,246.70

iterations 641.1 771.4 1,373.30 1,445.80 2,027.00 1,251.50

Table 3.20: Results of Revised Simplex Algorithm over Randomly Optimal Generated Problems with Density 20% Density 20% 1000x1000 1500x1500 2000x2000 2500x2500 3000x3000 Average

3.7

RSM time (sec) 179.9 899.75 2,136.92 5,671.15 12,549.91 4,287.53

iterations 19,907.10 41,427.40 58,260.80 105,135.20 145,013.40 73,948.80

RSM with AM scaling time (sec) 139.54 468.39 1,975.36 4,561.93 7,984.15 3,025.87

iterations 13,235.10 22,501.60 48,373.40 70,193.40 100,451.70 50,951.00

RSM with EQ scaling time (sec) 73.51 395.71 1,316.99 3,813.18 6,198.10 2,359.50

iterations 7,462.10 17,265.70 32,272.20 50,145.40 70,135.10 35,456.10

RSM with GM scaling time (sec) 197.09 781.89 2,010.17 5,103.51 9,814.13 3,581.36

iterations 18,858.50 35,868.30 55,272.80 91,031.10 125,012.30 65,208.60

Conclusions

Scaling is a de facto preconditioning technique that is applied by linear optimization solvers prior to the execution of a linear programming algorithm. By scaling an LP, the condition number of the constraint matrix can be reduced and that can lead to the reduction of the iterations and the solution time of the simplex algorithm. In this chapter, we reviewed and implemented ten scaling methods and proposed GPUbased implementations for them using MATLAB and CUDA. Nowadays, GPUs are used in scientiﬁc programming, because they outperform high-end CPUs in data intensive problems. We performed a computational study and found that GPU-based implementations outperform the serial ones. More speciﬁcally, the speedup gained from all scaling techniques is about 7x. The results are very promising and provide hope for fast GPU-based implementations for linear programming algorithms. Furthermore, we studied the impact of scaling prior to the application of the interior point methods, exterior point methods and revised simplex algorithm. We performed a computational study and found that equilibration is the best scaling technique on all types of algorithms. Furthermore, scaling is more important on IPM and revised simplex algorithm achieving to reduce signiﬁcantly both the solution time and the number of iterations needed by the algorithms. Finally, EPSA is proved to be scaling invariant.

80

Chapter 4 Pivoting Rules 4.1

Introduction

Simplex type algorithms perform successive pivoting operations (or iterations) in order to reach the optimal solution. The choice of the pivot element at each iteration is one of the most critical steps in simplex type algorithms. The ﬂexibility of the entering and leaving variable selection allows to develop various pivoting rules. We propose some of the most well–known pivoting rules for the revised simplex algorithm on a CPU–GPU computing environment. All pivoting rules have been implemented using MATLAB and CUDA. Computational results on randomly generated optimal dense linear programs and on a set of benchmark problems (netlib–optimal, kennington, netlib–infeasible, M´esz´aros) are also presented. These results showed that the proposed GPU-based implementations of the pivoting rules outperform the corresponding CPU-based implementations. The structure of the chapter is as follows. In Section 4.2, we present the background of this chapter. In Section 4.3, six pivoting rules are presented and analyzed. Section 4.4 presents the GPU-based implementations of these pivoting rules. In Section 4.5, the computational comparison of the CPU- and GPU- based implementations obtained with a set of randomly generated optimal dense LPs and a set of benchmark instances is presented. Finally, the conclusions of this chapter are outlined in section 4.6. 81

4.2

Background

A critical step in solving a linear program is the selection of the entering variable in each iteration. Good choices of the entering variable can lead to fast convergence to the optimal solution, while poor choices lead to more iterations and worst execution times or even no solutions of the LPs. The pivoting rule applied to a simplex type algorithm is the key factor that will determine the number of iterations that the algorithm performs [70]. Diﬀerent pivoting rules yield diﬀerent basis sequences in the simplex algorithm. Many pivoting rules have been proposed in the literature. Six of them are implemented and compared in this chapter, namely: (i) Bland’s rule, (ii) Dantzig’s rule, (iii) Greatest Increment Method, (iv) Least Recently Considered Method, (v) Partial Pricing rule, and (vi) Steepest Edge rule. Other well-known pivoting rules include Devex [47], Modiﬁed Devex [5], Steepest Edge approximation scheme [110], Murty’s Bard type scheme [76], Edmonds-Fukuda rule [31] and its variants [15] [119] [123] [125]. Forrest and Goldfarb [28] presented several new serial implementations of steepest edge pivoting rule and compared them with Devex variants and Dantzig’s rule over large LPs and concluded that steepest-edge variants are clearly superior to Devex variants and Dantzig’s rule for solving diﬃcult large-scale LPs. Thomadakis [111] has implemented and compared the serial implementations of ﬁve pivoting rules: (i) Dantzig’s rule, (ii) Bland’s rule, (iii) Least-Recently Considered Method, (iv) Greatest-Increment Method, and (v) Steepest-Edge rule. Thomadakis examined the trade-oﬀ between the number of iterations and the execution time per iteration and concluded that: (i) Bland’s rule has the shortest execution time per iteration, but it usually needs many more iterations than the other pivoting rules to converge to the optimal solution of a linear program, (ii) Dantzig’s rule and Least Recently Considered Method perform comparably, but the latter requires fewer iterations when degenerate pivots exist, (iii) Greatest Increment Method has the worst execution time per 82

iteration, but it usually needs fewer iterations to converge to the optimal solution, and (iv) Steepest-Edge rule requires fewer iterations than all the other pivoting rules and its execution time per iterations is lower than Greatest Increment Method but higher than the other three pivoting rules. Ploskas and Samaras [87] have implemented and compared the serial implementations of eight pivoting rules over small- and medium-sized Netlib LPs: (i) Bland’s rule, (ii) Dantzig’s rule, (iii) Greatest Increment Method, (iv) Least Recently Considered Method, (v) Partial Pricing rule, (vi) Queue rule, (vii) Stack rule, and (viii) Steepest Edge rule. In contrast to Thomadakis, Ploskas and Samaras also examined the total execution time of the simplex algorithm relating to the pivoting rule that is used and concluded that: (i) with a limit of 70, 000 iterations, only Dantzig’s rule has solved all instances of the test set, while Bland’s rule solved 45 out of 48 instances, Greatest Increment method 46 out of 48, Least Recently Considered method 45 out of 48, Partial Pricing rule 45 out of 48, Queue’s rule 41 out of 48, Stacks’ rule 43 out of 48, and Steepest Edge rule 46 out of 48, (ii) Dantzig’s rule requires the shortest execution time both on average and on almost all instances, while Steepest Edge rule has the worst execution time both on average and on almost all instances, and (iii) despite its computational cost, Steepest Edge rule needs the fewest number of iterations than all the other pivoting rules, while Bland’s rule is by far the worst pivoting rule in terms of the number of iterations. To the best of our knowledge, Forrest’s and Goldfarb’s paper [28], Thomadakis’ report [111] and Ploskas’ and Samaras’ paper [87] are the only papers that compares some of the most widely-used pivoting rules. Yet, there has been no attempt to implement and compare pivoting rules using GPUs. This work builds on the work of Ploskas and Samaras [87]. The novelty of this work is that we propose GPU-based implementations of six pivoting rules and perform a detailed computational study on large-scale randomly generated optimal dense LPs and on a set of benchmark LPs of various sources. The aim of the computational study is to establish the practical value of the GPU-based implementations. The computational results demonstrate that the proposed GPU-based implementations outperform the CPU-based implementations 83

of the six pivoting rules.

4.3

Pivoting Rules

Six pivoting rules are presented in this section: (i) Bland’s rule, (ii) Dantzig’s rule, (iii) Greatest Increment Method, (iv) Least Recently Considered Method, (v) Partial Pricing Rule, and (vi) Steepest Edge Rule. Some necessary notations should be introduced, before the presentation of the aforementioned pivoting rules. Let l be the index of the entering variable and cl be the diﬀerence in the objective value when the non-basic variable xl is increased by one unit and the basic variables are adjusted appropriately. Reduced cost is the amount by which the objective function on the corresponding variable must be improved before the value of the variable will be positive in the optimal solution.

4.3.1

Bland’s Rule

Bland’s rule [10] selects as entering variable the ﬁrst among the eligible ones, i.e. the leftmost among columns with negative relative cost coeﬃcient. Whereas Bland’s rule avoids cycling, it has been observed in practice that this pivoting rule can lead to stalling, a phenomenon where long degenerate paths are produced.

4.3.2

Dantzig’s Rule

Dantzig’s rule or largest coeﬃcient rule [20] is the ﬁrst pivoting rule that was proposed for the simplex algorithm. It has been widely-used in simplex implementations [3] [80]. This pivoting rule selects the column Al with the most negative cl . Dantzig’s rule ensures the largest reduction in the objective value per unit of non-basic variable cl increase. Although its worst-case complexity is exponential [54], Dantzig’s rule is considered as simple but powerful enough to guide simplex algorithm into short paths [111]. 84

4.3.3

Greatest Increment Method

The entering variable in the Greatest Increment Method [54] is the one with the largest total objective value improvement. The improvement of the objective value is computed for each non-basic variable. The variable that oﬀers the largest improvement in the objective value is selected as the entering variable. This pivoting rule can lead to fast convergence to the optimal solution. However, this advantage is eliminated by the additional computational cost per iteration. Finally, G¨artner [34] constructed LPs where Greatest Increment Method showed exponential complexity.

4.3.4

Least Recently Considered Method

In the ﬁrst iteration of the Least Recently Considered Method [122], the entering variable l is selected according to Dantzig’s rule (i.e. the column Al with the most negative cl ). In the following iterations, it starts searching for the ﬁrst eligible variable with index greater than l. If l = n then Least Recently Considered Method starts searching from the ﬁrst column again. Least Recently Considered Method prevents stalling and has been shown that it performs fairly well in practice [111]. However, its worst-case complexity has not been proved yet.

4.3.5

Partial Pricing Rule

Full pricing considers the selection of the entering variable from all eligible variables. Partial Pricing methods are variants of the standard rules that take only a part of the non-basic variables into account. In the computational study presented in Section 4.5, we have implemented the partial pricing rule as a variant of Dantzig’s rule. In static partial pricing, non-basic variables are divided into equal segments with predeﬁned size. Then, the pricing operation is carried out segment by segment until a reduced cost is found. In dynamical partial pricing, the segments’ size is determined dynamically during the execution of the algorithm. 85

4.3.6

Steepest Edge Rule

Steepest Edge Rule or All-Variable Gradient Method [38] selects as entering variable the variable with the most objective value reduction per unit distance, as shown in Equation (4.1): { dj = min

√ 1+

cl ∑m i=1

} x2il

: l = 1, 2, ..., n

(4.1)

This pivoting rule can lead to fast convergence to the optimal solution. However, this advantage is debatable due to the additional computational cost. Approximate methods have been proposed in order to improve the computational eﬃciency of this method [110] [115].

4.4

GPU Accelerated Pivoting Rules

In this section, we present the GPU-based implementations of the aforementioned pivoting rules, taking advantage of the power that modern GPUs oﬀer. The GPUbased implementations of these pivoting rules are implemented using MATLAB and CUDA. MATLAB’s GPU support was added in version R2011a. MATLAB’s GPU library does not support sparse matrices, so, all matrices are stored as dense (including zero elements).

4.4.1

Implementation of GPU-Based Pivoting Rules

Figure 4-1 presents the process that is performed in the GPU-based implementation of the simplex algorithm. Firstly, the CPU initializes the algorithm by reading all the necessary data. In the second step, the CPU computes a feasible solution and checks if the linear problem is optimal. If the linear problem is optimal, the algorithm terminates, while if it is not the CPU transfers the adequate variables to the GPU. In step 3, the GPU ﬁnds the index of the entering variable according to a pivoting rule and then transfers the index to the CPU. The CPU checks if the linear problem 86

Figure 4-1: Flow Chart of the GPU-based Pivoting Rules

is unbounded in order to terminate the algorithm. Otherwise, it ﬁnds the index of the leaving variable and updates the basis and all the necessary variables. Then, the algorithm continues with the next iteration until a solution is found.

4.4.2

Notations

Some necessary notations should be introduced, before the presentation of the pseudocodes of each pivoting rule. Let A be an mxn matrix. Let N onBasicList be an mx(n − m) vector with the indices of the non basic variables. Let BasisInv be an mxm matrix with the basis inverse. Let Xb be an 1xm vector with the values of the basic variables. Let Sn be an 1xn vector with dual slack variables. Let l be the index of the entering variable. Let ./ denote the element–wise division of two vectors and X ′ denotes the transpose of the matrix X. Let lrcmLast be the index of the last selected entering variable in the Least Recently Considered Method. Let segmentSize be the ﬁxed size of the segments and lastSegment the segment that the last entering 87

Table 4.1: GPU-based Bland’s Rule 1. 2. 3. 4. 5.

transfer to GPU the vector Sn do parallel find the index of the leftmost negative element of vector Sn end parallel transfer the index to CPU

variable was found in the Partial Pricing rule. The pseudocodes of all pivoting rules are presented in the following sub-sections of this section. Pseudocodes include “do parallel” and “end parallel” sections, in which the workload is divided into warps that are executed sequentially on a multiprocessor. The warp size in NVIDIA Quadro 6000 is 32 threads.

4.4.3

GPU-based Bland’s Rule

Table 4.1 shows the pseudocode of the implementation of Bland’s rule on a GPU. Initially, the CPU transfers the vector Sn to the GPU (line 1). Then, the index of the leftmost negative element of Sn is calculated in parallel (lines 2 - 4). Finally, the GPU transfers the index to the CPU (line 5).

4.4.4

GPU-based Dantzig’s Rule

Table 4.2 shows the pseudocode of the implementation of Dantzig’s rule on a GPU. Initially, the CPU transfers the vector Sn to the GPU (line 1). Then, the index of the minimum element of Sn is calculated in parallel (lines 2 - 4). Finally, the GPU transfers the index to the CPU (line 5).

4.4.5

GPU-based Greatest Increment Method

Table 4.3 shows the pseudocode of the implementation of Greatest Increment Method on a GPU. Initially, the CPU transfers the vectors Sn, N onBasicList, Xb and A(:, l) (the lth column of array A) and array BasisInv to the GPU (line 1). Then, for each 88

Table 4.2: GPU-based Dantzig’s Rule 1. 2. 3. 4. 5.

transfer to GPU the vector Sn do parallel find the index of the minimum element of Sn end parallel transfer the index to CPU

candidate variable with negative coeﬃcient cost, the improvement in the objective value is calculated (lines 4 - 17). The index of the variable that oﬀers the largest improvement in the objective value is selected (lines 18 - 19). Finally, the GPU transfers the index to the CPU (line 21).

4.4.6

GPU-based Least Recently Considered Method

Table 4.4 shows the pseudocode of the implementation of Least Recently Considered Method on a GPU. Initially, the CPU transfers the vectors Sn and N onBasicList to GPU (line 1). In the ﬁrst iteration of the algorithm, the incoming variable is selected according to Dantzig’s rule (lines 2 - 5). In the following iterations, Least Recently Considered Method ﬁnds the index of the ﬁrst eligible variable with index greater than lrcmLast (lines 6 - 10). Finally, the GPU transfers the index to the CPU (line 11).

4.4.7

GPU-based Partial Pricing Rule

Table 4.5 shows the pseudocode of the implementation of the Partial Pricing Rule on a GPU. Initially, the CPU transfers the values segmentSize and lastSegment and vector Sn to the GPU (line 1). Dantzig’s rule is performed in the segment denoted by the variable lastSegment (line 4). If a reduced cost does not exist in the speciﬁc segment, then we search in the next segment (lines 5 - 7). Finally, the GPU transfers the index to the CPU (line 10). 89

Table 4.3: GPU-based Greatest Increment Method 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

transfer to GPU vectors Sn, NonBasicList, Xb and arrays A(:,l) and BasisInv do parallel localMaxDecrease = inf for i=1:length(NonBasicList) if(Sn(i) < 0) l = NonBasicList(i) h_l = BasisInv * A(:,l) mrt = indices of the positive elements of h_l Xb_hl_div = Xb(mrt) ./ h_l(mrt) theta0 = minimum index of positive elements of vector Xb_hl_div currentDecrease = theta0 * Sn(column) if(currDecrease < localMaxDecrease) localMaxDecrease = currDecrease index = i end end end maxDecrease = minimum decrease of all localMaxDecrease find the index of the variable where the maxDecrease occurs end parallel transfer the index to CPU

Table 4.4: GPU-based Least Recently Considered Method 1. 2. 3. 4.

transfer to GPU the vectors Sn and NonBasicList if(iterations == 1) do parallel find the index of the minimum element of Sn and store it to variable lrcmLast 5. end parallel 6. else 7. do parallel 8. find the index of the first eligible variable with index greater than lrcmLast and store it to variable lrcmLast 9. end parallel 10. end 11. transfer the index to CPU

90

Table 4.5: GPU-based Partial Pricing Rule 1. 2. 3. 4. 6. 7. 8. 9. 10.

transfer to GPU the values segmentSize and lastSegment and vector Sn while(index == null) do parallel find the index of the minimum element in the lastSegment of Sn lastSegment = lastSegment + 1 end end parallel end transfer the index to CPU

Table 4.6: GPU-based Steepest Edge Rule 1. 2. 3. 4. 5. 6. 7. 8.

transfer to GPU the vectors Sn, NonBasicList and arrays A and BasisInv do parallel Y = BasisInv * A(:,NonBasicList) dj = sqrt(1 + diag(Y’ * Y)) rj = Sn’ ./ dj find the index of the minimum element of the vector rj end parallel transfer the index to CPU

4.4.8

GPU-based Steepest Edge Rule

Table 4.6 shows the pseudocode of the implementation of the Steepest Edge rule on a GPU. Initially, the CPU transfers the vectors Sn, N onBasicList and arrays A and BasisInv to the GPU (line 1). Then, the index of the incoming variable is calculated according to the Equation (5.1) (lines 3 - 6). Finally, the GPU transfers the index to the CPU (line 8).

4.5

Computational Results

Computational studies have been widely used in order to examine the practical eﬃciency of an algorithm or even compare algorithms. The computational comparison of the aforementioned pivoting rules has been performed on a quad-processor Intel Core i7 3.4 GHz with 32 Gbyte of main memory and 8 cores, a clock of 3700 MHz, 91

an L1 code cache of 32 KB per core, an L1 data cache of 32 KB per core, an L2 cache of 256 KB per core, an L3 cache of 8 MB and a memory bandwidth of 21 GB/s, running under Microsoft Windows 7 64-bit and on a NVIDIA Quadro 6000 with 6 GB GDDR5 384-bit memory, a core clock of 574 MHz, a memory clock of 750 MHz and a memory bandwidth of 144 GB/s. It consists of 14 stream processors with 32 cores each resulting in 448 total cores. The graphics card driver installed in our system is NVIDIA 64 kernel module 306.23. The CPU-based algorithms have been implemented using MATLAB Professional R2012b. MATLAB (MATrix LABoratory) is a powerful programming environment and is speciﬁcally designed for matrix computations in general. CPU-based pivoting rules automatically execute on multiple computational threads in order to take advantage of the multiple cores of the CPU. The execution time of all CPU-based pivoting rules already includes the performance beneﬁt of the inherent multithreading in MATLAB. Execution times both on CPU- and GPU-based pivoting rules have been measured in seconds using tic and toc MATLAB’s built-in functions. Finally, the results of the GPU are very accurate, because NVIDIA Quadro 6000 is fully IEEE 764-2008 compliant 32- and 64-bit fast double-precision. In the computational study, two test beds were used: (i) randomly generated dense optimal LPs (problem instances have the same number of constraints and variables and the largest problem tested has 3000 constraints and 3000 variables), and (ii) ten large-scale LPs from the Netlib set (optimal, Kennington and infeasible LPs) [13] [35] and the Misc section of M´esz´aros collection [74]. These benchmarks do not have bounds and ranges sections in their mps ﬁles. We have implemented an mps reader to read mps ﬁles and translate the data into MATLAB mat ﬁles. The Netlib library is a well known suite containing many real world LPs. Ord´on ˜ez and Freund [79] have shown that 71% of the Netlib LPs are ill-conditioned. Hence, numerical diﬃculties may occur. All runs terminated with correct optimal objective values reached as those in Netlib index ﬁle. For each instance we averaged times over 10 runs. Table 4.7 presents some useful information about the second test bed, which was used in the computational study. The ﬁrst column includes the name of the problem, the 92

Table 4.7: Statistics of the Netlib (optimal, Kennington and infeasible LPs) and M´esz´aros LPs Name Constraints Variables Nonzeros A Optimal Objective value BNL1 644 1,175 6,129 1.98E+03 3,517 4,067 19,054 2.35E+07 CRE-A CRE-C 3,069 3,678 16,922 2.52E+07 KLEIN1 55 54 696 Infeasible KLEIN2 478 54 4,585 Infeasible ROSEN1 520 1,544 23,794 -2.76E+04 ROSEN2 1,032 3,080 47,536 5.44e+04 SCFXM3 991 1,371 7,846 5.49E+04 1,481 2,480 1,0734 1.42E+03 SCTAP3 STOCFOR2 2,158 2,031 9,492 -3.90E+04

second the number of constraints, the third the number of variables, the fourth the nonzero elements of matrix A and the ﬁfth the optimal objective value. The test bed includes 4 optimal and 2 infeasible LPs from Netlib, 2 Kennington and 2 LPs from M´esz´aros collection.

In Tables 4.8 - 4.13 and Figures 4-2 - 4-5, the following abbreviations are used: (i) Bland’s rule - BR, (ii) Dantzig’s rule - DR, (iii) Greatest Increment method - GIM, (iv) Least Recently Considered method - LRCM, (v) Partial Pricing rule - PPR, and (vi) Steepest Edge rule - SER. Tables 4.8 and 4.9 present the iterations needed by each pivoting rule over the randomly generated optimal dense LPs and over the Netlib and M´esz´aros set of LPs, respectively. Both on the randomly generated optimal dense LPs and on the Netlib and M´esz´aros set of LPs, Steepest Edge rule requires the fewest iterations than the other rules, whereas Bland’s rule is by far the worst pivoting rule in terms of the number of iterations.

Tables 4.10 and 4.11 present the results from the execution of the CPU-based implementations of the pivoting rules over the randomly generated optimal dense LPs and over the Netlib and M´esz´aros set of LPs, respectively. Tables 4.12 and 4.13 present the results from the execution of the GPU-based implementations of pivoting rules over the randomly generated optimal dense LPs and over the Netlib and 93

Table 4.8: Number of Iterations Name BR 500x500 8,451 1,000x1,000 29,786 1,500x1,500 77,028 2,000x2,000 84,224 2,500x2,500 144,216 3,000x3,000 214,875 AVERAGE 93,097

over the DR 542 1,792 3,938 3,986 5,872 8,130 4,043

Randomly Generated Optimal Dense LPs GIM LRCM PPR SER 362 4,851 5,528 197 894 24,683 25,931 548 1,856 54,987 61,243 779 1,934 77,139 79,134 1,042 4,569 101,369 112,041 1,180 6,431 152,310 160,134 1,508 2,674 69,223 74,002 876

Table 4.9: Number of Iterations Over the Netlib and M´esz´aros Set of LPs Name BR DR GIM LRCM PPR SER BNL1 41,816 4,577 1,570 32,400 15,120 843 39,371 5,431 3,091 14,231 4,821 2,066 CRE-A CRE-C 30,565 5,710 3,102 22,461 14,735 1,736 KLEIN1 102 115 99 238 173 106 KLEIN2 1,398 258 221 913 940 272 ROSEN1 12,322 3,444 958 6,341 10,267 519 ROSEN2 28,282 6,877 2,103 14,151 22,156 1,023 SCFXM3 7,411 1,167 2,313 6,684 5,948 879 SCTAP3 2,459 2,215 1,655 3,195 3,396 589 6,454 1,074 1,534 10,985 5,992 1,125 STOCFOR2 AVERAGE 17,018.00 3,086.80 1,664.60 11,159.90 8,354.80 915.80

94

Table 4.10: Total Execution Time of the CPU-based Implementations over domly Generated Optimal Dense LPs Name BR DR GIM LRCM PPR 500x500 54.71 3.42 12.85 31.57 36.34 1,000x1,000 1,317.40 62.75 945.34 861.55 906.77 1,500x1,500 8,677.64 433.02 8,134.51 6,056.12 6,834.04 2,000x2,000 19,057.93 888.48 16,098.19 12,365.41 13,654.98 2,500x2,500 59,884.37 2,872.15 51,013.45 38,561.09 41,361.77 3,000x3,000 148,775.20 7,037.62 125,678.43 90,561.65 98,134.15 AVERAGE 39,627.87 1,882.91 33,647.13 24,739.57 26,821.34

the RanSER 5.58 162.43 713.43 2,034.12 4,761.34 9,451.34 2,854.71

Table 4.11: Total Execution Time of the CPU-based implementations over the Netlib and M´esz´aros Set of LPs Name BR DR GIM LRCM PPR SER BNL1 436.91 58.29 180.41 337.88 158.14 113.94 CRE-A 28,028.03 4,286.12 22,013.45 17,984.34 19,349.18 16,541.09 CRE-C 12,525.00 2,583.08 9,845.45 7,034.98 8,045.61 7,683.43 KLEIN1 0.02 0.02 0.03 0.04 0.03 0.03 KLEIN2 5.89 1.30 1.22 3.74 3.90 1.51 ROSEN1 102.34 28.51 88.45 52.54 87.89 51.49 ROSEN2 1,381.47 345.88 2,564.34 694.96 983.23 690.52 SCFXM3 206.21 38.56 167.34 186.28 168.25 156.14 SCTAP3 222.12 214.36 287.41 286.76 309.60 543.28 STOCFOR2 1,402.98 258.01 2,985.91 2,394.38 1,362.05 768.93 AVERAGE 4,431.10 781.41 3,813.40 2,897.59 3,046.79 2,655.04 M´esz´aros set of LPs, respectively. Figures 4-2 and 4-3 present the average speedup for each pivoting rule over the randomly generated optimal dense LPs and over the Netlib and M´esz´aros set of LPs, respectively.

The results show that only the Steepest Edge rule can be eﬃciently implemented on a GPU. All the other pivoting rules present trivial speedups. Figures 4-4 and 4-5 present the average portion of time that each CPU-based pivoting rule needs to calculate the entering variable against the total time over the randomly generated optimal dense LPs and over the Netlib and M´esz´aros set of LPs, respectively. The results of Figures 4-4 and 4-5 highlight that only Steepest Edge rule’s total execution time is dictated by the pivoting time. The other pivoting rules perform trivial pivoting operations, so, it is not worth to implement them on a GPU (Greatest Increment 95

Table 4.12: Total Execution Time of the GPU-based Implementations over the Randomly Generated Optimal Dense LPs Name BR DR GIM LRCM PPR SER 500x500 45.59 3.16 10.44 28.90 32.35 2.07 1,000x1,000 1,029.86 58.61 716.14 842.31 880.05 53.34 1,500x1,500 8,592.08 412.31 6,025.51 5,834.13 6,004.89 213.45 2,000x2,000 16,429.25 800.43 13,416.43 11,845.64 12,566.67 501.49 2,500x2,500 52,530.14 2,804.20 38,355.97 36,798.88 38,905.64 1,040.54 3,000x3,000 123,979.31 6,459.08 95,937.73 85,139.04 91,003.85 2,305.66 AVERAGE 33,767.71 1,756.30 25,743.70 23,414.82 24,898.91 686.09

Table 4.13: Total Execution Time of the GPU-based Implementations over the Netlib and M´esz´aros Set of LPs Name BR DR GIM LRCM PPR SER BNL1 368.08 50.42 152.34 322.87 146.82 17.93 CRE-A 22,098.69 3,477.51 17,800.55 16,490.55 17,234.05 2,832.34 CRE-C 9,899.01 2,145.31 8,156.44 6,500.67 7,231.55 1,448.00 KLEIN1 0.02 0.02 0.03 0.03 0.03 0.04 4.96 1.15 1.03 3.56 3.65 1.08 KLEIN2 ROSEN1 88.78 24.16 74.98 51.66 81.47 15.45 ROSEN2 1,211.45 276.70 2,231.01 663.49 910.22 200.05 SCFXM3 172.51 26.66 145.50 177.86 155.87 36.85 SCTAP3 190.41 201.54 245.08 273.88 286.60 95.20 STOCFOR2 1,201.40 223.45 2,501.58 2,156.00 1,190.51 354.61 AVERAGE 3,523.53 642.69 3,130.85 2,664.06 2,724.08 500.16

Figure 4-2: Average Speedup over the Randomly Generated Optimal Dense LPs

96

Figure 4-3: Average Speedup over the Netlib and M´esz´aros Set of LPs

Method also spends a lot of time in the communication between the CPU and the GPU). These ﬁndings lead us to conclude that only the Steepest Edge rule can be implemented on a GPU. Many CPU parallel implementations of the Steepest Edge rule have been proposed [9] [42] [112], while Meyer et. al. [75] proposed a multi-GPU implementation for the classical simplex algorithm using Steepest Edge as the pivoting rule.

Tables 4.14 and 4.15 present the time to perform the selection of the entering variable both on the CPU- and GPU-based implementation of the Steepest Edge rule and the speedup over the randomly generated optimal dense LPs and over the Netlib and M´esz´aros set of LPs, respectively. Figure 4-6 presents the speedup of the total and the pivoting time of the Steepest Edge rule over the randomly generated optimal dense LPs and over the Netlib and M´esz´aros set of LPs, respectively. The results show a maximum speedup on the pivoting step of 16.72 (on the 3000x3000 dense random LP) and an average of 10.47 and 9.21 for the the randomly generated optimal dense LPs and the Netlib and M´esz´aros set of LPs, respectively. The speedup obtained from the pivoting is not depicted exactly in the speedup of the total time for two reasons: (i) only the selection of the entering variable has been implemented on the GPU, and (ii) the communication cost is still expensive (CPU transfers the adequate 97

Figure 4-4: Average Portion of Pivoting Time to Total Execution Time over the Randomly Generated Optimal Dense LPs

Figure 4-5: Average Portion of Pivoting Time to Total Execution Time over the Netlib and M´esz´aros Set of LPs

98

Table 4.14: Pivoting Execution Time on the CPU- and GPU-based Implementations and Speedup of the Steepest Edge Rule over the Randomly Generated Optimal Dense LPs Problem CPU GPU Speedup 500x500 4.32 0.87 4.95 1,000x1,000 115.77 20.51 5.64 1,500x1,500 545.33 73.59 7.41 12.91 2,000x2,000 1,602.21 124.11 2,500x2,500 3,982.61 262.00 15.20 3,000x3,000 7,532.62 450.45 16.72 AVERAGE 2,297.14 155.26 10.47 Table 4.15: Pivoting Execution Time on the CPU- and GPU-based Implementations and Speedup of the Steepest Edge Rule over the Netlib and M´esz´aros Set of LPs Problem CPU GPU Speedup BNL1 98.32 9.17 10.72 CRE-A 14,105.19 1,078.01 13.08 CRE-C 6,503.03 573.55 11.34 KLEIN1 0.02 0.02 1.00 KLEIN2 0.87 0.41 2.15 40.67 5.65 7.20 ROSEN1 ROSEN2 533.34 50.74 10.51 136.97 10.80 12.69 SCFXM3 SCTAP3 485.27 36.99 13.12 STOCFOR2 533.79 51.79 10.31 AVERAGE 2,243.75 181.71 9.21 variables to the GPU at each iteration). For example, the speedup of the pivoting time achieved for ROSEN1 linear program is 7.20, while the speedup of the total time is 3.33. The communication cost for ROSEN1 is 4.45 seconds, approximately 1/3 of the total execution time.

4.6

Conclusions

A good selection of the pivoting rule for linear optimization solvers can lead to the reduction of the iterations and the solution time of an LP. GPUs have been already applied for the solution of linear optimization algorithms, but GPU-based implementations of the pivoting rules have not yet been studied. In this chapter, we reviewed 99

Figure 4-6: Speedup of the Total and the Pivoting Time of the Steepest Edge Rule

and implemented six widely-used pivoting rules and proposed GPU-based implementations for them using MATLAB and CUDA. We performed a computational study on large-scale randomly generated optimal dense LPs, the Netlib (optimal, Kennington, and infeasible LPs) and the M´esz´aros set and found that only Steepest Edge rule is suitable for GPUs, because the total time of the algorithm is dictated by the time to perform the pivoting. The maximum speedup gained from the pivoting operation of steepest edge rule is 16.72.

100

Chapter 5 Basis Update Methods 5.1

Introduction

The computation of the basis inverse is the most time-consuming step in simplex type algorithms. This inverse does not have to be computed from scratch at any iteration, but updating schemes can be applied to accelerate this calculation. In this chapter, we perform a computational comparison in which the basis inverse is computed with ﬁve diﬀerent updating schemes. Then, we propose an implementation of two updating schemes on a CPU-GPU System using MATLAB and CUDA environment. A computational study on randomly generated full dense linear programs is presented to establish the practical value of the GPU-based implementations. Furthermore, we incorporate three of these methods on the exterior and the revised simplex algorithm in order to highlight the signiﬁcance of the choice of the basis update method on simplex type algorithms and the reduction that it can oﬀer to the solution time. Finally, we perform a computational comparison in which the basis inverse is computed with the aforementioned updating methods. The Netlib set of LPs was used in the computational experiments. All these algorithms and methods have been implemented with MATLAB. The structure of this chapter is as follows. In Section 5.2, we present the background of this chapter. In Section 5.3, ﬁve methods that have been widely used for basis inversion are presented and analyzed. Section 5.4 presents the computational comparison 101

of the basis updating schemes. Computational tests were carried out on randomly generated LPs of at most 5000 variables and 5000 constraints. Section 5.5 includes the implementations of PFI and MPFI updating schemes on a CPU-GPU system, while Section 5.6 presents the computational results of GPU-based implementations over randomly generated LPs of at most 5000 variables and 5000 constraints. In Section 5.7, we perform a computational comparison in which the basis inverse is computed with three updating methods and incorporate them on the exterior and the revised simplex algorithm in order to highlight the signiﬁcance of the choice of the basis update method on simplex type algorithms. The Netlib set of LPs was used in these computational experiments. Finally, the conclusions of this chapter are outlined in section 5.8.

5.2

Background

As in the solution of any large scale mathematical system, the computational time for large LP problems is a major concern. The basis inverse dictates the total computational eﬀort of an iteration of simplex type algorithms. This inverse does not have to be computed from scratch at each iteration, but can be updated through a number of updating schemes. All eﬃcient versions of the simplex algorithm work with some factorization of the basis matrix B or its inverse B −1 . Dantzig and Orchard-Hays [19] have proposed the Product Form of the Inverse (PFI), which maintains the basis inverse using a set of eta vectors. Benhamadou [4] proposed a Modiﬁcation of the Product Form of the Inverse (MPFI). The key idea is that the current basis inverse (AB )−1 can be computed from the previous inverse (AB )−1 using a simple outer product of two vectors and one matrix addition. LU decomposition produces generally sparser factorizations that PFI [12]. The LU factorization for the basis inverse has been proposed by Markowitz [69]. Markowitz used LU decomposition to fully invert a matrix, but used the PFI scheme to update the basis inverse during simplex iterations. Bartels and Golub [2] have later proposed a scheme to update a sparse factorization, which was more stable than the 102

PFI. Their computational experiments, however, proved that it was more computationally expensive. Forrest and Tomlin [29] created a variant of the Bartels-Golub method by sacriﬁcing some stability characteristics causing the algorithm to have a smaller growth rate in the number of non-zero elements relative to the PFI scheme. Reid [99] proposed two variants of the Bartels-Golub updating scheme that aim to balance the sparsity and the numerical stability of the factorization. A variant of the Forrest-Tomlin update was proposed by Suhl and Suhl [108]. Other important updating techniques can be found in Saunders [102] and Goldfarb [37]. A full description of most of these updating methods can be found in Nazareth [77] and Chvatal [14]. There have been many reviews and variants of these methods individually, but only a few comparisons between them that are either obsolete or do not compare all these updating schemes. McCoy and Tomlin [72] report the results of some experiments on measuring the accuracy of the PFI scheme, Bartels-Golub method and Forrest-Tomlin scheme. Lim et al. [65] provide a comparative study between Bartels-Golub method, Forrest-Tomlin method and Reid method. Badr et al. [1] perform a computational evaluation of PFI and MPFI updating schemes. Ploskas et al. [88] compare PFI and MPFI updating schemes both on their serial and their parallel implementation.

5.3 5.3.1

Basis Inversion Updating Schemes Gaussian Elimination

Gaussian elimination is a method for solving systems of linear equations, which can be used to compute the inverse of a matrix. Gaussian elimination performs the following two steps: (i) Forward Elimination: reduces the given matrix to a triangular or echelon form and (ii) Back Substitution: ﬁnds the solution of the given system. Gaussian elimination with partial pivoting requires O(n3 ) time complexity. Gaussian elimination has been implemented on MATLAB using the mldivide operator. In order to ﬁnd the new basis inverse using Gaussian elimination, one can use 103

the following equation: (AB )−1 = AB \ I

5.3.2

(5.1)

Built-in Function Inv of MATLAB

The basis inverse can be computed using the built-in function of MATLAB called inv, which uses LAPACK routines to compute the basis inverse. Due to the fact that this function is already compiled and optimized for MATLAB, its execution time is smaller compared with the other relevant methods that compute the explicit basis inverse; time-complexity, though, remains O(n3 ).

5.3.3

LU Decomposition

LU decomposition method factorizes a matrix as the product of an upper (U ) and a lower (L) triangular factors, which can be used to compute the inverse of a matrix. In order to compute the U and L factors, the built-in function of MATLAB called lu has been used. LU decomposition can be computed in time O(n3 ).

5.3.4

MPFI

MPFI updating scheme has been presented by Benhamadou [4]. The main idea of this method is that the current basis inverse (AB )−1 can be computed from the previous inverse (AB )−1 using a simple outer product of two vectors and one matrix addition, as shown in the following equation: (

AB

)−1

( )−1 = AB r. + v ⊗ (AB )−1 r. 104

(5.2)

The updating scheme of the inverse is shown in equation (5.3).

(AB )−1

b11 .. . = 0 .. . bm1

(AB ) : br1 · · · brr · · · brm h1l − ··· b1m hrl .. .. .. . . . 0 0 + − h1rl .. .. ... . . hml ··· bmm − hrl (5.3) −1

The outer product of equation (5.3) requires m2 multiplications and the addition of two matrices requires m2 additions. Hence, the complexity is Θ(m2 ).

5.3.5

PFI

The PFI scheme, in order to update the new basis (AB )−1 , uses only information on the entering and leaving variables along with the current basis (AB )−1 . The new basis inverse can be updated at any iteration using the (5.4). (AB )−1 = (AB E)−1 = E −1 (AB )−1

(5.4)

where E −1 is the inverse of the eta-matrix and can be computed by the following equation:

E −1

1 .. . 1 (hl − el ) eTl = =I− hrl

−h1l .. . 1/hrl .. . −hml /hrl

... 1

(5.5)

If the current basis inverse is computed using regular multiplication, then the complexity of the PFI is Θ (m3 ). 105

5.4

Computational Results of CPU-based Implementations

The test set used in the computational study was randomly generated. Problem instances have the same number of constraints and variables. The largest problem tested has 5000 constraints and 5000 variables. All instances are dense. For each instance, we averaged times over 10 runs. A time limit of 20 hours was set. This explains why there are no measurements for some updating methods on large instances. It should be noted that in MATLAB R2012a multithreading is enabled by default thus our implementations are automatically parallelized and executed using the available multicore CPU. Table 5.2 presents the results from the execution of the above mentioned updating schemes. We have also included the execution time from MATLAB’s linprog built-in function, a function for solving linear programming problems. MATLAB’s linprog function includes two algorithms for large-scale and medium-scale optimization. The large-scale algorithm is based on Interior Point Solver [124], a primal-dual interiorpoint algorithm. LIPSOL used a Cholesky-inﬁnity factorization that causes overhead during the factorization of dense matrices and as a result it cannot solve problems with more than 1500 variables and constraints. Due to this restriction, we have used in our comparison the medium-scale algorithm, which is a variation of the simplex algorithm. Table 5.1 includes the execution time for the basis inverse of each updating scheme, while Table 5.2 presents the total execution time. The execution time of the basis inverse and the whole algorithm for each updating scheme is also graphically illustrated in Figures 5-1 and 5-2, respectively. The MPFI updating scheme has the best performance. On the other hand, the LU updating method has the worst performance. Another signiﬁcant issue is the performance of Gaussian elimination, PFI, function inv and linprog of MATLAB which are close to each other and the results are not quite satisfactory.

106

Table Gaussian Problem Eliminasize tion 1,000 793.30 1,250 1,173.10 1,500 3,487.13 1,750 7,746.69 2,000 12,157.33 2,250 22,547.04 2,500 21,209.41 2,750 47,205.29 3,000 60,134.08 3,250 3,500 3,750 4,000 4,250 4,500 4,750 5,000 -

Gaussian Problem Eliminasize tion 1,000 871.81 1,250 1,269.92 1,500 3,746.93 1,750 8,245.07 2,000 12,875.85 2,250 23,709.13 2,500 34,249.22 2,750 49,312.02 3,000 62,646.43 3,250 3,500 3,750 4,000 4,250 4,500 4,750 5,000 -

5.1: Basis Inverse Time LU inv Decomposition 702.85 2,314.62 1,036.74 2,820.17 3,075.34 10,351.52 6,843.46 23,153.03 10,622.34 36,338.81 19,288.01 66,437.50 18,336.32 63,775.88 34,422.09 50,964.37 -

Table 5.2: Total Time LU inv Decomposition 774.15 2,025.42 1,134.50 2,446.59 3,330.55 8,946.89 7,347.78 19,905.57 11,340.48 31,160.31 20,453.34 56,804.33 22,975.68 59,233.99 36,202.77 53,472.82 107

(secs) PFI 555.65 845.49 2,160.71 5,743.58 9,411.65 17,156.07 22,266.25 27,952.71 41,204.74 -

MPFI 83.36 121.18 292.43 680.61 981.08 1,667.49 2,228.82 2,648.20 3,762.74 7,100.16 8,793.45 11,785.46 13,087.66 19,432.13 23,344.92 28,534.90 32,317.85

(secs) PFI 631.22 940.03 2,384.46 6,255.89 10,140.09 18,345.68 23,828.22 29,762.26 43,740.40 -

MPFI

linprog

155.51 216.97 509.71 1,161.56 1,670.73 2,798.21 3,730.49 4,384.83 6,242.79 14,200.80 20,147.04 28,776.84 34,235.18 42,196.15 51,210.67 62,919.07 70,058.59

171.68 3,453.09 7,097.17 11,682.54 19,267.61 36,614.81 44,998.08 -

4

7

x 10

Gaussian Elimination inv LU Decomposition PFI MPFI

6

Time (sec)

5

4

3

2

1

0 1000

1500

2000

2500 3000 3500 Problem Size

4000

4500

5000

Figure 5-1: Basis Inverse Time Comparison

4

8

x 10

Gaussian Elimination inv LU Decomposition PFI MPFI linprog

7

6

Time (sec)

5

4

3

2

1

0 1000

1500

2000

2500

3000 Problem Size

3500

4000

4500

5000

Figure 5-2: Total Time Comparison

108

5.5

Implementation of PFI and MPFI Updating Schemes on a CPU-GPU system

PFI and MPFI were the fastest updating schemes. In this section, we present the GPU-based implementations of these updating methods taking advantage of the power that modern GPUs oﬀer. The GPU-based implementations of these updating methods are implemented using MATLAB and CUDA. The updating methods are built using both native MATLAB code and CUDA MEX ﬁles. Both methods take as input the previous basis inverse (AB )−1 , the pivot column (hl ), the index of the leaving variable (k) and the number of the constraints (m).

5.5.1

GPU-based PFI

Let us assume that we have t GPU cores. Table 5.3 shows the steps that we used to compute the new basis inverse (AB )−1 with the PFI scheme on the GPU.

Table 5.3: GPU-based PFI Step 0. Compute the column vector: [ ]T h1l hml 1 v = − hrl · · · hrl · · · − hrl Each core computes in parallel m/t elements of v. The pivot element is shared between all t cores. Step 1. Replace the rth column of an identity matrix with the column vector v. Each core assigns in parallel m/t elements to the identity matrix. This matrix is the inverse of the Eta-matrix. Step 2. Perform a matrix multiplication according to (5.4). Each core will compute a block of the new basis.

109

5.5.2

GPU-based MPFI

Table 5.4 shows the steps that we used to compute the new basis inverse (AB )−1 with the MPFI scheme on the GPU.

Table 5.4: GPU-based MPFI Step 0. Compute the column vector: [ ]T 1l · · · h1rl · · · − hhml v = − hhrl rl Each core computes in parallel m/t elements of v. The pivot element is shared between all t cores. Step 1. Compute the outer product v ⊗ (ABr )−1 with matrix multiplication. Each core will compute a block of the new matrix. Step 2. Set the rth row of (AB )−1 equal to zero. Each core computes in parallel t/p rows of (AB )−1 . Step 3. Add matrix (AB )−1 with the resulted matrix from step 1. Each core will compute a block of the new basis inverse.

5.6

Computational Results of GPU-based Implementations

The same randomly generated test set is also used in order to test the performance of the GPU-based implementations. The computational comparison of the GPUbased implementations has been also performed on a quad-processor Intel Core i7 3.4 GHz with 32 Gbyte of main memory running under Microsoft Windows 7 64-bit and NVIDIA Quadro 6000 with 6 Gbyte of memory and 448 CUDA cores. The mex ﬁles have been implemented using CUDA 4.2 and Microsoft Visual Studio 2012. Table 5.5 presents the results from the execution of the GPU-based implementations of PFI and MPFI updating schemes. For each implementation, the table shows the CPU time for the basis inverse and the total time.

110

Table 5.5: Basis Inverse and Total Time of the GPU-based Implementations (secs) Problem

1,000 1,250 1,500 1,750 2,000 2,250 2,500 2,750 3,000 3,250 3,500 3,750 4,000 4,250 4,500 4,750 5,000

PFI Time of basis inverse 149.56 198.98 455.32 1,122.77 1,611.04 2,868.02 3,838.33 4,480.71 5,846.40 -

Total time 208.57 275.93 625.03 1,509.01 2,166.60 3,765.82 5,043.59 5,867.20 7,820.77 -

MPFI Time of Total basis time inverse 41.56 101.42 37.50 113.80 65.55 239.37 123.63 510.37 147.14 704.66 230.96 1,147.32 293.38 1,512.32 309.58 1,712.80 405.98 2,372.87 702.99 5,108.20 775.44 6,806.43 938.33 8,964.75 1,005.20 9,923.24 1,305.05 11,163.00 1,409.72 12,738.97 1,659.01 14,139.12 1,709.94 14,067.99

Table 5.6 presents the speedup obtained by the GPU-based implementations regarding the CPU time for the basis inverse and the total time. We now plot the ratios taken from Table 5.6 in Fig. 5-3. The total time is in logarithmic scale.

From the above results, we observe: (i) the MPFI scheme is much faster than PFI both on CPU- and on GPU-based implementation, (ii) using PFI scheme, the speedup gained from the GPU-based implementation is around 7 for the time of the basis inverse and 5.5 for the total time when the problem size reaches to 3000x3000, and (iii) using MPFI scheme, the speedup gained from the GPU-based implementation is around 19 for the time of the basis inverse and 5 for the total time when the problem size reaches to 5000x5000.

111

Table 5.6: Speedup Obtained by the GPU-based Implementations PFI Basis inverse 3.72 4.25 4.75 5.12 5.84 5.98 5.80 6.24 7.05 -

Problem 1,000 1,250 1,500 1,750 2,000 2,250 2,500 2,750 3,000 3,250 3,500 3,750 4,000 4,250 4,500 4,750 5,000

MPFI Basis inverse 2.01 3.23 4.46 5.51 6.67 7.22 7.60 8.55 9.27 10.10 11.34 12.56 13.02 14.89 16.56 17.20 18.90

Total 3.03 3.41 3.81 4.15 4.68 4.87 4.72 5.07 5.59 -

20 PFI Basis Inverse PFI Total MPFI Basis Inverse MPFI Total

18 16 14

Speedup

12 10 8 6 4 2 0 1000

1500

2000

2500 3000 3500 Problem Size

4000

4500

5000

Figure 5-3: Speedup Comparison

112

Total 1.53 1.91 2.13 2.28 2.37 2.44 2.47 2.56 2.63 2.78 2.96 3.21 3.45 3.78 4.02 4.45 4.98

5.7

Computational Results of the Impact of Basis Update on Simplex Type Algorithms

In Section 5.4, we reviewed and compared ﬁve updating methods, namely: (i) Gaussian Elimination, (ii) the built-in function inv of MATLAB, (iii) LU decomposition, (iv) Product Form of the Inverse and (v) a Modiﬁcation of the Product Form of the Inverse; and incorporated them with the revised simplex algorithm. We have performed a computational study over dense randomly optimal generated LPs and concluded that MPFI, PFI and MATLAB’s inv are the best CPU-based scaling techniques. In this section, a computational study is performed in order to highlight the impact of the choice of the basis update methods on the exterior and the revised simplex algorithm. The Exterior Primal Simplex Algorithm (EPSA) that we used has been proposed by Paparrizos et al. [83]. This algorithm outperforms the Revised Simplex Algorithm (RSA) as the problem size increases and the density decreases. The RSA in the computational study has been proposed by Dantzig [18]. The Netlib set of LPs [35], which is a well-known suite containing many read world LPs, was used in this computational study. 71% of the Netlib LPs are ill-conditioned [79], so numerical diﬃculties may occur. For each instance, we averaged times over 10 runs. In this computational study, we examine the signiﬁcance of the choice of the basis updating technique on the exterior and the revised simplex algorithm. Thus, we have executed these algorithms with the three diﬀerent basis update methods presented: (i) MPFI, (ii) PFI, and (iii) MATLAB’s inv. Table 5.7 and Table 5.8 present the results from the execution of the exterior and the revised simplex algorithm, respectively; with the aforementioned basis update methods over the Netlib set. All times in the following tables are measured in seconds using MATLAB’s builtin command cputime. In Tables 5.7 – 5.8, the following abbreviations are used: (i) INV - MATLAB’s built-in function, (ii) PFI - Product Form of the Inverse, and (iii) MPFI - Modiﬁcation of the Product Form of the Inverse. The results are graphically illustrated in Figure 5-4.

113

Table 5.7: Results of the Exterior Simplex Algorithm Using Three Diﬀerent Updating Methods over the Netlib Set Name ADLITTLE AFIRO AGG AGG2 AGG3 BANDM BEACONFD BLEND DEGEN2 E226 FFFFF800 ISRAEL LOTFI SC50A SC50B SC105 SC205 SCAGR7 SCFXM1 SCFXM2 SCFXM3 SCORPION SCRS8 SCTAP1 SCTAP3 SHARE1B SHARE2B SHIP04L SHIP04S SHIP08L SHIP08S SHIP12L SHIP12S STOCFOR1 Average

EPSA-INV Total Time Inverse Time 0.1872 0.0653 0.0312 0.0123 0.1092 0.0468 0.2808 0.0312 0.3432 0.0468 0.7176 0.1716 0.0312 0.0153 0.0936 0.078 5.2884 2.9172 0.4056 0.1716 1.2792 0.2496 0.1404 0.0468 0.2496 0.0312 0.0936 0.0624 0.6396 0.4836 0.0312 0.0156 0.0156 0.0068 0.0936 0.0312 0.936 0.1716 7.0356 1.482 25.8338 7.644 0.7488 0.0624 6.474 1.1856 0.1716 0.078 3.7752 2.2776 0.234 0.0312 0.0624 0.0241 5.8812 0.0936 2.2308 0.0156 25.163 0.234 4.2432 0.1092 55.3648 0.1716 7.644 0.0936 0.0468 0.025 4.5846 0.5348

EPSA-PFI Total Time Inverse Time 0.1746 0.0362 0.0162 0.0042 0.0772 0.0179 0.2317 0.0176 0.1647 0.0265 0.6816 0.0885 0.028 0.0094 0.0681 0.0425 3.1063 1.8425 0.247 0.1101 0.7301 0.1408 0.1081 0.0238 0.1869 0.0171 0.0616 0.0341 0.4236 0.1847 0.0243 0.0083 0.0094 0.0032 0.0618 0.0156 0.671 0.1076 4.6819 0.876 18.7696 3.0041 0.5746 0.0396 4.894 0.4658 0.1251 0.0341 1.7438 1.0635 0.2096 0.0155 0.0339 0.0158 5.5653 0.0465 1.9913 0.007 24.0501 0.0932 4.0686 0.041 55.0681 0.0787 7.5187 0.0334 0.418 0.0101 4.0231 0.2516

EPSA-MPFI Total Time Inverse Time 0.1544 0.0377 0.0268 0.005 0.0932 0.019 0.2642 0.0187 0.3193 0.0276 0.6943 0.0948 0.0219 0.0108 0.0782 0.0507 3.4833 2.072 0.2967 0.1256 0.9131 0.1497 0.1189 0.0266 0.2117 0.0203 0.0622 0.0364 0.4633 0.2048 0.0234 0.0088 0.0127 0.0036 0.0782 0.0182 0.8369 0.1211 5.1343 1.0299 19.141 3.2197 0.7092 0.0455 5.0458 0.5091 0.1374 0.039 1.8615 1.121 0.2195 0.0181 0.0464 0.0178 5.738 0.051 2.1237 0.0084 24.4794 0.1004 4.1066 0.0436 55.2934 0.0852 7.612 0.0357 0.4271 0.0114 4.1244 0.2761

Table 5.8: Results of the Revised Simplex Algorithm Using Three Diﬀerent Updating Methods over the Netlib Set Name ADLITTLE AFIRO AGG AGG2 AGG3 BANDM BEACONFD BLEND DEGEN2 E226 FFFFF800 ISRAEL LOTFI SC50A SC50B SC105 SC205 SCAGR7 SCFXM1 SCFXM2 SCFXM3 SCORPION SCRS8 SCTAP1 SCTAP3 SHARE1B SHARE2B SHIP04L SHIP04S SHIP08L SHIP08S SHIP12L SHIP12S STOCFOR1 Average

RSA-INV Total Time Inverse Time 0.0936 0.0468 0.001 0.0001 0.3432 0.2496 1.014 0.8736 1.5912 1.404 1.716 1.2012 0.0312 0.023 0.078 0.0624 17.0509 15.4285 1.2324 1.1544 1.9032 1.0608 0.5616 0.546 0.3276 0.1716 0.0936 0.078 0.5304 0.4056 0.0468 0.0034 0.0312 0.018 0.0936 0.056 1.5756 0.9204 13.7281 8.7985 54.9748 38.5478 0.936 0.234 9.8281 5.3664 0.468 0.3744 104.7391 103.5223 0.312 0.156 0.0468 0.0468 5.46 0.0312 1.9188 0.0312 26.177 2.5896 4.368 0.5304 59.1244 5.6628 8.4709 1.4196 0.0624 0.0156 9.3803 5.6185

RSA-PFI Total Time Inverse Time 0.0312 0.002 0.001 0.0001 0.078 0.0312 0.2184 0.078 0.2652 0.0936 0.6552 0.1404 0.0312 0.002 0.0312 0.0156 5.7408 3.8844 0.2808 0.2028 1.1232 0.2808 0.1092 0.0624 0.1716 0.0468 0.0624 0.003 0.2184 0.1092 0.0312 0.002 0.0156 0.002 0.078 0.0156 0.7956 0.156 6.2556 1.3416 21.4813 5.226 0.7644 0.0624 5.2416 0.8112 0.1716 0.1248 7.0356 5.9124 0.1872 0.0156 0.0312 0.0212 5.4444 0.0156 1.9188 0.0312 23.8214 0.234 3.8688 0.078 53.8359 0.3588 7.1292 0.078 0.0468 0.003 4.3286 0.5718

114

RSA-MPFI Total Time Inverse Time 0.0468 0.0312 0.001 0.0001 0.0936 0.0468 0.2496 0.0624 0.2652 0.1248 0.6552 0.156 0.0156 0.005 0.0312 0.0016 6.4428 4.5864 0.2496 0.1404 1.1232 0.2808 0.0936 0.0624 0.1716 0.0468 0.0312 0.005 0.1872 0.078 0.0156 0.005 0 0.003 0.078 0.0156 0.7956 0.1716 6.5052 1.716 23.1661 6.9576 0.7332 0.078 5.304 0.9048 0.1872 0.0624 7.2072 6.2088 0.156 0.0468 0.0312 0.0212 5.4756 0.0468 1.9032 0.0312 23.9462 0.2964 3.8688 0.0936 53.9763 0.3588 7.1448 0.0624 0.0312 0.005 4.4172 0.668

Figure 5-4: Average Total and Inverse Time of the Exterior and Revised Simplex Algorithm Using Three Diﬀerent Updating Methods over the Netlib Set

From the above results, we observe that: (i) the percentage of the time to compute the basis inverse to the total time is bigger in the revised than the exterior simplex algorithm, and (ii) PFI is much faster than MPFI and MATLAB’s inv over the Netlib set.

5.8

Conclusions

The basis inverse is the most time-consuming step in simplex type algorithms, so it is essential to implement a fast and numerically stable updating method. In this chapter, we performed a computational comparison of ﬁve updating schemes and incorporated them to the revised simplex algorithm. The results of the computational study showed that MPFI updating scheme is the fastest when solving large dense LPs. We proposed a GPU-based implementation for PFI and MPFI updating schemes, which were the fastest serial implementations, and implemented them on MATLAB and CUDA. We performed again a computational study and ﬁnd out that GPU-based implementations of PFI and MPFI outperform the CPU-based ones. More speciﬁcally, the speedup for PFI method is up to 7 for the time of basis inverse and 5.5 for the total time and the speedup for MPFI method is up to 19 for the time of 115

basis inverse and 5 for the total time. Our approach allows us to solve problems of size 10, 000x10, 000. Moreover, we performed a computational comparison of three CPU-based basis update methods over sparse optimal randomly generated LPs and incorporated them on the exterior and the revised simplex algorithms. The results showed that the PFI scheme is faster than MATLAB’s inv and MPFI. Finally, the percentage of the time to compute the basis inverse to the total time is bigger in the revised than the exterior simplex algorithm.

116

Chapter 6 Eﬃcient GPU-based Implementations of Simplex Type Algorithms 6.1

Introduction

Recent hardware advances have made it possible to solve large scale LPs in a short amount of time. Graphical Processing Units (GPUs) have gained a lot of popularity and have been applied to linear programming algorithms. In this chapter, we propose two eﬃcient GPU-based implementations of the revised simplex algorithm and a primal-dual exterior point simplex algorithm. Both GPU-based algorithms have been implemented in MATLAB using MATLAB’s Parallel Computing Toolbox. Computational results on randomly generated optimal sparse and dense linear programming problems are also presented. The results show that the proposed GPU-based implementations outperform MATLAB’s interior point method. The structure of this chapter is as follows. In Section 6.2, we present the background of this chapter. Section 6.3 presents the GPU-based implementations of these algorithms. In Section 6.4, the computational comparison of the GPU-based implementations with MATLAB’s interior point method on a set of randomly generated 117

optimal dense and sparse LPs is presented. Finally, the conclusions of this chapter are outlined in section 6.5.

6.2

Background

The increasing size of real life LPs demands more computational power and parallel computing capabilities. Recent hardware advances have made it possible to solve large LPs in a short amount of time. In the past two decades, many parallel implementations of LP algorithms have been proposed. Some of these implementations use dense matrix algebra (parallelization of the tableau simplex algorithm using dense matrix algebra or parallelization of the revised simplex algorithm using dense basis inverse), other use sparse matrix algebra, while some other use special LP algorithms variants. Table 6.1 presents a representative list of parallel (CPU- and GPU-based) simplex implementations in chronological order.

118

119

and

Netlib LPs

one small Netlib LP

other LPs

on medium sized Netlib and

Netlib LPs

sors

duced cost and with the

rithm with explicit inverse

and revised simplex algo-

steepest edge pivot rules

to MINOS 5.4 on some

with thousands of proces-

with the most-negative re-

al. [24]

Connection Machine CM-2 Iteration speed is superior

Balance 8000

Intel iPSC/2 and Sequent An average speedup of 1.5

machine

Dense CPU

Sparse CPU

16 processor shared memory Up to 12 on one large and Dense CPU

cube

Tableau simplex algorithm

Revised simplex algorithm

Tableau simplex algorithm

or

Dense GPU

or

Speedup and Comments Sparse CPU

16 processor Intel hyper Between 8 and 12 on small Dense CPU

Hardware Platform

1995 Eckstein et

[49]

Sundarraj

1994 Ho

et al. [17]

1991 Cvetanovic

[109]

Reed

Tableau simplex algorithm

1988 Stunkel

and

Algorithm

Date Author

120

and

with Devex pivot rule

Revised simplex algorithm

Revised simplex algorithm

Tableau simplex algorithm

[112]

and

Liu

rule

Sparse CPU

Sparse CPU

LP

on a medium sized Netlib

LPs

Dense CPU

3 in terms of iteration speed Sparse CPU

Kennington LPs

LPs and up to 8 on large

Up to 17 on small Netlib

frame

solver running on a main-

to a sparse revised simplex

column-row ratio compared

lem with a relatively large

LPs and 5.2 on a prob-

small medium sized Netlib

Between 0.5 and 2.7 on

128 x 128 cores MasPar ma- Up to 1000 on large random

6 processors of a Cray T3D

Intel iPSC/2

8 transputers

with the steepest edge pivot chine

1996 Thomadakis Tableau simplex algorithm

[43]

McKinnon

1996 Hall

1995 Shu [105]

al. [63]

1995 Lentini et

121

and

[121]

and Slyke

2009 Yarmish

Tableau simplex algorithm

predictor-

algorithm

corrector algorithm)

point

[51]

Interior

Tableau simplex algorithm

ants

clusters,

Sun

S20-

GeForce

Between 1 and 3 on Netlib

medium sized Netlib LPs

Between 2.5 and 4.8 on

7 workstations

dense LPs

Up to 5 on random small

Dense GPU

Dense CPU

Sparse CPU

Sparse CPU

on a large random dense LP

7 in terms of iteration speed Dense CPU

mentation

corresponding CPU imple-

Netlib LPs compared to the

7800 Up to 1.4 on medium-sized

GTX and Intel Xeon 3GHz

NVIDIA

8 processors SMP

processors and an IBM SP2

502, Silicon Graphics multi-

rule and steepest edge vari- tion

(Mehrotra’s

and

Diﬀerent platforms, includ-

Cray T3D

with the steepest edge pivot ing heterogeneous worksta- LPs

Revised simplex algorithm

Revised simplex algorithm

OLeary

2008 Jung

[1]

2006 Badr et al.

Martin [9]

2000 Bixby and

[44]

McKinnon

1998 Hall

122

al. [56]

Tableau simplex algorithm

NVIDIA GeForce GTX 260

ning on an Intel Core 2 Duo

rule

2011 Lalami et

the serial GLPK solver run-

and the steepest edge pivot

tel Xeon 3 GHz

mentation running on an In-

corresponding CPU imple-

dense LPs compared to the

Up to 12.5 on large random

3GHz

on random LPs compared to

with the PFI basis update

Up to 18 in single precision

al. [7]

NVIDIA GeForce 9600 GT

Core 2 Quad 2.83 GHz

tation running on an Intel

responding CPU implemen-

LPs compared to the cor-

Up to 2.5 on large random

Revised simplex algorithm

with dense PFI basis update

NVIDIA GeForce GTX 280

2010 Bieling et

[107]

and Elster

2009 Spampinato Revised simplex algorithm

Dense GPU

Dense GPU

Dense GPU

123

et

al. [75]

2011 Meyer

[64]

2011 Li

al. [57]

et

al.

2011 Lalami et

Tableau simplex algorithm

Tableau simplex algorithm

Tableau simplex algorithm

4 NVIDIA Tesla S1070

NVIDIA GeForce 9800GT

2 NVIDIA Tesla C2050

Compared a

dense LPs

to the serial CLP solver on

multi-GPU implementation

itly stated.

The speedup is not explic-

1.6 GHz

ning on an Intel Core 2 Duo

CPU implementation run-

pared to the corresponding

Up to 120 on large LPs com-

tel Xeon 2.66 GHz

mentation running on an In-

corresponding CPU imple-

dense LPs compared to the

Up to 24.5 on large random

Dense GPU

Dense GPU

Dense GPU

124

al. [106]

2012 Smith

NVIDIA Tesla C2070

trix free method)

Table 6.1: LP Algorithms’ Parallelization

an AMD Opteron 2378

implementation running on

responding multi-core CPU

LPs compared to the cor-

Up to 1.27 on large sparse

Core i7 2.8GHz

prog running on an Intel

[33]

LPs compared to the MAT-

and 2 on sparse random

6 on dense random LPs

LAB’s built-in function lin-

et

NVIDIA Tesla C2050

gensen

Interior point method (ma-

corrector algorithm)

Jor-

predictor-

method

and

point

(Mehrotra’s

Interior

Nielsen

2012 Gade-

Sparse GPU

Sparse GPU

Jung and O’Leary [51] proposed a CPU-GPU implementation of the Interior Point Method for dense LPs and their computational results showed a speedup up to 1.4 on medium-sized Netlib LPs [35] compared to the corresponding CPU-based implementation. Spampinato and Elster [107] presented a GPU-based implementation of the revised simplex algorithm on a GPU with NVIDIA CUBLAS [78] and NVIDIA LAPACK libraries [58]. Their implementation showed a maximum speedup of 2.5 on large random LPs compared to the corresponding CPU-based implementation. Bieling et al. [7] also proposed a parallel implementation of the revised simplex algorithm on a GPU. They compared their GPU-based implementation with the serial GLPK solver and reported a maximum speedup of 18 in single precision. Lalami et al. [56] proposed an implementation of the tableau simplex algorithm on a CPU-GPU system. Their computational results on randomly generated dense problems showed a maximum speedup of 12.5 compared to the corresponding CPU-based implementation. Lalami et al. [57] extended their previous work [56] on a multi-GPU implementation and their computational results on randomly generated dense problems showed a maximum speedup of 24.5. Li et al. [64] presented a GPU-based algorithm, based on Gaussian elimination, for large scale LPs that outperforms the CPU-based algorithm. Meyer et al. [75] proposed a mono- and a multi-GPU implementation of the tableau simplex algorithm and compared their implementation with the serial CLP solver. Their implementation outperformed CLP solver on large sparse LPs. GadeNielsen and Jorgensen [33] presented a GPU-based interior point method and their computational results showed a speedup of 6 on randomly generated dense LPs and a speedup of 2 on randomly generated sparse LPs compared to the MATLAB’s built-in function linprog. Smith et al. [106] proposed a GPU-based interior point method and their computational results showed a maximum speedup of 1.27 on large sparse LPs compared to the corresponding multi-core CPU implementation. No parallel implementation of the simplex algorithm has yet oﬀered signiﬁcantly better performance relative to an eﬃcient sequential simplex solver. For this reason, there is a strong motivation for exploring how the simplex type algorithms exploit high performance computing architectures. To the best of our knowledge, these are 125

all the papers that proposed a GPU-based implementation of an LP algorithm. Yet, there has been no attempt to implement a GPU-based exterior point simplex algorithm. The novelty of this chapter is that we propose GPU-based implementations of the revised simplex algorithm and a primal-dual exterior point simplex algorithm. The computational results demonstrate that the proposed GPU-based implementations outperform MATLAB’s interior point method on randomly generated optimal dense and sparse LPs.

6.3

GPU-based Implementations

This section presents the GPU-based implementations of the revised simplex algorithm and a primal-dual exterior point simplex algorithm.

6.3.1

Implementation of the GPU-Based Revised Simplex Algorithm

Figure 6-1 presents the process that is performed in the GPU-based implementation of the revised simplex algorithm. In the ﬁrst step, the CPU initializes the algorithm by reading all the necessary data. In the second step, the CPU transfers the adequate variables (A, b, and c) to the GPU and the GPU scales the linear problem. In the third step, the GPU computes a feasible solution and transfers to the CPU a ﬂag variable. The CPU checks the ﬂag variable to determine if the linear problem is optimal. If the linear problem is optimal, the algorithm terminates, while if it is not the GPU ﬁnds the index of the entering variable in the fourth step and then transfers to the CPU another ﬂag variable. The CPU checks the ﬂag variable to determine if the linear problem is unbounded in order to terminate the algorithm. Otherwise, the GPU ﬁnds the index of the leaving variable in the ﬁfth step and updates the basis and all the necessary variables in the sixth step. Then, the algorithm continues with the next iteration until a solution is found. In this point, we should describe the methods that we used for three separate 126

Figure 6-1: Flow Chart of the GPU-based Revised Simplex Algorithm

127

steps of the algorithm: (i) scaling, (ii) pivoting, and (iii) basis update. Scaling is the most widely used preconditioning technique in linear optimization solvers. Scaling is an operation in which the rows and columns of a matrix are multiplied by positive scalars and these operations lead to nonzero numerical values of similar magnitude. Scaling is used prior to the application of a linear programming algorithm for four reasons [113]: (a) to produce a compact representation of the bounds of variables, (b) to reduce the number of iterations required to solve LPs, (c) to simplify the setup of the tolerances, and (d) to reduce the condition number of the constraint matrix A and improve the numerical behavior of the algorithms. Scaling has been proven to reduce both the number of iterations and the total execution time of the revised simplex algorithm [89]. In a previous paper [90], we have reviewed and implemented ten scaling techniques with a focus on their implementation on a GPU under the MATLAB and CUDA environment. The computational study showed that the GPU-based arithmetic mean is the fastest scaling method. However, the equilibration scaling technique leads the revised simplex algorithm to less iterations than the other methods [89]. So, in this chapter the scaling is performed using the arithmetic mean scaling method followed by the equilibration scaling method. A crucial step in solving a linear problem with the simplex algorithm is the selection of the entering variable. This step is performed in each iteration. Good choices can lead to a fast convergence to the optimal solution, while poor choices lead to worse execution times or even no solutions of the LPs. A pivoting rule is one of the main factors that will determine the number of iterations that simplex algorithm performs [70]. In a previous paper [91], we have proposed six well-known pivoting rules for the revised simplex algorithm on a CPU-GPU computing environment. The computational study showed that the GPU-based steepest-edge is the fastest pivoting rule. So, in this chapter the pivoting step is performed using the steepest-edge pivoting rule [37]. The total work of an iteration of simplex type algorithms is dominated by the determination of the basis inverse [85] [88]. This inverse, however, does not have to be computed from scratch during each iteration. Simplex type algorithms main128

tain a factorization of basis and update this factorization in each iteration. There are several schemes for updating basis inverse. In a previous paper [86], we proposed a GPU-based implementation for the Product Form of the Inverse (PFI) [19] and a Modiﬁcation of the Product Form of the Inverse (MPFI) [4] updating schemes. The computational study showed that the GPU-based MPFI outperformed the GPUbased PFI. So, in this chapter the basis update is performed using the MPFI updating scheme. Finally, we have tried to optimize the use of GPU memory. The frequent and heavy back-and-forth transmission of data between the CPU and GPU will dominate the computational time, so we reduced the communication time as far as possible. CPU is used to control the whole iteration while GPU is used for computing intensive steps. In both GPU-based algorithms presented in this chapter, the communication between the CPU and GPU occurs only in the following steps of the algorithm: (i) initially, the CPU transfers to the GPU the matrix A and the vectors b and c, (ii) in each iteration the GPU transfers a ﬂag variable to the CPU in order to determine if the linear problem is optimal and terminate the algorithm, (iii) in each iteration the GPU transfers another ﬂag variable to the CPU in order to determine if the linear problem is unbounded and terminate the algorithm, and (iv) ﬁnally, the GPU transfers the objective value and the iterations needed to ﬁnd the solution to the CPU. Table 6.2 shows the pseudocode of the implementation of the revised simplex algorithm on a GPU.

6.3.2

Implementation of the GPU-Based Primal-Dual Exterior Point Simplex Algorithm

Figure 6-2 presents the process that is performed in the GPU-based implementation of the PDEPSA. In the ﬁrst step, the CPU initializes the algorithm by reading all the necessary data. In the second step, the CPU transfers the adequate variables (A, b, and c) to the GPU and the GPU scales the linear problem. In the third step, the GPU computes a dual feasible solution and an interior point. In the fourth step, the 129

Table 6.2: GPU-based Revised Simplex Algorithm 1. 2. 3. 4. 5. 6. 7.

8. 9. 10.

11. 12. 13. 14.

The CPU reads the linear problem and initialize all the necessary variables. The CPU transfers to the GPU the matrix A and the vectors b and c The GPU scales the linear problem using arithmetic mean followed by equilibration. The GPU calculates a feasible partition. The GPU transfers a flag variable to the CPU. while linear problem is not optimal The CPU checks the flag variable to determine if the linear problem is optimal. If the linear problem is optimal, the algorithm terminates. The GPU calculates the index of the entering variable using the steepest-edge pivoting rule. The GPU transfers a flag variable to the CPU. The CPU checks the flag variable to determine if the linear problem is unbounded. If the linear problem is unbounded, the algorithm terminates. The GPU calculates the index of the leaving variable. The GPU updates the basis using the MPFI updating scheme. The GPU transfers a flag variable to the CPU. end

130

Figure 6-2: Flow Chart of the GPU-based Primal-Dual Exterior Point Simplex Algorithm

GPU computes the direction dB and transfers to the CPU a ﬂag variable. The CPU checks the ﬂag variable to determine if the linear problem is optimal. If the linear problem is optimal, the algorithm terminates, while if it is not the GPU ﬁnds the index of the entering variable in the ﬁfth step. In the sixth step, the GPU computed the next interior point. In the seventh step, the GPU ﬁnds the index of the leaving variable and transfers to the CPU another ﬂag variable. The CPU checks the ﬂag variable to determine if the linear problem is unbounded in order to terminate the algorithm. Otherwise, in the eighth step the GPU updates the basis and all the necessary variables. Then, the algorithm continues with the next iteration until a solution is found. As described in section 6.3.2, the scaling is performed using the arithmetic mean scaling method followed by the equilibration scaling method and the basis update is performed using the MPFI updating scheme. Again, we reduced the communication 131

Table 6.3: GPU-based Primal-Dual Exterior Point Simplex Algorithm 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

14. 15. 16.

The CPU reads the linear problem and initialize all the necessary variables. The CPU transfers to the GPU the matrix A and the vectors b and c. The GPU scales the linear problem using arithmetic mean followed by equilibration. The GPU calculates a dual feasible partition and an interior point. The GPU computes the direction $d_{B}. The GPU transfers a flag variable to the CPU. while linear problem is not optimal The CPU checks the flag variable to determine if the linear problem is optimal. If the linear problem is optimal, the algorithm terminates. The GPU calculates the index of the entering variable. The GPU computes the next interior point. The GPU finds the index of the leaving variable. The GPU transfers a flag variable to the CPU. The CPU checks the flag variable to determine if the linear problem is unbounded. If the linear problem is unbounded, the algorithm terminates. The GPU updates the basis using the MPFI updating scheme. The GPU transfers a flag variable to the CPU. end

time between the CPU and GPU in the same manner explained in section 6.3.2. Table 6.3 shows the pseudocode of the implementation of the PDEPSA on a GPU.

6.4

Computational Results

Computational studies have been widely used in order to examine the practical eﬃciency of an algorithm or even compare algorithms. The computational comparison has been performed on a quad-processor Intel Core i7 3.4 GHz with 32 Gbyte of main memory and 8 cores, a clock of 3700 MHz, an L1 code cache of 32 KB per core, an L1 data cache of 32 KB per core, an L2 cache of 256 KB per core, an L3 cache of 8 MB and a memory bandwidth of 21 GB/s, running under Microsoft Windows 7 64-bit and on a NVIDIA Quadro 6000 with 6 GB GDDR5 384-bit memory, a core clock of 574 MHz, a memory clock of 750 MHz and a memory bandwidth of 144 GB/s. It consists of 14 stream processors with 32 cores each resulting in 448 total cores. The 132

graphics card driver installed in our system is NVIDIA 64 kernel module 320.92. Both GPU-based algorithms have been implemented in MATLAB 2013b using MATLAB’s Parallel Computing Toolbox. In this computational study, we compare the proposed GPU-based algorithms with MATLAB’s large-scale linprog built-in function. The large-scale algorithm is based on Interior Point Solver [124], a primal-dual interior point algorithm. MATLAB’s linprog function automatically executes on multiple computational threads, in order to take advantage of the multiple cores of the CPU. The execution time of this algorithm already includes the performance beneﬁt of the inherent multithreading in MATLAB. MATLAB supports multithreaded computation for some built-in functions. These functions automatically utilize multiple threads without the need to specify commands to handle the threads in a code. Of course, MATLABs inherent multithreading is not as eﬃcient as a pure parallel implementation. Execution times for all algorithms have been measured in seconds using tic and toc MATLAB’s built-in functions. Finally, the results of the GPU-based implementations are very accurate, because NVIDIA Quadro 6000 is fully IEEE 764-2008 compliant 32- and 64-bit fast double-precision. The test set used in the computational study was a set of randomly generated sparse and dense optimal LPs (problem instances have the same number of constraints and variables and the largest problem tested has 6, 000 constraints and 6, 000 variables). Sparse LPs were generated with 10% and 20% density. For each instance we averaged times over 10 runs. All runs were made as a batch job.

In Tables 6.4 - 6.9 and Figures 6-3 - 6-5, the following abbreviations are used: (i) MATLAB’s large-scale linprog built-in function - IPM, (ii) Revised Simplex Algorithm - RSA, and (iii) Primal-Dual Exterior Point Simplex Algorithm - PDEPSA. Tables 6.4 - 6.6 present the total execution time of the algorithms over the randomly generated optimal dense LPs, the randomly generated optimal sparse LPs with density 10% and the randomly generated optimal sparse LPs with density 20%, respectively. The total execution time of the GPU-based implementations of RSA and PDEPSA also include 133

Table 6.4: Total execution time over the randomly generated optimal dense LPs Problem IPM RSA PDESPA 1,000x1,000 279.68 7.41 (1.66%) 3.02 (1.23%) 1,500x1,500 499.05 12.65 (1.41%) 5.06 (1.12%) 2,000x2,000 1,128.46 23.69 (0.93%) 8.94 (1.23%) 2,500x2,500 2,794.41 51.20 (0.72%) 17.32 (1.04%) 3,000x3,000 4,399.06 80.35 (0.63%) 26.46 (0.97%) 3,500x3,500 7,040.20 125.04 (0.54%) 40.26 (0.93%) 4,000x4,000 8,938.19 154.06 (0.46%) 49.34 (0.85%) 4,500x4,500 252.41 (0.42%) 70.60 (0.81%) 5,000x5,000 428.00 (0.41%) 95.41 (0.79%) 5,500x5,500 778.84 (0.34%) 146.23 (0.75%) 6,000x6,000 1,261.53 (0.30%) 205.42 (0.73%) Average 3,582.72 288.65 (0.71%) 60.73 (0.90%)

the communication time. The percentage of the communication time to the total execution time for the GPU-based algorithms is presented in the parentheses. Tables 6.7 - 6.9 present the iterations needed by each algorithm to solve the linear problem over the randomly generated optimal dense LPs, the randomly generated optimal sparse LPs with density 10% and the randomly generated optimal sparse LPs with density 20%, respectively. A time limit of 3 hours was imposed which explains why there are no measurements for IPM on randomly generated optimal dense LPs above n = 4000.

Figures 6-3 - 6-5 present the speedup over the randomly generated optimal dense LPs, the randomly generated optimal sparse LPs with density 10% and the randomly generated optimal sparse LPs with density 20%, respectively. The following abbreviations are used: (i) PDESPA/IPM is the speedup of PDEPSA over IPM, (ii) RSA/IPM is the speedup of RSA over IPM, and (iii) PDESPA/RSA is the speedup of PDEPSA over RSA. Before the discussion of the results, we should note again that MATLAB’s largescale linprog built-in function is not completely parallelized, but utilizes the inherent multithreading in MATLAB. We have selected this algorithm because it is a sophisticated commercial LP solver and it is implemented in MATLAB as our algorithms. So, we believe that it is fair to use MATLAB’s large-scale linprog built-in function as 134

Table 6.5: Total execution time over the randomly generated optimal sparse LPs with density 10% Problem IPM RSA PDESPA 1,000x1,000 78.86 (0.12%) 61.02 9.84 (0.10%) 1,500x1,500 280.80 (0.08%) 196.35 28.58 (0.08%) 2,000x2,000 415.70 (0.06%) 270.31 37.46 (0.08%) 2,500x2,500 1,298.86 (0.06%) 688.14 88.66 (0.07%) 3,000x3,000 1,683.41 (0.05%) 873.17 104.01 (0.06%) 3,500x3,500 2,539.07 (0.05%) 1,236.40 148.83 (0.05%) 4,000x4,000 3,128.60 (0.03%) 1,511.56 182.00 (0.04%) 4,500x4,500 3,710.59 (0.03%) 1,731.46 213.49 (0.04%) 5,000x5,000 4,466.21 (0.02%) 2,049.69 255.90 (0.03%) 5,500x5,500 5,736.32 (0.02%) 2,435.80 315.66 (0.03%) 6,000x6,000 7,234.65 (0.02%) 3,035.53 385.42 (0.02%) Average 2,779.37 1,280.86 (0.05%) 160.90 (0.05%)

Table 6.6: Total execution time over the randomly generated optimal sparse LPs with density 20% Problem IPM RSA PDESPA 1,000x1,000 88.50 34.26 (0.16%) 8.18 (0.12%) 1,500x1,500 162.82 59.39 (0.10%) 13.21 (0.09%) 2,000x2,000 355.79 127.77 (0.08%) 25.84 (0.09%) 2,500x2,500 758.21 232.88 (0.06%) 51.24 (0.07%) 3,000x3,000 1,077.84 314.60 (0.05%) 71.28 (0.06%) 3,500x3,500 2,487.42 649.23 (0.05%) 139.57 (0.05%) 4,000x4,000 3,072.17 776.63 (0.05%) 171.23 (0.05%) 4,500x4,500 3,853.41 953.61 (0.04%) 206.81 (0.04%) 5,000x5,000 5,024.56 1,237.60 (0.04%) 267.17 (0.04%) 5,500x5,500 6,539.02 1,535.80 (0.04%) 339.31 (0.04%) 6,000x6,000 8,104.58 1,835.53 (0.03%) 415.32 (0.03%) Average 2,865.85 705.21 (0.06%) 155.38 (0.06%)

135

Table 6.7: Number of iterations over the randomly generated optimal dense LPs Problem IPM RSA PDESPA 1,000x1,000 34 118 134 1,500x1,500 32 68 103 2,000x2,000 29 148 163 2,500x2,500 42 249 257 3,000x3,000 42 163 170 3,500x3,500 45 180 258 4,000x4,000 38 220 242 4,500x4,500 216 323 5,000x5,000 131 149 5,500x5,500 280 299 6,000x6,000 330 415 Average 37.43 191.18 228.45

Table 6.8: Number of iterations over the randomly generated optimal sparse LPs with density 10% Problem IPM RSA PDESPA 1,000x1,000 22 1,275 1,256 1,500x1,500 26 1,836 2,069 2,000x2,000 28 2,450 2,581 2,500x2,500 29 2,845 3,503 3,000x3,000 24 3,285 4,688 3,500x3,500 25 4,675 5,332 4,000x4,000 26 5,035 6,801 4,500x4,500 27 5,508 7,517 28 6,516 8,669 5,000x5,000 5,500x5,500 30 7,945 10,314 31 9,287 12,456 6,000x6,000 Average 26.91 4,605.18 5,926.00

136

Table 6.9: Number of iterations over the randomly generated optimal sparse LPs with density 20% Problem IPM RSA PDESPA 1,000x1,000 23 1,024 1,041 1,500x1,500 19 1,340 1,832 2,000x2,000 24 1,811 2,460 2,500x2,500 23 2,650 3,282 24 2,836 4,190 3,000x3,000 3,500x3,500 30 3,454 4,987 4,000x4,000 26 3,889 5,627 4,500x4,500 28 4,536 6,438 5,000x5,000 30 4,888 7,764 5,500x5,500 29 5,653 9,135 6,000x6,000 32 6,781 11,341 Average 26.18 3,532.91 5,281.55

Figure 6-3: Speedup over the randomly generated optimal dense LPs

137

Figure 6-4: Speedup over the randomly generated optimal sparse LPs with density 10%

Figure 6-5: Speedup over the randomly generated optimal sparse LPs with density 20%

138

a reference point for the comparison with our GPU-based algorithms. The results show a large speedup of both proposed GPU-based algorithms over MATLAB’s large-scale linprog built-in function. From the above results, we observe: (i) GPU-based PDEPSA achieves a maximum speedup over IPM of about 181 (on average 143) on dense LPs, about 19 (on average 15) on sparse LPs with density 10%, and about 20 (on average 16) on sparse LPs with density 20%, (ii) GPU-based RSA achieves a maximum speedup over IPM of about 58 (on average 50) on dense LPs, about 2 (on average 2) on sparse LPs with density 10%, and about 4 (on average 4) on sparse LPs with density 20%, (iii) GPU-based PDEPSA is much faster than GPU-based RSA on all LPs, and (iv) the percentage of the communication time to the total execution time on both GPU-based algorithms is very small. For the sake of completeness, we also present the number of iterations needed by each algorithm to solve the linear problem in Tables 6.7 - 6.9, but only the total execution time of the algorithms is used to compare the algorithms. This is because interior point methods converge after a few iterations, but with a large computational cost per iteration. These ﬁndings are signiﬁcant because they show that primal-dual exterior point simplex algorithms are more eﬃcient for GPU-based implementations than the revised simplex algorithm. To the best of our knowledge this is the only work presenting an implementation of an exterior point simplex algorithm on a GPU. Furthermore, our GPU-based implementations presented great speedup not only on dense LPs, but also on sparse LPs.

6.5

Conclusions

GPUs have been already applied for the solution of linear optimization algorithms, but GPU-based implementations of an exterior point simplex algorithm have not yet been studied. In this chapter, we proposed two eﬃcient GPU-based implementations of the revised simplex algorithm and a primal-dual exterior point simplex algorithm. We performed a computational study on large-scale randomly generated 139

optimal sparse and dense LPs and found that both GPU-based algorithms outperform MATLAB’s interior point method. The primal-dual exterior point simplex algorithm was the fastest GPU-based implementation and the maximum speedup gained over MATLAB’s interior point method was about 181 on dense LPs and about 20 on sparse LPs.

140

Chapter 7 Web-based Educational Software for Solving Linear Programming Problems 7.1

Introduction

Web-based Decision Support Systems (DSS) are computerized information systems that provide decision support tools to managers or business analysts using only a thin-client Web Browser [98]. Web-based DSS can assist a decision maker to: (i) retrieve, analyze and display data from large databases, (ii) provide access to a model, and (iii) establish communication and decision making in distributed teams [2]. In general, all types of DSS, communication-driven, knowledge-driven and documentdriven [6], can be implemented as a web-based DSS [97]. Linear programming algorithms have been widely used in DSS for supplier selection [36], forest management planning systems [59], assignment of parking spaces [117], schedule of student attendants [61], portfolio robustness evaluation [66], optimality in open air reservoir strategies [115], energy planning [71] and water resource management [27] among others. However, the special structure of each linear problem should be taken into consideration in order to take advantage of diﬀerent linear programming 141

algorithms and methods. Educational tools for solving linear programming problems have been already proposed [41] [45] [55] [104] [116] [120]. Many of them allow to deﬁne their linear programming problems and solve them graphically and/or algorithmically. In this chapter, we propose two educational tools that provide decision support tools to students that want to solve their linear programming problems. Our main aim is not to provide a tool to teach students how to solve their linear programming problems, like the aforementioned educational tools, but assist them to select the appropriate method for each step of the linear programming algorithm. This chapter presents two educational software for the solution of a linear programming problem. The ﬁrst software assists students in the selection of the linear programming algorithm and basis update method for solving their linear programming problems. The second software builds on the ﬁrst one and goes further to explore all diﬀerent steps of the linear programming algorithms. Two linear programming algorithms are incorporated in the software: (i) Revised Simplex Algorithm [18] and (ii) Exterior Primal Simplex Algorithm [83]. The DSS also includes a variety of diﬀerent methods for the diﬀerent steps of these algorithms. More speciﬁcally, ten scaling techniques, ﬁve basis update methods and eight pivoting rules have been implemented in the software. The student can either select the algorithm and the appropriate methods to solve a linear programming problem or perform a thorough computational study with all combinations of algorithms and methods in order to gain an insight on its linear programming problem. The structure of this chapter is as follows. In Section 7.2, we present the ﬁrst education software that assists students in the selection of the linear programming algorithm and basis update method for solving their linear programming problems. Section 7.3 presents the second educational software that extends the ﬁrst one and includes a variety of diﬀerent methods for the diﬀerent steps of these algorithms. Finally, the conclusions of this chapter are outlined in section 7.4.

142

7.2

A Web-Based Educational Software using Basis Update on Simplex Type Algorithms

This section presents a web-based educational software that that assists students in the selection of the algorithm and basis update method in order to solve their Linear Programming problems. In Section 5.4 and 5.6, we reviewed and compared both the CPU- and GPU- based implementations of ﬁve updating methods [86], namely: (i) Gaussian Elimination, (ii) the built-in function inv of MATLAB, (iii) LU decomposition, (iv) Product Form of the Inverse (PFI) and (v) a Modiﬁcation of the Product Form of the Inverse (MPFI); and incorporated them with the revised simplex algorithm. We have performed a computational study over dense randomly optimal generated LPs and concluded that MPFI, PFI and MATLABs inv are the best serial basis update methods. The proposed educational software include these methods and incorporate them on the exterior and the revised simplex algorithm.

7.2.1

Object-Oriented Analysis

Figure 7-1 presents the decision making process that the student can perform using the DSS. Firstly, the student formulates the problem under examination as a linear programming problem. Then, the data acquisition, validation and veriﬁcation phase follows, so, the student may upload the input data to the DSS and select the algorithms and the basis update methods to be evaluated, in the next step. Then, the algorithms evaluation and execution step follows. The last step includes the presentation and the analysis of the results. Finally, the student validates the results and if necessary provides feedback on the operation and the updated decision making process is performed again. The sequence diagram in Figure 7-2 shows the whole interaction between the decision maker and the DSS in order to produce the report that will assist further in the decision making process. The decision-policy maker interacts with the initial screen of the DSS and uploads the input ﬁle, selects the algorithms (RSM and\or 143

Figure 7-1: Decision Making Process

Figure 7-2: Sequence Diagram

EPSA) and basis update methods (full inverse and\or PFI and\or MPFI) and presses the ’Report’ button. Then, the system validates the input data and executes the RSM and EPSA algorithms for each basis update method, collects the total execution time, the time to perform the basis inverse, the number of iterations and the objective value, and presents these results in the report screen. Finally, the decision maker can export the results as a pdf ﬁle. Figure 7-3 presents the class diagram of the DSS. InitialScreen is a boundary class that includes three methods responding to decision maker’s action events: i) upload input ﬁle, ii) select algorithms and basis update methods, and iii) press ’Report’ button. SimplexAlgorithm is an abstract class that incorporates the common attributes and methods of RevisedSimplexAlgorithm and ExteriorPrimalSimplexAlgorithm. Matrix A contains the constraints coeﬃcients, vector c the objective function coeﬃcients, vector b the right-hand side values, vector Eqin the type of constraints 144

Figure 7-3: Class Diagram

(equality or inequality), and variable minMax the type of problem (minimization or maximization). Moreover, the SimplexAlgorithm class includes three methods that perform the basis update methods and an abstract method for the execution of the algorithms’ logic. Finally, the RevisedSimplexAlgorithm and ExteriorPrimalSimplexAlgorithm classes override the abstract method executeAlgorithm of the SimplexAlgorithm and perform their steps for the solution of the linear programming problem.

7.2.2

Educational Software Presentation

All algorithms and update methods have been implemented using MATLAB. These algorithms have been converted to Java classes using the MATLAB Builder JA. The web interface of the DSS has been designed using jsp and multiple users can access it through a web browser. The initial screen of the DSS is shown in Figure 7-4. The decision maker presses the ’Browse’ button in order to upload the ﬁle containing the LP in mps format. In addition, the decision maker selects the algorithms and the basis update methods 145

Figure 7-4: Initial Screen of the web-based DSS

that will be included in the comparison. By pressing the ’Report’ button, a screen with a report is presented, as shown in Figure 7-5. This screen includes the objective value and the iterations made by each algorithm. Furthermore, the total and the inverse time of each basis update method are presented both as numerical values and in a illustrative ﬁgure. Finally, the decision maker can export this report as a pdf ﬁle. Many LPs of the Netlib set are real-world problems. Figures 7-4 and 7-5 present a case study for SCRS8, which is a problem of the Netlib set [35]. SCRS8 is a technological assessment model for the transition from fossil to renewable energy resources in the United States [48] adapted from Manne [68]. SCRS8 includes 1, 169 variables with 491 constraints. From the results that are presented in Figure 7-5, it is concluded that EPSA with the PFI updating method is the best choice for the solution of this problem. 146

Figure 7-5: Report of the Results

7.3

A Web-Based Educational Tool for Solving Linear Programming Problems

This educational tool builds on the work of 7.2. Ploskas et al. [92] have implemented a web-based DSS that assists students in the selection of the linear programming algorithm and basis update method for solving their linear programming problems. In this educational software, we do not only take into consideration the basis update step, but go further to explore all diﬀerent steps of the linear programming algorithms. Two linear programming algorithms are incorporated in the DSS: (i) Revised Simplex Algorithm [18] and (ii) Exterior Primal Simplex Algorithm [83]. The DSS also includes a variety of diﬀerent methods for the diﬀerent steps of these algorithms. More speciﬁcally, ten scaling techniques, ﬁve basis update methods and eight pivoting rules have been implemented in the DSS. The student can either select the algorithm and the appropriate methods to solve a linear programming problem or perform a thorough computational study with all combinations of algorithms and methods in order to gain an insight on its linear programming problem. In the proposed DSS, we have implemented ten widely used scaling techniques, as presented in Section 3.3: (i) arithmetic mean, (ii) de Buchet for the case p = 1, (iii) de Buchet for the case p = 2, (iv) entropy, (v) equilibration, (vi) geometric mean, (vii) IBM MPSX, (viii) Lp-norm for the case p = 1, (ix) Lp-norm for the case p = 2, 147

Figure 7-6: Decision Making Process

and (x) Lp-norm for the case p = ∞ and de Buchet for the case p = ∞. Moreover, we have implemented ﬁve widely used basis update methods, as presented in Section 5.3: (i) Gaussian elimination, (ii) MATLAB’s built-in function called inv, (iii) LU decomposition, (iv) Modiﬁcation of the Product Form of the Inverse, and (v) Product Form of the Inverse. Finally, we have also implemented eight widely used pivoting rules, as presented in Section 4.3: (i) Bland’s rule, (ii) Dantzig’s rule, (iii) Greatest Increment Method, (iv) Least Recently Considered Method, (v) Partial Pricing rule, (vi) Queue rule, (vii) Stack rule, and (viii) Steepest Edge rule.

7.3.1

Analysis and Design

The decision making process that the student can perform using the proposed DSS is presented in Figure 7-6. Initially, the student formulates the given problem as a linear programming problem. In step 2, the student gathers, validates and veriﬁes the adequate data and the input data are uploaded to the DSS. Then, the student either selects the desired algorithm and the appropriate methods to solve a linear programming problem or selects the option to perform a computational study with all combinations of algorithms and methods. Then, the algorithms’ evaluation and execution step follows. In the last step, the results are presented and analyzed. Finally, the student validates the results and considers if the provision of further feedback on the operation of the DSS is necessary; if so, the updated decision making process is performed again. The interaction between the student and the DSS is presented in Figure 7-7. The student uploads the input ﬁle in the standardized mps format, selects the algorithms (RSM, EPSA), the scaling methods (arithmetic mean, de Buchet for the case p = 1, de Buchet for the case p = 2, entropy, equilibration, geometric mean, IBM MPSX, 148

Figure 7-7: Sequence Diagram

Lp-norm for the case p = 1, Lp-norm for the case p = 2, Lp-norm for the case p = ∞, de Buchet for the case p = ∞), the basis update methods (Gaussian elimination, MATLAB’s built-in function called inv, LU decomposition, Modiﬁcation of the Product Form of the Inverse, Product Form of the Inverse), the pivoting rules (Bland’s rule, Dantzig’s rule, Greatest Increment Method, Least Recently Considered Method, Partial Pricing Rule, Queue Rule, Stack Rule, Steepest Edge Rule) and presses the ’Report’ button. Then, the DSS validates the input data and executes the algorithms for each combination of methods (scaling methods, basis update methods and pivoting rules). Then, it collects: (i) the total execution time, (ii) the time to perform the scaling, (iii) the time to perform the basis inverse, (iv) the time to perform the pivoting, (v) the number of iterations, and (vi) the objective value; and presents these results to the student. Finally, the decision maker can export the results as a pdf ﬁle for further analysis. Figure 7-8 presents the class diagram of the proposed DSS. InitialScreen is a boundary class that includes three methods that respond to the decision maker’s action events: (i) upload input ﬁle, (ii) select algorithms, scaling methods, basis update methods and pivoting rules, and (iii) press ’Report’ button. SimplexAlgorithm is an abstract class that includes the common attributes and methods of RevisedSimplexAlgorithm and ExteriorPrimalSimplexAlgorithm. Matrix A contains the constraints coeﬃcients, vector c the objective function coeﬃcients, vector b the right-hand side values, vector Eqin the type of constraints (equality or inequality), and variable 149

minM ax the type of the linear programming problem (minimization or maximization). Furthermore, SimplexAlgorithm class includes three methods that perform the scaling, the basis inverse and the pivoting according to the selected methods. RevisedSimplexAlgorithm and ExteriorPrimalSimplexAlgorithm classes override the abstract method executeAlgorithm of the SimplexAlgorithm and perform their unique steps for the solution of the linear programming problem. Scaling is an abstract class that includes the common attributes and methods of all diﬀerent scaling methods. Matrix A again contains the constraints coeﬃcients, vector c the objective function coeﬃcients, vector b the right-hand side values, vector r the row scaling factors and vector s the column scaling factors. All the derived scaling classes override the abstract method scaling of the Scaling class and perform their steps to scale the linear programming problem. BasisUpdateMethod is an abstract class that includes the common attributes and methods of all diﬀerent basis update methods. Matrix Ab contains the previous basis inverse, vector hl the pivot column, k the index of the leaving variable and m the number of the constraints. All the derived basis update classes override the abstract method inverse of the BasisUpdateMethod class and perform their steps to update the basis matrix. PivotingRule is an abstract class that includes the common attributes and methods of all diﬀerent pivoting rules. Vector Sn contains the cost coeﬃcients. All the derived pivoting classes override the abstract method pivoting of the PivotingRule class and perform their steps to make the pivoting step. Finally, some of the derived pivoting classes, like Steepest, contain some unique attributes, i.e. Steepest contains vector nonBasicList that holds the indices of the non-basic variables, matrix A the constraints coeﬃcients and matrix Ab the basis matrix.

7.3.2

Educational Software Presentation

Simplex type algorithms, scaling methods, basis update methods and pivoting rules have been implemented using MATLAB. Then, these algorithms and methods were converted to Java classes using the MATLAB Builder JA. The web interface of the 150

Figure 7-8: Class Diagram

DSS was designed using Java Server Pages (JSP). The initial screen of the educational software is presented in Figure 7-9. The student presses the ’Browse’ button in order to upload the ﬁle containing the LP in mps format. Moreover, the student selects the algorithms, the scaling methods, the basis update methods and the pivoting rules that will be included in the comparison. By pressing the ’Report’ button a screen with a thorough report is presented (Figure 710). This screen includes the objective value, the number of the iterations, the total time, the times needed to perform the scaling and the basis update, and the number of iterations for each pivoting rule. Finally, the decision maker may export the report as a pdf ﬁle for further analysis.

7.4

Conclusions

Many problems from diﬀerent scientiﬁc ﬁelds can be formulated as linear programming problems. Many DSS that utilize linear programming algorithms exist, but they do not take into consideration the structure of the problem in order to suggest the 151

Figure 7-9: Initial Screen of the proposed DSS

Figure 7-10: Report Screen

152

best combination of the linear programming algorithm and the appropriate methods for each step of the algorithm. Furthermore, there are many educational software that allow users to deﬁne their linear programming problems and solve them graphically and/or algorithmically. In this chapter, we presented two educational tools that provide decision support tools to students that want to solve their linear programming problems. The main aim of these educational tools is not to teach students how to solve their linear programming problems, but to assist them to select the appropriate method for each step of the linear programming algorithm.

153

154

Chapter 8 Conclusions This chapter summarizes the results of this thesis and presents ideas for future work.

8.1

Results

The main contributions and results of this thesis were as follows: • We surveyed, implemented and computationally compared ten well-known scaling techniques: (i) arithmetic mean, (ii) de Buchet for the case p = 1, (iii) de Buchet for the case p = 2, (iv) entropy, (v) equilibration, (vi) geometric mean, (vii) IBM MPSX, (viii) Lp -norm for the case p = 1, (ix) Lp -norm scaling technique for the case p = 2, and (x) Lp -norm scaling technique for the case p = ∞ and de Buchet for the case p = ∞. Initially, we proposed GPU-based implementations of these techniques. A thorough computational study was presented to establish the practical value of the GPU-based implementations. The GPU-based implementations outperformed the CPU-based implementations in all problems and for all scaling techniques. The speedup gained from all scaling techniques is about 7x. Moreover, we studied the impact of scaling prior to the application of linear programming algorithms. We applied three of the aforementioned techniques, arithmetic mean, equilibration, and geometric mean, in order to investigate the signiﬁcance of scaling on: (i) simplex-type or pivoting algorithms, (ii) interior-point methods, and (iii) exterior point simplex type al155

gorithms. From the computational study that we conducted, we observed that: (i) equilibration is the best scaling technique on all algorithms, (ii) the eﬀect of scaling is more signiﬁcant on IPM and revised simplex algorithm, and (iii) EPSA is scaling invariant. • We studied and implemented six well-known pivoting rules used on simplextype algorithms: (i) Bland’s rule, (ii) Dantzig’s rule, (iii) Greatest Increment Method, (iv) Least Recently Considered Method, (v) Partial Pricing rule, and (vi) Steepest Edge rule. Moreover, we proposed GPU-based implementations of the aforementioned pivoting rules. Computational results were presented to show that that the proposed GPU-based implementations of the pivoting rules outperform the corresponding CPU-based implementations. More speciﬁcally, we concluded that only the Steepest Edge rule can be eﬃciently implemented on a GPU. All the other pivoting rules presented trivial speedups. These results stem from the fact that only Steepest Edge rule’s total execution time is dictated by the pivoting time. The other pivoting rules perform trivial pivoting operations, so, it is not worth to implement them on a GPU. A maximum speedup on the pivoting step of 16.72 was obtained for Steepest Edge rule. The speedup obtained from the pivoting was not depicted exactly in the speedup of the total time, which was on average 4, for two reasons: (i) only the selection of the entering variable has been implemented on a GPU, and (ii) the communication cost is still expensive. • We surveyed and implemented ﬁve widely used basis update schemes: (i) Gaussian elimination, (ii) built-in function of MATLAB called inv, (iii) LU decomposition, (iv) Product Form of the Inverse, and (v) a Modiﬁcation of the Product Form of the Inverse. A computational study highlighted that MPFI and PFI update schemes had the best performance, while LU updating method had the worst performance. Then, we proposed GPU-based implementations for the two fastest update schemes, PFI and MPFI. A thorough computational study over dense optimal LPs was presented to establish the practical value of the GPU156

based implementations. The MPFI scheme was much faster than PFI both in CPU- and in GPU-based implementation. Moreover, the speedup gained from the GPU-based implementation of PFI was around 7 for the time of the basis inverse and 5.5 for the total time when the problem size reaches to 3000x3000. Respectively, and (iii) the speedup gained from the GPU-based implementation of PFI was around 19 for the time of the basis inverse and 5 for the total time when the problem size reaches to 5000x5000. Finally, we incorporated three of these methods, MATLAB’s function inv, MPFI, and PFI, on the exterior and the revised simplex algorithm in order to highlight the signiﬁcance of the choice of the basis update method on simplex type algorithms and the reduction that can oﬀer to the solution time. A computational studied that we conducted pointed out that the percentage of the time to compute the basis inverse to the total time is bigger in the revised than the exterior simplex algorithm, and that PFI is much faster than MPFI and MATLABs inv over the Netlib set.

• We proposed two eﬃcient GPU-based implementations of the revised simplex algorithm and a primal-dual exterior point simplex algorithm. The GPU-based primal-dual exterior point simplex algorithm is the ﬁrst GPU-based exterior point algorithm developed for the solution of a linear programming problem. So, the development of the algorithm introduced an innovation to the ﬁeld of linear programming algorithms on GPUs since it is the only algorithm of its kind. The scaling was performed on both algorithms using the GPU-based implementation of the arithmetic mean scaling method followed by the GPU-based implementation of the equilibration scaling method. The basis update was performed on both algorithms using the GPU-based implementation of MPFI updating scheme. The pivoting step was performed on the revised simplex algorithm using the GPU-based implementation of Steepest Edge pivoting rule. Computational results highlighted that the proposed GPU-based implementations outperform MATLAB’s interior point method. GPU-based PDEPSA achieved a maximum speedup over IPM of about 181 (on average 143) on dense LPs, about 157

19 (on average 15) on sparse LPs with density 10%, and about 20 (on average 16) on sparse LPs with density 20%. GPU-based RSM achieved a maximum speedup over IPM of about 58 (on average 50) on dense LPs, about 2 (on average 2) on sparse LPs with density 10%, and about 4 (on average 4) on sparse LPs with density 20%. So, GPU-based PDEPSA is much faster than GPUbased RSM and MATLAB’s interior point method on all LPs. The percentage of the communication time to the total execution time on both GPU-based algorithms was very small due to memory optimizations that were applied. We concluded that primal-dual exterior point simplex algorithms are more eﬃcient for GPU-based implementations than the revised simplex algorithm. • We designed and implemented two web-based educational tools for learning computational issues of simplex type algorithms. More speciﬁcally, the ﬁrst software assists students in the selection of the linear programming algorithm and basis update method for solving their linear programming problems, while the second software builds on the ﬁrst one and goes further to explore all different steps of the linear programming algorithms. Two linear programming algorithms were incorporated in the software: (i) Revised Simplex Algorithm, and (ii) Exterior Primal Simplex Algorithm. The web-based DSS included a variety of diﬀerent methods for the diﬀerent steps of these algorithms. More speciﬁcally, ten scaling techniques, ﬁve basis update methods and eight pivoting rules have been implemented in the software. The main aim of these educational tools is not to teach students how to solve their linear programming problems, but to assist them to select the appropriate method for each step of the linear programming algorithm.

8.2

Future Research

A subject for future work is the implementation on GPUs of other scaling, pivoting and basis update techniques. More speciﬁcally, the scaling techniques presented in Chapter 3 are Gauss - Seidel based versions (perform ﬁrst a scaling of the rows and 158

then a scaling of the columns). It would be very interesting to implement Jacobi based scaling techniques like binormilization scaling technique. A GPU-based implementation of binormilization scaling technique proposed by Elble and Sahinidis [25] and showed smaller convergence rate, but high speedup in dense large scale LPs. Furthermore, from the six pivoting rules presented in Chapter 4 only Steepest Edge rule can be eﬃciently implemented on a GPU. It would be valuable to implement GPU-based implementations of other approximate methods of Steepest Edge rule [110] [115]. Moreover, in Chapter 5 we studied three diﬀerent types of basis update methods: (i) Product Form of the Inverse, (ii) a Modiﬁcation of the Product Form of the Inverse, and (iii) methods that compute the explicit basis inverse, i.e. LU decomposition, MATLAB’s built-in function inv, and Gaussian elimination. It would be interesting to implement GPU-based implementations of other LU factorization methods [2] [29] [99] [108] [102] [37]. The GPU-based primal-dual exterior point simplex algorithm presented in Chapter 6 is the ﬁrst GPU-based exterior point algorithm developed for the solution of a linear programming problem. It would be valuable to experiment with other primal-dual exterior point simplex algorithm variants. Moreover, we plan to port our implementation using CUDA Fortran in order to take advantage of a high performance computing language and compare it with other state-of-the-art solvers, like CPLEX and GUROBI. Finally, it would be very interesting to examine the performance of the GPU-based PDEPSA algorithm on more modern GPUs and even better to expand this implementation to a hybrid version taking advantage of multi-core CPUs and multi-GPUs.

159

160

Bibliography [1]

Badr, E.S., Moussa, M., Papparrizos, K., Samaras, N., & Sifaleras, A. (2006). Some computational results on MPI parallel implementations of dense simplex method. In Proceedings of World Academy of Science, Engineering and Technology, 23, (CISE 2006), Cairo, Egypt, 39–32.

[2]

Bartels, R.H., & Golub, G.H. (1969). The simplex method of linear programming using LU decomposition. Communications of the ACM, 12, 266–268.

[3]

Bazaraa, M.S., Jarvis, J.J., & Sherali, H.D. (1990). Linear Programming and Network Flows. John Wiley & Sons, Inc.

[4]

Benhamadou, M. (2002). On the simplex algorithm ’revised form’. Advances in Engineering Software, 33(11-12), 769–777.

[5]

Benichou, M., Gautier, J., Hentges, G., & Ribiere, G. (1977). The eﬃcient solution of large-scale linear programming problems. Mathematical Programming, 13, 280–322.

[6]

Bhargava, H.K., Power, D.J., & Sun, D. (2007). Progress in Web-based Decision Support Technologies. Decision Support Systems, 43(4), 1083–1095.

[7]

Bieling, J., Peschlow, P., & Martini, P. (2010). An eﬃcient GPU implementation of the revised Simplex method. In Proceddings of the 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010), Atlanta, USA.

[8]

Bixby, E.R. (1992). Implementing the simplex method: The initial basis. ORSA Journal on Computing, 4, 267–284. 161

[9]

Bixby, R.E., & Martin, A. (2000). Parallelizing the dual simplex method. INFORMS Journal on Computing, 12(1), 45–56.

[10] Bland, R.G. (1977). New ﬁnite pivoting rules for the simplex method. Mathematics of Operations Research, 2(2), 103–107. [11] Borgwardt, H.K. (1982). The average number of pivot steps required by the simplex method is polynomial. Zeitschrift fur Operational Research, 26(1), 157– 177. [12] Brayton, R.K., Gustavson, F.G., & Willoughby, R.A. (1970) Some results on sparse matrices. Mathematics of Computation, 24, 937–954. [13] Carolan, W.J., Hill, J.E., Kennington, J.L., Niemi, S., & Wichmann, S.J. (1990). An Empirical Evaluation of the KORBX Algorithms for Military Airlift Applications. Operations Research, 38(2), 240–248. [14] Chvatal, V. (1983). Linear Programming. W.H. Freeman and Company, New York, USA. [15] Clausen, J. (1987). A note on Edmonds-Fukuda’s pivoting rule for the simplex method. Eur. J. Oper. Res., 29, 378–383. [16] Curtis, A.R., & Reid, J.K. (1972). On the automatic scaling of matrices for Gaussian elimination. J. Inst. Math. Appl., 10, 118–124. [17] Cvetanovic, Z., Freedman, E., & Nofsinger, C. (1991). Eﬃcient decomposition and performance of parallel PDE, FFT, Monte Carlo simulations, simplex, and sparse solvers. The Journal of Supercomputing, 5, 219–238. [18] Dantzig, G.B. (1953). Computational Algorithm of the Revised Simplex Method. RAND Report RM-1266, The RAND Corporation, Santa Monica, CA. [19] Dantzig, G.B., & Orchard-Hays, W. (1954). The product form of the inverse in the simplex method. Math. Comp., 8, 64–67. 162

[20] Dantzig, G.B. (1963). Linear programming and extensions. Princeton, NJ: Princeton University Press. [21] de Buchet, J. (1966). Experiments and statistical data on the solving of largescale linear programs. In Hertz, D.A., Melese, J. (eds.) Proceedings of the Fourth International Conference on Operational Research, Wiley-Interscience, New York, 3–13. [22] Dikin, I.I. (1967). Iterative Solution of Problems of Linear and Quadratic Programming. Soviet Mathematics Doklady, 8, 674–675. [23] Dongarra, J., & Sullivan, F. (2000). Guest Editors’ Introduction: The Top 10 Algorithms. Computing in Science & Engineering, 22–23. [24] Eckstein, J., Boduroglu, I.I., Polymenakos, L.C., & Goldfarb, D. (1995). Dataparallel implementations of dense simplex methods on the Connection Matching CM-2. ORSA Journal on Computing, 7(4), 402–416. [25] Elble, J.M., & Sahinidis, N.V. (2008). Sparse Matrix Binormalization on a GPU. In Proceedings of the 9th International Workshop on State-of-the-Art in Scientiﬁc and Parallel Computing, Trondheim, Norway. [26] Elble, J.M., & Sahinidis, N.V. (2012). Scaling linear optimization problems prior to application of the simplex method. Computational Optimization and Applications, 52(2), 345–371. [27] Faye, R.M., Mora-Camino, F., Sawadogo, S., & Niang, A. (1998). An Intelligent Decision Support System for Irrigation System Management. In Systems, Man, and Cybernetics, IEEE International Conference, 4, 3908–3913. [28] Forrest, J.J., & Goldfarb, D. (1992). Steepest-edge simplex algorithms for linear programming. Mathematical programming, 57(1-3), 341–374. [29] Forrest, J.J.H., & Tomlin, J.A. (1972). Updated triangular factors of the basis to maintain sparsity in the product form simplex method. Mathematical Programming, 2, 263–278. 163

[30] Frisch, K.F. (1955). The logarithmic potential method of convex programming. Technical report, University Institute of Economics, Oslo, Norway. [31] Fukuda, K. (1982). Oriented matroid programming. Ph.D. Thesis, Waterloo University, Waterloo, Ontario, Canada. [32] Fulkerson, D.R., & Wolfe, P. (1962). An algorithm for scaling matrices. SIAM Rev., 4, 142-146. [33] Gade-Nielsen, N.F., Jorgensen, J.B., & Dammann, B. (2012). MPC Toolbox with GPU Accelerated Optimization Algorithms. In Proceedings of the 10th European Workshop on Advanced Control and Diagnosis (ACD 2012), Copenhagen, Denmark. [34] G¨artner, B. (1995). Randomized optimization by Simplex-type methods. PhD thesis, Freien Universit¨at, Berlin. [35] Gay, D.M. (1985). Electronic mail distribution of linear programming test problems. Mathematical Programming Society COAL Newsletter, 13, 10–12. [36] Ghodsypour, S.H., & O’Brien, C. (1998). A Decision Support System for Supplier Selection Using an Integrated Analytic Hierarchy Process and Linear Programming. International Journal of Production Economics, 56, 199–212. [37] Goldfarb, D. (1977). On the Bartels-Golub decomposition for linear programming bases. Mathematical Programming, 13, 272–279. [38] Goldfarb, D., & Reid, J.K. (1977). A Practicable Steepest-Edge Simplex Algorithm. Mathematical Programming, 12(3), 361–371. [39] Gondzio, J. (1992). Splitting dense columns of constraint matrix in interior point methods for large-scale linear programming. Optimization, 24, 285–297. [40] Gondzio, J. (1996). Multiple centrality corrections in a primal-dual method for linear programming. Computational Optimization and Applications, 6, 137– 156. 164

[41] Green, L. (2010). Exploring linear programming. Lake Tahoe Community College. http://www.ltcconline.net/greenl/java/IntermedCollegeAlgebra/ LinearProgramming/LinearProgramming.html (Last accessed on 16/11/2013) [42] Hall, J.A.J., & McKinnon, K.I.M. (1995). An asynchronous parallel revised simplex algorithm. Tech. Rep. MS 95-50, Department of Mathematics and Statistics, University of Edinburgh. [43] Hall, J.A.J., & McKinnon, K.I.M. (1996). PARSMI, a parallel revised simplex algorithm incorporating minor iterations and Devex pricing. In Wasniewski, J., Dongarra, J., Madsen, K., Olesen, D. (eds.) Applied Parallel Computing, Lecture Notes in Computer Science, Springer, 1184, 67–76. [44] Hall, J.A.J., & McKinnon, K.I.M. (1998). ASYNPLEX, an asynchronous parallel revised simplex algorithm. Annals of Operations Research, 81, 27–49. [45] Hall, J., & Baird, M. (2002). LP Explorer 1.0. University of Edinburgh. http: //www.maths.ed.ac.uk/LP-Explorer/ (Last accessed on 16/11/2013) [46] Hamming, R.W. (1971). Introduction to Applied Numerical Analysis. McGrawHill, New York. [47] Harris, P.M.J. (1973). Pivot selection methods for the Devex LP code. Mathematical Programming, 5, 1–28. [48] Ho, J.K. (1977). Nested decomposition of a dynamic energy model. Management Science 23, 1022–1026. [49] Ho, J., & Sundarraj, R. (1994). On the eﬃcacy of distributed simplex algorithms for linear programming. Computational Optimization and Applications, 3, 349– 363. [50] Hoﬀman, A.J., Mannos, M., Sokolowsky, D., & Wiegman, N. (1953). Computational experience in solving linear programs. Journal of the Society for Industrial and Applied Mathematics, 1, 17-33. 165

[51] Jung, J.H., & OLeary, D.P. (2008). Implementing an interior point method for linear programs on a CPU-GPU system. Electronic Transaction on Numerical Analysis, 28, 174-189. [52] Karmarkar, N.K. (1984). A new polynomialtime algorithm for linear programming. Combinatorica, 4, 373-395. [53] Khachiyan, L.G. (1979). A polynomial algorithm in linear programming. Soviet Mathematics Doklady, 20, 191-194. [54] Klee, V., & Minty, G.J. (1972). How good is the simplex algorithm. In O. Shisha (ed.) Inequalities - III. New York and London: Academic Press Inc. [55] Kydd, C. (2010). Linear programming applet. University of Delaware. http:// www.udel.edu/present/tools/lpapplet/lpapplet.html (Last accessed on 16/11/2013) [56] Lalami, M.E., Boyer, V., & El-Baz, D. (2011). Eﬃcient Implementation of the Simplex Method on a CPU-GPU System. In Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW 2011), Washington, USA, 1999–2006. [57] Lalami, M.E., El-Baz, D., & Boyer, V. (2011). Multi gpu implementation of the simplex algorithm. In Proceedings of the 2011 IEEE 13th International Conference on High Performance Computing and Communications (HPCC), Banﬀ, Canada, 179–186. [58] LAPACK Library (2013). http://www.cudatools.com (Last accessed on 16/11/2013) [59] Lappi, J., Nuutinen, T., & Siitonen, M. (1996). A Linear Programming Software for Multilevel Forest Management Planning. Management Systems for a Global Economy with Global Resource Concerns, 470–482. 166

[60] Larsson, T. (1993). On scaling linear programsSome experimental results. Optimization, 27, 335-373. [61] Lauer, J., Jacobs, L.W., Brusco, M.J., & Bechtold, S.E. (1994). An Interactive, Optimization-based Decision Support System for Scheduling Part-time, Computer Lab Attendants. Omega, 22(6), 613–626. [62] Lazaridis, V., Paparrizos, K., Samaras, N., & Sifaleras, A. (2007). Visual LinProg: A web-based educational software for linear programming. Computer Applications in Engineering Education, 15(1), 1–14. [63] Lentini, M., Reinoza, A., Teruel, A., & Guillen, A. (1995). SIMPAR: a parallel sparse simplex. Computational and Applied Mathematics, 14(1), 49–58. [64] Li, J., Lv, R., Hu, X., & Jiang, Z. (2011). A GPU-Based Parallel Algorithm for Large Scale Linear Programming Problem. In Watada, J., Phillips-Wren, G., Jai, L., Howlett, R.J. (eds.) Intelligent Decision Technologies, SIST 10, Springer Berlin Heidelberg, Springer-Verlag, Berlin, 37–46. [65] Lim, S., Kim, G., & Park, S. (2003). A comparative study between various LU update methods in the simplex method. Journal of the Military Operations Research Society of Korea, 29(1). [66] Louren¸co, J.C., Morton, A., & Bana e Costa, C.A. (2012). PROBE-A Multicriteria Decision Support System for Portfolio Robustness Evaluation. Decision Support Systems, 54, 534–550. [67] Lustig, I.J., Marsten, R.E., & Shanno, D.F. (1992). On implementing Mehrotra’s predictor corrector interior point method for linear programming. SIAM J. Optimization, 2, 435–449. [68] Manne, A.S. (1970). Suﬃcient conditions for optimality in an inﬁnite horizon development plan. Econometrica, 38, 18–38. 167

[69] Markowitz, H. (1957). The elimination form of the inverse and its applications to linear programming. Management Science, 3, 255–269. [70] Maros, I., & Khaliq, M.H. (1999). Advances in design and implementation of optimization software. European Journal of Operational Research, 140(2), 322– 337. [71] Mavrotas, G. (2000). Multiple Objective Linear Programming under Uncertainty: Development of a DSS and Application in Energy Planning. PhD Thesis, School of Chemical Engineering, National Technical University of Athens, Greece. [72] McCoy, P.F., & Tomlin, J.A. (1974). Some experiments on the accuracy of three methods of updating the inverse in the simplex method. Technical Report, Stanford University. [73] Mehrotra, S. (1992). On the implementation of a primal-dual interior point method. SIAM J. Optimization, 2, 575–601. [74] M´esz´aros, C. (2013). Linear programming test problems. http://www.sztaki. hu/~meszaros/public_ftp/lptestset/misc/ (Last accessed on 16/11/2013) [75] Meyer, X., Albuquerque, P., & Chopard, B. (2011). A multi-GPU implementation and performance model for the standard simplex method. In Proceedings of the 1st International Symposium and 10th Balkan Conference on Operational Research, Thessaloniki, Greece, 312–319. [76] Murty, K.G. (1974). A note on a Bard type scheme for solving the complementarity problem. Opsearch, 11, 123–130. [77] Nazareth, J.L. (1987). Computer Solution of Linear Programs. Oxford University Press, Oxford, UK. [78] NVIDIA CUDA-CUBLAS Library 2.0, NVIDIA Corp. (2013). http:// developer.nvidia.com/object/cuda.html (Last accessed on 16/11/2013) 168

[79] Ord´on ˜ez, F., & Freund, R. (2003). Computational experience and the explanatory value of condition measures for linear optimization. SIAM J. Optimization, 14(2), 307–333. [80] Papadimitriou, C.H., & Steiglitz, K. (1982). Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, Inc. [81] Paparrizos, K. (1991). An infeasible exterior point simplex algorithm for assignment problems. Mathematical Programming, 51(1–3), 45–54. [82] Paparrizos, K. (1993). An exterior point simplex algorithm for (general) linear programming problems. Annals of Operations Research, 47, 497–508. [83] Paparrizos, K., Samaras, N., & Stephanides, G. (2003). An eﬃcient simplex type algorithm for sparse and dense linear programs. European Journal of Operational Research, 148(2), 323–334. [84] Ploskas, N., Samaras, N., & Papathanasiou, J. (2012). LU decomposition in the revised simplex algorithm. In Proceedings of the 23th National Conference, Hellenic Operational Research Society, Athens, Greece, 77–81. [85] Ploskas, N., & Samaras, N. (2013). Basis Update on Simplex Type Algorithms. In Proceedings of the EWG-DSS Thessaloniki 2013, Thessaloniki, Greece, 11. [86] Ploskas, N., & Samaras, N. (2013). A Computational Comparison of Basis Updating Schemes for the Simplex Algorithm on a CPU-GPU System. American Journal of Operations Research, 3, 497–505. [87] Ploskas, N., & Samaras, N. (2013). Computational Comparison of Pivoting Rules for the Revised Simplex Algorithm. In Proceedings of the XI Balkan Conference on Operational Research, Belgrade, Serbia, 309–317. [88] Ploskas, N., Samaras, N., & Margaritis, K. (2013). A parallel implementation of the revised simplex algorithm using OpenMP: some preliminary results. Optimization Theory, Decision Making, and Operational Research Applications, Springer Proceedings in Mathematics & Statistics, 31, 163–175. 169

[89] Ploskas, N., & Samaras, N. (2013). The impact of scaling on simplex type algorithms. In Proceedings of the 6th Balkan Conference in Informatics, ACM Proceedings, Thessaloniki, Greece, 17–22. [90] Ploskas, N., & Samaras, N. (2013). A Computational Comparison of Scaling Techniques for Linear Optimization Problems on a GPU. International Journal of Computer Mathematics (accepted with minor changes). [91] Ploskas, N., & Samaras, N. (2013). GPU Accelerated Pivoting Rules for the Simplex Algorithm. Journal of Systems and Software (submitted for publication). [92] Ploskas, N., Samaras, N., & Papathanasiou, J. (2013). A Web-Based Decision Support System Using Basis Update on Simplex Type Algorithms. In J.E. Hernandez et al. (eds.) EWG-DSS 2012, LNBIP, 164, 102–114. [93] Ploskas, N., & Samaras, N. (2013). Combining Pivoting Rules for the Revised Simplex Algorithm. Optimization Letters (submitted for publication). [94] Ploskas, N., & Samaras, N. (2013). Eﬃcient GPU-based Implementations of Simplex Type Algorithms. Applied Mathematics and Computation (submitted for publication). [95] Ploskas, N., Samaras, N., & Papathanasiou, J. (2013). A Decision Support System for Solving Linear Programming Problems. In Book of Abstracts of the EURO Mini-Conference GRAZ-2013 on Collaborative Decision Systems in Economics, Complex Societal & Environmental Applications, Graz, Austria, 4. [96] Ploskas, N., Samaras, N., & Papathanasiou, J. (2013). A Decision Support System for Solving Linear Programming Problems. International Journal of Decision Support Systems Technology (submitted for publication). [97] Power, D.J. (2000). Web-based and Model-driven Decision Support Systems: Concepts and Issues. In Proceedings of the Americas Conference on Information Systems, Long Beach, California. 170

[98] Power, D.J., & Kaparthi, S. (2002). Building Web-based Decision Support Systems. Studies in Informatics and Control, 11(4), 291–302. [99] Reid, J. (1982). A sparsity-exploiting variant of the Bartels-Golub decomposition for linear programming bases. Mathematical Programming, 24, 55–69. [100] Renegar, J. (1988). A polynomial-time algorithm, based on Newtons method, for linear programming. Mathematical Programming, 40, 59-93. [101] Samaras, N. (2001). Computational improvements and eﬃcient implementation of two path pivoting algorithms. Ph.D. dissertation, Department of Applied Informatics, University of Macedonia. [102] Saunders, M. (1976). A fast and stable implementation of the simplex method using Bartels-Golub updating. In Bunch, J., Rachev, S.T. (eds.) Sparse Matrix Computation, Academic Press, New York, 213–226. [103] Shor, N.Z. (1977). Cut-oﬀ Method with Space Extension in Convex Programming Problems. Cybernetics, 13, 94–96. [104] Shepard, B. (2010). Linear Programming and Pivoting in 2D. The Computational Geometry Lab at McGill. http://cgm.cs.mcgill.ca/~beezer/cs601/ main.htm (Last accessed on 16/11/2013) [105] Shu, W. (1995). Parallel implementation of a sparse simplex algorithm on MIMD distributed memory computers. Journal of Parallel and Distributed Computing, 31(1), 25–40. [106] Smith, E., Gondzio, J., & Hall, J. (2012). GPU acceleration of the matrixfree interior point method. In Parallel Processing and Applied Mathematics, Springer Berlin Heidelberg, 681–689. [107] Spampinato, D.G., & Elster, A.C. (2009). Linear optimization on modern GPUs. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2009), Rome, Italy. 171

[108] Suhl, L.M., & Suhl, U.H. (1993). A fast LU update for linear programming. Annals of Operations Research, 43(1), 33–47. [109] Stunkel, C.B., & Reed, D.A. (1988). Hypercube implementation of the simplex algorithm. In Proceedings of the 3rd Conference on Hypercube Concurrent Computers and Applications, 2, New York, USA, 1473–1482. ´ [110] Swietanowski, A. (1988). A new steepest edge approximation for the simplex method for linear programming. Computational Optimization and and Applications, 10(3), 271–281. [111] Thomadakis, M.E. (1994). Implementation and Evaluation of Primal and Dual Simplex Methods with Diﬀerent Pivot-Selection Techniques in the LPBench Environment. A Research Report. Texas A&M University, Department of Computer Science. [112] Thomadakis, M.E., & Liu, J.C. (1996). An eﬃcient steepest-edge simplex algorithm for SIMD computers. In Proceedings of the 10th International Conference on Supercomputing (ICS 1996), Philadelphia, Pennsylvania, USA, 286–293. [113] Tomlin, J.A. (1975). On scaling linear programming problems. Math. Program. Stud. 4, 146-166. [114] Van Vuuren, J.H., & Grundlingh, W.R. (2001). An Active Decision Support System for Optimality in Open Air Reservoir Release Strategies. International Transactions in Operational Research, 8(4), 439–464. [115] Vanderbei, R.J. (2001). Linear Programming: Foundations and Extensions (2nd ed.). Kluwer Academic Publishers. [116] Vanderbei, R.J. (2013). Simplex Pivot Tool. Princeton University. http:// campuscgi.princeton.edu/~rvdb/JAVA/pivot/simple.html (Last accessed on 16/11/2013) 172

[117] Venkataramanan, M.A., & Bornstein, M. (1991). A Decision Support System for Parking Space Assignment. Mathematical and Computer Modelling, 15(8), 71–76. [118] Von Neumann, J. (1947). On a maximization problem. Technical report, Institute for Advanced Study, Princeton, NJ, USA. [119] Wang, Z. (1989). A modiﬁed version of the Edmonds-Fukuda algorithm for LP problems in the general form. Asia-Paciﬁc J. Oper. Res. [120] Wright,

D. (2010). Animated linear programming applet. ST. Ed-

ward’s University Computer Sciences Advanced Computing Lab. http:// www.cs.stedwards.edu/~wright/linprog/AnimaLP.html (Last accessed on 16/11/2013) [121] Yarmish, G., & Slyke, R. (2009). A distributed, scaleable simplex method. The Journal of Supercomputing, 49, 373–381. [122] Zadeh, N. (1980). What is the worst case behavior of the simplex algorithm? Technical report, Department of Operations Research, Stanford University. [123] Zhang, S. (1991). On anti-cycling pivoting rules for the simplex method. Oper. Res. Lett., 10, 189–192. [124] Zhang, Y. (1998). Solving Large-Scale Linear Programs by Interior-Point Methods under the MATLAB Environment. Optimization Methods and Software, 10(1), 1–31. [125] Ziegler, G.M. (1990). Linear programming in oriented matroids. Technical Report No. 195, Institut f¨ ur Mathematik, Universit¨at Augsburg, Germany.

173