GPU pytorch compile options for older card - i.e. GT710


I was curious if anyone has managed to compile pytorch to get Deepstack working with older cards where Compute Compatability is <= 3.5?

I have successfully compiled pytorch as per: GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration
and copied the “torch” directory under Deepstack/windows_packages but get the following error:

Traceback (most recent call last):
  File "C:\DeepStack\intelligencelayer\shared\", line 21, in <module>
    import torch
  File "C://DeepStack\windows_packages\torch\", line 135, in <module>
    raise err
OSError: [WinError 126] The specified module could not be found. Error loading "C://DeepStack\windows_packages\torch\lib\backend_with_compiler.dll" or one of its dependencies.

Any ideas how to get this working?

Thanks again!!


Just got this working with the Deepstack-GPU-2022.01.1 installer!
So much faster on my older machine.

The problem above was because not all of the libs were copied over from my compile directory.
So I just needed to copy the contents of the folder: pytorch-1.10.1\pytorch\build\lib
into the C://DeepStack\windows_packages\torch\lib
and the error message went away, but then I had to do the same with torchvision.

This is with
Driver Version: 472.98
CUDA Version: 11.4
CUDNN: 8.2.4 (


Thanks for reporting this @loopy12 . We will review the Windows GPU build again to figure out what could be wrong that caused the PyTorch build to be missing.

1 Like

With “older” (aka MUCH cheaper) cards such as the GT710 being the go-to for tinkerers and testers, could I ask the inclusion of these compiled kernel images with Deepstack’s GPU version?

1 Like

@OlafenwaMoses To clarify, the installer works fine, but does not include support for the older cards due to pytorch dropping support for Compute<=3.5 in their builds by default but can be included by compiling. So I compiled pytorch with the support and got it working by copying the compiled libs over the Deepstack-GPU installer and it worked.


Thanks for sharing this @loopy12 . To address issues with running DeepStack on different generations of GPUs, we are working on doing the following in future releases

  • the default :gpu tag will support latest GPUs and CUDA version
  • there will be other GPU tags for older CUDA versions

cc: @john

1 Like

Great! I can confirm that Deepstack works well on the GT710 and is about ~3-4x superior in performance over my running the CPU version.


@loopy12 I have a GT 710 and cannot get Deeptack GPU to work either. Can you share your compiled pytorch with support for compute<= 3.5? Thanks!

You can try it. Probably need to make sure you have everything installed exactly as in my second post above (i.e. driver, CUDA, and CUDNN).

Two zip files below (check the for where to put the contents) should look like what is already there from the Deepstack GPU version installer:


Hey @loopy12, first… thanks for finding the issue with torch and torchvision only set to work for Compute Capability >3.5, indeed I am running an older card (Gforce GT 530 = 2.1 Compute Capability), so yes, Cuda detects it as a valid GPU… and torch ignores it and DeepStack tries to deserialize apparently…

I did try your recompiled torch/torchvision, but had no real success… I installed the same v11.4 Cuda and CudaNN and overwrote a fresh DeepStack install with your binaries, but still get the same log file result:
NOT complaining, understand no warranty expressed or implied, just sharing my experience…

C://DeepStack\windows_packages\torch\cuda\ UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ..\c10\cuda\CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "C:\DeepStack\intelligencelayer\shared\", line 68, in <module>
    detector = YOLODetector(model_path, reso, cuda=CUDA_MODE)
  File "C:\DeepStack\intelligencelayer\shared\.\", line 31, in __init__
    self.model = attempt_load(model_path, map_location=self.device)
  File "C:\DeepStack\intelligencelayer\shared\.\models\", line 158, in attempt_load
    torch.load(w, map_location=map_location)["model"].float().fuse().eval()
  File "C://DeepStack\windows_packages\torch\", line 607, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "C://DeepStack\windows_packages\torch\", line 882, in _load
    result = unpickler.load()
  File "C://DeepStack\windows_packages\torch\", line 857, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "C://DeepStack\windows_packages\torch\", line 846, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "C://DeepStack\windows_packages\torch\", line 827, in restore_location
    return default_restore_location(storage, str(map_location))
  File "C://DeepStack\windows_packages\torch\", line 175, in default_restore_location
    result = fn(storage, location)
  File "C://DeepStack\windows_packages\torch\", line 151, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "C://DeepStack\windows_packages\torch\", line 135, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I think the 2.1 compute capability is the problem. If I recall, when I compiled it only included >= 3.5.
You might be able to compile with a lower version, but I am not sure…

FYI, the most informative tool to make sure you have correct versions is the devicequery.exe tool from the Nvidia demo_suite

This is the result when run on my system:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\extras\demo_suite>deviceQuery.exe
deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce GT 710"
  CUDA Driver Version / Runtime Version          11.4 / 11.3
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 1024 MBytes (1073741824 bytes)
  ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
  GPU Max Clock rate:                            954 MHz (0.95 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               zu bytes
  Total amount of shared memory per block:       zu bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          zu bytes
  Texture alignment:                             zu bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.3, NumDevs = 1, Device0 = NVIDIA GeForce GT 710
Result = PASS

Wow! Thanks for this. This tells me Compute Capability is not a scalar multiple of some sort, but a versioned compatibility number. When I try to run this on my system, I get:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\extras\demo_suite\deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL

So not clear why the Cuda Installer detects my card and installs Cuda if the GPU Cuda Compute Version (seems a MUCH better name for it) is not compatible… days of chasing my tail…

Again, much gratitude for the knowledge here!!!

If you want to try more tail chasing, according to here, this is a CUDA 11.x limitation (requires a card with compute >=3.5), but CUDA 10.x might work.