杭州网站制作专业,东北网站建设公司,怎么选择优秀的网站建设公司,如何评价一个网站做的是否好对应用程序来说#xff0c;选择合适的memory object传输path可以有效提高程序性能。 下面先看一写buffer bandwidth的例子#xff1a; 1. clEnqueueWriteBuffer()以及clEnqueueReadBuffer() 如果应用程序已经通过malloc 或者mmap分配内存#xff0c;CL_MEM_USE_HOST_PTR是个… 对应用程序来说选择合适的memory object传输path可以有效提高程序性能。 下面先看一写buffer bandwidth的例子 1. clEnqueueWriteBuffer()以及clEnqueueReadBuffer() 如果应用程序已经通过malloc 或者mmap分配内存CL_MEM_USE_HOST_PTR是个理想的选择。 有两种使用这种方式的方法 第一种 a. pinnedBuffer clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR ) b. deviceBuffer clCreateBuffer() c. void *pinnedMemory clEnqueueMapBuffer( pinnedBuffer ) d. clEnqueueRead/WriteBuffer( deviceBuffer, pinnedMemory ) e. clEnqueueUnmapMemObject( pinnedBuffer, pinnedMemory ) pinning开销在步骤a产生步骤d没有任何pinning开销。通常应用立即程序执行a,b,c,e步骤而在步骤d之后要反复读和修改pinnedMemory中的数据 第二种 clEnqueueRead/WriteBuffer 直接在用户的memory buffer中被使用。在copy(host-device)数据前首先需要pin(lock page)操作然后才能执行传输操作。这条path大概是peak interconnect bandwidth的2/3。 2. 在pre-pinned host buffer上使用clEnqueueCopyBuffer() 和1类似clEnqueueCopyBuffer在pre-pinned buffer上以peak interconnect bandwidth执行传输操作 a. pinnedBuffer clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR ) b. deviceBuffer clCreateBuffer() c. void *memory clEnqueueMapBuffer( pinnedBuffer ) d. Application writes or modifies memory. e. clEnqueueUnmapMemObject( pinnedBuffer, memory ) f. clEnqueueCopyBuffer( pinnedBuffer, deviceBuffer ) 或者通过: g. clEnqueueCopyBuffer( deviceBuffer, pinnedBuffer ) h. void *memory clEnqueueMapBuffer( pinnedBuffer ) i. Application reads memory. j. clEnqueueUnmapMemObject( pinnedBuffer, memory ) 由于pinned memory驻留在host memroy所以clMap() 以及 clUnmap()调用不会导致数据传输。cpu可以以host memory带宽来操作这些pinned buffer。 3、在device buffer上执行 clEnqueueMapBuffer() and clEnqueueUnmapMemObject() 对于已经通过malloc和mmap分配空间的buffer传输开销除了interconnect传输外还要包括一个memcpy过程该过程把buffer拷贝进mapped device buffer。 a. Data transfer from host to device buffer. 1. ptr clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. ) 由于缓冲被映射为write-only,所以没有数据从device传输到host映射开销比较低。一个指向pinned host buffer的指针被返回。 2. 应用程序通过memset(ptr)填充host buffer memcpy ( ptr, srcptr ), fread( ptr ), 或者直接CPU写 这些操作以host memory全速带宽读写。 3. clEnqueueUnmapMemObject( .., buf, ptr, .. ) pre-pinned buffer以peak interconnect速度被传输到GPU device。 b. Data transfer from device buffer to host. 1. ptr clEnqueueMapBuffer(.., buf, .., CL_MAP_READ, .. ) 这个命令启动devcie到host数据传输数据以peak interconnect bandwidth传输到一个pre-pinned的临时缓冲中。返回一个指向pinned memory的指针。 2. 应用程序读、处理数据或者执行 memcpy( dstptr, ptr ), fwrite (ptr), 或者其它类似的函数时候由于buffer驻留在host memory中所以操作以host memory bandwidth执行。 3. clEnqueueUnmapMemObject( .., buf, ptr, .. ) 由于buffer被映射成只读的没有实际数据传输所以unmap操作的cost很低。 4. host直接访问设备zero copy buffer 这个访问允许数据传输和GPU计算同时执行overlapped在一些稀疏(sparse)的写或者更新情况下比较有用。 a. 一个device上的 zero copy buffer通过下面的命令被创建 buf clCreateBuffer ( .., CL_MEM_USE_PERSISTENT_MEM_AMD, .. ) CPU能够通过uncached WC path直接访问该buffer。 通常可以使用双缓冲机制gpu在处理一个缓冲中的数据cpu同时在填充另一个缓冲中的数据。 A zero copy device buffer can also be used to for sparse updates, such as assembling sub-rows of a larger matrix into a smaller, contiguous block for GPU processing. Due to the WC path, it is a good design choice to try to align writes to the cache line size, and to pick the write block size as large as possible. b. Transfer from the host to the device. 1. ptr clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. ) This operation is low cost because the zero copy device buffer is directly mapped into the host address space. 2. The application transfers data via memset( ptr ), memcpy( ptr, srcptr ), or direct CPU writes. The CPU writes directly across the interconnect into the zero copy device buffer. Depending on the chipset, the bandwidth can be of the same order of magnitude as the interconnect bandwidth, although it typically is lower than peak. 3. clEnqueueUnmapMemObject( .., buf, ptr, .. ) As with the preceding map, this operation is low cost because the buffer continues to reside on the device. c. If the buffer content must be read back later, use clEnqueueReadBuffer( .., buf, ..) or clEnqueueCopyBuffer( .., buf, zero copy host buffer, .. ). This bypasses slow host reads through the uncached path. 5 - GPU直接访问host zero copy memory This option allows direct reads or writes of host memory by the GPU. A GPU kernel can import data from the host without explicit transfer, and write data directly back to host memory. An ideal use is to perform small I/Os straight from the kernel, or to integrate the transfer latency directly into the kernel execution time. aThe application creates a zero copy host buffer. buf clCreateBuffer( .., CL_MEM_ALLOC_HOST_PTR, .. ) bNext, the application modifies or reads the zero copy host buffer. 1. ptr clEnqueueMapBuffer( .., buf, .., CL_MAP_READ | CL_MAP_WRITE, .. ) This operation is very low cost because it is a map of a buffer already residing in host memory. 2. The application modifies the data through memset( ptr ), memcpy( in either direction ), sparse or dense CPU reads or writes. Since the application is modifying a host buffer, these operations take place at host memory bandwidth. 3. clEnqueueUnmapMemObject( .., buf, ptr, .. ) As with the preceding map, this operation is very low cost because the buffer continues to reside in host memory. c. The application runs clEnqueueNDRangeKernel(), using buffers of this type as input or output. GPU kernel reads and writes go across the interconnect to host memory, and the data transfer becomes part of the kernel execution. The achievable bandwidth depends on the platform and chipset, but can be of the same order of magnitude as the peak interconnect bandwidth. For discrete graphics cards, it is important to note that resulting GPU kernel bandwidth is an order of magnitude lower compared to a kernel accessing a regular device buffer located on the device. d. Following kernel execution, the application can access data in the host buffer in the same manner as described above.