Skip to content

Commit f71dcfe

Browse files
ffelixggargsaumyajahnvi480bewithgaurav
authored
FIX: varchar columnsize does not account for utf8 conversion (#392)
### Work Item / Issue Reference <!-- IMPORTANT: Please follow the PR template guidelines below. For mssql-python maintainers: Insert your ADO Work Item ID below (e.g. AB#37452) For external contributors: Insert Github Issue number below (e.g. #149) Only one reference is required - either GitHub issue OR ADO Work Item. --> <!-- mssql-python maintainers: ADO Work Item --> <!-- External contributors: GitHub Issue --> > GitHub Issue: #391 ------------------------------------------------------------------- ### Summary <!-- Insert your summary of changes below. Minimum 10 characters required. --> This PR enlarges the fetch buffer size for char/varchar columns by a factor of 4 to account for characters which take up more space in utf8 than in whichever encoding the database is using. It also adds a test, which passes at each commit and therefore tracks how the interface changes. I removed some of the fallback mechanisms and I hope that the evolution of the error messages over the different iterations of the test shows why that's preferable. For the SQLGetData path for wchars, a `columnSize == 0` check is added, to avoid needing the fallback branch for `nvarchar(max)`. Previously, SQLGetData was called once with a buffer of length 0 and only then did the real attempt to fetch data start, after entering the fallback branch. `columnSize == 0` was actually also the only case where the fallback branch behaved correctly, anything else discarded the first `columnSize` characters. A test that covers all of latin1 and documents behavior with unmapped characters is added as well. Note that if the database is using a utf8 collation then a buffer of size `columnSize` would be enough, as it tells us the number of bytes and not the number of characters (this distinction only matters for utf8). <!-- ### PR Title Guide > For feature requests FEAT: (short-description) > For non-feature requests like test case updates, config updates , dependency updates etc CHORE: (short-description) > For Fix requests FIX: (short-description) > For doc update requests DOC: (short-description) > For Formatting, indentation, or styling update STYLE: (short-description) > For Refactor, without any feature changes REFACTOR: (short-description) > For release related changes, without any feature changes RELEASE: #<RELEASE_VERSION> (short-description) ### Contribution Guidelines External contributors: - Create a GitHub issue first: https://github.com/microsoft/mssql-python/issues/new - Link the GitHub issue in the "GitHub Issue" section above - Follow the PR title format and provide a meaningful summary mssql-python maintainers: - Create an ADO Work Item following internal processes - Link the ADO Work Item in the "ADO Work Item" section above - Follow the PR title format and provide a meaningful summary --> --------- Co-authored-by: gargsaumya <saumyagarg.100@gmail.com> Co-authored-by: Jahnvi Thakkar <61936179+jahnvi480@users.noreply.github.com> Co-authored-by: Gaurav Sharma <sharmag@microsoft.com>
1 parent 63376fd commit f71dcfe

3 files changed

Lines changed: 206 additions & 59 deletions

File tree

mssql_python/pybind/ddbc_bindings.cpp

Lines changed: 55 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -2924,7 +2924,9 @@ SQLRETURN SQLGetData_wrap(SqlHandlePtr StatementHandle, SQLUSMALLINT colCount, p
29242924
row.append(
29252925
FetchLobColumnData(hStmt, i, SQL_C_CHAR, false, false, charEncoding));
29262926
} else {
2927-
uint64_t fetchBufferSize = columnSize + 1 /* null-termination */;
2927+
// Multiply by 4 because utf8 conversion by the driver might
2928+
// turn varchar(x) into up to 3*x (maybe 4*x?) bytes.
2929+
uint64_t fetchBufferSize = 4 * columnSize + 1 /* null-termination */;
29282930
std::vector<SQLCHAR> dataBuffer(fetchBufferSize);
29292931
SQLLEN dataLen;
29302932
ret = SQLGetData_ptr(hStmt, i, SQL_C_CHAR, dataBuffer.data(), dataBuffer.size(),
@@ -2953,12 +2955,15 @@ SQLRETURN SQLGetData_wrap(SqlHandlePtr StatementHandle, SQLUSMALLINT colCount, p
29532955
row.append(raw_bytes);
29542956
}
29552957
} else {
2956-
// Buffer too small, fallback to streaming
2957-
LOG("SQLGetData: CHAR column %d data truncated "
2958-
"(buffer_size=%zu), using streaming LOB",
2959-
i, dataBuffer.size());
2960-
row.append(FetchLobColumnData(hStmt, i, SQL_C_CHAR, false, false,
2961-
charEncoding));
2958+
// Reaching this case indicates an error in mssql_python.
2959+
// Theoretically, we could still compensate by calling SQLGetData or
2960+
// FetchLobColumnData more often, but then we would still have to process
2961+
// the data we already got from the above call to SQLGetData.
2962+
// Better to throw an exception and fix the code than to risk returning corrupted data.
2963+
ThrowStdException(
2964+
"Internal error: SQLGetData returned data "
2965+
"larger than expected for CHAR column"
2966+
);
29622967
}
29632968
} else if (dataLen == SQL_NULL_DATA) {
29642969
LOG("SQLGetData: Column %d is NULL (CHAR)", i);
@@ -2995,7 +3000,7 @@ SQLRETURN SQLGetData_wrap(SqlHandlePtr StatementHandle, SQLUSMALLINT colCount, p
29953000
case SQL_WCHAR:
29963001
case SQL_WVARCHAR:
29973002
case SQL_WLONGVARCHAR: {
2998-
if (columnSize == SQL_NO_TOTAL || columnSize > 4000) {
3003+
if (columnSize == SQL_NO_TOTAL || columnSize == 0 || columnSize > 4000) {
29993004
LOG("SQLGetData: Streaming LOB for column %d (SQL_C_WCHAR) "
30003005
"- columnSize=%lu",
30013006
i, (unsigned long)columnSize);
@@ -3024,12 +3029,15 @@ SQLRETURN SQLGetData_wrap(SqlHandlePtr StatementHandle, SQLUSMALLINT colCount, p
30243029
"length=%lu for column %d",
30253030
(unsigned long)numCharsInData, i);
30263031
} else {
3027-
// Buffer too small, fallback to streaming
3028-
LOG("SQLGetData: NVARCHAR column %d data "
3029-
"truncated, using streaming LOB",
3030-
i);
3031-
row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
3032-
"utf-16le"));
3032+
// Reaching this case indicates an error in mssql_python.
3033+
// Theoretically, we could still compensate by calling SQLGetData or
3034+
// FetchLobColumnData more often, but then we would still have to process
3035+
// the data we already got from the above call to SQLGetData.
3036+
// Better to throw an exception and fix the code than to risk returning corrupted data.
3037+
ThrowStdException(
3038+
"Internal error: SQLGetData returned data "
3039+
"larger than expected for WCHAR column"
3040+
);
30333041
}
30343042
} else if (dataLen == SQL_NULL_DATA) {
30353043
LOG("SQLGetData: Column %d is NULL (NVARCHAR)", i);
@@ -3291,8 +3299,15 @@ SQLRETURN SQLGetData_wrap(SqlHandlePtr StatementHandle, SQLUSMALLINT colCount, p
32913299
row.append(py::bytes(
32923300
reinterpret_cast<const char*>(dataBuffer.data()), dataLen));
32933301
} else {
3294-
row.append(
3295-
FetchLobColumnData(hStmt, i, SQL_C_BINARY, false, true, ""));
3302+
// Reaching this case indicates an error in mssql_python.
3303+
// Theoretically, we could still compensate by calling SQLGetData or
3304+
// FetchLobColumnData more often, but then we would still have to process
3305+
// the data we already got from the above call to SQLGetData.
3306+
// Better to throw an exception and fix the code than to risk returning corrupted data.
3307+
ThrowStdException(
3308+
"Internal error: SQLGetData returned data "
3309+
"larger than expected for BINARY column"
3310+
);
32963311
}
32973312
} else if (dataLen == SQL_NULL_DATA) {
32983313
row.append(py::none());
@@ -3434,7 +3449,9 @@ SQLRETURN SQLBindColums(SQLHSTMT hStmt, ColumnBuffers& buffers, py::list& column
34343449
// TODO: handle variable length data correctly. This logic wont
34353450
// suffice
34363451
HandleZeroColumnSizeAtFetch(columnSize);
3437-
uint64_t fetchBufferSize = columnSize + 1 /*null-terminator*/;
3452+
// Multiply by 4 because utf8 conversion by the driver might
3453+
// turn varchar(x) into up to 3*x (maybe 4*x?) bytes.
3454+
uint64_t fetchBufferSize = 4 * columnSize + 1 /*null-terminator*/;
34383455
// TODO: For LONGVARCHAR/BINARY types, columnSize is returned as
34393456
// 2GB-1 by SQLDescribeCol. So fetchBufferSize = 2GB.
34403457
// fetchSize=1 if columnSize>1GB. So we'll allocate a vector of
@@ -3580,8 +3597,7 @@ SQLRETURN SQLBindColums(SQLHSTMT hStmt, ColumnBuffers& buffers, py::list& column
35803597
// Fetch rows in batches
35813598
// TODO: Move to anonymous namespace, since it is not used outside this file
35823599
SQLRETURN FetchBatchData(SQLHSTMT hStmt, ColumnBuffers& buffers, py::list& columnNames,
3583-
py::list& rows, SQLUSMALLINT numCols, SQLULEN& numRowsFetched,
3584-
const std::vector<SQLUSMALLINT>& lobColumns) {
3600+
py::list& rows, SQLUSMALLINT numCols, SQLULEN& numRowsFetched) {
35853601
LOG("FetchBatchData: Fetching data in batches");
35863602
SQLRETURN ret = SQLFetchScroll_ptr(hStmt, SQL_FETCH_NEXT, 0);
35873603
if (ret == SQL_NO_DATA) {
@@ -3600,19 +3616,28 @@ SQLRETURN FetchBatchData(SQLHSTMT hStmt, ColumnBuffers& buffers, py::list& colum
36003616
SQLULEN columnSize;
36013617
SQLULEN processedColumnSize;
36023618
uint64_t fetchBufferSize;
3603-
bool isLob;
36043619
};
36053620
std::vector<ColumnInfo> columnInfos(numCols);
36063621
for (SQLUSMALLINT col = 0; col < numCols; col++) {
36073622
const auto& columnMeta = columnNames[col].cast<py::dict>();
36083623
columnInfos[col].dataType = columnMeta["DataType"].cast<SQLSMALLINT>();
36093624
columnInfos[col].columnSize = columnMeta["ColumnSize"].cast<SQLULEN>();
3610-
columnInfos[col].isLob =
3611-
std::find(lobColumns.begin(), lobColumns.end(), col + 1) != lobColumns.end();
36123625
columnInfos[col].processedColumnSize = columnInfos[col].columnSize;
36133626
HandleZeroColumnSizeAtFetch(columnInfos[col].processedColumnSize);
3614-
columnInfos[col].fetchBufferSize =
3615-
columnInfos[col].processedColumnSize + 1; // +1 for null terminator
3627+
switch (columnInfos[col].dataType) {
3628+
case SQL_CHAR:
3629+
case SQL_VARCHAR:
3630+
case SQL_LONGVARCHAR:
3631+
// Multiply by 4 because utf8 conversion by the driver might
3632+
// turn varchar(x) into up to 3*x (maybe 4*x?) bytes.
3633+
columnInfos[col].fetchBufferSize =
3634+
4 * columnInfos[col].processedColumnSize + 1; // +1 for null terminator
3635+
break;
3636+
default:
3637+
columnInfos[col].fetchBufferSize =
3638+
columnInfos[col].processedColumnSize + 1; // +1 for null terminator
3639+
break;
3640+
}
36163641
}
36173642

36183643
std::string decimalSeparator = GetDecimalSeparator(); // Cache decimal separator
@@ -3630,7 +3655,6 @@ SQLRETURN FetchBatchData(SQLHSTMT hStmt, ColumnBuffers& buffers, py::list& colum
36303655
columnInfosExt[col].columnSize = columnInfos[col].columnSize;
36313656
columnInfosExt[col].processedColumnSize = columnInfos[col].processedColumnSize;
36323657
columnInfosExt[col].fetchBufferSize = columnInfos[col].fetchBufferSize;
3633-
columnInfosExt[col].isLob = columnInfos[col].isLob;
36343658

36353659
// Map data type to processor function (switch executed once per column,
36363660
// not per cell)
@@ -3739,7 +3763,7 @@ SQLRETURN FetchBatchData(SQLHSTMT hStmt, ColumnBuffers& buffers, py::list& colum
37393763
// types) to just 10 (setup only) Note: Processor functions no
37403764
// longer need to check for NULL since we do it above
37413765
if (columnProcessors[col - 1] != nullptr) {
3742-
columnProcessors[col - 1](row, buffers, &columnInfosExt[col - 1], col, i, hStmt);
3766+
columnProcessors[col - 1](row, buffers, &columnInfosExt[col - 1], col, i);
37433767
continue;
37443768
}
37453769

@@ -3916,7 +3940,9 @@ size_t calculateRowSize(py::list& columnNames, SQLUSMALLINT numCols) {
39163940
case SQL_CHAR:
39173941
case SQL_VARCHAR:
39183942
case SQL_LONGVARCHAR:
3919-
rowSize += columnSize;
3943+
// Multiply by 4 because utf8 conversion by the driver might
3944+
// turn varchar(x) into up to 3*x (maybe 4*x?) bytes.
3945+
rowSize += 4 * columnSize;
39203946
break;
39213947
case SQL_SS_XML:
39223948
case SQL_WCHAR:
@@ -4070,7 +4096,7 @@ SQLRETURN FetchMany_wrap(SqlHandlePtr StatementHandle, py::list& rows, int fetch
40704096
SQLSetStmtAttr_ptr(hStmt, SQL_ATTR_ROW_ARRAY_SIZE, (SQLPOINTER)(intptr_t)fetchSize, 0);
40714097
SQLSetStmtAttr_ptr(hStmt, SQL_ATTR_ROWS_FETCHED_PTR, &numRowsFetched, 0);
40724098

4073-
ret = FetchBatchData(hStmt, buffers, columnNames, rows, numCols, numRowsFetched, lobColumns);
4099+
ret = FetchBatchData(hStmt, buffers, columnNames, rows, numCols, numRowsFetched);
40744100
if (!SQL_SUCCEEDED(ret) && ret != SQL_NO_DATA) {
40754101
LOG("FetchMany_wrap: Error when fetching data - SQLRETURN=%d", ret);
40764102
return ret;
@@ -4203,7 +4229,7 @@ SQLRETURN FetchAll_wrap(SqlHandlePtr StatementHandle, py::list& rows,
42034229

42044230
while (ret != SQL_NO_DATA) {
42054231
ret =
4206-
FetchBatchData(hStmt, buffers, columnNames, rows, numCols, numRowsFetched, lobColumns);
4232+
FetchBatchData(hStmt, buffers, columnNames, rows, numCols, numRowsFetched);
42074233
if (!SQL_SUCCEEDED(ret) && ret != SQL_NO_DATA) {
42084234
LOG("FetchAll_wrap: Error when fetching data - SQLRETURN=%d", ret);
42094235
return ret;

0 commit comments

Comments
 (0)