Qt
Internal/Contributor docs for the Qt SDK. Note: These are NOT official API docs; those are found at https://doc.qt.io/
Loading...
Searching...
No Matches
qstring-overview.qdoc
Go to the documentation of this file.
1// Copyright (C) 2025 The Qt Company Ltd.
2// SPDX-License-Identifier: LicenseRef-Qt-Commercial OR GFDL-1.3-no-invariants-only
3
4/*!
5 \group string-processing
6
7 \title Classes for string data
8
9 \section1 Overview
10
11 This page gives an overview over string classes in Qt, in particular the
12 large amount of string containers and how to use them efficiently in
13 performance-critical code.
14
15 The following instructions for efficient use are aimed at experienced
16 developers working on performance-critical code that contains considerable
17 amounts of string processing. This is, for example, a parser or a text file
18 generator. \e {Generally, \l QString can be used everywhere and it will
19 perform fine.} It also provides APIs for handling several encodings (for
20 example \l{QString::fromLatin1()}). For many applications and especially when
21 string-processing plays an insignificant role for performance, \l QString
22 will be a simple and sufficient solution. Some Qt functions return a \l
23 QStringView. It can be converted to a QString with
24 \l{QStringView::toString()} if required.
25
26 \section2 Impactful tips
27
28 The following three rules improve string handling substantially without
29 increasing the complexity too much. Follow these rules to get nearly
30 optimal performance in most cases. The first two rules address encoding of
31 string literals and marking them in source code. The third rule addresses
32 deep copies when using parts of a string.
33
34 \list
35
36 \li All strings that only contain ASCII characters (for example log
37 messages) can be encoded with Latin-1. Use the
38 \l{Qt::Literals::StringLiterals::operator""_L1}{string literal}
39 \c{"foo"_L1}. Without
40 this suffix, string literals in source code are assumed to be UTF-8
41 encoded and processing them will be slower. Generally, try to use the
42 tightest encoding, which is Latin-1 in many cases.
43
44 \li User-visible strings are usually translated and thus passed through the
45 \l {QObject::tr()} function. This function takes a string literal (const char
46 array) and returns a \l QString with UTF-16 encoding as demanded by all UI
47 elements. If the translation infrastructure is not used, you should use
48 UTF-16 encoding throughout the whole application. Use the string literal
49 \c{u"foo"} to create UTF-16 string literals or the Qt specific literal
50 \c{u"foo"_s} to directly create a \l QString.
51
52 \li When processing parts of a \l QString, instead of copying each part
53 into its own \l QString object, create \l QStringView objects instead.
54 These can be converted back to \l QString using
55 \l{QStringView::toString()}, but avoid doing so as much as possible. If
56 functions return \l QStringView, it is most efficient to keep working with
57 this class, if possible. The API is similar to a constant \l QString.
58
59 \endlist
60
61 \section2 Efficient usage
62
63 To use string classes efficiently, one should understand the three concepts
64 of:
65 \list
66 \li Encoding
67 \li Owning and non-owning containers
68 \li Literals
69 \endlist
70
71 \section3 Encoding
72
73 Encoding-wise Qt supports UTF-16, UTF-8, Latin-1 (ISO 8859-1) and US-ASCII
74 (that is the common subset of Latin-1 and UTF-8) in one form or another.
75 \list
76 \li Latin-1 is a character encoding that uses a single byte per character
77 which makes it the most efficient but also limited encoding.
78 \li UTF-8 is a variable-length character encoding that encodes all
79 characters using one to four bytes. It is backwards compatible to
80 US-ASCII and it is the common encoding for source code and similar
81 files. Qt assumes that source code is encoded in UTF-8.
82 \li UTF-16 is a variable-length encoding that uses two or four bytes per
83 character. It is the common encoding for user-exposed text in Qt.
84 \endlist
85 See the \l{Unicode in Qt}{information about support for Unicode in Qt} for
86 more information.
87
88 Other encodings are supported in the form of single functions like
89 \l{QString::fromUcs4()} or of the \l{QStringConverter} classes. Furthermore,
90 Qt provides an encoding-agnostic container for data, \l QByteArray, that is
91 well-suited to storing binary data. \l QAnyStringView keeps track of the
92 encoding of the underlying string and can thus carry a view onto strings
93 with any of the supported encoding standards.
94
95 Converting between encodings is expensive, therefore, avoid if possible. On
96 the other hand, a more compact encoding, particularly for string literals,
97 can reduce binary size, which can increase performance. Where string
98 literals can be expressed in Latin-1, it manages a good compromise between
99 these competing factors, even if it has to be converted to UTF-16 at some
100 point. When a Latin-1 string must be converted to a \l QString, it is done
101 relatively efficiently.
102
103 \section3 Functionality
104
105 String classes can be further distinguished by the functionality they
106 support. One major distinction is whether they own, and thus control, their
107 data or merely reference data held elsewhere. The former are called \e
108 owning containers, the latter \e non-owning containers or views. A
109 non-owning container type typically just records a pointer to the start of
110 the data and its size, making it lightweight and cheap, but it only remains
111 valid as long as the data remains available. An owning string manages the
112 memory in which it stores its data, ensuring that data remains available
113 throughout the lifetime of the container, but its creation and destruction
114 incur the costs of allocating and releasing memory. Views typically support
115 a subset of the functions of the owning string, lacking the possibility to
116 modify the underlying data.
117
118 As a result, string views are particularly well-suited to representing
119 parts of larger strings, for example in a parser, while owning strings are
120 good for persistent storage, such as members of a class. Where a function
121 returns a string that it has constructed, for example by combining
122 fragments, it has to return an owning string; but where a function returns
123 part of some persistently stored string, a view is usually more suitable.
124
125 Note that owning containers in Qt share their data \l{Implicit
126 Sharing}{implicitly}, meaning that it is also efficient to pass or return
127 large containers by value, although slightly less efficient than passing by
128 reference due to the reference counting. If you want to make use of the
129 implicit data sharing mechanism of Qt classes, you have to pass the string
130 as an owning container or a reference to one. Conversion to a view and back
131 will always create an additional copy of the data.
132
133 Finally, Qt provides classes for single characters, lists of strings and
134 string matchers. These classes are available for most supported encoding
135 standards in Qt, with some exceptions. Higher level functionality is
136 provided by specialized classes, such as \l QLocale or \l
137 QTextBoundaryFinder. These high level classes usually rely on \l QString
138 and its UTF-16 encoding. Some classes are templates and work with all
139 available string classes.
140
141 \section3 Literals
142
143 The C++ standard provides
144 \l{https://en.cppreference.com/w/cpp/language/string_literal} {string
145 literals} to create strings at compile-time. There are string literals
146 defined by the language and literals defined by Qt, so-called
147 \l{https://en.cppreference.com/w/cpp/language/user_literal}{user-defined
148 literals}. A string literal defined by C++ is enclosed in double quotes and
149 can have a prefix that tells the compiler how to interpret its content. For
150 Qt, the UTF-16 string literal \c{u"foo"} is the most important. It creates
151 a string encoded in UTF-16 at compile-time, saving the need to convert from
152 some other encoding at run-time. \l QStringView can be easily and
153 efficiently constructed from one, so they can be passed to functions that
154 accept a \l QStringView argument (or, as a result, a \l QAnyStringView).
155
156 User-defined literals have the same form as those defined by C++ but add a
157 suffix after the closing quote. The encoding remains determined by the
158 prefix, but the resulting literal is used to construct an object of some
159 user-defined type. Qt thus defines these for some of its own string types:
160 \c{u"foo"_s} for \l QString, \c{"foo"_L1} for \l QLatin1StringView and
161 \c{u"foo"_ba} for \l QByteArray. These are provided by using the
162 \l{Qt::Literals::StringLiterals}{StringLiterals Namespace}. A plain C++
163 string literal \c{"foo"} will be
164 understood as UTF-8 and conversion to QString and thus UTF-16 will be
165 expensive. When you have string literals in plain ASCII, use \c{"foo"_L1}
166 to interpret it as Latin-1, gaining the various benefits outlined above.
167
168 \section1 Basic string classes
169
170 The following table gives an overview over basic string classes for the
171 various standards of text encoding.
172
173 \table
174 \header
175 \li Encoding
176 \li C++ String literal
177 \li Qt user-defined literal
178 \li C++ Character
179 \li Qt Character
180 \li Owning string
181 \li Non-owning string
182 \row
183 \li Latin-1
184 \li -
185 \li ""_L1
186 \li -
187 \li \l QLatin1Char
188 \li -
189 \li \l QLatin1StringView
190 \row
191 \li UTF-8
192 \li u8""
193 \li -
194 \li char8_t
195 \li -
196 \li -
197 \li \l QUtf8StringView
198 \row
199 \li UTF-16
200 \li u""
201 \li u""_s
202 \li char16_t
203 \li \l QChar
204 \li \l QString
205 \li \l QStringView
206 \row
207 \li Binary/None
208 \li -
209 \li ""_ba
210 \li std::byte
211 \li -
212 \li \l QByteArray
213 \li \l QByteArrayView
214 \row
215 \li Flexible
216 \li any
217 \li -
218 \li -
219 \li -
220 \li -
221 \li \l QAnyStringView
222 \endtable
223
224 Some of the missing entries can be substituted with built-in and standard
225 library C++ types: An owning Latin-1 or UTF-8 encoded string can be
226 \c{std::string} or any 8-bit \c char array. \l QStringView can also reference
227 any 16-bit character arrays, such as std::u16string or std::wstring on some
228 platforms.
229
230 Qt also provides specialized lists for some of those types, that are \l
231 QStringList and \l QByteArrayView, as well as matchers, \l
232 QLatin1StringMatcher and \l QByteArrayMatcher. The matchers also have
233 static versions that are created at compile-time, \l
234 QStaticLatin1StringMatcher and \l QStaticByteArrayMatcher.
235
236 Further worth noting:
237
238 \list
239
240 \li \l QStringLiteral is a macro which is identical to \c{u"foo"_s} and
241 available without the \l{Qt::Literals::StringLiterals}{StringLiterals
242 Namespace}. Preferably you should use the modern string literal.
243
244 \li \l QLatin1String is a synonym for \l QLatin1StringView and exists for
245 backwards compatibility. It is not an owning string and might be removed in
246 future releases.
247
248 \li \l QAnyStringView provides a view for a string with any of the three
249 supported encodings. The encoding is stored alongside the reference to the
250 data. This class is well suited to create interfaces that take a wide
251 spectrum of string types and encodings. In contrast to other classes, no
252 processing is conducted on \l QAnyStringView directly. Processing is
253 conducted on the underlying \l QLatin1StringView, \l QUtf8StringView or
254 \l QStringView in the respective encoding. Use \l QAnyStringView::visit()
255 to do the same in your own functions that take this class as an argument.
256
257 \li A \l QLatin1StringView with non-ASCII characters is not straightforward
258 to construct in a UTF-8 encoded source code file and requires special
259 treatment, see the \l QLatin1StringView documentation.
260
261 \li \l QStringRef is a reference to a portion of a \l QString, available in
262 the Qt5Compat module for backwards compatibility. It should be replaced by
263 \l QStringView.
264
265 \endlist
266
267 \section1 High-level string-related classes
268
269 More high-level classes that provide additional functionality work
270 mostly with \l QString and thus UTF-16. These are:
271
272 \list
273 \li \l QRegularExpression, \l QRegularExpressionMatch and
274 \l QRegularExpressionMatchIterator to work with pattern matching
275 and regular expressions.
276 \li \l QLocale to convert numbers and data to and from strings in a
277 manner appropriate to the user's language and culture.
278 \li \l QCollator and \l QCollatorSortKey to compare strings with
279 respect to the users language, script or territory.
280 \li \l QTextBoundaryFinder to break up text ready for typesetting
281 in accord with Unicode rules.
282 \li \c{QStringBuilder}, an internal class that will substantially
283 improve the performance of string concatenations with the \c{+}
284 operator, see the \l QString documentation.
285 \endlist
286
287 Some classes are templates or have a flexible API and work with various
288 string classes. These are
289
290 \list
291 \li \l QTextStream to stream into \l QIODevice, \l QByteArray or
292 \l QString
293 \li \l QStringTokenizer to split strings
294 \endlist
295
296 \section1 Which string class to use?
297
298 The general guidance in using string classes is:
299 \list
300 \li Avoid copying and memory allocations,
301 \li Avoid encoding conversions, and
302 \li Choose the most compact encoding.
303 \endlist
304
305 Qt provides many functionalities to avoid memory allocations. Most Qt
306 containers employ \l{Implicit Sharing} of their data. For implicit sharing
307 to work, there must be an uninterrupted chain of the same class —
308 converting from \l QString to \l QStringView and back will result in two \l
309 {QString}{QStrings} that do not share their data. Therefore, functions need
310 to pass their data as \l QString (both values or references work).
311 Extracting parts of a string is not possible with implicit data sharing. To
312 use parts of a longer string, make use of string views, an explicit form of
313 data sharing.
314
315 Conversions between encodings can be reduced by sticking to a certain
316 encoding. Data received, for example in UTF-8, is best stored and processed
317 in UTF-8 if no conversation to any other encoding is required. Comparisons
318 between strings of the same encoding are fastest and the same is the case
319 for most other operations. If strings of a certain encoding are often
320 compared or converted to any other encoding it might be beneficial to
321 convert and store them once. Some operations provide many overloads (or a
322 \l QAnyStringView overload) to take various string types and encodings and
323 they should be the second choice to optimize performance, if using the same
324 encoding is not feasible. Explicit encoding conversions before calling a
325 function should be a last resort when no other option is available. Latin-1
326 is a very simple encoding and operation between Latin-1 and any other
327 encoding are almost as efficient as operations between the same encoding.
328
329 The most efficient encoding (from most to least efficient Latin-1, UTF-8,
330 UTF-16) should be chosen when no other constrains determine the encoding.
331 For error handling and logging \l QLatin1StringView is usually sufficient.
332 User-visible strings in Qt are always of type \l {QString} and as such
333 UTF-16 encoded. Therefore it is most effective to use \l
334 {QString}{QStrings}, \l {QStringView}{QStringViews} and \l
335 {QStringLiteral}{QStringLiterals} throughout the life-time of a
336 user-visible string. The \l QObject::tr() function provides the correct
337 encoding and type. \l QByteArray should be used if encoding does not play a
338 role, for example to store binary data, or if the encoding is unknown.
339
340 \section2 String class for creating API
341
342 \image string_class_api.svg "String class for an optimal API"
343
344 \section3 Member variables
345
346 Member variables should be of an owning type in nearly all cases. Views can only
347 be used as member variables if the lifetime of the referenced owning string
348 is guaranteed to exceed the lifetime of the object.
349
350 \section3 Function arguments
351
352 Function arguments should be string views of a suitable encoding in most
353 cases. \l QAnyStringView can be used as a parameter to support more than
354 one encoding and \l QAnyStringView::visit() can be used internally to fork
355 off into per-encoding functions. If the function is limited to a single
356 encoding, \l QLatin1StringView, \l QUtf8StringView, \l QStringView or \l
357 QByteArrayView should be used.
358
359 If the function saves the argument in an owning string (usually a
360 setter function), it is most efficient to use the same owning string as
361 function argument to make use of the implicit data sharing functionality of
362 Qt. The owning string can be passed as a \c const reference. Overloading
363 functions with multiple owning and non-owning string types can lead to
364 overload ambiguity and should be avoided. Owning string types in Qt can be
365 automatically converted to their non-owning version or to \l
366 QAnyStringView.
367
368 \section3 Return values
369
370 Temporary strings have to be returned as an owning string, usually
371 \l QString. If the returned string is known at compile-time use
372 \c{u"foo"_s} to construct the \l QString structure at compile-time. If
373 existing owning strings (for example \l QString) are returned from a
374 function in full (for example a getter function), it is most efficient to
375 return them by reference. They can also be returned by value to allow
376 returning a temporary in the future. Qt's use of implicit sharing avoids
377 the performance impact of allocation and copying when returning by value.
378
379 Parts of existing strings can be returned efficiently with a string view
380 of the appropriate encoding, for an example see \l
381 QRegularExpressionMatch::capturedView() which returns a \l QStringView.
382
383 \section2 String class for using API
384
385 \image string_class_calling.svg "String class for calling a function"
386
387 To use a Qt API efficiently you should try to match the function argument
388 types. If you are limited in your choice, Qt will conduct various
389 conversions: Owning strings are implicitly converted to non-owning
390 strings, non-owning strings can create their owning counter parts,
391 see for example \l QStringView::toString(). Encoding conversions are
392 conducted implicitly in many cases but this should be avoided if possible.
393 To avoid accidental implicit conversion from UTF-8 you can activate the
394 macro \l QT_NO_CAST_FROM_ASCII.
395
396 If you need to assemble a string at runtime before passing it to a function
397 you will need an owning string and thus \l QString. If the function
398 argument is \l QStringView or \l QAnyStringView it will be implicitly
399 converted.
400
401 If the string is known at compile-time, there is room for optimization. If
402 the function accepts a \l QString, you should create it with \c{u"foo"_s}
403 or the \l QStringLiteral macro. If the function expects a \l QStringView,
404 it is best constructed with an ordinary UTF-16 string literal \c{u"foo"},
405 if a \l QLatin1StringView is expected, construct it with \c{"foo"_L1}. If
406 you have the choice between both, for example if the function expects \l
407 QAnyStringView, use the tightest encoding, usually Latin-1.
408
409 \section1 List of all string related classes
410*/
411
412// The list is autogenerated by qdoc because this is a group page