# Process-variation Effects on 3D TLC Flash Reliability: Characterization and Mitigation Scheme

Yuqian Pan Department of Cyber Science and Engineering Huazhong University of Science and Technology Wuhan, China yuqianpan@126.com Haichun Zhang Department of Optical and Electronic Information Huazhong University of Science and Technology Wuhan, China 2533776360@qq.com Mingyang Gong Department of Optical and Electronic Information Huazhong University of Science and Technology Wuhan, China d201677550@hust.edu.cn Zhenglin Liu Department of Optical and Electronic Information Huazhong University of Science and Technology Wuhan, China liuzhenglin@hust.edu.cn

Abstract—In Solid State Drives, flash management techniques such as wear-leveling and refresh usually assume NAND flash memories have the same endurance value. However, the actual endurance values differ from blocks to blocks. This reliability difference is introduced by process-variation during flash fabrication. In recent years, for improving flash management techniques, various works have been done on the reliability variation of 2D flash memory. As 2D NAND transmitted to 3D NAND flash, the vertical structure and multi-layer stacking changed the effect of previously known reliability problems. In this paper, we are first to characterize the process-variation effects on 3D TLC flash reliability. The characterization includes two parts: endurance variation and error feature variation. Second, we propose an adaptive error prediction scheme to mitigate the process-variation effects. This scheme uses the machine-learning model to realize the error prediction operation. We also discuss the implications of this scheme on main flash management techniques.

Keywords—3D NAND, storage, reliability, process variation, error analysis, Machine learning

## I. INTRODUCTION

NAND flash memory has been one of the popular storage mediums for the last decades. With the development of process technology, NAND flash memory was on a one-year cadence for a new generation where other memories were on a two-year cadence [1]. The consecutive scaling down of process technology results in reliability decrease in flash memory. To enhance NAND flash reliability, many strategies have been proposed, such as: garbage collection, wear-leveling, and refresh. Generally, these strategies use the standard endurance value as the limitation of the flash lifetime. However, due to the process-variation during fabrication, the actual endurance of flash blocks is different from each other. Therefore, previous works [2]-[6] have studied the process-variation effects on 2D NAND flash memory for improving reliability.

To overcome the scaling challenge, researchers introduced 3D NAND flash. Instead of planar arrays in 2D NAND, 3D NAND flash memory stacks storage cells vertically in layers [1].



Fig. 1. The schematic circuit of 3D NAND flash array.

Fig. 1 shows the schematic circuit of the 3D NAND flash array. The vertical structure of 3D NAND allows increasing storage density on equivalent cell area with the planar device. However, it also changes the characteristics of the known reliability problems in NAND flash memory. At present, most of the studies on PV effects are on planar NAND flash. It demands more researches on 3D NAND flash as the manufacturing technology of flash memory transmitted to the next generation. In order to gain a further understanding of the process-variation effects on 3D flash reliability, we studied the endurance value and error features of different flash blocks through testing. The key contributions are as follows:

• We analyze and characterize the process-variation effects on 3D TLC flash reliability. We analyze the effects from two aspects: the NAND flash endurance distribution and error features. We find that the endurance difference between chips is higher than the difference in the same chip. And the variance of block errors at a later stage is higher than the early stage.

This work was supported in part by the National Natural Science Foundation of China (Grant No. 61874047) and National Key Research and Development Project of China (Grant No. 2019YFB1300601).

- We propose an error prediction scheme to mitigate the process-variation effects, which adaptively estimates the number of flash errors. This scheme is based on the machine-learning model. Besides, we compare six types of machine-learning models and show the accuracy of these models.
- We discuss the implications of our scheme on main flash management techniques.

#### II. BACKGROUND

#### A. Process-variation Effects on NAND Flash Reliability

With manufacturing technology scaling, process-variation (PV) is becoming a critical problem as a result of uncertainty in electrical attributes. Process-variation has been identified as fabrication induced electrical parameter changes among different transistors [7]. The changes in electrical parameters make the reliability of transistors with the same design varies from each other [8].

The process-variation in oxide thicknesses and cell dimensions influences the capacitive conditions and results in reliability differences among NAND flash blocks. Endurance is one of the most important reliability parameters in NAND flash memory, which represents the number of P/E cycles that the flash block can sustain before a failure [9].

Previous work [2] has found that the actual endurance of flash chip is different from the standard value. In NAND flash memory, different blocks and pages have different reliabilities. The number of flash block errors grows at different speeds during program/erase (P/E) cycling. And the number of page errors in the same block is significantly varied at the same P/E cycle [3]. Due to the uncertainty of the manufacturing process, PV effects on reliability cannot be avoided. Therefore, it's essential to build a strong understanding of PV effects and develop management techniques to alleviate the influence.

## B. Related Works

In order to enable a strong understanding of PV effects on NAND flash, many researchers have studied the variations of reliability characteristics among flash blocks. Pan et al. [2] explored the actual endurance P/E cycles of planar flash blocks. They pointed out that the endurance P/E cycles of tested blocks are in the range of [1500, 24600], which are higher than the standard value [2]. Woo et al. [4] analyzed PV effects on different flash metrics: bit error rate (BER), endurance P/E cycles, and operation latency. They summarized that the average P/E cycle of flash blocks in the same chip is 8524, with the standard deviation of 1318 [4]. Jimenez et al. [5] found that the flash page BER has different growing speeds under PV influence. Meza et al. [6] performed a large-scale study on various types of failures. The results showed that flash blocks have varied reliability characteristics.

Many strategies have been proposed to tolerate PV in NAND flash memory. Pan et al. [2] proposed a wear-leveling method that identifies the reliability of flash blocks by error rate. In [4], the authors introduced two techniques to extend the lifetime of SSDs: a new wear index which takes into account the PV effects, and a dynamic wear-leveling algorithm. And a write speed detection approach by exploiting PV was proposed in [10]. Di et al. [11] presented a refresh minimization scheme with the consideration of PV among flash blocks. In 2018, they designed a refresh frequency matching scheme [12] that allocates data to blocks with higher retention ability.

Recently, several studies have been done on PV tolerating techniques for 3D MLC NAND flash. In [13], a PV tolerant reliability management strategy was proposed for 3D charge-trapping flash memory. This strategy predicts the status of physical pages and assigns the data to reliable pages. [14] observed the early retention loss and layer-to-layer PV in 3D MLC NAND flash. They also proposed two schemes to mitigate layer-to-layer PV.

#### C. Motivation

Current studies of PV induced flash reliability issues are focus on planar flash memory and 3D MLC flash memory. With the development of process technology, more and more 3D TLC flash memories are applied in SSDs. Therefore, it's necessary to characterize PV effects on 3D TLC flash. In this work, we have tested 3D TLC flash blocks from different chips and collected flash metrics such as raw page bit errors, P/E cycle number, and endurance value. And we analyzed the endurance distribution and error features of selected samples for characterizing the PV effects on 3D TLC flash reliability.

In order to enhance the reliability of the storage system, many PV tolerant schemes have been presented in the past decade. They allocate data to blocks according to parameters like BER, erase count, and program latency. Since flash blocks have different reliability, the changes of these parameters as P/E cycles increase might vary among flash blocks. The frequency of parameter measurement will be a critical issue for flash management techniques. Thus, to reduce the frequency, we propose an error prediction scheme that adaptively estimates the number of flash errors. The error prediction is realized by the machine-learning method.

## III. 3D TLC RELIABILITY VARIATION CHARACTERIZATION

## A. Methodology

In this work, we designed a NAND flash test platform with Xilinx Zynq-7020. Through our test platform, we tested 3D TLC flash blocks from different chips. Considering the PV influences on different regions, we randomly select the blocks from the front, middle, and back positions of each flash chips as samples. The total number of tested flash blocks is 128. The size of the block is 2304.

The testing procedure is shown in Fig. 2. We test 3D TLC flash blocks by stressing repeated P/E cycles. During P/E cycling, test platform samples flash metrics every 50 P/E cycles at room temperature. The flash metrics are transmitted to the host computer and stored in the database on the host computer. Flash test procedure stops when the page BER larger than previously set ECC capability. The ECC capability is that the correction code can correct 73bit errors on 1kB data.



Fig. 2. The flash testing procedure.

#### B. Test Results Analysis

In NAND flash memory, the metrics used to estimate reliability generally are endurance and raw bit error. Thus, to study the PV influence on 3D TLC flash reliability, the following subsections analyze the endurance distribution and error features of selected sample blocks.

#### 1) Endurance distribution

The endurance distribution of 3D TLC flash blocks is shown in Fig. 3. According to this figure, the value of the block endurance is in the range of 6500 to 13500 P/E cycle. The range of [10500, 11500] has a higher density compared to others. In [12500, 13500], the number of blocks is the minimum. Table. 1 reports the variance of endurance across different blocks. Through the results in Table. 1, we can observe that blocks from the same chip have smaller endurance variance compared to the blocks from the different chips. Therefore, it can be concluded that the blocks belong to the same type of 3D TLC flash memory have different chips has a greater difference than the blocks in the same chip.



Fig. 3. 3D TLC flash endurance distribution.

TABLE I. ENDURANCE METRICS

| Туре  | Different<br>chip<br>variance | Same<br>chip<br>variance | Min  | Max   | Average |
|-------|-------------------------------|--------------------------|------|-------|---------|
| Value | 2364237                       | 353381                   | 6650 | 13150 | 10259   |
|       | 0                             |                          |      |       |         |

2) Error features

In order to gain further understanding of PV effects on flash reliability, we also observe the error features of different 3D TLC flash blocks and characterize the error variation of blocks and pages. We analyze the error features by visual graphics of raw bit errors.

The average error number of each block at different P/E cycles is shown in Fig. 4. To give a more explicit visual inspection, we plot the errors of sample blocks at 50 P/E, 100 P/E, 500 P/E, 1000, P/E, 3000 P/E, 5000 P/E, and 7000 P/E. In Fig.4, at 50 P/E cycle, the average error number of 119 sample blocks is between 10 and 15. And at 100 P/E, the range of error values is slightly greater than 50 P/E. When the P/E cycle is 500, the range of block error is [10, 22], which is wider than the range at 50 P/E and 100P/E. As the P/E cycle increases, the range of error distribution becomes wider and wider. And at 7000 P/E, the range of error difference between blocks at higher P/E cycles is larger than the lower P/E cycles.



Fig. 4. The average error number of each block at P/E 100, 500, 1000, 3000, 5000, and 7000.

To further analyze the relationship between error difference and P/E cycle, we plot the variance of block errors in Fig. 5. Also, we show the variance of page errors in Fig. 6. Through these figures, we can observe that the error number variance of blocks from different chip increase with P/E cycles. Before 1000 P/E, the variance value is small and grows slowly. When the P/E cycle reaches 7000, the variance increases to 70.877, which is far more than the value 4.616 at 1000 P/E. According to Fig. 5, the variance of block errors in the same chip is much lower than that of block errors among different chips. And from 1 P/E to 1000 P/E, the variance of block errors in the same chip has small change (0.392 to 1.130) with the increase of P/E cycles. The



Fig. 5. The variance of block errors.



Fig. 6. The average variance of page errors.

variance of page errors has a similar tendency to that of block errors. The variance for page errors increases smoothly in the early phase and becomes faster at the later period of the lifetime. However, at the same P/E cycle, the value of page error variance is larger than the variance of block errors. Therefore, it can be concluded that the block errors grow at different speeds during P/E cycling. The variance of block errors in the same chip is smaller than the difference among different chips. And the variance of block errors and page errors increases slowly in the early phase and accelerates in the later phase.

The variance of errors for different page type is shown in Fig. 7. Through this figure, we find that the variance of upper page



Fig. 7. The average variance of page errors.

errors is higher than the lower page and middle page at the same P/E cycle. The variance of middle-page errors is slightly smaller than the upper page. Besides, the variance of upper-page errors and middle-page errors increases much faster than the lower page across the entire P/E cycling process. We conclude that different types of pages in 3D TLC flash have different error variation, and the variance of upper-page errors is larger than other types of pages.

### IV. ADAPTIVE ERROR PREDICTION SCHEME

### A. Overview Units

In this work, we propose an error prediction scheme that adaptively estimates the level of flash errors for tolerating PV effects on flash reliability. The flow diagram of the proposed scheme is illustrated in Fig. 8. In the light of former analysis, the error prediction operation begins at 1000 P/E cycle that has low PV effects on blocks. When the P/E cycle reaches 1000, the system calculates the raw bit errors of blocks, and the machinelearning model predicts the error level. Then, the blocks are assigned to different groups according to the error level. Each group corresponds to an update interval. Based on the observation of error features, we tentatively set the interval values to 50 P/E, 100 P/E, 500 P/E, 1000, P/E, and 1500 P/E. However, the proper value of interval, which could improve system efficiency, demands further study. After the corresponding interval, the raw bit errors of blocks in the group are calculated again by the system. If the errors exceed ECC capability, the block will be removed from the group. The machine-learning model predicts the error level and remaps the blocks to groups. The error prediction operation repeats until there is no available block or receives a stop command.



Fig. 8. The flow diagram of the error prediction scheme.

### B. Machine-learning Model

In the machine-learning field, classification is a supervised learning method to predict discrete random variables [15]. In this work, we use the following six classification methods to build an error prediction model: Decision Trees, K-Nearest Neighbors, linear discriminant analysis (LDA), Support Vector Machines (SVM), Naive Bayes, and Bagged Classification Trees. We make use of MATLAB Statistics and Machine Learning Toolbox for modeling. Data samples for modeling and verification are randomly selected from the test data. The size of the training data set and test data set are 3750 and 1250. The large scale training and testing of machine-learning models will be implemented in the future study. The 5-fold Cross Validation is adopted to each model to avoid over-fitting.

TABLE II. EVALUATION RESULTS

|                                        | Accuracy        |                 |                 |                 |                 |         |  |  |
|----------------------------------------|-----------------|-----------------|-----------------|-----------------|-----------------|---------|--|--|
| Method                                 | Error<br>level1 | Error<br>level2 | Error<br>level3 | Error<br>level4 | Error<br>level5 | Average |  |  |
| Decision<br>Trees                      | 100%            | 93%             | 79%             | 85%             | 99%             | 91.4%   |  |  |
| KNN                                    | 100%            | 97%             | 83%             | 81%             | 99%             | 92.1%   |  |  |
| LDA                                    | 100%            | 99%             | 75%             | 76%             | 97%             | 89.2%   |  |  |
| SVM                                    | 100%            | 98%             | 79%             | 85%             | 99%             | 92.3%   |  |  |
| Naive<br>Bayes                         | 100%            | 95%             | 87%             | 73%             | 99%             | 90.9%   |  |  |
| Bagged<br>Classifica<br>-tion<br>Trees | 100%            | 93%             | 80%             | 86%             | 99%             | 91.6%   |  |  |

**Model inputs and outputs**. The model inputs include P/E cycle number, raw bit error number of block, and the former error level. The output of the model is 1, 2, 3, 4, or 5 if the maximum page error number at next 50 P/E is in the range of [0, 300), [300, 500), [500, 700), [700, 900), or [900, 1100). Five types of output represent: error-level1, error-level2, error-level3, error-level4, and error-level5. The number of data samples corresponding to each output value is equal.

The evaluation results of six classification learning models are shown in Table.1. The results demonstrate that the SVM model has the best performance, and LDA is the worst. The accuracy of SVM is 92.3%, and LDA is 89.2%. The evaluation results differ from disks that SVM model performs worse in predicting faulty disks [16]. The results of model evaluation on one storage device are not suitable for other devices.

Although LDA model has the worst accuracy, but it performs best on error-level2 dataset. Most of the models have a weak predictive capability on the error-level3 and error-level4 dataset. We hypothesize that the weak capability results from the current input cannot provide enough information to distinguish error levels. And on different dataset, the model with the best performance is different. The best model of error-level2, errorlevel3 and error-level4 dataset are: LDA, Naïve Bayes and Bagged Classification Trees. For different error levels, the flash memory demands different prediction methods. In future study, we will explore the prediction performance of Hybrid Model on NAND flash memory devices.

## C. Implementation

In flash-memory-based storage systems, there are many management techniques such as refresh and wear-leveling. To improve 3D TLC flash reliability and reduce PV influence, we present several implications of error prediction scheme on the designs of the management techniques:

Refresh. Refresh is a fundamental scheme for reducing retention errors [9]. Since retention errors are related to wearing degree and retention time, the refresh frequency can be determined by the output of the error prediction model. During the storage system is running, the blocks are assigned to groups and refreshed at a different rate. The refresh rate of each group is related to the error level: the refresh interval decreases when the error level increases. By implying the scheme on refresh techniques, we can refresh flash blocks with the consideration of PV effects and reduce the number of refresh operations.

Wear-leveling. In flash storage devices, it is crucial to keep the aging of each block at a similar rate [9]. Therefore, a technique named wear-leveling was proposed [9]. Generally, wear-leveling algorithms adopt P/E cycle number as the standard of block aging level. However, previous research [2] found that the P/E cycle can't be a good standard for identifying reliability because of the existence of PV. Besides, the actual endurance P/E cycle is larger than the standard endurance. Thus, we utilize the output of the error prediction model as the wearleveling information to mitigate the PV effects and extend the number of executable operations.

#### V. CONCLUSION

As process technology has transmitted from 2D to 3D, new reliability issues arose in flash memories. Process-variation is a critical problem in NAND flash memory, which induced by fabrication. To gain a strong understanding of process-variation in the 3D flash, we analyzed and characterized the processvariation effects on 3D TLC flash reliability. The analysis is mainly made from two aspects: endurance distribution and error features. Through the endurance distribution, we found that the endurance difference of blocks in different chips is greater than the blocks in the same chip. In the aspect of error features, we discovered that the variance of block errors and page errors at the later stage increases faster than the early stage. The variance of blocks in the same chip is lower than that of blocks in different chips. Besides, in 3D TLC flash, the different types of pages have different error increasing speeds that the variance of upperpage errors grows faster than other types of pages.

We also proposed an error prediction scheme to mitigate the process-variation effects, which adaptively estimates the number of flash errors. We evaluated the performance of six types of machine-learning models. The results show that the SVM model has the best performance of 92.3%.

To reduce PV influence on 3D flash reliability, we discussed the implications of the error prediction scheme on the designs of the management techniques like refresh and wear-leveling. Since our observation of PV effects is mainly on limited flash chips, the influence among different types of chips still demands further exploration. We hope that this work will inspire studies of the PV effects and new flash management techniques to enhance 3D flash reliability in the future.

### ACKNOWLEDGMENT

We gratefully thank the support of the National Natural Science Foundation and National Key Research and Development Project of China.

#### REFERENCES

- Rino Micheloni, "Reliability issues of NAND Flash memories," in Inside NAND Flash Memories, 1st ed. New York: Springer, 2010, pp. 89–113.
- [2] Yangyang Pan, Guiqiang Dong, and Tong Zhang, "Error Rate-Based Wear-Leveling for nand Flash Memory at Highly Scaled Technology Nodes," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 21, pp. 1350–1354, July 2013.
- [3] E. Yaakobi, J. Ma, L. Grupp, P. Siegel, S. Swanson, and J. Wolf, "Error characterization and coding schemes for Flash memories," in IEEE Globecom Workshops, pp. 1856–1860, December 2010.
- [4] Y.-J. Woo and J.-S. Kim, "Diversifying wear index for mlc nand flash memory to extend the lifetime of ssds," in EMSOFT, pp. 1–10, September 2013.
- [5] X. Jimenez, D. Novo, and P. Ienne, "Wear unleveling: Improving nand flash lifetime by balancing page endurance," in USENIX FAST, pp. 47– 59, February 2015.
- [6] J. Meza, Q. Wu, S. Kumar, and O. Mutlu, "A large-scale study of flash memory failures in the field," ACM SIGMETRICS, vol. 43, pp. 177–190, June 2015.
- [7] Meng-Fan Chang and Shin-Jang Shen, "A Process Variation Tolerant Embedded Split-Gate Flash Memory Using Pre-Stable Current Sensing Scheme," IEEE Journal of Solid-State Circuits, vol. 44, pp. 987–994, February 2009.

- [8] S. Borkar, "Designing reliable systems from unreliable components: the challenges of transistor variability and degradation," IEEE Micro, vol. 25, pp. 10–16, December 2005.
- [9] Rino Micheloni, Inside Solid State Drives, 1st ed. New York: Springer, 2013, pp. 1–28.
- [10] L. Shi et al., "Exploiting process variation for write performance improvement on nand flash memory storage systems," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, pp. 334–337, February 2015.
- [11] Yejia Di, Liang Shi, Kaijie Wu, and Chun Jason Xue, "Exploiting process variation for retention induced refresh minimization on flash memory," in DATE, March 2016.
- [12] Yejia Di et al., "Minimizing Retention Induced Refresh Through Exploiting Process Variation of Flash Memory," in IEEE Transactions on Computers, vol. 68, pp. 83–98, July 2018.
- [13] Y. Wang, L. Dong, and R. Mao, "P-Alloc: Process-Variation Tolerant Reliability Management for 3D Charge-Trapping Flash Memory," in ACM Trans. Embedded Comput. Syst., vol. 16, pp. 1-19, Octorber 2017.
- [14] Y. Luo et al., "Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation," in SIGMETRICS, pp. 1–48, December 2018.
- [15] Shan Suthaharan, "Understanding Machine Learning," in Machine Learning Models and Algorithms for Big Data Classification, 1st ed. Greensboro: Springer, 2016, pp. 121–269.
- [16] Xu et al., Improving Service Availability of Cloud Systems by Predicting Disk Error, Proc. 2018 USENIX Annual Technical Conference (USENIX ATC'18), July 2018, Boston, MA, USA.