Make export 10 times faster (!456) · Merge requests · SMASCH / scheduling-system

Carlos Vega requested to merge fix/slow_export into master Apr 19, 2024

The export functionality was found to be too slow.

Profiling was applied adding the decorator @profile in the function subject_to_row_for_fields and the following imports and sets:

import line_profiler
import atexit
profile = line_profiler.LineProfiler()
atexit.register(profile.print_stats)

The test was done with just 100 subjects.

Total time: 26.8099 s
File: /Users/carlos.vega/dev/scheduling-system/smash/web/views/export.py
Function: subject_to_row_for_fields at line 153

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   153                                           @profile
   154                                           def subject_to_row_for_fields(study_subject: StudySubject, subject_fields):
   155       100      32000.0    320.0      0.0      row = []
   156       100  112104000.0    1e+06      0.4      custom_fields = CustomStudySubjectField.objects.filter(study=study_subject.study).all()
   157
   158      5600     880000.0    157.1      0.0      for field in subject_fields:
   159      5500     502000.0     91.3      0.0          cell = None
   160     77000   93804000.0   1218.2      0.3          for custom_field in custom_fields:
   161     71500   29996000.0    419.5      0.1              if get_study_subject_field_id(custom_field) == field.name:
   162      1300        3e+10    2e+07     96.8                  val = study_subject.get_custom_data_value(custom_field)
   163      1300     250000.0    192.3      0.0                  if val is not None:
   164      1300     310000.0    238.5      0.0                      cell = val.value
   165
   166      5500    1995000.0    362.7      0.0          if field == DROP_OUT_FIELD:
   167       100      15000.0    150.0      0.0              if not study_subject.resigned:
   168        83      10000.0    120.5      0.0                  cell = False
   169                                                       else:
   170        51    3384000.0  66352.9      0.0                  finished_appointments = Appointment.objects.filter(visit__subject=study_subject).filter(
   171        34   17637000.0 518735.3      0.1                      status=Appointment.APPOINTMENT_STATUS_FINISHED).count()
   172        17       3000.0    176.5      0.0                  if finished_appointments > 0:
   173         4       1000.0    250.0      0.0                      cell = True
   174                                                           else:
   175        13       2000.0    153.8      0.0                      cell = False
   176                                                   else:
   177      5400  184002000.0  34074.4      0.7              if hasattr(study_subject, field.name):
   178      2200    1047000.0    475.9      0.0                  cell = getattr(study_subject, field.name)
   179      3200  281638000.0  88011.9      1.1              elif hasattr(study_subject.subject, field.name):
   180      1900    2966000.0   1561.1      0.0                  cell = getattr(study_subject.subject, field.name)
   181      5400     602000.0    111.5      0.0              if cell is None:
   182       979     103000.0    105.2      0.0                  cell = ""
   183      5500    1335000.0    242.7      0.0          if isinstance(cell, BaseManager):
   184       100      13000.0    130.0      0.0              collection_value = ""
   185       214  113770000.0 531635.5      0.4              for value in cell.all():
   186       114      74000.0    649.1      0.0                  collection_value += str(value) + ","
   187       100      60000.0    600.0      0.0              cell = collection_value
   188      5500    1183000.0    215.1      0.0          row.append(cell)
   189       100      11000.0    110.0      0.0      return row

As can be seen, the main time losses come from:

line 162, val = study_subject.get_custom_data_value(custom_field) with 2e+07 per hit and a 96.8% of time lost.
line 156, custom_fields = CustomStudySubjectField.objects.filter(study=study_subject.study).all() with 1e+06 per hit.
also the two hasattr from lines 177 and 179.

Then, after the changes.

Total time: 2.23502 s
File: /Users/carlos.vega/dev/scheduling-system/smash/web/views/export.py
Function: subject_to_row_for_fields at line 167

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   167                                               @profile
   168                                               def subject_to_row_for_fields(self, study_subject: StudySubject):
   169
   170                                                   #cache fields
   171       100      38000.0    380.0      0.0          if len(self.study_subject_fields) == 0:
   172        56       7000.0    125.0      0.0              for field in self.subject_fields:
   173        55    3050000.0  55454.5      0.1                  if hasattr(study_subject, field.name):
   174        22      10000.0    454.5      0.0                      self.study_subject_fields.add(field.name)
   175        33    5461000.0 165484.8      0.2                  elif hasattr(study_subject.subject, field.name):
   176        19       7000.0    368.4      0.0                      self.study_subject_subject_fields.add(field.name)
   177
   178
   179                                                   # to avoid re-execution of this function (property) inside get_custom_data_value
   180       100  397573000.0    4e+06     17.8          custom_data_values = study_subject.custom_data_values
   181       100 1219816000.0    1e+07     54.6          custom_data_values_map = {value.study_subject_field:value for value in custom_data_values} #considerably faster than for loop
   182
   183       100      19000.0    190.0      0.0          row = []
   184
   185      5600     659000.0    117.7      0.0          for field in self.subject_fields:
   186      5500     445000.0     80.9      0.0              cell = None
   187
   188      5500    1342000.0    244.0      0.1              if field.name in self.custom_fields_map:
   189      1300     221000.0    170.0      0.0                  custom_field = self.custom_fields_map[field.name]
   190      1300    2695000.0   2073.1      0.1                  if custom_field in custom_data_values_map:
   191      1300    2498000.0   1921.5      0.1                      cell = custom_data_values_map[custom_field].value
   192
   193      5500    1988000.0    361.5      0.1              if field == DROP_OUT_FIELD:
   194       100      13000.0    130.0      0.0                  if not study_subject.resigned:
   195        83      14000.0    168.7      0.0                      cell = False
   196                                                           else:
   197        51    3430000.0  67254.9      0.2                      finished_appointments = Appointment.objects.filter(visit__subject=study_subject).filter(
   198        34   19763000.0 581264.7      0.9                          status=Appointment.APPOINTMENT_STATUS_FINISHED).count()
   199        17       5000.0    294.1      0.0                      if finished_appointments > 0:
   200         4          0.0      0.0      0.0                          cell = True
   201                                                               else:
   202        13       1000.0     76.9      0.0                          cell = False
   203                                                       else:
   204      5400     859000.0    159.1      0.0                  if field.name in self.study_subject_fields:
   205      2200  206676000.0  93943.6      9.2                      cell = getattr(study_subject, field.name)
   206      3200     448000.0    140.0      0.0                  elif field.name in self.study_subject_subject_fields:
   207      1900  255779000.0 134620.5     11.4                      cell = getattr(study_subject.subject, field.name)
   208      5400     707000.0    130.9      0.0                  if cell is None:
   209       979      86000.0     87.8      0.0                      cell = ""
   210      5500    1282000.0    233.1      0.1              if isinstance(cell, BaseManager):
   211       100       9000.0     90.0      0.0                  collection_value = ""
   212       214  108952000.0 509121.5      4.9                  for value in cell.all():
   213       114      81000.0    710.5      0.0                      collection_value += str(value) + ","
   214       100      52000.0    520.0      0.0                  cell = collection_value
   215      5500    1012000.0    184.0      0.0              row.append(cell)
   216
   217       100      18000.0    180.0      0.0          return row

72.4% of the time is now lost in the following lines:

180, custom_data_values = study_subject.custom_data_values
181, custom_data_values_map = {value.study_subject_field:value for value in custom_data_values}

But at least now they execute once per subject (100) instead of one per subject*custom field (1300).

The main changes that make the improvement are the changes in study_subject for the left join and the cache changes done to prevent multiple runs of the same loop.

Please, try to run it locally and run export to see that all works fine. I tried multiple times and found no issues.

Should fix #525 (closed)

To be checked: After the function export_to_csv is finished, it takes a lot of time to fire the csv file to the client so not sure where time goes after line 33 in export.py. I tried with PRC data and it takes 60 seconds to export and then 5 seconds or so more after line 33.

Edited Apr 19, 2024 by Carlos Vega

Make export 10 times faster

Merge request reports