Make export 10 times faster
The export functionality was found to be too slow.
Profiling was applied adding the decorator @profile
in the function subject_to_row_for_fields
and the following imports and sets:
import line_profiler
import atexit
profile = line_profiler.LineProfiler()
atexit.register(profile.print_stats)
The test was done with just 100 subjects.
Total time: 26.8099 s
File: /Users/carlos.vega/dev/scheduling-system/smash/web/views/export.py
Function: subject_to_row_for_fields at line 153
Line # Hits Time Per Hit % Time Line Contents
==============================================================
153 @profile
154 def subject_to_row_for_fields(study_subject: StudySubject, subject_fields):
155 100 32000.0 320.0 0.0 row = []
156 100 112104000.0 1e+06 0.4 custom_fields = CustomStudySubjectField.objects.filter(study=study_subject.study).all()
157
158 5600 880000.0 157.1 0.0 for field in subject_fields:
159 5500 502000.0 91.3 0.0 cell = None
160 77000 93804000.0 1218.2 0.3 for custom_field in custom_fields:
161 71500 29996000.0 419.5 0.1 if get_study_subject_field_id(custom_field) == field.name:
162 1300 3e+10 2e+07 96.8 val = study_subject.get_custom_data_value(custom_field)
163 1300 250000.0 192.3 0.0 if val is not None:
164 1300 310000.0 238.5 0.0 cell = val.value
165
166 5500 1995000.0 362.7 0.0 if field == DROP_OUT_FIELD:
167 100 15000.0 150.0 0.0 if not study_subject.resigned:
168 83 10000.0 120.5 0.0 cell = False
169 else:
170 51 3384000.0 66352.9 0.0 finished_appointments = Appointment.objects.filter(visit__subject=study_subject).filter(
171 34 17637000.0 518735.3 0.1 status=Appointment.APPOINTMENT_STATUS_FINISHED).count()
172 17 3000.0 176.5 0.0 if finished_appointments > 0:
173 4 1000.0 250.0 0.0 cell = True
174 else:
175 13 2000.0 153.8 0.0 cell = False
176 else:
177 5400 184002000.0 34074.4 0.7 if hasattr(study_subject, field.name):
178 2200 1047000.0 475.9 0.0 cell = getattr(study_subject, field.name)
179 3200 281638000.0 88011.9 1.1 elif hasattr(study_subject.subject, field.name):
180 1900 2966000.0 1561.1 0.0 cell = getattr(study_subject.subject, field.name)
181 5400 602000.0 111.5 0.0 if cell is None:
182 979 103000.0 105.2 0.0 cell = ""
183 5500 1335000.0 242.7 0.0 if isinstance(cell, BaseManager):
184 100 13000.0 130.0 0.0 collection_value = ""
185 214 113770000.0 531635.5 0.4 for value in cell.all():
186 114 74000.0 649.1 0.0 collection_value += str(value) + ","
187 100 60000.0 600.0 0.0 cell = collection_value
188 5500 1183000.0 215.1 0.0 row.append(cell)
189 100 11000.0 110.0 0.0 return row
As can be seen, the main time losses come from:
- line 162,
val = study_subject.get_custom_data_value(custom_field)
with 2e+07 per hit and a 96.8% of time lost. - line 156,
custom_fields = CustomStudySubjectField.objects.filter(study=study_subject.study).all()
with 1e+06 per hit. - also the two
hasattr
from lines 177 and 179.
Then, after the changes.
Total time: 2.23502 s
File: /Users/carlos.vega/dev/scheduling-system/smash/web/views/export.py
Function: subject_to_row_for_fields at line 167
Line # Hits Time Per Hit % Time Line Contents
==============================================================
167 @profile
168 def subject_to_row_for_fields(self, study_subject: StudySubject):
169
170 #cache fields
171 100 38000.0 380.0 0.0 if len(self.study_subject_fields) == 0:
172 56 7000.0 125.0 0.0 for field in self.subject_fields:
173 55 3050000.0 55454.5 0.1 if hasattr(study_subject, field.name):
174 22 10000.0 454.5 0.0 self.study_subject_fields.add(field.name)
175 33 5461000.0 165484.8 0.2 elif hasattr(study_subject.subject, field.name):
176 19 7000.0 368.4 0.0 self.study_subject_subject_fields.add(field.name)
177
178
179 # to avoid re-execution of this function (property) inside get_custom_data_value
180 100 397573000.0 4e+06 17.8 custom_data_values = study_subject.custom_data_values
181 100 1219816000.0 1e+07 54.6 custom_data_values_map = {value.study_subject_field:value for value in custom_data_values} #considerably faster than for loop
182
183 100 19000.0 190.0 0.0 row = []
184
185 5600 659000.0 117.7 0.0 for field in self.subject_fields:
186 5500 445000.0 80.9 0.0 cell = None
187
188 5500 1342000.0 244.0 0.1 if field.name in self.custom_fields_map:
189 1300 221000.0 170.0 0.0 custom_field = self.custom_fields_map[field.name]
190 1300 2695000.0 2073.1 0.1 if custom_field in custom_data_values_map:
191 1300 2498000.0 1921.5 0.1 cell = custom_data_values_map[custom_field].value
192
193 5500 1988000.0 361.5 0.1 if field == DROP_OUT_FIELD:
194 100 13000.0 130.0 0.0 if not study_subject.resigned:
195 83 14000.0 168.7 0.0 cell = False
196 else:
197 51 3430000.0 67254.9 0.2 finished_appointments = Appointment.objects.filter(visit__subject=study_subject).filter(
198 34 19763000.0 581264.7 0.9 status=Appointment.APPOINTMENT_STATUS_FINISHED).count()
199 17 5000.0 294.1 0.0 if finished_appointments > 0:
200 4 0.0 0.0 0.0 cell = True
201 else:
202 13 1000.0 76.9 0.0 cell = False
203 else:
204 5400 859000.0 159.1 0.0 if field.name in self.study_subject_fields:
205 2200 206676000.0 93943.6 9.2 cell = getattr(study_subject, field.name)
206 3200 448000.0 140.0 0.0 elif field.name in self.study_subject_subject_fields:
207 1900 255779000.0 134620.5 11.4 cell = getattr(study_subject.subject, field.name)
208 5400 707000.0 130.9 0.0 if cell is None:
209 979 86000.0 87.8 0.0 cell = ""
210 5500 1282000.0 233.1 0.1 if isinstance(cell, BaseManager):
211 100 9000.0 90.0 0.0 collection_value = ""
212 214 108952000.0 509121.5 4.9 for value in cell.all():
213 114 81000.0 710.5 0.0 collection_value += str(value) + ","
214 100 52000.0 520.0 0.0 cell = collection_value
215 5500 1012000.0 184.0 0.0 row.append(cell)
216
217 100 18000.0 180.0 0.0 return row
72.4% of the time is now lost in the following lines:
- 180,
custom_data_values = study_subject.custom_data_values
- 181,
custom_data_values_map = {value.study_subject_field:value for value in custom_data_values}
But at least now they execute once per subject (100) instead of one per subject*custom field (1300).
The main changes that make the improvement are the changes in study_subject
for the left join and the cache changes done to prevent multiple runs of the same loop.
Please, try to run it locally and run export to see that all works fine. I tried multiple times and found no issues.
Should fix #525 (closed)
To be checked: After the function export_to_csv is finished, it takes a lot of time to fire the csv file to the client so not sure where time goes after line 33 in export.py. I tried with PRC data and it takes 60 seconds to export and then 5 seconds or so more after line 33.